arXivDaily arXiv每日学术速递 周一至周五更新
2606.19679 2026-06-19 cs.LG cs.AI 新提交

LOKI: Memory-Free Null-Space Constrained Lifelong Knowledge Editing

LOKI: 无记忆零空间约束的终身知识编辑

Masih Eskandar, Miquel Sirera Perelló, Stratis Ioannidis, Jennifer Dy

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 提出LOKI方法,通过希尔伯特-施密特独立性准则动态选择层,并将梯度更新投影到模型权重的零空间,实现无需访问旧知识的终身知识编辑,平均准确率提升14%。

详情
AI中文摘要

终身知识编辑旨在随着时间推移,当新知识可用或模型出错时,高效且顺序地更新语言模型,同时保持对过去知识的可接受性能。一个未解决的挑战是现有方法对所有新知识样本修改固定层集,降低了灵活性并增加了灾难性遗忘。另一个挑战是需要访问先前知识并进行大量预处理以获得数据统计。为了解决这些挑战,我们引入了LOKI,一种新颖的方法,它基于希尔伯特-施密特独立性准则进行动态层选择,并将梯度更新投影到模型权重的零空间,从而绕过了对先前知识访问的需求。我们表明,LOKI在广泛的实验中实现了优于现有方法的性能,平均准确率提升高达14%。

英文摘要

Lifelong knowledge editing aims to efficiently and sequentially update language models over time, as new knowledge becomes available or when the model makes mistakes, while preserving acceptable performance on past knowledge. One unresolved challenge is that existing methods modify a fixed set of layers for all new knowledge samples, reducing flexibility and increasing catastrophic forgetting. Another is requiring access to previous knowledge and extensive pre-processing to obtain data statistics. To address these challenges, we introduce LOKI, a novel approach that uses dynamic layer selection based on the Hilbert-Schmidt Independence Criterion and projects gradient updates onto the null-space of the model weights, bypassing the requirement for previous knowledge access. We show that LOKI achieves superior performance to existing approaches across a wide variety of experiments, achieving up to a 14\% improvement in average accuracy.

2606.19676 2026-06-19 cs.CV cs.AI 新提交

TeleMorpher: Toward Robust Simultaneous Motion-Location Editing

TeleMorpher: 迈向鲁棒的同步运动-位置编辑

Haengbok Chung

AI总结 提出TeleMorpher,一种基于扩散模型的一步式框架,通过运动先验、姿态扭曲和基线运动编辑器注入,实现视频中主角运动与位置的同步编辑,在定量和定性评估中表现优异。

详情
AI中文摘要

扩散模型在图像和视频生成与编辑中取得了显著成功。尽管最近的研究将工作扩展到运动编辑,但同步变换运动与位置——尽管具有实际重要性——仍基本未被探索。为了更好地理解鲁棒的运动-位置编辑,我们首先分析了降低其质量的根本因素。基于此分析,我们提出了TeleMorpher,据我们所知,这是首个用于同步运动-位置编辑的一步式框架之一。我们的方法利用运动先验(从现成模型生成的目标运动中心视频作为运动编辑指导)和真实运动,实现更可控和精确的运动-位置编辑。通过这种方式,我们的框架工作如下:(1) 首先通过预训练的分割和修复模型分离主角和背景。(2) 然后,我们引入一种无需训练的姿势扭曲,以运动先验为指导编辑主角的运动。(3) 扭曲运动视频的结果在推理时直接注入基线运动编辑器,减轻源运动与目标运动之间的差异,同时保留源视频的外观。(4) 为提高定量评估的可靠性,我们提出了两个新的基于LPIPS的指标,分别测量运动编辑前后背景一致性以及通过测量从源视频和目标视频中提取的主角骨架差异来评估运动编辑性能的保真度。在野外视频和TaiChi数据集上的实验表明,TeleMorpher在定量和定性测量(真实人类评估)中均取得了优越性能,凸显了其有效性。

英文摘要

Diffusion models have achieved remarkable success in image and video generation and editing. While recent studies have extended these efforts toward motion editing, simultaneously transforming both motion and location-despite its practical importance-remains largely unexplored. To better understand robust motion-location editing, we first analyze the fundamental factors that degrade its quality. Based on this analysis, we propose TeleMorpher, one of the first one-shot frameworks to the best of our knowledge, for simultaneous motion-location editing. Our approach leverages motion priors, a target motion-centric video generated from an off-the-shelf model as motion-editing guidance, and the ground truth motion to enable more controllable and precise motion-location editing. Via this, our framework works as follows: (1) we first disentangle the protagonist and the background via pre-trained segmentation and inpainting models. (2) Then, we introduce a training-free pose warping that edits the protagonist's motion with the motion prior as the guidance. (3) The result of warped motion video is directly injected into a baseline motion editor during inference, mitigating the difference between source and target motions while preserving the appearance of the source video. (4) To enhance the reliability of quantitative evaluations, we propose two new LPIPS-based metrics that measure the background consistency before and after the motion editing and the fidelity of motion editing performance via measuring the difference between the extracted protagonist's skeletons from source and target videos. Experiments with in-the-wild videos and the TaiChi dataset demonstrate that TeleMorpher achieves superior performance across both quantitative and qualitative measurements (real-human evaluation), underscoring its effectiveness.

2606.19675 2026-06-19 cs.RO 新提交

ForEnt: A Multi-Modal Dataset for Characterizing Quadruped Robot Entrapments in Forest Environments

ForEnt: 用于表征四足机器人在森林环境中被困的多模态数据集

Natapat Kirdwichai, Danesh Tarapore

发表机构 * University of Southampton(南安普顿大学)

AI总结 针对四足机器人在森林中因植被缠绕而倾覆的问题,提出多模态数据集ForEnt,包含RGB-D、LiDAR、本体感知和第三人称视频,记录69次被困事件,支持可重复的基准测试。

Comments 8 pages, 7 figures

详情
AI中文摘要

腿式机器人越来越多地被部署在森林中进行生态调查和监测,但由于穿越森林环境带来的挑战,它们的自主性经常中断。森林被困,例如当机器人的腿被藤蔓或其他植被缠住时,会导致失去稳定性并翻倒。此类事件不仅中断任务并需要人工干预,还可能损坏机器人硬件。为了解决缺乏专门数据集来研究森林环境中这些故障模式的问题,我们提出了ForEnt,这是一个多模态数据集,使用低成本的Unitree Go2四足机器人在英国南安普顿公共林地的八个森林地点收集。在我们的数据集中,进行了约1.7公里的穿越,共11个序列,记录了69次被困事件。ForEnt包括时间同步的RGB-D图像、LiDAR扫描、本体感知数据和第三人称视频,能够分析导致被困的地形因素,并提供标记的传感器流用于可重复的基准测试。通过支持被困检测策略的评估,ForEnt降低了在具有挑战性的森林环境中开发稳健四足机器人部署的门槛。

英文摘要

Legged robots are increasingly deployed in forests for ecological surveying and monitoring, yet their autonomy is often interrupted consequent to the challenges posed in traversing forest environments. Forest entrapments, for example, when a robot's legs are ensnared in vines or other vegetation, result in loss of stability and toppling. Such events not only disrupt the mission and require manual intervention, but also risk damage to the robot hardware. To address the absence of a dedicated dataset to investigate these failure modes in forest environments, we present ForEnt, a multi-modal dataset collected with the low-cost Unitree Go2 quadruped across eight forest sites in the Southampton Common Woodlands, UK. For our dataset, over approximately 1.7 km of traversals in 11 sequences were conducted, yielding 69 recorded entrapment events. ForEnt includes time-synchronized RGB-D images, LiDAR scans, proprioceptive data, and third-person video, enabling analysis of terrain factors contributing to entrapment and providing labeled sensor streams for reproducible benchmarking. By supporting the evaluation of entrapment detection strategies, ForEnt lowers the barrier to developing robust quadruped robot deployments in challenging forest environments.

2606.19672 2026-06-19 cs.RO 新提交

Safe Local Navigation for Ackermann-Steered Robots in Unmapped Environments

阿克曼转向机器人在未映射环境中的安全局部导航

Christian Schaible, Shahin Sirouspour

发表机构 * McMaster University(麦克马斯特大学)

AI总结 提出一种控制框架,通过局部障碍物检测确定最安全航向角,构建边界线并优化车辆-障碍物间距,实现阿克曼转向机器人在无全局目标环境中的安全局部导航。

Comments Presented at the 23rd Conference on Robots and Vision (CRV 2026)

Journal ref Proc. 23rd Conference on Robots and Vision (CRV), 2026

详情
AI中文摘要

提出了一种控制框架,用于在缺乏全局目标的未映射环境中,对配备阿克曼转向的移动机器人进行安全局部导航。基于局部障碍物检测,沿车辆前方最大开阔空间方向确定最安全航向角。在该方向引导下,在车辆左右两侧构建边界线以实现障碍物分离。这些边界线通过求解一个最大化车辆-障碍物间距的凸二次优化获得。可选地,对边界线施加约束以保持平行性并平滑先前控制步骤的突变。然后使用反馈线性化控制器调节车辆与一条或两条边界线的距离,从而有效跟踪通过最大化障碍物间距保证安全的局部参考路径。该控制方案包含开源代码。实验结果表明,与一些现有的基于探索的规划器相比,所提方法生成的导航路径更安全,计算时间显著缩短。

英文摘要

A control framework is proposed for safe local navigation of mobile robots equipped with Ackermann steering in unmapped environments where a global goal is absent. Based on local obstacle detections, the safest heading angle is determined along the direction of the largest open space ahead of the vehicle. Guided by this direction, bounding lines are constructed on the left and right sides of the vehicle to achieve obstacle separation. These bounding lines are obtained by solving a convex quadratic optimization that maximizes vehicle-to-obstacle clearance. Optionally, conditions are imposed on the bounding lines to preserve parallelism and smooth abrupt changes from prior control steps. A feedback-linearizing controller is then used to regulate the vehicle's distance from one or both bounding lines, effectively enabling tracking of a local reference path that preserves safety through obstacle clearance maximization. Open-source code is included for the application of this control scheme. Experimental results demonstrate that the proposed method produces safer navigation paths with significantly shorter computation times, compared to some existing exploration-based planners.

2606.19668 2026-06-19 cs.CL 新提交

Code-Switching Reveals Language Anchoring in Multilingual LLMs

代码切换揭示多语言大模型中的语言锚定

Jeonghyun Park, Seunghyun Yoon, Yonghyun Jun, Hwanhee Lee

发表机构 * Chung-Ang University(中央大学) Adobe Research(Adobe研究院)

AI总结 通过语法强制代码切换诊断多语言大模型中的语言锚定现象,提出锚定偏差度量并设计CANVAS干预方法,有效缓解代码切换导致的问答性能下降。

Comments 36 pages, 13 figures, 27 tables

详情
AI中文摘要

多语言大模型(MLLMs)越来越需要处理代码切换(CS)输入,然而混合语言通常会导致性能相对于源语言或目标语言单语版本下降。为了理解这种退化,我们使用语法强制CS作为受控诊断设置,将CS表示相对于其源和目标对应物进行定位。我们引入锚定偏差(Anchor Bias),一种几何度量,用于量化语言锚定,即CS隐藏状态是否更接近其源语言或目标语言对应物。在不同的MLLMs中,锚定偏差揭示了一致的语法框架效应:源框架CS保持源锚定,而目标框架CS向目标方向移动,并显示出更大的问答(QA)退化。受这种表示模式的启发,我们提出了CANVAS(基于上下文锚定的神经向量对齐引导),一种推理时干预方法,从输入中提取源侧画布,并在预填充期间将目标语言隐藏状态软引导向源锚定。CANVAS在MLLMs和CS条件下一致地恢复了QA F1分数,表明内部锚定信号为缓解CS推理失败提供了可行的目标。

英文摘要

Multilingual Large Language Models (MLLMs) are increasingly expected to handle Code-Switched (CS) inputs, yet mixing languages frequently degrades performance relative to source- or target-language monolingual counterparts. To understand this degradation, we use grammar-forced CS as a controlled diagnostic setting for locating CS representations relative to their source and target counterparts. We introduce Anchor Bias, a geometric measure that quantifies language anchoring, whether a CS hidden state aligns closer to its source or target language counterpart. Across diverse MLLMs, Anchor Bias reveals a consistent grammar-frame effect: source-framed CS stays source-anchored, whereas target-framed CS shifts target-ward and shows larger Question Answering (QA) degradation. Motivated by this representational pattern, we propose CANVAS (Contextual Anchor-based Neural Vector Alignment Steering), an inference-time intervention that extracts a source-side canvas from the input and softly steers target-language hidden states toward the source anchor during prefill. CANVAS consistently recovers QA F1 across MLLMs and CS conditions, showing that internal anchoring signals provide an actionable target for mitigating CS inference failures.

2606.19667 2026-06-19 cs.CL 新提交

CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

CacheWeaver:面向高效接地RAG推理的缓存感知证据排序

Kaizhen Tan, Rong Gu, Mingyuan Li

发表机构 * Heinz College of Information Systems and Public Policy, Carnegie Mellon University(卡内基梅隆大学海因茨信息系统与公共政策学院)

AI总结 提出CacheWeaver,一种轻量级提示层方法,通过缓存感知的证据排序降低RAG推理的首令牌延迟,无需修改服务引擎或证据集。

详情
AI中文摘要

检索增强生成(RAG)改善了事实基础,但也延长了提示并增加了预填充成本。vLLM等服务引擎中的前缀缓存仅在请求共享相同令牌前缀时降低此成本。然而,在接地生成中,相邻查询可能以不同顺序检索重叠证据,因此集合重叠不会变成可重用的前缀重叠。我们提出CacheWeaver,一种用于缓存感知证据排序的轻量级提示层方法。该方法维护最近服务的证据序列的前缀树,并使用贪婪遍历将最可重用的前缀放在首位,同时保持服务引擎和检索到的证据集不变。在三种vLLM配置中,相对于检索顺序前缀缓存,该方法将中位首令牌时间(TTFT)降低了约20-33%,且在我们的QA测试中不损害答案质量。贪婪策略达到了Oracle排序中位TTFT增益的97.5%,表明大多数可重用前缀局部性可以通过检索和推理之间的简单调度层恢复。

英文摘要

Retrieval-Augmented Generation (RAG) improves factual grounding, but it also lengthens prompts and raises prefill cost. Prefix caching in serving engines such as vLLM reduces this cost only when requests share the same token prefix. In grounded generation, however, adjacent queries may retrieve overlapping evidence in different orders, so set overlap does not become reusable prefix overlap. We present CacheWeaver, a lightweight prompt-layer method for cache-aware evidence ordering. The method keeps a prefix tree over recently served evidence sequences and uses a greedy walk to place the most reusable prefix first, while leaving the serving engine and retrieved evidence set unchanged. Across three vLLM configurations, the method lowers median time-to-first-token (TTFT) by about 20-33 percent relative to retrieval-order prefix caching, without hurting answer quality in our QA tests. The greedy policy reaches 97.5 percent of the median TTFT gain from oracle ordering, indicating that most reusable prefix locality can be recovered by a simple scheduling layer between retrieval and inference.

2606.19662 2026-06-19 cs.CV 新提交

Learning When to Denoise: Optimizing Asynchronous Schedules for Latent Diffusion

学习何时去噪:优化潜在扩散的异步调度

Bingshuo Qian, Xiang Cheng

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 提出学习异步调度策略,通过调度校正目标优化多表示扩散模型的去噪顺序,在ImageNet 256x256上以不到1%额外训练计算实现4倍加速,FID达1.02。

Comments 25 pages, 9 figures, 4 tables

详情
AI中文摘要

多表示扩散模型可以通过对图像的互补视图进行去噪来改善视觉合成,但其性能关键取决于决定每个表示何时去噪的异步调度。我们提出学习这种调度。我们的方法在多个表示空间上制定异步流匹配,并使用调度校正目标,该目标在调度变化时保持每个表示的局部噪声时间权重固定。我们用一个灵活的参数类实例化调度,该类通过构造是凸且单调的,并使用快速联合探针进行学习,额外训练计算少于1%。在ImageNet 256x256上,学习的调度在匹配的675M参数XL骨干下显著提高了收敛速度和最终质量。使用AutoGuidance,我们的200 epoch模型达到FID 1.05,与800 epoch的SFD-XL基线相当,训练量减少4倍。训练到600 epoch进一步改善到FID 1.02,优于1B参数的SFD-XXL结果(FID 1.04),同时使用更小的模型。在无引导设置中,我们的200 epoch模型达到FID 2.37,已经低于最佳800 epoch SFD-XL结果(2.54),训练量减少4倍,并在600 epoch时改善到FID 2.14。代码可在https://this URL获取。

英文摘要

Multi-representation diffusion models can improve visual synthesis by denoising complementary views of an image, but their performance depends critically on the asynchronous schedule that determines when each representation is denoised. We propose to learn this schedule. Our method formulates asynchronous flow matching over multiple representation spaces and uses a schedule-corrected objective that keeps each representation's local noising-time weights fixed as the schedule changes. We instantiate the schedule with a flexible parametric class that is convex and monotone by construction, and learn it using a fast joint probe with less than 1% additional training compute. On ImageNet 256x256, the learned schedule substantially improves both convergence speed and final quality under a matched 675M-parameter XL backbone. With AutoGuidance, our 200-epoch model reaches FID 1.05, matching the 800-epoch SFD-XL baseline with 4x less training. Training to 600 epochs further improves to FID 1.02, outperforming the 1B-parameter SFD-XXL result of FID 1.04 while using a smaller model. In the unguided setting, our 200-epoch model reaches FID 2.37, already below the best 800-epoch SFD-XL result (2.54) at 4x less training, and improves to FID 2.14 at 600 epochs. Code is available at https://github.com/bsq532087/LWD

2606.19660 2026-06-19 cs.CR cs.CL 新提交

A Layered Security Framework Against Prompt Injection in RAG-Based Chatbots

基于RAG的聊天机器人中针对提示注入的分层安全框架

Gulshan Saleem, Nisar Ahmed, Muhammad Imran Zaman, Ali Hassan

AI总结 提出三层防御框架,通过输入过滤、上下文指令层级和输出审计,将提示注入攻击成功率从71.4%降至11.3%,误报率4.8%,延迟开销61.2毫秒。

Comments Submitted in ICCK Transactions on Information Security and Cryptography

详情
AI中文摘要

提示注入被OWASP Top 10 for LLM Applications列为大语言模型(LLM)部署中最关键的漏洞,然而现有防御措施仅在孤立的流水线阶段运行且不完整。输入过滤器无法检查检索到的文档,而输出监控器无法阻止恶意载荷到达模型。因此,检索增强生成(RAG)聊天机器人仍然容易受到间接注入攻击,其中被污染的知识库文档会损害每个检索到它的用户。我们提出了一个三层框架,在推理流水线中拦截直接和间接的提示注入。第一层使用基于规则的模式库和微调后的语义异常分类器筛选用户输入。第二层在上下文组装期间强制执行基于来源的指令层级,防止检索到的内容覆盖操作员策略。第三层在交付前使用策略规则引擎和语义漂移检测器审计模型输出。一个持续审计循环聚合结构化日志,并支持重新训练以适应新兴攻击模式。该框架与模型无关,作为中间件部署,无需修改底层LLM。在GPT-4o、Llama 3和Mistral 7B上对5,080个样本的评估显示,该框架将攻击成功率(ASR)从71.4%降至11.3%,比最佳单层基线高出27.3个百分点,比已发布的护栏系统高出23.8个百分点,同时保持4.8%的误报率和61.2毫秒的中位延迟开销。消融研究证实,所有三层提供互补保护,且其组合效果超过单个贡献的总和。

英文摘要

Prompt injection is ranked as the most critical vulnerability in large language model (LLM) deployments by the OWASP Top 10 for LLM Applications, yet existing defenses operate at isolated pipeline stages and remain incomplete. Input filters cannot inspect retrieved documents, while output monitors cannot prevent malicious payloads from reaching the model. Consequently, retrieval-augmented generation (RAG) chatbots remain vulnerable to indirect injection, where a poisoned knowledge-base document compromises every user whose query retrieves it. We present a three-layer framework that intercepts both direct and indirect prompt injection throughout the inference pipeline. Layer 1 screens user input using a rule-based pattern library and a fine-tuned semantic anomaly classifier. Layer 2 enforces a provenance-based instruction hierarchy during context assembly, preventing retrieved content from overriding operator policy. Layer 3 audits model output using a policy rule engine and semantic drift detector before delivery. A continuous audit loop aggregates structured logs and supports retraining to adapt the classifier to emerging attack patterns. The framework is model-agnostic and deploys as middleware without modifying the underlying LLM. Evaluation on 5,080 samples across GPT-4o, Llama 3, and Mistral 7B shows that the framework reduces Attack Success Rate (ASR) from 71.4\% to 11.3\%, outperforming the best single-layer baseline by 27.3 percentage points and a published guardrail system by 23.8 percentage points, while maintaining a 4.8\% false positive rate and a median latency overhead of 61.2 ms. Ablation studies confirm that all three layers provide complementary protection and that their combined effect exceeds the sum of individual contributions.

2606.19659 2026-06-19 cs.CL 新提交

SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation

SAGE-OPD:面向多轮在策略蒸馏的选择性智能体引导干预

Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang, Bo Peng, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao

发表机构 * Meta AI

AI总结 提出SAGE-OPD框架,通过环境反馈和教师判断选择性干预学生响应,结合置信度加权和损失归一化,解决多轮在策略蒸馏中的错误累积问题,在ALFWorld任务中取得13.3%的相对提升。

Comments 21 pages, 3 figures

详情
AI中文摘要

在策略蒸馏(OPD)通过训练学生模型在其自身策略生成的轨迹上来改进学生模型,使其成为缓解智能体训练中曝光偏差的一种有前景的方法。然而,大多数OPD研究集中在单轮设置,而现实中的LLM智能体需要与环境进行多轮交互。在这种机制下,早期错误会改变未来观察并沿轨迹累积,标准的密集令牌级OPD变得脆弱,因为它可能过度惩罚语义上有效的替代方案,强化局部退化(如重复动作),并在分布外历史中传播不可靠的教师监督。我们提出SAGE-OPD,一种专门为多轮OPD设计的无验证器选择性干预框架。SAGE-OPD不是在所有轮次上统一应用教师监督,而是首先观察环境反馈,并使用教师判断来决定每个学生响应是否应被跳过或干预。为了进一步解决累积错误,SAGE-OPD通过教师置信度对令牌级蒸馏进行加权,减少不确定的教师分布在受损或模糊历史上的影响。最后,SAGE-OPD应用损失归一化以保留标准OPD的整体损失规模,同时保持选择性轮次级加权。在智能体任务上的实验表明,SAGE-OPD持续优于基线,在ALFWorld未见成功率上比标准OPD实现了高达13.3%的相对提升。消融研究进一步表明,轮次级干预、教师置信度加权和损失归一化提供了互补的益处。我们的结果表明,有效的多轮OPD应保持策略内,但教师监督应选择性地分配到需要干预且可靠的轮次。

英文摘要

On-policy distillation (OPD) improves student models by training them on trajectories induced by their own policy, making it a promising approach for mitigating exposure bias in agent training. However, most OPD studies focus on single-turn settings, while realistic LLM agents interact with environments over multiple turns. In this regime, early errors can alter future observations and compound across the trajectory, and standard dense token-level OPD becomes brittle, as it may over-penalize semantically valid alternatives, reinforce local degeneracies such as repeated actions, and propagate unreliable teacher supervision on off-distribution histories. We propose SAGE-OPD, a verifier-free selective intervention framework specifically designed for multi-turn OPD. Instead of applying teacher supervision uniformly across all turns, SAGE-OPD first observes environment feedback and uses teacher judgment to decide whether each student response should be skipped or intervened on. To further address compounding errors, SAGE-OPD weights token-level distillation by teacher confidence, reducing the influence of uncertain teacher distributions on corrupted or ambiguous histories. Finally, SAGE-OPD applies loss normalization to preserve the overall loss scale of standard OPD while retaining selective turn-level weighting. Experiments on agent tasks show that SAGE-OPD consistently improves over baselines, achieving up to a 13.3% relative improvement in ALFWorld unseen success rate over standard OPD. Ablation studies further demonstrate that turn-level intervention, teacher confidence weighting, and loss normalization provide complementary benefits. Our results suggest that effective multi-turn OPD should remain on-policy, but teacher supervision should be selectively allocated to turns where intervention is necessary and reliable.

2606.19658 2026-06-19 cs.AI cs.IR cs.MM 新提交

Denoising Implicit Feedback for Cold-start Recommendation

去噪隐式反馈用于冷启动推荐

Gaode Chen, Shicheng Wang, Shikun Li, Rui Huang, Xinghua Zhang, Yunze Luo, Shipeng Li, Shiming Ge, Ruina Sun, Yinjie Jiang, Jun Zhang

发表机构 * Hong Kong Baptist University(香港浸会大学) Independent Researcher(独立研究员) Peking University(北京大学) Nanjing University(南京大学) Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所)

AI总结 针对冷启动推荐中隐式反馈噪声问题,提出模型无关的去噪方法DIF,通过内容相似性推断伪标签并建模置信度与不确定性,在快手应用中显著提升冷启动场景商业指标。

Comments Accepted by KDD 2026 ADS Track

详情
AI中文摘要

隐式反馈因其可获取性和通用性被广泛用于推荐系统,但通常包含噪声样本(如点击诱饵、位置偏差)。同时,由于新物品的持续涌入,推荐器不可避免地面临物品冷启动问题。我们识别出冷物品因上述因素更容易受到噪声样本的影响,而研究者往往忽视了为冷物品去噪隐式反馈的重要性。先前的去噪研究通常基于启发式模式(如高损失值)识别噪声样本,并通过样本选择或重加权来减轻噪声。然而,这些方法适应性有限,在冷启动场景中效果不佳。为了实现冷启动推荐中的隐式反馈去噪,我们提出了一种模型无关的去噪方法DIF。首先,用户对内容的偏好是稳定的,这使我们能够通过内容相似的热物品推断出指示用户是否对冷物品感兴趣的伪标签。其次,为了提高伪标签准确性,我们基于冷物品与热物品的内容相似性对伪标签的置信度进行建模,然后为每个样本聚合多个伪标签。最后,我们通过考虑噪声样本标签的相对熵和物品的冷启动状态,显式估计其不确定性,从而自适应地指导伪标签在样本级别纠正噪声标签。DIF的优越性得到了理论证明和真实数据集上大量实验的支持。该方法已部署在十亿用户规模的短视频应用快手上,并在冷启动场景中显著提升了各项商业指标。

英文摘要

Implicit feedback is widely used in recommender systems due to its accessibility and generality, yet it usually presents noisy samples (e.g., clickbait, position bias). Meanwhile, recommenders inevitably face the item cold-start problem due to the continuous influx of new items. We identify that cold items are more prone to noisy samples due to the aforementioned factors, and researchers often overlook the significance of denoising implicit feedback for cold items. Previous denoising studies usually identify noisy samples based on heuristic patterns, such as higher loss values, and mitigate noise through sample selection or re-weighting. However, these methods have limited adaptability and are ineffective in cold-start scenarios. To achieve denoising implicit feedback for cold-start recommendation, we propose a model-agnostic denoising method called DIF. First, user preferences for content remain stable, which allows us to infer pseudo-labels indicating whether a user is interested in a cold item through content-similar warm items. Furthermore, to improve pseudo-label accuracy, we model the confidence of pseudo-labels based on the content similarity between the cold item and warm items, and then aggregate multiple pseudo-labels for each sample. Finally, we explicitly estimate the uncertainty of the noisy sample label by considering its relative entropy and the cold-start status of the item, which adaptively guides the role of pseudo-labels to correct the noisy labels at the sample level. DIF's superiority is supported by both theoretical justification and extensive experiments on real-world datasets. The method has been deployed on a billion-user scale short video application Kuaishou and has significantly improved various commercial metrics within cold-start scenarios.

2606.19656 2026-06-19 cs.RO cs.LG 新提交

DF-ExpEnse: Diffusion Filtered Exploration for Sample Efficient Finetuning

DF-ExpEnse: 扩散滤波探索用于高效样本微调

Calvin Luo, Chen Sun, Shuran Song

发表机构 * Stanford University(斯坦福大学) Brown University(布朗大学)

AI总结 提出DF-ExpEnse探索技术,利用生成控制策略的多模态建模能力和评论家集成,在微调中高效收集在线经验,提升样本效率。

Comments ICML 2026

详情
AI中文摘要

智能机器人决策的自然方案是从预训练的生成控制策略初始化,该策略总结了离线经验,并将其适应于自收集的在线经验。我们提出了DF-ExpEnse,一种探索技术,可提高在线经验收集的质量,从而提升微调样本效率。DF-ExpEnse利用生成控制策略的多模态建模能力,创建一个表达性强且易于评估的候选集。然后,它利用评论家集成来识别在质量与高探索兴趣之间最佳平衡的动作。在群体设置中,DF-ExpEnse进一步支持跨智能体通信,以促进群体协作探索。DF-ExpEnse可以无缝集成到通过强化学习微调预训练生成控制策略的现有策略中。我们通过实验验证,在各种操作和 locomotion 任务中,与默认微调和替代动作选择方案相比,DF-ExpEnse 持续带来样本效率优势。项目可在此 https URL 找到。

英文摘要

A natural recipe for intelligent robotic decision-making is initializing from pretrained generative control policies, which have summarized offline experience, and adapting them to self-collected online experience. We present DF-ExpEnse, an exploration technique that improves the quality of online experience collection, thus increasing finetuning sample-efficiency. DF-ExpEnse leverages the multimodal modeling capabilities of the generative control policy to create an expressive and tractably evaluatable candidate set. It then utilizes an ensemble of critics to identify the action that best balances quality with high exploration interest. In fleet settings, DF-ExpEnse further enables cross-agent communication to facilitate collaborative exploration as a group. DF-ExpEnse can be seamlessly integrated with existing strategies that finetune pretrained generative control policies via reinforcement learning. We experimentally validate consistent sample-efficiency benefits through DF-ExpEnse across a variety of manipulation and locomotion tasks, compared to default finetuning and alternative action selection schemes. Project can be found at https://df-expense.github.io.

2606.19654 2026-06-19 cs.CR cs.SE 新提交

PUFFERDOS: Efficient and Effective Attack String Generation for Regular Expression Denial of Service Vulnerabilities

PUFFERDOS:针对正则表达式拒绝服务漏洞的高效攻击字符串生成

Shangzhi Xu, Ziqi Ding, Xiao Cheng, Yuekang Li, Nan Sun, Benjamin Turnbull, Shuangxiang Kan, Siqi Ma

AI总结 提出PUFFERDOS方法,通过定义三种脆弱模式并利用合成技术与组合符号执行,生成在现实长度预算内且经程序验证有效的ReDoS攻击字符串。

Comments Accepted by S&P'26

详情
AI中文摘要

ReDoS攻击构成了一类关键的资源耗尽漏洞。在此类攻击中,攻击者利用正则表达式引擎的病态最坏情况执行行为,诱导高度不对称的计算工作负载,最终耗尽系统资源并降低服务可用性。为了保护系统免受ReDoS攻击,研究人员提出了许多检测技术,这些技术通过生成攻击字符串来模拟攻击过程,以便在早期开发阶段主动利用ReDoS漏洞并促进修复。现有技术大致分为两类:搜索病态正则表达式结构的静态分析,以及合成候选攻击字符串的动态探索方法。然而,生成的攻击字符串通常不适用于实际利用,因为它们往往假设不切实际的输入长度预算,并且未在程序级别验证攻击的有效性和效率。因此,许多生成的字符串在应用于实际程序时无法触发易受攻击的正则表达式,进一步限制了其实用性。为了解决这些不足,我们引入了一种有效且高效的攻击字符串生成器PUFFERDOS,旨在合成在现实长度预算内可行且经程序级别验证的攻击输入,从而实现对实际程序中ReDoS漏洞的有效利用。具体来说,我们首先基于观察和形式化验证定义了三种脆弱模式。根据这些模式,PUFFERDOS采用合成技术生成攻击字符串,然后通过针对ReDoS的组合符号执行对字符串进行细化和验证,以确保现实世界中的可利用性。

英文摘要

ReDoS attacks constitute a critical class of resource-exhaustion vulnerabilities. In such attacks, adversaries exploit the pathological worst-case execution behavior of regular expression (regex) engines to induce highly asymmetric computational workloads, ultimately exhausting system resources and degrading service availability. To protect systems against ReDoS attacks, numerous detection techniques have been proposed that simulate the attack process by generating attack strings to proactively exploit ReDoS vulnerabilities at the early development stage and facilitate remediation. Existing techniques broadly fall into two classes: static analyses that search for pathological regex structures, and dynamic exploration methods that synthesize candidate attack strings. However, the generated attack strings are often impractical for real-world exploitation because they usually assume unrealistic input-length budgets and do not validate the effectiveness and efficiency of the attack at the program level. Therefore, many generated strings fail to trigger vulnerable regexes when applied to real-world programs, further limiting the practical utility. To address these shortcomings, we introduce an effective and efficient attack string generator, PUFFERDOS, designed to synthesize attack inputs that are both feasible within realistic length budgets and validated at the program level, enabling effective exploitation of ReDoS vulnerabilities in real-world programs. Specifically, we first define three vulnerable patterns based on our observation and formal verification. According to the patterns, PUFFERDOS conducts a synthesis technique to generate attack strings, and then refines and validates the strings with ReDoS-specific compositional concolic execution to guarantee real-world exploitability.

2606.19652 2026-06-19 cs.LG 新提交

Convex training of Lipschitz-regularized shallow neural networks

Lipschitz正则化浅层神经网络的凸训练

Chao Yin, Antoine Lesage-Landry

发表机构 * Polytechnique Montréal, GERAD & Mila, Montréal, QC, Canada(蒙特利尔理工学院,GERAD & Mila,加拿大魁北克省蒙特利尔市)

AI总结 提出一种凸限制方法求解非凸Lipschitz正则化训练问题,可全局最优求解,并作为预训练网络的后处理步骤,提升对抗鲁棒性和准确性。

详情
AI中文摘要

在这项工作中,我们引入了一种针对浅层神经网络的训练程序,该程序能够提升对对抗攻击的鲁棒性。我们通过引入一个凸限制来解决非凸的Lipschitz正则化训练问题,该凸限制可以高效地求解全局最优解。我们的方法可以作为后处理步骤,将预训练网络作为初始解,然后求解凸规划,其最优网络保证不劣于初始网络。我们通过在对抗设置下使用真实世界数据集进行回归任务的实验,展示了我们训练程序的改进。数值结果表明,与现有方法相比,求解我们提出的凸规划得到的网络在Lipschitz正则化程序上具有更低的目标值。此外,我们表明,在某些数据集上,使用我们的凸训练程序获得的网络在对抗攻击下既更准确又更鲁棒。

英文摘要

In this work, we introduce a training procedure for shallow neural networks that promotes robustness against adversarial attacks. We solve a non-convex Lipschitz-regularized training program by introducing a convex restriction that can be efficiently solved to global optimality. Our approach can be employed as a post-processing step by taking a pre-trained network as an initial solution to then solving the convex program whose optimal network is guaranteed to be no worse than the initial one. We illustrate the improvements of our training procedure with experiments using real world datasets for regression tasks under an adversarial setting. We show numerically that solving our proposed convex program yields networks with lower objective values on the Lipschitz-regularized program compared to existing methods. Additionally, we show that on certain datasets, networks obtained using our convex training program are both more accurate and robust with respect to adversarial attacks.

2606.19651 2026-06-19 cs.AI cs.CV cs.LG 新提交

BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation

BrainG3N:用于可控3D脑MRI生成的双用途分词器

Max Van Puyvelde, Ibrahim Gulluk, Wim Van Criekinge, Olivier Gevaert

发表机构 * Department of Biomedical Data Science, Stanford University School of Medicine(斯坦福大学医学院生物医学数据科学系) Department of Mathematical Modelling, Statistics & Bioinformatics, Ghent University(根特大学数学建模、统计与生物信息学系) Department of Electrical Engineering, Stanford University(斯坦福大学电气工程系)

AI总结 提出基于3D掩码自编码器的分词器,解耦编码器与解码器,在23项线性探测任务中21项超越SOTA,并支持条件生成和纵向预测。

详情
AI中文摘要

三维(3D)脑MRI是临床神经病学和神经肿瘤学的核心,生成模型可以增强代表性不足的队列、模拟疾病轨迹并支持隐私保护的数据共享。潜在扩散已成为建模成像数据的首选解决方案,但它对分词器提出了两个竞争性要求:编码器嵌入必须保留下游任务所需的临床信息,解码器必须重建解剖学上准确的体积。现有的重建驱动分词器以牺牲前者为代价实现了后者。为了解决这个问题,我们引入了一种基于全体积掩码自编码器(MAE)的分词器,用于3D脑MRI潜在扩散,解耦编码器和解码器:冻结的3D MAE编码器产生临床信息丰富的嵌入,而专用的CNN解码器从这些嵌入的线性投影重建体素。我们在来自18个公共队列的35,309个体积上预训练编码器,涵盖四种模态、十种疾病类别和200多个采集站点,并在两种设置中展示了其双重用途。首先,在23项线性探测基准测试中,编码器在21项任务上优于或匹配SOTA模型(即BrainIAC、BrainSegFounder和MedicalNet)。其次,在这些临床信息丰富的嵌入上训练的条件扩散变压器(DiT)支持跨六个变量的条件生成和患者特定的纵向预测。这些结果共同建立了一个单一的3D脑MRI嵌入空间,能够同时支持下游临床任务和可控生成。

英文摘要

Three-dimensional (3D) brain MRI is central to clinical neurology and neuro-oncology, where generative models could augment under-represented cohorts, simulate disease trajectories, and support privacy-preserving data sharing. Latent diffusion has been the go-to solution for modeling imaging data, but it places two competing demands on the tokenizer: encoder embeddings must retain the clinical information that downstream tasks act on, and the decoder must reconstruct anatomically faithful volumes. Existing reconstruction-driven tokenizers achieve the second at the expense of the first. To address this, we introduce a fully volumetric masked-autoencoder (MAE) based tokenizer for 3D brain MRI latent diffusion, decoupling encoder and decoder: a frozen 3D MAE encoder produces clinically informative embeddings, while a dedicated CNN decoder reconstructs voxels from a linear projection of those embeddings. We pretrain the encoder on 35,309 volumes from 18 public cohorts spanning four modalities, ten disease categories, and 200+ acquisition sites, and demonstrate its dual utility in two settings. First, on a 23-task linear-probing benchmark, the encoder outperforms or matches SOTA models (i.e., BrainIAC, BrainSegFounder, and MedicalNet) on 21 of 23 tasks. Second, a conditional diffusion transformer (DiT) trained on these clinically informative embeddings supports both conditional generation across six variables and patient-specific longitudinal forecasting. Together these results establish a single 3D brain-MRI embedding space capable of both downstream clinical tasks and controllable generation.

2606.19647 2026-06-19 cs.CL cs.CY cs.SI 新提交

From 50K to 8.2 Million in 24 Hours: Vozinha's Algorithmic Consecration and the Multilingual Making of World Cup Visibility

从5万到820万在24小时内:Vozinha的算法封圣与世界杯可见性的多语言构建

Vinicius Covas

发表机构 * Universidad Anáhuac México(墨西哥阿纳瓦克大学)

AI总结 通过多语言语料库和九框架叙事分类法,分析2026年世界杯后Vozinha的算法封圣过程,揭示不同语言承载不同叙事框架,将平台粉丝数作为语言对象研究可见性构建。

Comments 11 pages, 4 figures, 3 tables; v0.1 pilot preprint. Dataset and evidence package available at https://doi.org/10.5281/zenodo.20722235

详情
AI中文摘要

我们提出了一项多语言计算话语分析,研究语言如何构建了Vozinha——这位40岁的佛得角门将在2026年世界杯西班牙0-0佛得角比赛后的算法封圣。该研究贡献了一个包含葡萄牙语、西班牙语、英语和法语的多语言语料库;一个基于线索的九框架叙事分类法;一个结合LLM辅助建议与人工验证的可复现标注流程;以及跨话语阶段的多语言叙事扩散分析。我们将平台粉丝数本身——被叙述为“从5万到800万”——视为一个语言对象:一种流通且可叙述的可见性证明,而非单纯的测量。粉丝增长时间线仅作为上下文元数据使用:我们重构了一个保守的阶段结构,而非连续的API原生序列,并对每个数据点按值类别、置信度和证据类型进行标注。唯一精确的主要爬取锚点是2026年6月16日15:47 UTC的8,235,652粉丝;所有其他数字均报告为估计范围或阈值,包括估计的赛前基线45k-56k。研究结果表明,不同语言承载了不同的框架:葡萄牙语的动员、西班牙语的危机、英语的民族构建,以及共享的平台指标奇观,通过这种奇观,边缘的体育表现变得全球可见。作为v0.1试点,本文发布了语料库模式、框架分类法、标注指南、哈希视觉证据日志和类型化时间线,同时将完整的双重标注和标注者间一致性标记为计划工作。

英文摘要

We present a multilingual computational discourse analysis of how language constructed the algorithmic consecration of Vozinha, the 40-year-old Cape Verde goalkeeper, after Spain 0-0 Cape Verde at the 2026 FIFA World Cup. The study contributes a multilingual corpus in Portuguese, Spanish, English, and French; a nine-frame narrative taxonomy with cue-based frame annotation; a reproducible annotation pipeline combining LLM-assisted suggestion with human validation; and an analysis of cross-lingual narrative diffusion across discourse phases. We treat the platform follower count itself, narrated as "50k to 8M", as a linguistic object: a circulating and narratable proof of visibility rather than a mere measurement. The follower-growth timeline is used only as contextual metadata: we reconstruct a conservative phase structure, not a continuous API-native series, and type every datapoint by value class, confidence, and evidence type. The only exact primary scraper anchor is 8,235,652 followers at 2026-06-16 15:47 UTC; all other figures are reported as estimated ranges or thresholds, including an estimated pre-match baseline of 45k-56k. Findings suggest that distinct languages carried distinct frames: Portuguese mobilization, Spanish crisis, English nation-making, and a shared platform-metric spectacle through which peripheral athletic performance became globally visible. As a v0.1 pilot, the paper releases the corpus schema, frame taxonomy, annotation guidelines, hashed visual-evidence log, and typed timeline, while flagging full double annotation and inter-annotator agreement as planned work.

2606.19646 2026-06-19 cs.IR cs.CV 新提交

SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering

SAFE-Cascade: 面向图表问答的成本自适应视觉语言路由

Ayush Dwivedi, Qixin Wang, Ashvi Soni, Ruoteng Wang, Han Li, Animesh Mahapatra, Neeraj Agrawal, Xintao Wu

发表机构 * University of Arkansas(亚拉巴马大学)

AI总结 提出SAFE-Cascade系统,通过OCR和轻量语言模型先给出答案,再由学习路由器决定是否调用VLM,在ChartQA上以73.1%的VLM调用率达到69.1%准确率,减少26.9%的VLM调用和9.3%的成本。

Comments Demo paper submitted at CIKM 2026. 4 pages, 2 figures

详情
AI中文摘要

视觉语言模型(VLM)在图表问答中表现出色,但若每个查询都调用VLM,当许多问题可通过OCR文本和轻量语言推理回答时,成本会不必要地高昂。我们展示了SAFE-Cascade,一个用于成本自适应图表问答的交互系统。给定图表图像和自然语言问题,SAFE-Cascade首先通过OCR提取图表文本,从纯文本语言模型获得临时答案,然后使用学习路由器决定接受文本答案还是升级到VLM。该演示向用户展示这一决策过程:OCR证据、纯文本答案、路由概率、升级决策、最终答案、估计成本和估计延迟并排显示。SAFE-Cascade被设计为一个透明界面,用于理解何时实际需要视觉基础。用户可以上传或选择图表、提问、检查每条路径使用的证据、比较纯文本和VLM答案,并调整升级阈值以探索准确率-成本边界。该系统使用Azure Document Intelligence进行OCR,gpt-5-mini作为纯文本模型,gemini-2.5-flash-image作为VLM,以及基于推理时特征训练的随机森林路由器。在从2500个样本实验中留出的375个ChartQA测试集上,SAFE-Cascade实现了69.1%的统一准确率和73.1%的VLM调用率,而全VLM基线为67.7%准确率和100% VLM调用率。观察到的+1.4个百分点差异在统计上不确定,因此我们将SAFE-Cascade解释为匹配全VLM性能,同时减少26.9%的VLM调用和9.3%的估计成本。该演示展示了选择性模态路由如何使多模态知识系统更加透明、可调优和成本感知。

英文摘要

Vision-language models (VLMs) are powerful for chart question answering, but invoking a VLM for every query can be unnecessarily expensive when many questions are answerable from OCR text and lightweight language reasoning. We demonstrate SAFE-Cascade, an interactive system for cost-adaptive chart question answering. Given a chart image and a natural-language question, SAFE-Cascade first extracts chart text with OCR, obtains a provisional answer from a text-only language model, and then uses a learned router to decide whether to accept the text answer or escalate to a VLM. The demo exposes this decision process to users: OCR evidence, text-only answer, routing probability, escalation decision, final answer, estimated cost, and estimated latency are shown side by side. SAFE-Cascade is designed as a transparent interface for understanding when visual grounding is actually needed. Users can upload or select charts, ask questions, inspect the evidence used by each pathway, compare text-only and VLM answers, and adjust the escalation threshold to explore the accuracy-cost frontier. The system is implemented with Azure Document Intelligence for OCR, gpt-5-mini as the text-only model, gemini-2.5-flash-image as the VLM, and a Random Forest router trained on inference-time features. On a held-out ChartQA test split of 375 examples from a 2,500-example experiment, SAFE-Cascade achieves 69.1% unified accuracy with 73.1% VLM invocation, compared with 67.7% accuracy and 100% VLM invocation for the full-VLM baseline. The observed +1.4 percentage-point difference is statistically uncertain, so we interpret SAFE-Cascade as matching full-VLM performance while reducing VLM calls by 26.9% and estimated cost by 9.3%. The demonstration shows how selective modality routing can make multimodal knowledge systems more transparent, tunable, and cost-aware.

2606.19644 2026-06-19 cs.SE 新提交

Prompt Quality and Pull Request Outcomes: A Stage-Based Empirical Study of LLM-Assisted Development

提示质量与拉取请求结果:基于阶段的LLM辅助开发实证研究

Richard Sserunjogi, Daniel Ogenrwot, John Businge

AI总结 通过分析265个开发者与ChatGPT的交互,研究提示结构(上下文、具体性、验证)对LLM辅助开发中代码生成、采纳和集成深度的影响,发现不同维度在不同阶段有不同作用。

Comments 48 pages, 2 figures

详情
AI中文摘要

大型语言模型(LLM)驱动的工具(如ChatGPT)越来越多地用于协作软件工程工作流,但提示结构如何影响下游拉取请求(PR)结果尚不清楚。先前的研究主要考察对话帮助性、生产力或粗粒度的采用指标,对提示结构在协作集成行为中的作用理解不足。我们分析了来自开源拉取请求中自我承认的ChatGPT使用的265个手动验证的开发者-ChatGPT交互。基于先前关于开发者面向工件和提示工程的研究,我们使用三个维度操作化提示结构:上下文、具体性和验证。我们首先评估LLM辅助注释是否能可靠地再现人类对提示结构的判断,发现在不同维度和工作流上下文中存在显著差异。具体性与人类判断的一致性最稳定;上下文被LLM系统性地低估;验证仍然难以一致评估,这促使采用人类-LLM混合注释策略。使用这个经过验证的框架,我们然后检查提示结构如何影响AI辅助PR工作流中的可操作代码生成、代码采纳和集成深度。具体性和上下文与可操作代码生成关联最强;验证成为代码采纳的主要预测因子;集成深度与上下文关联最强。总体而言,我们的发现表明,提示特征在AI辅助软件工程工作流中表现出不同的、阶段依赖的影响,通过上下文基础、任务具体性和可评估性线索影响下游采纳和集成。

英文摘要

Large language model (LLM)-powered tools such as ChatGPT are increasingly used in collaborative software engineering workflows, yet little is known about how prompt structure influences downstream pull request (PR) outcomes. Prior studies primarily examine conversational helpfulness, productivity, or coarse-grained adoption metrics, leaving the role of prompt structure in collaborative integration behavior insufficiently understood. We analyze 265 manually validated developer-ChatGPT interactions derived from self-admitted ChatGPT usage in open-source pull requests. Building on prior research on developer-facing artifacts and prompt engineering, we operationalize prompt structure using three dimensions: Context, Specificity, and Verification. We first evaluate whether LLM-assisted annotation can reliably reproduce human judgments of prompt structure, finding substantial variation across dimensions and workflow contexts. Specificity shows the most stable agreement with human judgments; Context is systematically under-scored by the LLM; and Verification remains difficult to assess consistently, motivating a hybrid human-LLM annotation strategy. Using this validated framework, we then examine how prompt structure influences actionable code generation, code adoption, and integration depth across AI-assisted PR workflows. Specificity and Context are most strongly associated with actionable code generation; Verification emerges as the primary predictor of code adoption; and integration depth is most strongly associated with Context. Overall, our findings show that prompt characteristics exert distinct, stage-dependent effects across AI-assisted software engineering workflows, influencing downstream adoption and integration through contextual grounding, task specificity, and evaluability cues.

2606.19641 2026-06-19 cs.RO cs.CV 新提交

Scaling Self-Play for End-to-End Driving

扩展端到端驾驶的自我对弈

Luke Rowe, Roger Girgis, Rodrigue de Schaetzen, Daphne Cornelisse, Alaap Grandhi, Felix Heide, Eugene Vinitsky, Christopher Pal, Liam Paull

发表机构 * Mila(米拉研究所) Université de Montréal(蒙特利尔大学) Polytechnique Montréal(蒙特利尔理工学院) Torc Robotics NYU Tandon School of Engineering(纽约大学坦登工程学院) McMaster University(麦克马斯特大学) Princeton University(普林斯顿大学)

AI总结 提出大规模自我对弈训练策略,通过高效模拟器Gigapixel实现像素级自我对弈,结合DAgger蒸馏和感知适应,提升端到端驾驶模型性能。

详情
AI中文摘要

端到端自动驾驶模型通常基于离线的人类演示数据集进行训练,这些数据集提供的状态覆盖有限,且通常没有闭环反馈,使得模型在闭环部署时容易出现复合误差,并对长尾智能体交互脆弱。为克服这些限制,我们提出了一种替代策略:直接在模拟中的像素上进行大规模自我对弈。虽然先前的自我对弈方法已显示出向真实世界驾驶的有前景的迁移,但它们通常假设向量化的鸟瞰图(BEV)观测,这与直接基于传感器观测的端到端策略不兼容。为此,我们引入了Gigapixel,一个具有透视渲染的高吞吐量批处理驾驶模拟器,实现了直接从像素观测的可扩展自我对弈。Gigapixel并非针对计算成本高的逼真传感器模拟,而是渲染一个简化的边界框世界,保留基本场景结构,同时实现每秒5万智能体步的吞吐量。由于直接像素空间的自我对弈强化学习在端到端模型规模下样本效率极低,我们提出了自我对弈DAgger训练:通过从特权RL教师进行在线策略蒸馏来训练基于像素的策略。为弥合模拟到现实的差距,我们随后通过轻量级感知适应将自我对弈训练的策略迁移到真实世界传感器数据。在Gigapixel中训练并适应真实世界传感器数据的策略在HUGSIM和NAVSIM-v2基准测试中取得了竞争性表现,无需人类轨迹监督。此外,扩展自我对弈训练带来策略性能的成比例提升,确立了自我对弈作为训练端到端模型的实用且可扩展的策略。

英文摘要

End-to-end autonomous driving models are typically trained on offline human-demonstration datasets that provide limited state coverage and often no closed-loop feedback, making them prone to compounding errors when deployed in closed-loop and brittle to long-tail agent interactions. To overcome these limitations, we propose an alternative strategy for training end-to-end driving models: large-scale self-play directly from pixels in simulation. While prior self-play approaches have shown promising transfer to real-world driving, they typically assume vectorized Bird's-Eye-View (BEV) observations that are incompatible with end-to-end policies operating directly on sensor observations. To this end, we introduce Gigapixel, a high-throughput batched driving simulator with perspective rendering, enabling scalable self-play directly from pixel observations. Rather than targeting compute-costly photorealistic sensor simulation, Gigapixel renders a simplified bounding-box world that preserves essential scene structure while achieving throughput at 50k agent steps per second. Since direct pixel-space self-play RL is prohibitively sample-inefficient at end-to-end model scale, we propose self-play DAgger training: we train pixel-based policies in self-play via on-policy distillation from a privileged RL teacher. To bridge the sim-to-real gap, we subsequently transfer the self-play trained policies to real-world sensor data through lightweight perception adaptation. Policies trained in Gigapixel and adapted to real-world sensor data achieve competitive performance on the HUGSIM and NAVSIM-v2 benchmarks without human trajectory supervision. Moreover, scaling self-play training yields proportional gains in policy performance, establishing self-play as a practical and scalable strategy for training end-to-end models.

2606.19640 2026-06-19 cs.CL cs.AI cs.HC 新提交

Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language

创建多语言心理健康对话数据集:基于国籍和语言的人物角色本地化方法的局限性

Yunkai Xu, Saeed Abdullah

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 研究通过修改人物角色中的国籍和语言参数生成中文、孟加拉语和印地语临床对话,发现仅添加这些参数会导致跨语言临床不一致,且LLM评估非英语文本的抑郁严重度时存在不准确性。

Comments 15 pages, 4 figures. Accepted to the 2026 Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026), co-located with ACL 2026

详情
AI中文摘要

人工智能和大语言模型(LLMs)已成为应对全球心理健康挑战的有前景的工具。尽管这些挑战具有全球性,但用于训练和评估此类系统的高质量数据集仍然严重短缺。为弥补这一差距,研究人员越来越多地生成合成临床人物角色来模拟用户数据并测试数字心理健康支持系统。然而,大多数经过验证的人物角色依赖于以英语为中心的语境。本文研究了是否可以使用类似的人物角色方法生成多语言心理健康数据集。我们修改了人物角色中的国籍和语言参数,以生成普通话、孟加拉语和印地语的临床对话。然后,我们考察了不同LLM在评估这些生成的多语言数据集的抑郁严重程度(与英语基线相比)时的表现。我们的研究结果表明,仅在人物角色中添加国籍和语言参数可能不够,因为它可能引入跨语言的临床不一致性。LLM评判模型在评估非英语文本中的抑郁严重程度时常常表现出不准确性,且不同模型的性能存在差异。这暴露了将以英语为中心的人物角色应用于多语言语境的系统性局限性。最终,我们的工作强调了迫切需要文化响应式数据生成,以确保全球心理健康系统的公平性。

英文摘要

AI and large language models (LLMs) have emerged as promising tools to address global mental health challenges. Despite the global nature of these challenges, there remains a critical shortage of high-quality datasets for training and evaluating such systems. To mitigate this gap, researchers increasingly generate synthetic clinical personas to simulate user data and test digital mental health support systems. However, most validated personas rely on English-centric contexts. This paper investigates whether similar persona-based methods can be used to generate multilingual mental health datasets. We modified nationality and language parameters in personas to generate clinical dialogues in Mandarin, Bengali, and Hindi. We then examined how different LLMs perform when evaluating the depression severity of these generated multilingual datasets against the baseline in English. Our findings indicate that just adding nationality and language parameters in personas might not be adequate, as it can introduce clinical inconsistency across languages. LLM judge models often exhibit inaccuracies in assessing depression severity in non-English texts, with performance varying across different models. This exposes the systemic limitations of applying English-centric personas to multilingual contexts. Ultimately, our work highlights the urgent need for culturally responsive data generation to ensure equitable mental health systems globally.

2606.19638 2026-06-19 cs.CL 新提交

MiqraBERT: Regression-Based Sentence-BERT Finetuning for Biblical Hebrew Parallel Detection

MiqraBERT:基于回归的Sentence-BERT微调用于圣经希伯来语平行检测

David M. Smiley

AI总结 提出MiqraBERT模型,通过余弦相似度回归微调Sentence-BERT,在圣经希伯来语中检测文本平行,将分布分离度提升2.7倍,重叠区域从24%降至6%。

详情
AI中文摘要

文本复用遍及希伯来圣经,但用于检测的计算方法仍主要依赖词汇重叠,一旦平行涉及释义、词汇替换或句法重组,这些方法就会失效。本文介绍MiqraBERT,一个从AlephBERT(现代希伯来语编码器)微调而来的Sentence-BERT模型,用于圣经希伯来语的诗句级语义相似度。训练集包含1,650个标注的诗句和半诗句对:825个来自编年史同源材料和诗歌平行基础研究的真实平行,与825个随机采样的负例平衡。通过余弦相似度回归,模型学习到一个嵌入空间,其中平行诗句聚集在一起,无关诗句彼此远离。我们使用基于分布的指标、Wasserstein距离和重叠系数,在十个随机种子上评估分离度。MiqraBERT将分布分离度比预训练基线提高了2.7倍,并将模糊重叠区域从约24%减少到约6%。叙事同源平行的召回@10达到87.1%;诗歌平行仍然困难,低于9%。这种依赖于体裁的不对称性将模型的可靠范围限制在叙事文本复用。MiqraBERT在此https URL公开可用。

英文摘要

Textual reuse pervades the Hebrew Bible, yet the computational methods used to detect it still rest largely on lexical overlap, and they falter once a parallel involves paraphrase, lexical substitution, or syntactic reworking. This paper introduces MiqraBERT, a Sentence-BERT model finetuned from AlephBERT (a Modern Hebrew encoder) for verse-level semantic similarity in Biblical Hebrew. The training set comprises 1,650 labeled verse and half-verse pairs: 825 true parallels drawn from the Chronicles synoptic material and from foundational studies of poetic parallelism, balanced against 825 randomly sampled negatives. Through cosine-similarity regression, the model learns an embedding space in which parallel verses cluster together and unrelated verses move apart. We evaluate separation with distribution-based metrics, Wasserstein distance and the overlap coefficient, across ten random seeds. MiqraBERT improves distributional separation 2.7-fold over the pre-trained baseline and reduces the ambiguous overlap region from roughly 24% to about 6%. Narrative synoptic parallels reach a recall@10 of 87.1%; poetic parallels remain difficult, below 9%. This genre-dependent asymmetry confines the model's reliable scope to narrative textual reuse. MiqraBERT is publicly available at https://huggingface.co/davidmsmiley/MiqraBERT

2606.19637 2026-06-19 cs.CL cs.AI 新提交

Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text

标签之前:数据集构建如何塑造临床文本中的自杀检测

Priyanshi Garg, Ishita Rao, Jieqiong Ding, Amandalynne Paullada

发表机构 * University of Washington(华盛顿大学)

AI总结 通过ScAN数据集案例研究,揭示EHR自杀数据集编码特定操作化定义,受数据作者、事件边界和歧义处理影响,并展示相同标签涵盖异质性临床框架。

Comments To appear in the Proceedings of the 11th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026)

详情
AI中文摘要

临床自然语言处理越来越依赖电子健康记录(EHR)数据来检测自杀行为,将临床文档视为比社交媒体更可靠的真相。我们认为,这种框架掩盖了基于EHR的自杀数据集如何编码自杀的特定操作化定义,这种定义受到数据作者、事件边界划定方式以及歧义处理方式的影响。我们以ScAN数据集(基于MIMIC-III临床笔记构建)的案例研究为基础,论证了这一观点。我们展示了治理约束、基于ICD的队列选择、单一标注者标签以及住院级别聚合如何产生反映临床医生记录判断的标签,将自杀视为一个有边界的事件,并假设意图可以从文档中可靠推断。语言学分析表明,相同的标签涵盖了在时间性、否定性和不确定性方面不同的异质性临床框架。我们认为,临床自然语言处理在将自杀数据集的标签解释为真相之前,应审视其中嵌入的假设。

英文摘要

Clinical NLP increasingly relies on electronic health record (EHR) data to detect suicidal behaviors, treating clinical documentation as more reliable ground truth than social media. We argue that this framing obscures how EHR-based suicidality datasets encode a particular operationalization of suicidality, shaped by who authors the data, how episodes are bounded, and how ambiguity is resolved. We ground this argument in a case study of the ScAN dataset, built over MIMIC-III clinical notes. We show how governance constraints, ICD-based cohort selection, single-annotator labeling, and hospital-stay-level aggregation produce labels that reflect clinician-documented judgments, treat suicidality as a bounded episode, and assume that intent can be reliably inferred from documentation. A linguistic analysis demonstrates that identical labels subsume heterogeneous clinical framings differing in temporality, negation, and uncertainty. We argue that clinical NLP should examine the assumptions embedded in suicidality datasets before interpreting their labels as ground truth.

2606.19636 2026-06-19 cs.LG cs.AI 新提交

Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation

困难还是未触及?诊断数学推理难度估计中的采样盲点

Luca Zhou, Sajel Shah, Emanuele Rodolà, Roberto Dessì

发表机构 * Sapienza University of Rome(罗马大学)

AI总结 发现pass@k在数学推理难度估计中存在盲点,通过激活嫁接的确定性采样可恢复10.3-22.9%的零解样本,揭示结构可识别性。

Comments 9 pages of main paper, 4 figures and 5 tables in the main paper, with more in the appendix

详情
AI中文摘要

数学和科学推理基准依赖pass@k(达到正确结果的采样链比例)作为每个示例的典型难度信号。同样的信号驱动具有可验证奖励的强化学习、数学数据整理、合成课程和验证器训练。我们表明该代理在其最困难的层级上存在持续盲点:在我们测试的八个自由形式数学单元(GSM8K和MATH,跨四个开放权重模型)中,10.3-22.9%的示例在六次尝试中没有任何采样种子解决,但通过六链确定性机制在匹配计算量下被解决。这些是贪婪解码加上通过激活嫁接应用的五个廉价残差流扰动,而单独贪婪解码在这些数学单元上最多解决6%。恢复随额外预算扩展,跨扰动(其机制差异性我们通过所有十二个单元验证,每种设置下跨类型固定集Jaccard <= 0.47)。激活嫁接用作对内部表示的干预,而非解码方法;我们纯粹将其作为诊断和多样化工具,并且我们恢复的项目表明pass@k=0%层级在残差流中结构可识别,而非未修改模型在普通推理下达到它们。

英文摘要

Math and science reasoning benchmarks rely on pass@k, the fraction of sampled chains that reach gold, as the canonical per-example difficulty signal. The same signal drives RL with verifiable rewards, math data curation, synthetic curricula, and verifier training. We show this proxy has a persistent blind spot on its hardest stratum: on the eight free-form math cells we test (GSM8K and MATH across four open-weight models), 10.3-22.9% of the examples that no sampling seed solves in six tries are instead solved at matched compute by a six-chain deterministic regime. These are greedy decoding plus five cheap residual-stream perturbations applied via activation grafting, while greedy alone solves at most 6% on these math cells. Recovery scales with the additional budget, across perturbations whose mechanistic distinctness we verify across all twelve cells (cross-kind fix-set Jaccard <= 0.47 in every setup). Activation grafting is used as an intervention on internal representations, not a decoding method; we use it purely as a diagnostic and diversification tool, and our recovered items show that the pass@k= 0 % stratum is structurally identifiable in the residual stream rather than that the unmodified model reaches them under ordinary inference.

2606.19635 2026-06-19 cs.IR cs.AI cs.LG 新提交

Token Factory: Efficiently Integrating Diverse Signals into Large Recommendation Models

Token Factory:高效整合多样化信号于大型推荐模型

Xilun Chen, Shao-Chuan Wang, Baykal Cakici, Lukasz Heldt, Lichan Hong, Raghu Keshavan, Aniruddh Nath, Li Wei, Xinyang Xi

AI总结 提出Token Factory框架,将传统信号转化为软令牌,高效集成到基于Transformer的大型推荐模型中,避免提示长度爆炸并提升性能。

Comments 8 pages, 10 figures

详情
AI中文摘要

大型推荐模型(LRM)在工业级推荐任务中展现了强大的能力。然而,如何有效且高效地将传统信号整合到这些基于Transformer的架构中仍然是一个主要挑战。传统的直接“文本化”这些信号或创建离散物品表示的方法往往导致过长的提示、巨大的内存占用和高计算开销。为了克服这些限制,我们提出了“Token Factory”,一个旨在将传统信号转化为可由LRM直接处理的“软令牌”的框架。这种方法能够高效集成和压缩异构输入特征,防止提示长度爆炸,同时提升模型性能。我们详细描述了Token Factory的架构,并展示了在工业级推荐环境中验证其有效性的实验结果。

英文摘要

Large Recommendation Models (LRMs) have demonstrated promising capabilities in industry-scale recommendation tasks. However, holistically integrating traditional signals into these transformer-based architectures effectively and efficiently remains a major challenge. Conventional approaches that "textualize" these signals directly or create discrete item representations often lead to excessively long prompts, substantial memory footprints, and high computational overhead. To overcome these limitations, we propose "Token Factory", a framework designed to transform traditional signals into "soft tokens" that can be directly processed by LRMs. This approach enables efficient integration and compression of heterogeneous input features, preventing prompt length explosion while enhancing model performance. We detail the architecture of Token Factory and present experimental results validating its effectiveness in a production-scale recommendation environment.

2606.19633 2026-06-19 cs.RO cs.AI 新提交

CTS-MoE: Implicit Terrain Adaptation via Mixture-of-Experts for Perceptive Locomotion

CTS-MoE: 基于混合专家模型的隐式地形适应感知运动

Francisco Affonso, Matheus P. Angarola, Ana Luiza Mineiro, Aditya Potnis, Marcelo Becker, Girish Chowdhary

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of São Paulo(圣保罗大学)

AI总结 针对非连续地形上的感知运动问题,提出CTS-MoE方法,通过密集混合专家策略与感知门控组合共享行为,并用多批评家防止价值干扰,实现端到端训练和隐式地形适应,在仿真和硬件上优于基线。

详情
AI中文摘要

在不连续地形(如楼梯、间隙和障碍物)上的感知腿式运动需要自适应行为,因为单一的保守步态无法产生应对突然拓扑变化所需的预期动作。将该问题视为多任务强化学习,会在共享与分离之间引入张力。任务使用共同的运动基础但具有冲突的奖励,因此策略必须共享行为同时避免价值干扰。先前的工作只解决了其中一方面:整体策略牺牲了专业化,而分层子策略牺牲了跨过渡和未知地形的泛化能力。我们提出CTS-MoE,它结合了密集混合专家执行器与基于感知的门控来组合共享行为,以及具有任务特定价值头的多批评家来防止干扰。该模型在单阶段并发教师-学生设置中进行端到端训练,处理部分可观测性并避免顺序蒸馏,任务标签仅在训练期间使用。部署时,路由仅依赖于感知,从而无需高层选择器或地形分类器即可实现地形适应。在仿真和硬件上对Unitree Go1进行的实验(涵盖已知和未知地形)显示了任务感知的专业化,与整体基线相比,跟踪误差更低,成功率更高。项目网站:此https URL。

英文摘要

Perceptive legged locomotion over discontinuous terrain (e.g., stairs, gaps, and obstacles) requires adaptive behavior, as a single conservative gait cannot produce the anticipatory maneuvers needed for abrupt topology changes. Cast as multi-task reinforcement learning, this problem introduces a tension between sharing and separation. Tasks use a common locomotion base but have conflicting rewards, so a policy must share behavior while avoiding value interference. Prior work addresses only one side, with monolithic policies sacrificing specialization and hierarchical sub-policies sacrificing generalization across transitions and unseen terrain. We propose CTS-MoE, which combines a dense mixture-of-experts actor with perception-based gating to compose shared behaviors and a multi-critic with task-specific value heads to prevent interference. The model is trained end-to-end in a single-stage concurrent teacher-student setup that handles partial observability and avoids sequential distillation, with task labels used only during training. At deployment, routing depends solely on perception, allowing terrain adaptation without a high-level selector or terrain classifier. Experiments on a Unitree Go1 in simulation and on hardware across seen and unseen terrains show task-aware specialization, with lower tracking error and higher success rates than monolithic baselines. Project Website: https://cts-moe.github.io/ .

2606.19632 2026-06-19 cs.RO cs.AI cs.LG cs.LO cs.MA 新提交

Formal Verification of Learned Multi-Agent Communication Policies via Decision Tree Distillation

通过决策树蒸馏对学习到的多智能体通信策略进行形式化验证

Ahmad Farooq, Kamran Iqbal

发表机构 * University of Arkansas at Little Rock(阿肯色大学小石城分校)

AI总结 提出通过决策树蒸馏将多智能体强化学习策略转化为可解释模型,并利用PRISM进行形式化验证,确保安全属性转移至原始网络,在无人机编队任务中实现88.9%属性满足率。

Comments 9 pages, 3 figures, 7 tables. Accepted at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026), Pittsburgh, Pennsylvania, USA, September 27-October 1, 2026

详情
AI中文摘要

多智能体强化学习使智能体能够通过涌现通信发展协调策略,但神经策略缺乏无人机群和自动驾驶车队等安全关键机器人部署所需的形式化安全保证。我们提出了首个通过学习策略抽象进行安全验证的端到端框架:神经策略被蒸馏为可解释的决策树,然后进行形式化验证,并通过经验验证确认验证的安全属性可转移至原始网络。我们的四阶段流程包括:从智能体观测中提取领域特定特征;决策树蒸馏达到97.9% +/- 1.2%的神经策略保真度;自动翻译为PRISM概率模型检查器规范,具有完整的特征到状态变量对应关系;以及通过成对分解、联合界聚合和经验邻居建模对概率计算树逻辑属性进行组合验证。评估用于5-7个智能体多无人机协调的矢量量化变分信息瓶颈策略,我们验证了18个涵盖安全性、活性和合作的时间逻辑属性,实现了88.9%的属性满足率,所有五个安全阈值均满足(碰撞概率0.3% vs 阈值1%)。原始神经策略的蒙特卡洛验证确认验证的安全属性转移偏差<=0.6个百分点(95%置信区间)。离散VQ-VIB消息相比连续方法提供+11.6至+13.6个百分点的保真度优势,实现3-4倍更快的验证。我们的框架为蒸馏策略抽象提供了经验验证的安全验证,作为深度多智能体强化学习与多机器人部署形式化安全工作流之间的实用桥梁。

英文摘要

Multi-agent reinforcement learning (MARL) enables agents to develop coordination strategies through emergent communication, but neural policies lack the formal safety guarantees required for safety-critical robotic deployment in drone swarms and autonomous vehicle fleets. We present the first end-to-end framework for safety verification of learned multi-agent communication policies through policy abstraction: neural policies are distilled into interpretable decision trees, then formally verified, with empirical validation confirming that verified safety properties transfer to original networks. Our four-stage pipeline consists of domain-specific feature extraction from agent observations, decision tree distillation achieving 97.9% +/- 1.2% fidelity to neural policies, automated translation to PRISM probabilistic model checker specifications with complete feature-to-state-variable correspondence, and compositional verification of Probabilistic Computation Tree Logic (PCTL) properties via pairwise decomposition with union-bound aggregation and empirical neighbor modeling. Evaluating Vector-Quantized Variational Information Bottleneck (VQ-VIB) policies for multi-drone coordination with 5-7 agents, we verify 18 temporal logic properties across safety, liveness, and cooperation, achieving 88.9% property satisfaction with all five safety thresholds satisfied (0.3% collision probability vs. 1% threshold). Monte Carlo validation of original neural policies confirms that verified safety properties transfer with <=0.6 percentage-point deviation (95% CI). Discrete VQ-VIB messages provide +11.6 to +13.6 percentage-point fidelity advantages over continuous methods, enabling 3-4x faster verification. Our framework provides empirically validated safety verification for distilled policy abstractions, serving as a practical bridge between deep MARL and formal safety workflows for multi-robot deployment.

2606.19630 2026-06-19 cs.AI cs.DL cs.SY eess.SY 新提交

AI4SE and SE4AI Exploration: A Decade Looking Back and Forward

AI4SE 与 SE4AI 探索:回顾与展望的十年

H. Sinan Bank, Daniel R. Herber, Thomas Bradley

发表机构 * Colorado State University(科罗拉多州立大学)

AI总结 本文回顾了人工智能与系统工程在三个阶段的进展,通过人机一致性文献综述识别出五个关键研究空白,并提供了AI采纳、保障和劳动力转型的指导。

Comments 10 pages, 5 figure

详情
AI中文摘要

2020年3月INCOSE INSIGHT关于人工智能与系统工程的特刊成为该刊历史上下载量最高的一期,并催生了一个研究社区,其年度研讨会现吸引超过250名注册者。在本文中,我们基于作者对该领域核心论文的解读,追溯了人工智能与系统工程在三个阶段(标记为基础、应用和LLM转折点)的进展,并描述了我们对社区已达成共识以及仍存在关键空白的看法。此外,我们进行了一项人机一致性文献综述,利用人类专家和六个人工智能模型评估了1,712篇INCOSE INSIGHT文章和889篇SERC出版物的相关性。结果识别出五个关键研究空白,并为从业者在系统工程中应对AI采纳、保障和劳动力转型提供了指导。我们共享一致性数据以及AI4SE/SE4AI Explorer网络应用程序,以便读者将自己的相关性判断与人类和AI评分者进行比较。

英文摘要

The March 2020 INCOSE INSIGHT special issue on AI and Systems Engineering (SE) became the most downloaded issue in the publication's history and launched a research community that now draws over 250 registrants to its annual workshop. In this article, we trace the progress in AI and SE across three phases (labeled here foundational, applied, and LLM inflection) based on the authors' reading of the field's core papers, and describe our opinions of where the community has converged and where critical gaps remain. Separately, a human-AI agreement literature review leveraging both human expertise and six AI models was performed to assess the relevance of 1,712 INCOSE INSIGHT articles and 889 SERC publications. The results identify five critical research gaps and offer guidance for practitioners navigating AI adoption, assurance, and workforce transformation in SE. We share the agreement data and the AI4SE/SE4AI Explorer web application so readers can compare their own relevance judgments with the human and AI raters.

2606.19629 2026-06-19 cs.SD cs.AI cs.LG 新提交

RIVET: Robust Idempotent Voice Attribute Editing

RIVET: 鲁棒的幂等语音属性编辑

Dareen Alharthi, Bhuvan Koduru, Rita Singh, Bhiksha Raj

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出RIVET训练框架,通过幂等性正则化提升语音属性编辑模型对标签噪声的鲁棒性,在合成噪声和真实噪声数据集上均优于标准训练。

详情
AI中文摘要

语音属性编辑模型在保留说话人身份的同时修改年龄和性别等特征。然而,在大规模语音数据集中,属性标注通常带有噪声或不一致,这可能导致条件生成模型产生不稳定的编辑。在这项工作中,我们证明幂等性为提升对噪声标签的鲁棒性提供了一种有效机制。幂等算子是指重复应用不会改变结果的算子,即 f(f(x)) = f(x)。强制这一性质作为一种隐式正则化器,降低了对错误标注样本的敏感性。我们引入了 RIVET,一种结合幂等性目标以提升对标签噪声鲁棒性的训练框架。我们在受控标签噪声下以及在具有自然噪声标注的 GLOBE 数据集上评估了 RIVET。RIVET 提高了编辑成功率,并且比标准训练更好地保留了说话人身份,表明幂等性提升了语音编辑模型的鲁棒性。

英文摘要

Voice attribute editing models modify characteristics such as age and gender while preserving speaker identity. In large-scale speech datasets, however, attribute annotations are often noisy or inconsistent, which can cause conditional generative models to produce unstable edits. In this work, we show that idempotency provides an effective mechanism for improving robustness to noisy labels. An idempotent operator is one for which repeated application does not change the result, i.e., f(f(x)) = f(x). Enforcing this property acts as an implicit regularizer that reduces sensitivity to mislabeled examples. We introduce RIVET, a training framework that incorporates an idempotency objective to improve robustness to label noise. We evaluate RIVET under controlled label noise and on the GLOBE dataset with naturally noisy annotations. RIVET improves editing success and better preserves speaker identity than standard training, showing that idempotency improves robustness in voice editing models.

2606.19627 2026-06-19 cs.IR cs.AI cs.LG 新提交

VCG: A Multimodal Retrieval Framework for E-Commerce Video Feeds under Extreme Cold-Start Conditions

VCG:极端冷启动条件下电商视频流的多模态检索框架

Katya Mirylenka, Egor Malykh, Mahdyar Ravanbakhsh, Michael Gygli, Marco-Andrea Buchmann, Andrew Dzhoha, Svitlana Borzenko, Francesca Catino, Mohamed Gaafar, Maarten Versteegh, Thomas Kober, Dario d'Andrea, Ellie Langhans

发表机构 * Zalando Switzerland AG(Zalando瑞士有限公司) TU Wien(维也纳技术大学) Zalando SE(Zalando德国分公司)

AI总结 针对电商视频流中的极端冷启动和偏差问题,提出基于领域自适应视觉-语言模型(CLIP)的可扩展多模态检索系统VCG,实现零样本检索,在线测试显示深度视频完成率提升50%。

详情
AI中文摘要

数字商业格局正从静态的搜索驱动型目录转向动态的沉浸式视频流。这一转变引入了“极端冷启动”问题:与传统商品不同,新的短视频缺乏协同过滤所需的密集交互历史。此外,沉浸式视频流引入了强烈的位置和时长偏差,扭曲了标准参与信号。在本文中,我们展示了视频候选生成(VCG)系统,这是一个可扩展的多模态检索引擎,旨在解决大规模电商环境中的这些挑战。通过利用领域自适应的视觉-语言模型(基于CLIP),我们将用户和视频映射到共享语义空间,实现基于视觉内容而非行为历史的零样本检索。我们详细介绍了系统的架构,并进行了严格的评估,比较了生成式(LLM)和判别式(CLIP)嵌入。结果表明,虽然生成式模型在属性预测方面表现出色,但在检索任务中会出现嵌入空间坍塌。在线A/B测试表明,VCG有效缓解了参与偏差,使深度视频完成率提升了50%。为了展示系统的能力,我们提供了一个交互式演示,包含三种双向检索场景:产品到视频、视频到产品和零样本语义搜索。

英文摘要

The digital commerce landscape is shifting from static, search-driven catalogs to dynamic, immersive video feeds. This transition introduces an ``extreme cold-start'' problem: unlike traditional items, new short-form videos lack the dense interaction history required for collaborative filtering. Furthermore, immersive feeds introduce strong position and duration biases that distort standard engagement signals. In this paper, we demonstrate the Video Candidate Generation (VCG) system, a scalable multimodal retrieval engine designed to solve these challenges in a large-scale e-commerce environment. By leveraging a domain-adapted vision-language model (based on CLIP), we map users and videos into a shared semantic space, enabling zero-shot retrieval based on visual content rather than behavioral history. We detail the system's architecture and present a rigorous evaluation comparing generative (LLM) vs. discriminative (CLIP) embeddings. Our results show that while generative models excel at attribute prediction, they suffer from embedding space collapse in retrieval tasks. Online A/B testing demonstrates that VCG effectively mitigates engagement biases, yielding a 50\% uplift in deep video completion. To showcase the system's capabilities, we present an interactive demonstration featuring three bi-directional retrieval scenarios: Product-to-Video, Video-to-Product, and Zero-Shot Semantic Search.

2606.19626 2026-06-19 cs.AI cs.CL 新提交

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

Toten:基于知识本体的巴西葡萄牙语物理量和技术符号分词

Antonio de Sousa Leitão Filho; Allan Kardec Duailibe Barros Filho; Fabrício Saul Lima; Selby Mykael Lima dos Santos; Rejani Bandeira Vieira Sousa

发表机构 * Aia Context Universidade Federal do Maranhão(马拉尼昂联邦大学) Universidade de São Paulo(圣保罗大学)

AI总结 提出TOTEN框架,利用工程实体本体对物理量和技术符号进行声明式分类,替代统计分词,在巴西葡萄牙语语料上实现高原子性分词和数值重建。

详情
AI中文摘要

字节对编码分词在词汇压缩方面统计高效,但对结构化技术实体语义盲目,将物理量、数字、单位和符号表达式分割成词汇上任意子词。我们提出TOTEN,一个基于知识本体的分词框架,用基于工程实体形式本体(OEE)的声明式分类取代统计推导。我们将TOTEN形式化为三元组<O, classify, {inst_tau}>:本体收集类型、结构原理、组成关系和可保存不变量;分类函数将原始文本映射到类型化区域;实例化器族产生自描述的结构化表示。鲁棒性源于与三个外部预言机的确定性耦合:Pint(量纲)、Unicode字符数据库(排版)和RSLP(葡萄牙语形态)。内在评估涵盖四个可通过构造验证的属性——本体原子性、量纲等价性、排版鲁棒性和数值重建——在一个内部、物理验证的基准(EngQuant,N=800)和四个巴西葡萄牙语外部语料库(N=1771个合格案例)上进行。我们还报告检测召回率,区分覆盖率和条件原子性。与八个最先进基线相比,TOTEN在所有对比中实现单位本体原子性,在外部语料库上数值重建为0.775-0.904,而最佳基线(Quantulum3)为0.627-0.703;在EngQuant上为0.780 vs. 0.340。差异具有统计显著性(McNemar检验,Holm校正)。内部和外部排名之间的Spearman相关性证实了控制基准的同时效度。量纲等价性显示与Pint(系统继承量纲权威的预言机)统计对等。

英文摘要

Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary subwords. We present TOTEN, a knowledge-based ontological tokenization framework that replaces statistical derivation with declarative classification grounded in a formal ontology of engineering entities (OEE). We formalize TOTEN as the triple <O, classify, {inst_tau}>: the ontology gathers types, structural principles, composition relations, and preservable invariants; the classification function maps raw text into typed regions; and the instantiator family yields a self-descriptive structured representation. Robustness derives from deterministic coupling with three external oracles: Pint (dimensional), Unicode Character Database (typographic), and RSLP (Portuguese morphology). Intrinsic evaluation covers four properties verifiable by construction -- ontological atomicity, dimensional equivalence, typographic robustness, and numerical reconstruction -- over an internal, physically validated benchmark (EngQuant, N=800) and four Brazilian Portuguese external corpora (N=1771 eligible cases). We also report detection recall, distinguishing coverage from conditional atomicity. Against eight state-of-the-art baselines, TOTEN achieves unit ontological atomicity in all contrasts and numerical reconstruction of 0.775-0.904 on external corpora, vs. 0.627-0.703 for the best baseline (Quantulum3); on EngQuant, 0.780 vs. 0.340. Differences are statistically significant (McNemar with Holm correction). Spearman correlation between internal and external rankings confirms concurrent validity of the control benchmark. Dimensional equivalence shows statistical parity with Pint, the oracle from which the system inherits dimensional authority.

2606.19625 2026-06-19 cs.CL cs.LG 新提交

Where Does Social Reasoning Come From? Capability Provenance in Language Models

社会推理从何而来?语言模型中的能力来源

Glenn Matlin, Chandreyi Chakraborty, Saehee Eom, Mika Okamoto, Rayan Castilla, Louis Jaburi, Alvin Deng, Taywon Min, Lucia Quirke, Stella Biderman, Mark Riedl

发表机构 * Georgia Institute of Technology, College of Computing(佐治亚理工学院计算学院) MATS Program(MATS项目) EleutherAI KAIST AI(韩国科学技术院人工智能学院) Georgia Tech AI Safety Initiative(佐治亚理工学院人工智能安全倡议)

AI总结 通过训练数据归因方法,发现OLMo3-7B中社会推理和STEM推理依赖于不同的预训练语料区域,且推理层面的差异比知识层面更显著。

Comments Under review at COLM 2026 (Conference)

详情
AI中文摘要

我们使用训练数据归因作为可解释的工具进行能力发现,映射预训练语料库中哪些区域支持OLMo3-7B的社会推理与STEM推理。训练数据归因衡量每个训练文档对模型在基准测试上的预测的影响强度,但文档级别的分数过于嘈杂,无法识别哪些语料区域支持哪些能力,且先前的工作侧重于事实知识而非推理。我们在从去重后的Dolma3混合数据中抽取的工作集上计算基于梯度的归因(通过Bergmann的TrackStar),聚合跨WebOrganizer的24格式×24主题分类(576个箱子)的影响,并在2×2设计中对比基准对,该设计变化领域(社会 vs. STEM)和能力类型(推理 vs. 知识):SocialIQA和MMLU社会科学对比ARC-Challenge和MMLU STEM。社会和STEM推理依赖于定性不同的语料区域,且推理层面的对比比知识层面更尖锐。有针对性的机器遗忘提供了部分因果验证:遗忘高归因主题箱(例如,SocialIQA的文学)比箱内随机基线更严重地降低对齐的基准,我们开源所有代码、采样清单、箱级影响矩阵和遗忘检查点。

英文摘要

We use training-data attribution as an interpretable tool for capability discovery, mapping which regions of the pretraining corpus support social-reasoning versus STEM-reasoning in OLMo3-7B. Training-data attribution measures how strongly each training document influences a model's predictions on a benchmark, but document-level scores are too noisy to identify which corpus regions support which capabilities, and prior work has emphasized factual knowledge rather than reasoning. We compute gradient-based attribution (TrackStar via Bergson) over a working set drawn from the de-duplicated Dolma3 mix, aggregate influence across WebOrganizer's 24-format x 24-topic taxonomy (576 bins), and contrast benchmark pairs in a 2x2 design that varies domain (social vs. STEM) and capability type (reasoning vs. knowledge): SocialIQA and MMLU Social Sciences against ARC-Challenge and MMLU STEM. Social and STEM reasoning draw on qualitatively distinct corpus regions, and the contrast is sharper at the reasoning level than at the knowledge level. Targeted machine unlearning provides partial causal validation: forgetting high-attribution topic bins (e.g., Literature for SocialIQA) degrades the aligned benchmark more than within-bin random baselines, and we open-source all code, sampling manifests, the bin-level influence matrix, and unlearning checkpoints.