New Benchmarking Shows Limited Generalization Power of TCR Antigenic Epitope Prediction Models
新基准测试显示TCR抗原表位预测模型的泛化能力有限
AI总结 本文通过构建两类严格定义的未见基准数据集,评估了T细胞受体(TCR)抗原特异性预测模型的性能,发现现有模型泛化能力有限,并提出了改进框架。
新基准测试显示TCR抗原表位预测模型的泛化能力有限
Yiming Liao, Yiheng Li, Ning Jiang, Bo Li, Keke Chen
AI总结 本文通过构建两类严格定义的未见基准数据集,评估了T细胞受体(TCR)抗原特异性预测模型的性能,发现现有模型泛化能力有限,并提出了改进框架。
准确计算预测T细胞受体(TCR)抗原特异性将改变T细胞生物学研究,并实现可扩展的免疫工程,但现有模型缺乏足够的灵敏度和特异性,难以广泛应用。一个主要限制是缺乏严格定义的、未见过的基准数据集,无法对模型性能和泛化能力进行无偏评估。在此,我们描述了两类满足此标准的互补数据集,并认为它们既为模型评估提供了稳健框架,也为下一代TCR-抗原预测算法的开发奠定了基础。
Accurate computational prediction of T cell receptor (TCR) antigen specificity would transform the study of T cell biology and enable scalable immune engineering, yet existing models lack sufficient sensitivity and specificity for broad applications. A major limitation is the absence of rigorously defined, unseen benchmark datasets that allow unbiased evaluation of model performance and generalizability. Here, we describe two complementary classes of datasets that meet this criterion and argue that they provide both a robust framework for model assessment and a foundation for next-generation TCR-antigen prediction algorithm development.
用于手术器械操作与组装的 multi-camera AR 引导系统:工作量和效率研究
Shiyu Li, Julian Kreimeier, Hannah Schieber, Dirk Müller, Bernhard Kainz, Rüdiger von Eisenhart-Rothe, Daniel Roth
AI总结 提出一种无标记的多摄像头增强现实引导系统,结合6D位姿估计和头戴显示,显著降低手术器械操作的工作量并提高效率。
手术中器械的操作和组装对洗手护士提出了很高的认知要求,尤其是在器械不熟悉的情况下。我们提出了一种支持性的手术器械引导系统,该系统结合了多摄像头6D位姿估计和头戴显示器上的增强现实原位可视化,无需额外标记。位姿估计和连续的相机校准通过已知物体实现。6D位姿估计网络仅使用合成数据进行训练,旨在获得更好的泛化能力和实际应用性。AR引导显示工具提示定位线索和逐步组装动画。通过基于注视的选择和脚踏板,用户可以在术中操作中切换组装步骤。在技术评估中,我们的方法优于最先进的6D位姿估计。在膝关节置换术的手术模拟中,对29名洗手护士进行了用户研究,将系统与纸质手册进行了比较。AR引导显著降低了感知工作量。客观上,AR引导将任务完成时间减少了21.3%(4.76分钟)。特别是,对器械组不太熟悉的洗手护士在使用该系统时受益。两种条件下的错误频率相当。定性反馈强调了过程清晰度提高、信息过载减少和感知独立性。总之,我们的无标记多摄像头AR引导方法可以在主观和客观上改善术中器械操作性能,特别是对于未经培训的洗手护士。
The handling and assembly of instruments during surgery imposes high cognitive demands on scrub nurses, particularly when instruments are unfamiliar. We present a supporting guidance system for surgical instrumentation that combines multi-camera 6D pose estimation with augmented reality in-situ visualization on a head-mounted display without the requirement for additional markers. Pose estimation and consecutive camera calibration are achieved through known objects. The 6D pose estimation network is trained purely on synthetic data, aiming for better generalizability and real-world applicability. The AR guidance displays tooltip localization cues and step-wise assembly animations. Via gaze-based selection and a foot pedal, users can switch between assembly steps in intraoperative use. In a technical evaluation, our approach outperforms state-of-art 6D pose estimation. A user study with 29 scrub nurses was conducted in a surgical simulation of knee arthroplasty, comparing the system against a paper manual. AR guidance significantly reduced the perceived workload compared. Objectively, AR guidance reduced task completion time by 21.3\% (4.76 minutes). Specifically, scrub nurses less experienced with the instrument set benefited when using the system. Error frequencies were comparable between conditions. Qualitative feedback highlighted improved process clarity, reduced information overload, and perceived independence. To summarize, our marker-free multi-camera AR guidance approach for surgical instruments can, subjectively and objectively, improve intraoperative instrumentation performance, particularly for untrained scrub nurses.
从智能体痕迹到信任:LLM智能体中的证据追踪与执行溯源
Yiqi Wang, Jiaqi Zhang, Taotao Cai, Zirui Liu, Qingqiang Sun, Zequn Sun, Zhangkai Wu, Mingkai Zhang, Yanming Zhu
AI总结 本文系统综述了LLM智能体中的证据追踪与执行溯源方法,通过统一溯源视角连接检索、工具使用、记忆等环节,提出分类体系并讨论开放挑战。
基于大语言模型(LLM)的智能体通过与外部工具、检索系统、记忆模块、环境及其他智能体交互,日益解决复杂任务。这些能力增强了智能体的自主性,但也使其行为更难以验证、调试和审计。仅凭最终答案的准确性无法解释输出是如何产生的、每个主张由哪些证据支持、工具调用是否合理、记忆如何影响后续决策或执行失败的根源。证据追踪和执行溯源通过建模检索到的证据、工具输出、记忆项、环境观察、中间主张、动作和最终答案在智能体执行过程中的连接方式,弥补了这一空白。本综述对LLM智能体中的证据追踪和执行溯源进行了系统回顾和概念框架构建。我们围绕统一的溯源视角组织相关工作,该视角连接了检索依据、主张支持、工具使用安全、记忆谱系、可观测性、调试、审计和恢复。我们引入了一个分类体系,涵盖追踪来源、证据和执行单元、溯源关系、追踪粒度和时机、表示形式以及信任功能。我们回顾了关键方法论方向,包括溯源表示、证据归因、工具使用溯源、运行时护栏、携带溯源的记忆、基于痕迹的可观测性和故障诊断。我们还绘制了现有基准、数据集和评估指标与溯源相关能力的映射,并讨论了评估如何从最终答案正确性转向过程级问责。最后,我们概述了开放挑战,包括统一痕迹模式、主张级和语义溯源、溯源感知的安全机制、现实执行痕迹基准、面向恢复的评估以及隐私感知的审计基础设施。
Large language model (LLM)-based agents increasingly solve complex tasks by interacting with external tools, retrieval systems, memory modules, environments, and other agents. These capabilities expand agent autonomy, but also make agent behavior harder to verify, debug, and audit. Final-answer accuracy alone cannot explain how an output was produced, which evidence supported each claim, whether tool calls were justified, how memory influenced later decisions, or where execution failures originated. Evidence tracing and execution provenance address this gap by modeling how retrieved evidence, tool outputs, memory items, environment observations, intermediate claims, actions, and final answers are connected throughout agent execution. This survey provides a systematic review and conceptual framework for evidence tracing and execution provenance in LLM agents. We organize related work around a unified provenance perspective that connects retrieval grounding, claim support, tool-use safety, memory lineage, observability, debugging, audit, and recovery. We introduce a taxonomy covering trace sources, evidence and execution units, provenance relations, tracing granularity and timing, representation forms, and trust functions. We review key methodological directions, including provenance representation, evidence attribution, tool-use provenance, runtime guardrails, provenance-bearing memory, trace-based observability, and failure diagnosis. We also map existing benchmarks, datasets, and evaluation metrics to provenance-related capabilities, and discuss how evaluation can move from final-answer correctness toward process-level accountability. Finally, we outline open challenges, including unified trace schemas, claim-level and semantic provenance, provenance-aware safety mechanisms, realistic execution-trace benchmarks, recovery-oriented evaluation, and privacy-aware audit infrastructure.
眼动能教会我们关于真实世界骑行的什么?来自牛津RobotCycle项目的见解
Benjamin Hardin, Efimia Panagiotaki, Daniele De Martini, Lars Kunze
AI总结 本研究利用可穿戴眼动追踪眼镜,通过分析不同环境(如自行车道、汽车道和共享公交车道)和事件(如超车和行人)下的眼动模式,揭示了骑行中感知危险的潜意识差异,并评估了眼动追踪在估计骑行压力和认知负荷方面的潜力。
尽管对骑行情境的身体危险已有较多了解,但对骑行的感知危险知之甚少。此外,危险感知可能在潜意识层面被过滤,因此难以自我报告。为此,这些潜意识感知可以通过眼动等生理指标揭示。本文探讨了英国牛津骑行的感知安全性,并研究了可穿戴眼动追踪眼镜在不同环境和事件下产生关于感知差异见解的能力。本文发现,在自行车道、汽车道和共享公交车道之间,眼动模式发生变化,代表了每种车道类型的不同认知挑战。本文表明,不同交叉路口的眼动模式显著不同,这可能对骑行者的压力有影响。最后,与无事件骑行相比,在超车和道路行人等事件发生时,眼动模式存在差异。本文总结了使用可穿戴眼动追踪器估计压力和骑行者工作量的优点和局限性。
Although much is known about the physical danger of cycling situations, less is understood about the perceived danger of cycling. Furthermore, perception of danger may be filtered at a subconscious level and therefore difficult for one to self-report. To this end, these subconscious perceptions can be revealed through physiological metrics such as eye gaze. This paper explores the perceived safety of cycling in Oxford, United Kingdom and explores the ability of wearable eye tracking glasses to produce insights about the differences in perception under different environments and events. This paper finds that eye gaze patterns change between using bike lanes, car lanes and shared bus lanes, representing different cognitive challenges of each lane type. This paper presents that different intersections have significantly different eye gaze patterns which may have implications for cyclist stress. Finally, eye gaze patterns differ in the presence of events such as passes and pedestrians in the road compared to when cycling with no events. This paper draws conclusions on the benefits and limitations of using wearable eye trackers to estimate stress and cyclist workload.
DeliChess: 一个用于国际象棋谜题求解中深思熟虑的多方对话数据集
Xiaochen Zhu, Georgi Karadzhov, Tom Stafford, Andreas Vlachos
AI总结 提出DeliChess数据集,包含多方协作解决国际象棋谜题的对话,通过讨论显著提升群体准确性,并分析探询性话语的作用。
多方对话是研究协作推理和决策的关键场景,然而现有数据集很少关注结构化、深入的复杂推理任务。我们引入了DeliChess,一个新颖的群体深思熟虑对话数据集,其中参与者协作解决多项选择国际象棋谜题。每个小组首先单独完成谜题,然后进行多方讨论,最后提交修正后的集体答案。该数据集包含107个对话,附有完整转录、讨论前后的选择以及关于谜题难度和走棋质量的元数据。我们使用基于象棋引擎评估的三个指标评估性能,发现深思熟虑显著提高了群体准确性。我们进一步利用先前深思熟虑数据训练的分类器分析了探询性话语(即引发提议、理由或战略反思的消息)的作用。虽然探询性话语使讨论后的群体表现更加多变,但它并未持续带来更好的性能。我们的数据集为在一个明确定义的策略领域中建模群体推理、对话动态以及不同观点和意见的解决提供了丰富的测试平台。
Multi-party dialogue is a critical setting for studying collaborative reasoning and decision-making, yet existing datasets rarely focus on structured, in-depth complex reasoning tasks. We introduce DeliChess, a novel dataset of group deliberation dialogues in which participants collaboratively solve multiple-choice chess puzzles. Each group first completes the puzzle individually, then engages in a multi-party discussion before submitting a revised collective answer. The dataset includes 107 dialogues with full transcripts, pre- and post-discussion choices, and metadata on puzzle difficulty and move quality. We evaluate performance using three metrics based on chess engine evaluations, and find that deliberation significantly improves group accuracy. We further analyse the role of probing utterances (i.e., messages that elicit proposals, justifications, or strategic reflection) using a classifier trained on prior deliberation data. While probing makes group performance more variable after discussion, it does not consistently lead to better performance. Our dataset offers a rich testbed for modelling group reasoning, dialogue dynamics, and the resolution of differing perspectives and opinions in a well-defined strategic domain.
Food-R1: 一种基于强化学习的统一多任务食品视觉语言模型
Yu Zhu, Yongkang Li, Wenjie Zhu, Haoyi Jiang, Wenyu Liu, Wei Yang, Bin Li, Xinggang Wang
AI总结 针对现有食品视觉语言模型依赖监督微调导致推理和泛化能力受限以及营养标注稀缺的问题,提出包含链式思维标注的大规模基准CalorieBench-80K和基于强化微调(GRPO)的统一多任务食品视觉语言模型Food-R1,在食品相关任务上持续超越强基线。
最近的研究探索了将视觉语言模型(VLM)用于食品分析。然而,现有方法主要依赖于监督微调(SFT),这通常限制了推理和泛化能力。此外,高质量的大规模营养标注仍然稀缺。为了解决这些问题,我们引入了CalorieBench-80K,一个包含精心整理的卡路里标签和饮食建议注释的大规模基准。据我们所知,这是第一个包含链式思维(CoT)注释用于卡路里推理的食品图像基准。我们还提出了Food-R1,一个在多任务学习范式中训练的统一食品VLM,以赋予模型广泛的能力。Food-R1经过基于CoT的冷启动指令微调,然后使用组相对策略优化(GRPO)进行强化微调(RFT),以提高推理和性能。在CalorieBench-80K和代表性基准上的实验表明,Food-R1在食品相关任务上持续优于强基线。代码、模型权重和基准注释可在项目仓库中获得。
Recent studies have explored Vision-Language Models (VLMs) for food analysis. However, most existing methods rely primarily on supervised fine-tuning (SFT), which often limits reasoning and generalization capabilities. Moreover, high-quality large-scale nutritional annotations remain scarce. To address these issues, we introduce CalorieBench-80K, a large-scale benchmark with curated calorie labels and dietary advice annotations. To the best of our knowledge, it is the first food image benchmark to incorporate Chain-of-Thought (CoT) annotations for calorie reasoning. We also propose Food-R1, a unified food VLM trained in a multi-task learning paradigm to equip the model with broad capabilities. Food-R1 undergoes CoT-based cold-start instruction tuning, followed by reinforcement fine-tuning (RFT) using Group Relative Policy Optimization (GRPO) to improve reasoning and performance. Experiments on CalorieBench-80K and representative benchmarks show that Food-R1 consistently outperforms strong baselines across food-related tasks. The code, model weights, and benchmark annotations are available at the project repository.
AlphaQ: 混合专家量化的免校准位分配
Wanqi Yang, Yuexiao Ma, Alexander Conzelmann, Xiawu Zheng, Michael W. Mahoney, T. Konstantin Rusch, Shiwei Liu
AI总结 针对混合专家模型量化中依赖校准数据导致位分配次优的问题,提出基于重尾自正则化理论的免校准位分配方法AlphaQ,通过专家权重谱的重尾程度分配位宽,在预算约束下最小化量化误差,实现接近全精度的性能。
混合专家(MoE)架构通过稀疏专家激活扩展模型容量,但其部署仍受内存限制,因为所有专家权重必须驻留在内存中。混合精度量化通过为不同专家分配不同位宽,可以显著减少内存占用。然而,现有方法通常依赖校准数据来估计专家重要性并确定位分配。对于前沿的MoE大语言模型,原始训练数据(即真实训练分布)是专有的且不可访问。因此,校准集不可避免地成为不完美的替代品,这可能导致对专家利用率的错误估计和次优的位分配。受现代MoE模型中观察到的显著跨专家质量差异,以及重尾自正则化(HT-SR)理论在无需训练或测试数据的情况下成功预测神经网络模型质量的启发,我们提出了AlphaQ,一种用于MoE量化的免校准位分配方法。AlphaQ借鉴HT-SR理论,遵循一个简单原则:具有更重尾权重谱的专家通常训练得更好,因此应获得更高的位宽,而重尾结构较弱的专家可以更激进地量化。AlphaQ通过测量专家级别的谱重尾程度,并求解在全局位预算约束下最小化总量化误差的预算约束优化问题来实现这一原则。在多个MoE模型上,AlphaQ在匹配位预算下始终优于基于校准的基线方法。值得注意的是,在Qwen1.5-MoE上,AlphaQ在平均专家精度仅为3.5位的情况下实现了接近全精度的准确率,同时提供了超过4倍的内存压缩。我们的代码可在https://github.com/Superone77/AlphaQ获取。
Mixture-of-Experts (MoE) architectures scale model capacity through sparse expert activation, but their deployment remains memory-bound because all expert weights must reside in memory. Mixed-precision quantization can substantially reduce this footprint by assigning different bit-widths to different experts. Existing approaches, however, typically rely on calibration data to estimate expert importance and determine bit allocation. For frontier MoE LLMs, the original training data, and hence the true training distribution, is proprietary and inaccessible. As a result, calibration sets are inevitably imperfect surrogates, and this can misestimate expert utilization and lead to suboptimal bit allocation. Motivated by the substantial cross-expert quality variability observed in modern MoE models, and by the success of Heavy-Tailed Self-Regularization (HT-SR) theory at predicting neural network model quality without access to training or testing data, we propose AlphaQ, a calibration-free bit-allocation method for MoE quantization. AlphaQ draws on HT-SR theory and follows a simple principle: experts with more heavy-tailed weight spectra are typically better trained and hence should receive higher bit-widths, while experts with weaker heavy-tailed structure can be quantized more aggressively. AlphaQ operationalizes this principle by measuring expert-wise spectral heavy-tailedness and solving a budget-constrained optimization problem that minimizes total quantization error under a global bit-budget constraint. Across several MoE models, AlphaQ consistently outperforms calibration-based baselines under matched bit budgets. Notably, on Qwen1.5-MoE, AlphaQ achieves near full-precision accuracy with an average expert precision of only 3.5 bits, while delivering more than 4$\times$ memory compression. Our code is available at https://github.com/Superone77/AlphaQ.
探究大语言模型风险决策中的结果层面相似性与机制层面一致性:来自圣彼得堡博弈的证据
Chensong Huang, Changyu Chen, Chenwei Lin, Hanjia Lyu, Xian Xu, Jiebo Luo
AI总结 通过圣彼得堡博弈实验,发现大语言模型在风险决策中表现出结果层面的类人行为,但机制层面与人类决策存在显著差异,提示行为对齐可能仅停留在表面。
大语言模型在风险决策任务中可能显得谨慎,但看似谨慎的输出并不一定表明其与人类决策机制对齐。我们以圣彼得堡博弈作为受控测试平台来研究这一区别,这是一个经典悖论,其中期望收益无限,但人类通常报告低且有限的支付意愿。我们评估了28个大语言模型,使用结构化的提示套件,包括原始博弈;控制决策变体,扰动截断、重复游戏、数字禀赋和职业身份;要求模型以人类决策者身份推理的人类视角提示;以及基础模型与其指令微调对应模型之间的配对比较。在原始博弈中,大多数模型生成有限出价,造成类人风险行为的表象。然而,这种结果层面的相似性掩盖了显著的机制层面差异。控制变体揭示,模型并未保持原始博弈中观察到的类人行为,而是常常转向条件性和计算性理性行为。人类线索提示和指令微调通常降低出价并减少一些可见的病理现象,但大多数机制层面的响应模式基本保持不变。这些发现表明,风险决策中的行为对齐可能是表面层次的:大语言模型可能产生类人风险决策,而不表现出与人类一致的机制。因此,对大语言模型决策的高风险评估应超越结果相似性,检查对齐是否由机制层面的一致性支持。
LLMs can appear cautious in risk decision-making tasks, yet cautious-looking outputs do not necessarily indicate alignment with human decision-making mechanisms. We investigate this distinction using the St. Petersburg game as a controlled testbed, a classical paradox in which the expected payoff is infinite, yet humans typically report low, finite willingness to pay. We evaluate 28 LLMs with a structured prompt suite that includes the original game; controlled decision variants that perturb truncation, repeated play, numeric endowment, and occupational identity; a human-perspective prompt that asks models to reason as human decision makers; and paired comparisons between base models and their instruction-tuned counterparts. In the original game, most models generate finite bids, creating the appearance of human-like risk behavior. However, this outcome-level resemblance masks substantial mechanism-level differences. The controlled variants reveal that rather than maintaining human-like behavior seen in the original game, models often shift to conditionally and computationally rational behavior. Human-cue prompting and instruction tuning often lower bids and reduce some visible pathologies, but most mechanism-level response patterns remain largely unchanged. These findings show that behavioral alignment in risk decision-making can be surface-level: LLMs may produce human-like risk decisions without exhibiting human-consistent mechanisms. High-stakes evaluations of LLM decision-making should therefore move beyond outcome similarity and examine whether the alignment is supported by mechanism-level consistency.
SAID: 通过支架感知迭代解码加速基于扩散的语言模型
Na Li, Chengda Wang, Mingju Gao, Hao Tang
AI总结 提出SAID框架,通过将去噪计算重新分配给支架令牌来加速扩散语言模型,并引入CHLG为低置信度令牌分配额外步骤,在LLaDA模型上实现最高9.1倍加速且保持竞争性能。
扩散大语言模型(DLLMs)通过迭代去噪具有双向上下文的损坏令牌序列,实现非自回归生成。尽管它们能够并行更新多个位置,但由于高质量生成需要大量去噪步骤,推理成本仍然很高。我们提出了SAID,一种支架感知迭代解码框架,通过跨令牌重新分配计算来加速DLLMs。SAID首先将去噪计算用于支架令牌以建立粗略的语义结构,然后用更少的步骤完成可预测的细节令牌。我们进一步将SAID适配到块级扩散解码,并引入了置信度分层生成(CHLG),仅为低置信度令牌分配额外的步骤。在LLaDA-8B和LLaDA 1.5上的数学、编码和知识基准实验表明,SAID显著加速了DLLM推理,最高加速比达9.1倍,同时保持了竞争性能。我们的代码公开在:https://github.com/TH-AI-Lab-PKU/SAID。
Diffusion large language models (DLLMs) enable non-autoregressive generation by iteratively denoising corrupted token sequences with bidirectional context. Despite their ability to update multiple positions in parallel, inference remains costly due to the many denoising steps required for high-quality generation. We propose SAID, a Scaffold-Aware Iterative Decoding framework that accelerates DLLMs by reallocating computation across tokens. SAID first spends denoising computation on scaffold tokens to establish the coarse semantic structure, and then completes predictable detail tokens with fewer steps. We further adapt SAID to block-wise diffusion decoding and introduce Confidence-Hierarchical Layered Generation (CHLG), which assigns additional steps only to low-confidence tokens. Experiments on LLaDA-8B and LLaDA 1.5 across math, coding, and knowledge benchmarks show that SAID significantly accelerates DLLM inference with a maximum speedup of 9.1x while maintaining competitive performance. Our code is publicly available: https://github.com/TH-AI-Lab-PKU/SAID.
公平吗?机器学习工程代理能否遵守公平性约束?
Anna Richter, Julia Stoyanovich, Sebastian Schelter
AI总结 本文研究机器学习工程代理在自动化ML管道开发中能否满足公平性约束,通过黑色素瘤分类实验发现代理生成的管道在预测质量和公平性上均低于人工基线。
机器学习工程(MLE)代理承诺从原始数据和自然语言指令自动化端到端ML管道开发,可能使非技术领域专家也能使用ML。然而,在敏感和受监管的领域,这种抽象造成了责任差距:最终用户可能无法了解影响正确性、鲁棒性、公平性和法规遵从性的设计选择。我们认为现有基准不足以评估MLE代理能否安全应用于此类环境。我们提出了以责任为中心的评估框架的期望,并进行了黑色素瘤分类的探索性研究,重点关注跨肤色公平性作为责任约束。在评估两个最近的MLE代理时,我们发现代理生成的管道在预测质量和公平性方面表现出高方差,并且始终低于手动设计的基线,尽管使用了面向公平性的提示。这些初步结果表明,需要进一步研究重新设计MLE代理,以允许人类指导搜索过程并可靠地评估生成的ML管道的合规性和质量。
Machine learning engineering (MLE) agents promise to automate end-to-end ML pipeline development from raw data and natural language instructions, potentially making ML accessible to non-technical domain experts. However, in sensitive and regulated domains, this abstraction creates a responsibility gap: end-users may lack visibility into design choices that affect correctness, robustness, fairness, and regulatory compliance. We argue that existing benchmarks are insufficient to assess whether MLE agents can be safely applied in such settings. We propose desiderata for a responsibility-centered evaluation framework and conduct an exploratory study on melanoma classification, focusing on fairness across skin tones as a responsibility constraint. When evaluating two recent MLE agents, we find that agent-generated pipelines show high variance and consistently underperform manually designed baselines in both predictive quality and fairness, despite fairness-oriented prompts. These preliminary results suggest that further research is needed towards redesigning MLE agents to allow humans to guide the search process and reliably assess the compliance and quality of the generated ML pipelines.
计划、观察、恢复:主动式程序辅助的基准与架构
Kaustav Kundu, Ritvik Shrivastava, Maxim Arap, Nanshu Wang, Xianhui Zhu, Quintin Fettes, Gautam Tiwari, Parth Suresh, Théo Moutakanni, Alejandro Castillejo Munoz, Allen Bolourchi, Pascale Fung, Pinar Donmez, Babak Damavandi, Anuj Kumar, Seungwhan Moon
AI总结 提出EgoProactive数据集和Pro²Bench基准,并设计解耦规划器-交互架构,用于主动式程序辅助中的实时引导和异常恢复。
我们设想一个主动的多模态辅助系统,该系统在程序性任务中为用户提供实时的逐步指导,自主决定何时中断以及如何指导。然而,由于缺乏反映现实条件的大规模跨领域基准,特别是用户偏离预期步骤序列的常见情况,进展受到限制。我们通过四项贡献来解决这一差距: extbf{(1)}~我们发布了 extbf{EgoProactive},一个大规模的可穿戴自我中心数据集,用于主动程序辅助,带有明确的计划外(OOP)标注和恢复步骤; extbf{(2)}~我们将五个已建立的基准(Ego4D、EPIC-KITCHENS、EgoExo4D、HoloAssist、HowTo100M)扩充为统一的主动指导模式下的 extbf{Pro extsuperscript{2}Bench}; extbf{(3)}~我们提出了一种专门针对程序状态、视觉线索和恢复注入的 extbf{解耦规划器-交互架构}; extbf{(4)}~我们引入了一种跨模型家族迁移的训练后方案,通过在Llama~4和Qwen-3.6-VL上的跨骨干复制进行验证。在大量实验中,我们训练的Llama-4系统在所有六个数据集上,相对于强大的专有基线(Claude Opus~4.6、Gemini~3.1~Pro、GPT~5.2)和开放权重基线(Qwen3~VL~235B),显著提高了客观干预质量。Oracle计划实验进一步表明,当计划质量得到控制时,训练的双工模型产生高质量的指导,并在计划外恢复方面取得巨大收益。
We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textit{when} to interrupt, and \textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence. We address this gap with four contributions: \textbf{(1)}~we release \textbf{EgoProactive}, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Plan (OOP) annotations and recovery steps; \textbf{(2)}~we augment five established benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) into \textbf{Pro\textsuperscript{2}Bench} under a unified proactive-guidance schema; \textbf{(3)}~we propose a \textbf{decoupled planner--interaction architecture} specialized for procedural state, visual cues, and recovery injection; \textbf{(4)}~we introduce a post-training recipe that transfers across model families, validated by cross-backbone replication on Llama~4 and Qwen-3.6-VL. In extensive experiments, our trained Llama-4 system substantially improves objective intervention quality over strong proprietary baselines (Claude Opus~4.6, Gemini~3.1~Pro, GPT~5.2) and open-weight baselines (Qwen3~VL~235B) baselines across all six datasets. Oracle-plan experiments further show that, when plan quality is controlled, the trained duplex model produces high-quality guidance and large gains on Out-of-Plan recovery.
势引导的流匹配用于视觉-语言-动作策略改进
Yunpeng Mei, Jiakai He, Hongjie Cao, Chenyu Wang, Xiaowen Zhu, Yihan Zhou, Jiamin Wang, Chenbo Xin, Peng Cheng, Yuxuan Yang, Yijie Wang, Xinhu Zheng, Gao Huang, Jie Chen, Gang Wang
AI总结 提出ForesightFlow,一种自引导流匹配策略,通过解耦优势加权流匹配和一步边界估计器,无需外部评论家即可改进视觉-语言-动作策略。
大型视觉-语言-动作(VLA)策略越来越多地被训练为动作块上的条件生成模型。然而,部署会产生混合质量的体验——成功的演示、部分完成、可恢复的错误和失败——这些难以与标准模仿一起使用。完整的行为克隆(BC)模仿失败,过滤后的BC丢弃有用的子轨迹,而离线强化学习增加了大型评论家。我们引入了ForesightFlow,一种自引导流匹配策略,它为每个生成的动作块增加一个学习到的成功势轨迹。同一个流提出并评分候选动作,实现了无需外部评论家的最佳K选择推理。关键问题是策略改进和价值校准需要不同的监督:优势加权应强调高质量动作,但将相同的权重应用于势坐标会抑制失败梯度并产生过度自信的分数。我们通过解耦优势加权流匹配来解决这个问题,将指数化优势权重仅应用于动作速度,同时均匀训练势速度。我们进一步推导了条件流匹配的一步边界估计器,允许通过单次停止梯度前向传递计算优势。在五个BEHAVIOR-1K模拟任务和五个真实世界双臂任务中,ForesightFlow优于模仿基线,在模拟成功率上与最强的分离评论家基线持平,提高了真实世界成功率,并将训练计算量减少了38%。消融实验表明,解耦防止了价值幻觉,一步估计器保持了候选排名保真度,自引导采样改善了长时程执行。
Large vision-language-action (VLA) policies are increasingly trained as conditional generative models over action chunks. Yet deployment produces mixed-quality experience-successful demonstrations, partial completions, recoverable mistakes, and failures-that is difficult to use with standard imitation. Full behavior cloning (BC) imitates failures, filtered BC discards useful sub-trajectories, and offline reinforcement learning adds a large critic. We introduce ForesightFlow, a self-guided flow-matching policy that augments each generated action chunk with a learned success-potential trajectory. The same flow proposes and scores candidate actions, enabling best-of-$K$ inference without an external critic. The key issue is that policy improvement and value calibration require different supervision: advantage weighting should emphasize high-quality actions, but applying the same weights to potential coordinates suppresses failure gradients and creates overconfident scores. We address this with decoupled advantage-weighted flow matching, applying exponentiated advantage weights only to action velocities while training potential velocities uniformly. We further derive a one-step boundary estimator for conditional flow matching, allowing advantage computation with a single stop-gradient forward pass. Across five BEHAVIOR-1K simulation tasks and five real-world bimanual tasks, ForesightFlow improves over imitation baselines, matches the strongest separate-critic baseline in simulation success, improves real-world success, and reduces training compute by $38\%$. Ablations show that decoupling prevents value hallucination, the one-step estimator preserves candidate-ranking fidelity, and self-guided sampling improves long-horizon execution.
从提示到流程:支持AI软件开发智能体的框架流程分类与比较评估
Sanderson Oliveira de Macedo
AI总结 提出六维流程分类法,对六个AI软件开发框架进行评分比较,揭示流程深度与可移植性之间的结构性权衡。
AI编程工具不再仅仅是自动补全或聊天助手:它们组织为开发框架,包含流程、角色、工件和验证。最近的调查绘制了用于软件工程的智能体和LLM,但缺少一项以将这些能力转化为流程的操作框架为中心的研究。我们对主要来源进行了定向搜索,采用功能性纳入标准和牵引力测量,选择了六个框架:GitHub Spec Kit、OpenSpec、BMAD Method、Get Shit Done (GSD)、Spec Kitty和Reversa。每个框架通过不同路径攻击AI开发:完整和轻量变体的规范驱动开发、智能体驱动的敏捷规划、智能体上的上下文工程、工作树隔离与审查,以及从遗留系统中恢复操作规范。我们的核心贡献是一个六维流程分类法:规范、上下文、角色、执行、验证和可移植性,并附带一个评分标准,使其成为可复制的工具。我们将其应用于六个框架和一个样本外案例Spec-Flow。两个结果突出。在已经采用某种流程的框架中,存在趋同:孤立的提示失去中心地位,持久工件、工作合同、可追溯性和人工审查成为减少歧义和协调智能体的机制。并且没有框架强覆盖所有六个维度,暴露了流程深度与跨智能体可移植性之间的结构性权衡。我们还发现了反复出现的风险:规范与代码之间的漂移、对生成工件的过度信任、社区扩展的脆弱性、平台依赖性以及缺乏完整流程的基准测试。我们以一个研究议程结束,侧重于中间质量指标、上下文治理、安装安全性和可重复性。
AI tools for programming are no longer just autocomplete or chat assistants: they organize themselves as development frameworks, with process, roles, artifacts and verification. Recent surveys map agents and LLMs for software engineering, but a study centered on the operational frameworks that turn these capabilities into process is missing. We ran a directed search of primary sources, with a functional inclusion criterion and traction measurement, and selected six frameworks: GitHub Spec Kit, OpenSpec, BMAD Method, Get Shit Done (GSD), Spec Kitty and Reversa. Each attacks AI development through a different path: spec-driven development in full and lightweight variants, agent-driven agile planning, context engineering over the agent, worktree isolation and review, and recovery of operational specifications from legacy systems. Our central contribution is a six-dimension process taxonomy: specification, context, roles, execution, validation and portability, with a scoring rubric that turns it into a replicable instrument. We apply it to the six frameworks and an out-of-sample case, Spec-Flow. Two results stand out. Among frameworks that already adopt some process there is convergence: the isolated prompt loses centrality, and persistent artifacts, work contracts, traceability and human review become mechanisms that reduce ambiguity and coordinate agents. And no framework strongly covers all six dimensions, exposing a structural trade-off between process depth and portability across agents. We also found recurring risks: drift between specification and code, excessive trust in generated artifacts, fragility of community extensions, platform dependence and a lack of benchmarks for the complete process. We close with a research agenda for empirical evaluation, focused on intermediate-quality metrics, context governance, installation security and reproducibility.
SemBlock: 扩散语言模型的语义边界动态块
Xinrui Song, Zhuoran Wang, Mingju Gao, Hao Tang
AI总结 提出SemBlock框架,通过预测语义边界动态构建解码块,利用轻量预测器在冻结的LLaDA隐状态上训练,并在自然语言、数学和代码任务中优于固定块解码和AdaBlock。
扩散语言模型(DLM)通过迭代去噪生成文本,逐块解码通过提交局部块中的令牌提高了其实用性。然而,现有的逐块方法通常依赖于固定的块大小或基于分隔符的运行时信号,这些不一定与语义边界对齐。在本文中,我们提出了SemBlock,一种面向扩散LLM的语义边界驱动的动态块解码框架。SemBlock将动态块构建形式化为语义边界预测,并在冻结的LLaDA隐状态上训练轻量预测器。为了提供监督,我们构建了SemBound,一个语义边界数据集,该数据集从自然语言、数学和代码任务中的话语单元、推理步骤和实现跨度中推导出边界标签。在推理过程中,SemBlock使用预测的边界概率来选择每个动态块的结束位置。在GSM8K、IFEval、MATH和HumanEval上的实验表明,SemBlock始终优于固定块解码和AdaBlock。我们的代码公开在:https://github.com/TH-AI-Lab-PKU/SemBlock。
Diffusion language models (DLMs) generate text through iterative denoising, and blockwise decoding improves their practicality by committing tokens in local blocks. However, existing blockwise methods typically rely on fixed block sizes or delimiter-based runtime signals, which do not necessarily align with semantic boundaries. In this paper, we propose SemBlock, a semantic-boundary-driven dynamic block decoding framework for diffusion LLMs. SemBlock formulates dynamic block construction as semantic boundary prediction and trains lightweight predictors on frozen LLaDA hidden states. To provide supervision, we construct SemBound, a semantic-boundary dataset that derives boundary labels from discourse units, reasoning steps, and implementation spans across natural language, math, and code tasks. During inference, SemBlock uses predicted boundary probabilities to select the ending position of each dynamic block. Experiments on GSM8K, IFEval, MATH, and HumanEval show that SemBlock consistently improves over fixed-block decoding and AdaBlock. Our code is publicly available: https://github.com/TH-AI-Lab-PKU/SemBlock.
NLLog: 通过日志到语言重写的轻量级、可解释的SOC异常检测
Samuel Ndichu, Tao Ban, Seiichi Ozawa, Takeshi Takahashi, Daisuke Inoue
AI总结 提出NLLog流水线,将日志模板重写为自然语言句子,结合TF-IDF加权和树集成分类,利用TreeSHAP提供可解释的异常检测,在HDFS、BGL和AIT数据集上实现低误报率和低延迟。
系统生成的日志是安全监控的基础,但其僵化的基于模板的格式阻碍了自动化分析和人类理解。我们提出NLLog(自然语言日志),一个轻量级流水线,它确定性地将解析后的模板重写为WHO-WHAT-SEVERITY句子,通过词频-逆文档频率加权进行池化,使用树集成对会话进行分类,并通过TreeSHAP反向投影证据供分析师审查。在Hadoop分布式文件系统(HDFS)和Blue Gene/L(BGL)语料库上,NLLog超过了两个复现的匹配协议基线;在HDFS、BGL和AIT警报数据集上,它保持了低误报率,且延迟适用于安全运营中心分类。覆盖度、稀疏与密集、忠实性和对抗性消融实验表明,回退充分性依赖于语料库,部署前的覆盖度检查可以揭示细化需求,并且可审计的确定性重写结合轻量级密集编码为日志异常检测和分类提供了可测量的表示层。
System-generated logs underpin security monitoring, yet their rigid template-based format hinders both automated analysis and human comprehension. We present NLLog (Natural-Language Log), a lightweight pipeline that deterministically rewrites parsed templates into WHO-WHAT-SEVERITY sentences, pools them with term-frequency-inverse-document-frequency weighting, classifies sessions with tree ensembles, and back-projects evidence with TreeSHAP for analyst review. On Hadoop Distributed File System (HDFS) and Blue Gene/L (BGL) corpora, NLLog exceeds two reproduced matched-protocol baselines; across HDFS, BGL, and the AIT Alert Data Set, it sustains low false-positive rates with commodity-hardware latency suitable for security operations center triage. Coverage, sparse-versus-dense, faithfulness, and adversarial ablations show that fallback sufficiency is corpus-dependent, that an enrollment-time coverage check can surface refinement requirements before deployment, and that an auditable deterministic rewrite combined with lightweight dense encoding provides a measurable representation layer for log-anomaly detection and triage.
临床远程参与助手(CARE-link):一种用于管理糖尿病的基于网络的电子健康记录软件
Prince Ebenezer Adjei, Joshua Teye Tettey, Toufiq Musah, Audrey Agbeve, John Amuasi
AI总结 CARE-link是一个开源、基于网络的临床支持平台,通过LLM介导的工作流程连接临床医生和患者,用于改善妊娠期糖尿病管理,系统汇总院外患者生成数据、提供临床决策支持,并通过WhatsApp界面为患者提供管理计划解释和生活方式指导。
CARE-link是一个开源、基于网络的临床支持平台,旨在通过LLM介导的工作流程连接临床医生和患者,改善妊娠期糖尿病管理。该系统汇总医院外患者生成的数据,总结相关临床信息,并为临床医生提供情境感知的决策支持。对于患者,CARE-link通过WhatsApp界面提供管理计划的清晰解释和及时的生活方式指导。集成的双面设计旨在促进持续监测、支持个性化护理,并减少临床随访负担。该平台采用模块化架构,可适应其他需要纵向跟踪和行为支持的慢性疾病。CARE-link有潜力增强临床监督、促进患者依从性,并加强护理连续性,特别是在资源有限的环境中。
CARE-link is an open-source, web-based clinical support platform designed to improve the management of gestational diabetes by linking clinicians and patients through an LLM-mediated workflow. The system aggregates patient-generated data outside the hospital, summarizes relevant clinical information, and delivers context-aware decision support to clinicians. For patients, CARE-link provides clear explanations of management plans and delivers timely lifestyle guidance through a WhatsApp interface. The integrated dual-facing design aims to promote continuous monitoring, support individualized care, and reduce the burden of in-clinic follow-ups. Built with a modular architecture, the platform can be adapted to other chronic conditions requiring longitudinal tracking and behavioral support. CARE-link has the potential to enhance clinical oversight, promote patient compliance, and strengthen continuity of care particularly in resource-constrained settings.
动态一致子模最大化的通用框架
Paul Dütting, Federico Fusco, Silvio Lattanzi, Ashkan Norouzi-Fard, Ola Svensson, Morteza Zadimoghaddam
AI总结 针对全动态环境下的子模最大化问题,提出一个通用算法框架,首次实现具有次线性一致性的常数因子近似解。
一致性是动态子模最大化中的一个重要性质,它要求算法始终维持一个接近最优的解,并且在每一步只对解进行少量调整。先前的工作仅在仅插入的情况下探讨了这个问题,其中算法面临 $n$ 个插入的流,并建立了基数约束版本的下界和上界。我们在全动态设置中考虑这个问题,其中操作流可能同时包含插入和删除。我们开发了一个通用框架来设计该设置下的算法,并通过实例化得到了首个具有次线性一致性的常数因子近似。对于基数约束,我们提出了一个 $\frac 12 - O(\varepsilon)$ 近似,其一致性为 $O\left(\frac{1}{\varepsilon^2}\right)$。对于秩-$k$ 拟阵约束,我们构造了一个 $\frac 14 - O(\varepsilon)$ 近似于动态最优解,其一致性为 $O\left(\frac{\log k}{\varepsilon^2}\right)$。
Consistency is an important property in dynamic submodular maximization and entails maintaining a near-optimal solution at all times, making only a small number of adjustments to the solution in each step. Prior work has explored this question for the insertion-only case, where the algorithm faces a stream of $n$ insertions, and has established lower and upper bounds for the cardinality-constrained version of the problem. We consider this question in the fully dynamic setting, where the stream of operations may contain both insertions and deletions. We develop a general framework for designing algorithms for this setting, and instantiate it to obtain the first constant-factor approximations with sublinear consistency. For cardinality constraints, we propose a $\frac 12 - O(\varepsilon)$ approximation that is $O\left(\frac{1}{\varepsilon^2}\right)$ consistent. For rank-$k$ matroid constraints, we construct a $\frac 14 - O(\varepsilon)$ approximation to the dynamic optimum that is $O\left(\frac{\log k}{\varepsilon^2}\right)$ consistent.
基于均值的算法:下界与遗憾
Julius Durmann, Amelie Kleber
AI总结 本文针对未知时间范围且仅有赌博机反馈的设定,首次给出了基于均值算法定义序列γ_t的下界,并提出了两种新算法,实验表明其性能与现有算法相当,同时分析了与无遗憾算法的关系。
基于均值的算法是一类在线学习算法,它们将低概率分配给平均奖励低的动作。最近的研究表明,这些算法能够有利地收敛到序列非支配动作,从而逼近经济博弈中的纳什均衡。然而,实证研究也显示,在赌博机反馈场景中,与已有算法相比,其收敛速度较慢。 我们研究时间范围未知且仅有赌博机反馈时的基于均值算法。在此设定下,我们首次给出了算法定义序列$γ_t$的下界,正式确立了这些算法学习速度的极限。此外,我们提出了两种基于均值的算法:一种推广了$ε$-贪心算法,另一种将基于均值的Exp3扩展到未知时间范围。我们的实验表明,基于均值的算法虽然略慢,但可以与其他赌博机反馈算法竞争。 我们进一步分析了与无遗憾算法的关系。根据$γ_t$的选择,与无遗憾算法的交集是非平凡的,并且我们证明存在既是基于均值又是无遗憾的算法。这为此类算法的“可剥削性”提供了背景,而先前的研究曾暗示这一点。
Mean-based algorithms are a class of online learning algorithms that assign low probability to actions with low average rewards. Recent work indicates these algorithms converge favorably to serially undominated actions, which approximate Nash equilibria in economic games. However, empirical studies also show slower convergence compared to established algorithms in bandit-feedback scenarios. We study mean-based algorithms when the time horizon is unknown and only bandit feedback is available. In this setting, we provide the first lower bound on the algorithm-defining sequence $γ_t$ that formally establishes a limit on how fast these algorithms can learn. Additionally, we propose two mean-based algorithms: one generalizes $ε$-greedy, and the other extends the mean-based Exp3 to unknown horizons. Our experiments show that mean-based algorithms, although slightly slower, can perform competitively with other bandit-feedback algorithms. We further analyze the relationship to no-regret algorithms. Depending on the choice of $γ_t$, the intersection with no-regret algorithms is non-trivial, and we show that algorithms exist that are both mean-based and no-regret. This adds context to the "exploitability" of this class of algorithms that previous contributions suggest.
AdaKoop: 基于Koopman算子回归的非平稳数据流非线性动力学高效建模
Naoki Chihara, Ren Fujiwara, Yasuko Matsubara, Yasushi Sakurai
AI总结 提出AdaKoop,一种基于Koopman算子理论和概率框架的流式算法,通过将非线性动力学表示为线性系统,实现对非平稳数据流的高效、稳定建模,并在71个基准数据集上超越现有方法。
实时数据分析需要准确且自适应地处理非平稳数据流中的非线性动力学,同时保持计算效率。然而,非线性动力学非常复杂,在严格时间限制下捕获动态变化的非线性模式并将其用于下游任务并非易事。为了弥合非线性复杂性与计算可处理性之间的差距,本研究应用了Koopman算子理论,该理论指出非线性动力学可以表示为无限维空间中的线性变换。基于该算子的有限维近似,我们提出了AdaKoop,一种用于对非平稳数据流上的非线性动力学进行建模的高效流式算法。我们的方法利用基于Koopman算子理论的概率框架,将原始观测和再生核希尔伯特空间(RKHS)特征都视为来自潜在向量的发射。这种双视角公式允许非线性动力学被表示为可处理的线性系统。因此,AdaKoop能够以流式方式高效稳定地建模非线性动力学,避免了迭代非线性优化的高昂计算成本。此外,为了应对数据流中的非平稳性,AdaKoop通过统计假设检验自适应地检测模式突变,并增量更新模型参数以处理连续变化。在总共71个跨领域实际基准数据集上的大量实验表明,AdaKoop在实时预测准确性和计算效率方面均优于最先进的方法。
Real-time data analysis requires the ability to accurately and adaptively address nonlinear dynamics in a nonstationary data stream while preserving computational efficiency. However, nonlinear dynamics are so complex that capturing dynamically changing nonlinear patterns and utilizing them for downstream tasks under strict time constraints is nontrivial. To bridge the gap between nonlinear complexity and computational tractability, this study applies Koopman operator theory, which states that nonlinear dynamics can be represented as linear transitions in an infinite-dimensional space. Building upon finite-dimensional approximations of this operator, we present AdaKoop, an efficient streaming algorithm for modeling nonlinear dynamics over nonstationary data streams. Our approach utilizes a probabilistic framework grounded in Koopman operator theory, treating both raw observations and reproducing kernel Hilbert space (RKHS) features as emissions from latent vectors. This dual-view formulation allows nonlinear dynamics to be expressed as a tractable linear system. Therefore, AdaKoop enables the efficient and stable modeling of nonlinear dynamics in a streaming fashion, avoiding the prohibitive computational costs of iterative nonlinear optimization. Furthermore, to address nonstationarity in data streams, AdaKoop adaptively detects the switching of patterns via statistical hypothesis testing for abrupt pattern shifts and incrementally updates model parameters to handle continuous changes. Extensive experiments on a total of 71 practical benchmark datasets across various domains demonstrate that AdaKoop outperforms state-of-the-art methods in terms of real-time forecasting accuracy and computational efficiency.
LLM后训练中的顺序数据投毒
Jack Sanderson, Yihan Wang, Xiaoqian Lu, Gautam Kamath, Yiwei Lu
AI总结 提出顺序数据投毒威胁模型,研究多个攻击者在LLM后训练不同阶段(SFT和偏好数据)分别投毒,发现单一攻击者看似威胁小但多阶段协作会暴露真实漏洞,且不同管道中贡献呈加性或互补性。
LLM后训练通过多个阶段进行,例如监督微调(SFT)后跟人类反馈强化学习(RLHF)或直接偏好优化(DPO),每个阶段的数据来自不同且可能不可信的来源。现有文献假设数据投毒攻击可能发生在每个训练阶段,但忽略了多个攻击者的可能性。为了研究整个后训练管道的可信度,我们提出了顺序数据投毒的威胁模型,其中多个对手分别投毒SFT和偏好数据集。在此威胁模型下,我们发现了单一攻击者幻觉:每个对手单独评估时看似威胁可忽略,但当对手跨阶段协作时,真正的漏洞才会暴露。在SFT→DPO管道中,他们的贡献是加性的:将固定投毒预算跨阶段分配优于单独集中在任一阶段。在SFT→PPO管道中,他们的贡献是互补的:单独SFT或奖励模型投毒都不成功,但组合却成功。这些发现表明,对单个后训练阶段的安全分析系统性地低估了仅从它们交互中出现的复合漏洞。代码可在https://github.com/jcksanderson/sequential-poisoning获取。
LLM post-training proceeds through multiple stages, e.g., supervised fine-tuning (SFT) followed by reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO), where each stage draws data from different, potentially untrusted sources. Existing literature assumes data poisoning attacks may occur at each training stage, but neglects the possibility of multiple attackers. To study the trustworthiness of the entire post-training pipeline, we propose the threat model of sequential data poisoning, where multiple adversaries separately poison the SFT and preference datasets. Under this threat model, we identify the single-attacker illusion: each adversary, evaluated in isolation, appears to pose a negligible threat. Yet when adversaries collaborate across stages, the true vulnerability is revealed. In the SFT $\to$ DPO pipeline, their contributions are additive: splitting a fixed poison budget across stages outperforms concentrating it in either stage alone. In the SFT $\to$ PPO pipeline, their contributions are complementary: neither SFT nor reward model poisoning succeeds individually, yet their combination does. These findings show that security analyses of individual post-training stages systematically underestimate compound vulnerabilities that emerge only from their interaction. Code is available at https://github.com/jcksanderson/sequential-poisoning.
通过双向梯度优化实现大型语言模型中的数据归因
Frédéric Berdoz, Luca A. Lanzendörfer, Kaan Bayraktar, Roger Wattenhofer
AI总结 提出一种基于双向梯度优化的训练数据归因方法,用于自动回归大型语言模型,以识别影响模型输出的关键训练数据,提升模型可解释性。
大型语言模型(LLMs)越来越多地部署在各种应用中,引发了关于治理、问责和数据溯源的关键问题。理解哪些训练数据对模型的输出影响最大仍然是一个基本开放问题。我们通过扩展逆公式来解决自动回归LLMs的训练数据归因(TDA)挑战:如果模型在训练期间看到了生成的输出,训练数据会如何受到影响?我们的方法通过对生成的文本样本进行双向梯度优化(梯度上升和下降)来扰动基础模型,并测量训练样本上损失的变化。我们的框架支持任意数据粒度的归因,能够实现事实和风格归因。我们在已知数据集的预训练模型上评估了我们的方法,并表明它在影响力指标上优于先前的工作,从而增强了模型的可解释性,这是负责任AI系统的基本要求。
Large Language Models (LLMs) are increasingly deployed across diverse applications, raising critical questions for governance, accountability, and data provenance. Understanding which training data most influenced a model's output remains a fundamental open problem. We address this challenge through training data attribution (TDA) for auto-regressive LLMs by expanding upon the inverse formulation: How would training data be affected if the model had seen the generated output during training? Our method perturbs the base model using bidirectional gradient optimization (gradient ascent and descent) on a generated text sample and measures the resulting change in loss across training samples. Our framework supports attribution at arbitrary data granularity, enabling both factual and stylistic attribution. We evaluate our method against baselines on pretrained models with known datasets, and show that it outperforms previous work on influence metrics, thereby enhancing model interpretability, an essential requirement for accountable AI systems.
以场景为中心的无监督视频全景分割
Christoph Reich, Oliver Hahn, Nikita Araslanov, Laura Leal-Taixé, Christian Rupprecht, Daniel Cremers, Stefan Roth
AI总结 提出无监督视频全景分割任务及首个方法VideoCUPS,利用深度、运动和视觉线索生成伪标签,并通过Video DropLoss训练,在无监督条件下实现准确分割。
视频全景分割(VPS)旨在联合检测、分割和跟踪所有对象,同时将视频划分为语义一致的区域。我们引入了无监督VPS的任务设置,省略任何人工监督。现有的无监督场景理解工作主要关注图像分割任务;视频领域仍未充分探索。我们提出了VideoCUPS,这是第一个无监督VPS方法。VideoCUPS通过利用无监督的深度、运动和视觉线索,从以场景为中心的视频中生成时间一致的全景视频伪标签。使用新颖的Video DropLoss在这些伪标签上训练,可以得到一个准确的无监督VPS模型。为了对进展进行基准测试,我们引入了一个全面的评估协议和四个竞争基线,将最先进的无监督全景图像和实例视频分割模型扩展到VPS。VideoCUPS优于所有基线,并展示了强大的标签高效学习能力。通过VideoCUPS、我们的评估协议和基线,我们为未来无监督VPS的研究提供了坚实的基础。
Video panoptic segmentation (VPS) aims to jointly detect, segment, and track all objects while partitioning the video into semantically consistent regions. We introduce the task setting of unsupervised VPS, omitting any human supervision. Existing unsupervised scene understanding works mainly focused on image segmentation tasks; the video domain remains underexplored. We propose VideoCUPS, the first unsupervised VPS approach. VideoCUPS generates temporally consistent panoptic video pseudo-labels from scene-centric videos by exploiting unsupervised depth, motion, and visual cues. Training on these pseudo-labels using a novel Video DropLoss yields an accurate, unsupervised VPS model. To benchmark progress, we introduce a comprehensive evaluation protocol and four competitive baselines, extending state-of-the-art unsupervised panoptic image and instance video segmentation models to VPS. VideoCUPS outperforms all baselines and demonstrates strong label-efficient learning. With VideoCUPS, our evaluation protocol, and baselines, we provide a strong foundation for future research on unsupervised VPS.
众包能否在LLM时代幸存?关于人类数据收集的社区调查
Aswathy Velutharambath, Neele Falk, Sofie Labat, Tarun Tater, Amelie Wuehrl
AI总结 通过调查155名NLP及相关领域研究者,探讨LLM对众包数据有效性的挑战、检测策略及应对措施,发现44%的受访者观察到LLM使用,但现有努力仍不足。
大型语言模型(LLM)作为写作工具的广泛使用挑战了众包数据的有效性,因为众包工作者可能将任务外包给模型。为了更好地了解如何解决这一问题,我们调查了155名NLP及相关领域的研究人员,了解他们通过众包收集自由文本回复的经验和意见。本文概述了从业者面临的挑战、缓解策略以及对数据质量的预期影响。44%的受访者报告在其众包数据中观察到LLM的使用。虽然其中93%的人预料到了这一点,但一半的人不确定应采取何种预防措施。最普遍的检测策略是独特的文本风格模式和异常快速的完成时间。总体而言,调查回复显示研究社区意识到这一问题并正在采取措施,但现有努力仍不足以完全解决。最后,我们提出了一系列考虑因素,以指导LLM时代未来的众包自由文本数据收集。
The widespread use of Large Language Models (LLMs) as writing tools challenges the validity of crowdsourced data, as crowdworkers may outsource tasks to models. To better understand how this is addressed, we surveyed 155 researchers in NLP and related disciplines about their experiences and opinions on collecting free-text responses via crowdsourcing. This paper provides an overview of practitioners' challenges, mitigation strategies, and the foreseen implications on data quality. 44% of respondents reported observing LLM usage in their crowdsourced data. While 93% of them had anticipated this, half were unsure what precautions to take. The most prevalent detection strategies are distinctive textual style patterns and unusually fast completion times. Overall, survey responses show that the research community is aware of the problem and taking measures, but existing efforts remain insufficient to fully address it. Finally, we derive a set of considerations to guide future crowdsourced free-text data collection in the era of LLMs.
基于评分标准的强化学习中的奖励黑客行为的复现、分析与检测
Xuekang Wang, Zhuoyuan Hao, Shuo Hou, Hao Peng, Juanzi Li, Xiaozhi Wang
AI总结 本文提出可控黑客环境CHERRL,通过注入已知偏见复现奖励黑客行为,分析其可发现性与可利用性,并探索基于智能体的自动检测方法。
基于评分标准的强化学习(RL)使用LLM作为评判者(LaaJ)根据评分标准对模型输出进行评分作为奖励。然而,策略模型可能利用评判者中的潜在偏见,导致奖励黑客行为以及无效或不安全的训练结果。在真实的基于评分标准的RL中,此类黑客行为通常微妙且与多种评判者偏见纠缠在一起,使得分析、检测和缓解变得困难。在本文中,我们引入了CHERRL,一个用于基于评分标准的RL的可控黑客环境。通过将已知偏见注入LaaJ,CHERRL能够稳定复现奖励黑客行为,明确观察奖励发散,并精确识别黑客行为的起始点。这为研究基于评分标准的RL中奖励黑客行为的机制和缓解措施提供了一个干净的实验测试平台。为了展示其效用,我们从可发现性和可利用性的角度分析了不同的评判者偏见,并探索了一个基于智能体的系统,用于从训练日志中自动检测奖励黑客行为的起始点。代码和环境公开于https://github.com/THUAIS-Lab/CHERRL。
Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https://github.com/THUAIS-Lab/CHERRL.
几何感知蒸馏用于提示调优生物医学视觉-语言模型
Tran Dinh Tien, Zhiqiang Shen
AI总结 提出Omni-Geometry知识蒸馏(OGKD)框架,通过注入类别关系结构到教师模型,生成保留真实标签同时尊重类间几何的方向性目标,并设计全局几何感知蒸馏(GAD)和标签引导几何蒸馏(LGD)损失,在11个医学数据集上平均提升准确率1.7%-2.8%。
当前基于提示和适配器的视觉-语言模型(VLM)调优方法在医学影像中具有吸引力,因为临床数据敏感性倾向于冻结骨干网络且标注有限。然而,这些方法通常仅优化真实类别,将所有其他类别视为同等错误,忽略了临床上有意义的类别关系,并在有限监督设置下产生不稳定的决策边界。我们提出了Omni-Geometry知识蒸馏(OGKD),一种新框架,将类别关系结构注入教师模型,以生成保留真实标签同时尊重类间几何的方向性目标。利用这些目标,我们开发了两种蒸馏损失:全局几何感知蒸馏(GAD)作用于全局图像标记,标签引导几何蒸馏(LGD)将相同的几何应用于注意力补丁标记以改善细粒度对齐。在11个广泛使用的医学数据集上进行的基础到新类和少样本评估的综合实验和分析中,我们的OGKD实现了显著更好的性能,在所有先前最先进的VLM适应方法上平均绝对增益为1.7%-2.8%。它还能稳健地泛化到未见类别,并产生比其他方法更可靠的预测。我们的代码可在https://github.com/tientrandinh/OGKD获取。
Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating all other classes as equally incorrect, ignoring clinically meaningful class relations and yielding unstable decision boundaries in limited-supervision settings. We propose Omni-Geometry Knowledge Distillation (OGKD), a new framework that injects class-relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter-class geometry. Using these targets, we develop two distillation losses: Global Geometry-Aware Distillation (GAD) operates on the global image token, and Label-Guided Geometry Distillation (LGD) applies the same geometry to attentive patch tokens to improve fine-grained alignment. Across comprehensive experiments and analyses on 11 widely-used medical datasets for base-to-novel and few-shot evaluations, our OGKD achieves substantially better performance, consistently improving accuracy by an average absolute gain of 1.7%-2.8% over all prior state-of-the-art VLM adaptation counterparts. It also robustly generalizes to unseen classes and yields more reliable predictions than other approaches. Our code is available at https://github.com/tientrandinh/OGKD.
SURF: 通过无监督重混流进行分离
Henry Li, Robin Scheibler, Efthymios Tzinis, Matt Shannon, Arnaud Doucet, John R. Hershey
AI总结 提出无监督流匹配方法SURF,直接从混合信号学习源分离,结合监督流匹配与自监督回归,通过重混步骤引导学生模型,在图像和音频基准上达到新最优。
单通道源分离的目标是从混合信号中重建$K$个源。在监督设置中,当有大量干净源数据可用时,这个具有挑战性的不适定问题已通过生成扩散和基于流的先验模型成功解决。然而,获取此类干净源样本通常受限,即使可用,监督模型也容易受到领域偏移的影响。为弥补这一差距,我们提出了通过无监督重混流进行分离(SURF),这是一种无监督流匹配方法,直接从观测到的混合信号中学习。该方法依赖于最先进的监督流匹配和基于回归的自监督技术的新颖组合。在高层面上,从教师模型开始,我们利用“重混”步骤,从教师估计中引导学习学生流模型。我们提供了关于该方法优化目标的见解,并建立了与Wake-Sleep算法的新联系。在图像和音频基准上的实证评估表明,SURF建立了新的最优水平,显著优于现有无监督方法。示例请参见我们的演示页面:https://google.github.io/df-conformer/surf/
The goal of single-channel source separation is to reconstruct $K$ sources given their mixture. In supervised settings where vast amounts of clean source data are available, this challenging, ill-posed problem has been addressed successfully by generative diffusion and flow-based prior models. However, access to such clean source samples is often limited, and even when available, supervised models are vulnerable to domain shifts. To bridge this gap, we present Separation via Unsupervised Remixing Flow (SURF), an unsupervised flow matching approach for source separation that learns directly from observed mixtures. This method relies on a novel combination of state-of-the-art supervised flow matching and regression-based self-supervised techniques. At a high level, starting from a teacher model, we utilize a "remixing" step to bootstrap the learning of a student flow model from the teacher's estimates. We provide insights into the objectives optimized by this approach and draw a novel connection to the Wake-Sleep algorithm. Empirical evaluations on image and audio benchmarks demonstrate that SURF establishes a new state-of-the-art, significantly outperforming existing unsupervised methods. See our demo page for examples. https://google.github.io/df-conformer/surf/
工人效用作为滞后:零工劳动力市场中交易接受的Preisach模型
Piotr Frydrych
AI总结 本文提出Preisach滞后模型表示零工工人隐藏偏好,通过双输出神经网络估计接受和拒绝效用,结合XGBoost分类器,在36891笔交易上实现Jaccard=0.827和ROC AUC=0.799,并证明价格下降比上升对完成率影响更大。
工人效用是不可观测的——只有其结果可观测。每笔零工交易产生一个比特:接受或拒绝。我们认为这种结构直接指向Preisach滞后模型作为潜在工人偏好的自然表示。Preisach算子将总产出建模为对一群二元阈值元素的积分——这正是异质性工人各自持有私人接受工资时出现的结构。我们通过双输出神经网络(共享层256->128,边际损失强制U_1 >= U_0)估计两个潜在效用曲面:接受效用U_1(X)和拒绝效用U_0(X)。分类简化为Preisach间隙U_1(X) - U_0(X),与裁剪稳定的价格-阈值编码一起输入XGBoost分类器。在36,891笔零工交易上,该流程实现了Jaccard=0.827和ROC AUC=0.799。价格-阈值编码相比原始效用特征贡献了+11.0个百分点的AUC。模型证实了滞后预测的方向不对称性:价格下降比同等幅度的上升更严重地降低完成率。应用于完整数据集,模型的建议同时将总工资账单减少21.3%,并将预期填充率提高9.7个百分点。对于74.2%的交易,P(接受)已超过0.80;降低工资使其保持在阈值以上(削减后平均P=0.972),释放成本节约(中位数31%)。对于剩余的25.4%,中位数7%的工资增长恢复了+43个百分点的接受率。没有明确无差异区域的模型无法同时执行这两种操作。
Worker utility is not observed -- only its consequence is. Each gig transaction produces a single bit: accepted or rejected. We argue this structure points directly to the Preisach hysteresis model as the natural representation of latent worker preferences. The Preisach operator models aggregate output as an integral over a population of binary threshold elements -- precisely the structure that emerges when heterogeneous workers each carry a private acceptance wage. We estimate two latent utility surfaces: acceptance utility U_1(X) and rejection utility U_0(X), via a dual-output neural network (shared layers 256->128, margin loss enforcing U_1 >= U_0). Classification reduces to the Preisach gap U_1(X) - U_0(X), passed into an XGBoost classifier alongside clip-stabilised price-to-threshold encodings. On 36,891 gig transactions, this pipeline achieves Jaccard = 0.827 and ROC AUC = 0.799. The price-to-threshold encoding accounts for +11.0 pp AUC over raw utility features. The model confirms the directional asymmetry hysteresis predicts: price decreases depress completion rates more than equivalent increases raise them. Applied to the full dataset, the model's recommendations simultaneously reduce the total wage bill by 21.3% and increase expected fill rate by 9.7 pp. For 74.2% of transactions, P(accept) already exceeds 0.80; reducing the wage keeps it above threshold (mean post-cut P = 0.972), releasing cost savings (median 31%). For the remaining 25.4%, a median 7% wage increase recovers +43 pp acceptance. A model without an explicit indifference zone cannot execute both moves simultaneously.
Caliper: 探究LLM中的词汇锚点与因果结构
Zhenyu Yu, Shuigeng Zhou
AI总结 通过词汇匿名化扰动,揭示大语言模型在因果推理基准上的表现主要依赖词汇模式匹配而非结构因果推理。
大语言模型在CLadder等因果推理基准上达到50%至70%的准确率,但尚不清楚这反映的是结构推理还是词汇模式匹配。我们引入Caliper,一种受控扰动方法,在保留每个问题的因果图和概率规范的同时,用占位符标记替换语义变量名。在九个指令微调LLM(从3.8B到671B参数)和三个因果推理基准上,词汇匿名化在本地3.8B-14B模型集上导致稳健的准确率下降,分别为+7.6、+27.0和+11.1个百分点;在跨越2024-2026代际的九个前沿模型上,CRASS和e-CARE上的下降幅度升至+29.6和+18.0个百分点。在40个模型-基准组合中,39个显示出正差距,而在CLadder的伪词子集上,差距缩小了17倍。结构化提示和少样本上下文学习各自缩小了差距,但主要是通过降低较小模型上的P0准确率,而非恢复P1。当前指令微调LLM在零样本评估下,一旦移除词汇锚点,几乎没有证据表明其具备结构因果推理能力。
Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation that replaces semantic variable names with placeholder tokens while preserving the causal graph and probabilistic specification of each question. Across nine instruction-tuned LLMs from 3.8B to 671B and three causal reasoning benchmarks, lexical anonymization yields robust accuracy drops of +7.6, +27.0, and +11.1 pp on a local 3.8B-14B set, rising to +29.6 and +18.0 pp on CRASS and e-CARE across nine frontier models spanning the 2024-2026 generations. Of 40 engaged model-by-benchmark cells, 39 show a positive gap, and the gap collapses by 17x on CLadder's pseudoword subset. Structured scaffolding and few-shot in-context learning each narrow the gap, but mainly by lowering P0 accuracy on smaller models rather than recovering P1. Current instruction-tuned LLMs, evaluated zero-shot, show little evidence of structural causal reasoning once lexical anchors are removed.
BreastGPT: 面向乳腺癌临床全流程的多模态大语言模型
Yang Liu, Jiajin Zhang, Danyang Tu, Yaojun Hu, Jiao Qu, Jiuyu Zhang, Yu Shi, Wei Fang, Shi Gu, Ling Zhang, Yingda Xia
AI总结 提出BreastGPT多模态大语言模型,通过构建工作流对齐的指令语料库BreastStage和双分支视觉编码器,实现乳腺癌筛查、诊断和治疗规划全流程的多模态推理,在BreastStage-Bench上取得75.66%封闭式准确率和89.92%开放式得分。
乳腺癌仍然是女性癌症相关死亡的主要原因。其临床管理需要跨临床工作流(包括筛查、诊断和治疗规划)的多模态推理,其中每个阶段涉及不同的成像模态、任务目标和推理模式。然而,受限于数据稀缺和模型通用性,现有的医学多模态大语言模型通常仅在孤立的模态或狭窄的任务族上进行评估,限制了它们支持工作流级临床推理的能力。在这项工作中,我们首先引入了BreastStage,一个工作流对齐的乳腺影像指令语料库,包含来自5种成像模态的17个子数据集和136个任务模板的186万条指令遵循对。其保留子集BreastStage-Bench为评估乳腺癌护理连续体中的多模态推理提供了全面的基准。基于该语料库,我们提出了BreastGPT,一个统一的多模态大语言模型,配备双分支视觉编码器和概念保持的令牌压缩,以弥合标准放射学与千兆像素病理学之间的尺度差距。在BreastStage-Bench上,BreastGPT实现了75.66%的封闭式准确率和89.92%的开放式得分,在临床阶段和任务格式上均优于通用和医学专用多模态大语言模型。这些结果表明,工作流对齐的数据和跨尺度视觉建模对于临床基础的医学多模态大语言模型至关重要。所有数据、代码和模型检查点已在https://yangyy-liu.github.io/BreastGPT.io发布。
Breast cancer remains a leading cause of cancer-related mortality among women. Its clinical management requires multimodal reasoning across a clinical workflow that spans \textit{screening}, \textit{diagnosis} and \textit{treatment planning}, where each stage involves distinct imaging modalities, task objectives, and reasoning patterns. However, constrained by data scarcity and model versatility, existing medical MLLMs are typically evaluated on isolated modalities or narrow task families, limiting their ability to support workflow-level clinical reasoning. In this work, we first introduce \textbf{BreastStage}, a workflow-aligned breast imaging instruction corpus comprising 1.86M instruction-following pairs curated from 17 sub-datasets across 5 imaging modalities and 136 task templates. Its held-out split, \textbf{BreastStage-Bench}, provides a comprehensive benchmark for evaluating multimodal reasoning across the breast cancer care continuum. Building on this corpus, we propose \textbf{BreastGPT}, a unified MLLM equipped with a dual-branch visual encoder and concept-preserving token compression to bridge the scale gap between standard radiology and gigapixel pathology. On BreastStage-Bench, BreastGPT achieves 75.66\% closed-ended accuracy and 89.92\% open-ended score, outperforming both general-purpose and medical-specific MLLMs across clinical stages and task formats. These results suggest that workflow-aligned data and cross-scale visual modeling are critical for clinically grounded medical MLLMs. All data, code, and model checkpoints are released at https://yangyy-liu.github.io/BreastGPT.io.
BEATS: 通过迭代人机协作引导电商搜索属性分类
Yung-Yu Shih, Shang-Yu Su, Tzu-I Ho, Dongzhe Wang, Yun-Nung Chen
AI总结 针对新兴市场电商平台缺乏结构化属性模式的问题,提出BEATS框架,利用人机协作的LLM流水线从零构建产品属性分类,并通过属性标注提升搜索系统性能。
新兴市场的电商平台通常使用欠发达的产品目录,仅包含类别分类而缺乏结构化属性模式。缺乏细粒度产品属性限制了搜索能力——阻碍分面过滤、降低查询理解、削弱搜索系统使用的语义表示。我们提出BEATS,一种人机协作的LLM框架,用于从零开始引导产品属性分类。我们的方法扩展了一个多阶段LLM生成流水线,包含两个关键生产阶段:(1) 模型开发者主动进行质量检查以过滤错误输出,以及(2) 领域专家本地工作人员进行人工标注以验证生成的属性。该框架迭代运行——每个生成阶段的提示基于质量检查观察和标注者在连续轮次中的反馈进行优化,逐步提高属性质量。一旦属性分类建立,我们使用LLM对单个产品项目进行结构化属性标注,丰富其上下文表示。丰富的目录直接有益于搜索系统的多个组件:实现细粒度基于属性的过滤、为排序模型提供结构化特征、改善密集检索的语义表示。我们通过在属性丰富的产品数据上训练密集检索模型来验证生成的分类,证明相对于使用原始目录信息的基线有一致的改进。我们的系统已在台湾乐天部署,丰富了9个主要类别,涵盖2,694个子类别,生成了67,277个属性,超过540万产品已使用生成的属性进行标注,并计划丰富整个产品目录。
E-commerce platforms in emerging markets often operate with underdeveloped product catalogs that contain only category taxonomies but lack structured attribute schemas. This absence of fine-grained product attributes limits search capabilities -- preventing faceted filtering, degrading query understanding, and weakening semantic representations used by search systems. We present BEATS, a human-in-the-loop LLM framework for bootstrapping product attribute taxonomies entirely from scratch. Our approach extends a multi-stage LLM generation pipeline with two critical production stages: (1) proactive quality checking by model developers to filter erroneous outputs, and (2) human annotation by domain-expert local staff to validate generated attributes. The framework operates iteratively -- prompts at each generation stage are refined based on quality check observations and annotator feedback across successive rounds, progressively improving attribute quality. Once the attribute taxonomy is established, we employ LLMs to perform structured attribute tagging on individual product items, enriching their contextual representations. The enriched catalog directly benefits multiple components of the search system: enabling granular attribute-based filtering, providing structured features for ranking models, and improving semantic representations for dense retrieval. We validate the generated taxonomy by training dense retrieval models on attribute-enriched product data, demonstrating consistent improvements over baselines using original catalog information. Our system has been deployed at Rakuten Taiwan, enriching 9 major categories spanning 2,694 sub-categories with 67,277 generated attributes, and over 5.4 million products have been tagged with the generated attributes, with plans to enrich the entire product catalog.