arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2507.03373 2026-06-04 cs.CL

WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia

WETBench：用于检测维基百科上特定任务机器生成文本的基准

Gerrit Quaremba, Elizabeth Black, Denny Vrandečić, Elena Simperl

发表机构 * King’s College London（伦敦国王学院）； Wikimedia Foundation（维基媒体基金会）

AI总结提出WETBench，一个多语言、多生成器、任务特定的基准，用于检测维基百科编辑场景下的机器生成文本，实验表明训练型检测器平均准确率78%，零样本检测器平均58%。

详情

AI中文摘要

鉴于维基百科作为高质量、可靠内容的可信来源，对其平台上由大型语言模型（LLM）产生的低质量机器生成文本（MGT）的扩散担忧日益增加。因此，可靠的MGT检测至关重要。然而，现有工作主要在通用生成任务上评估MGT检测器，而非维基百科编辑者更常执行的任务。这种错位可能导致在真实维基百科场景中应用时泛化能力差。我们引入了WETBench，一个多语言、多生成器、任务特定的MGT检测基准。我们定义了三个编辑任务，这些任务基于维基百科编辑者对LLM辅助编辑的感知用例进行实证：段落写作、摘要和文本风格迁移，我们使用两个新数据集在三种语言中实现这些任务。对于每个写作任务，我们评估三个提示，使用表现最佳的提示跨多个生成器生成MGT，并对多种检测器进行基准测试。我们发现，在各种设置下，基于训练的检测器平均准确率达到78%，而零样本检测器平均为58%。这些结果表明，检测器在现实生成场景中难以应对MGT，并强调了在多样化、任务特定数据上评估此类模型以评估其在编辑驱动场景中可靠性的重要性。

英文摘要

Given Wikipedia's role as a trusted source of high-quality, reliable content, concerns are growing about the proliferation of low-quality machine-generated text (MGT) produced by large language models (LLMs) on its platform. Reliable detection of MGT is therefore essential. However, existing work primarily evaluates MGT detectors on generic generation tasks rather than on tasks more commonly performed by Wikipedia editors. This misalignment can lead to poor generalisability when applied in real-world Wikipedia contexts. We introduce WETBench, a multilingual, multi-generator, and task-specific benchmark for MGT detection. We define three editing tasks, empirically grounded in Wikipedia editors' perceived use cases for LLM-assisted editing: Paragraph Writing, Summarisation, and Text Style Transfer, which we implement using two new datasets across three languages. For each writing task, we evaluate three prompts, generate MGT across multiple generators using the best-performing prompt, and benchmark diverse detectors. We find that, across settings, training-based detectors achieve an average accuracy of 78%, while zero-shot detectors average 58%. These results show that detectors struggle with MGT in realistic generation scenarios and underscore the importance of evaluating such models on diverse, task-specific data to assess their reliability in editor-driven contexts.

URL PDF HTML ☆

赞 0 踩 0

2506.05233 2026-06-04 cs.LG cs.AI cs.CL

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

MesaNet: 通过局部最优测试时训练进行序列建模

Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Sarthak Mittal, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, Rif A. Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Agüera y Arcas, João Sacramento

发表机构 * Google（谷歌）； Paradigms of Intelligence Team（智能范式团队）； Google DeepMind（谷歌深Mind）； MIT CSAIL（麻省理工学院CSAIL）

AI总结提出一种基于共轭梯度求解器实现局部最优测试时训练的Mesa层，在保持常数推理成本的同时，在语言建模困惑度和下游基准性能上超越现有RNN模型。

Comments Published at ICLR 2026

详情

AI中文摘要

序列建模目前主要由使用softmax自注意力的因果Transformer架构主导。尽管被广泛采用，Transformer在推理时需要线性扩展内存和计算。最近一系列工作将softmax操作线性化，产生了具有恒定内存和计算成本的强大循环神经网络模型，如DeltaNet、Mamba或xLSTM。这些模型可以通过注意到其循环层动态都源于上下文回归目标（通过在线学习规则近似优化）来统一。在此，我们加入这一系列工作，引入最近提出的Mesa层（von Oswald等人，2024）的一个数值稳定、可分块并行化的版本，该层原本只能顺序运行，因此不可扩展。该层同样源于上下文损失，但现在使用快速共轭梯度求解器在每个时间点将其最小化至最优。通过一系列扩展到十亿参数规模的实验，我们表明最优测试时训练使得语言建模困惑度更低，下游基准性能优于之前的RNN，尤其是在需要长上下文理解的任务上。这一性能提升以推理时额外浮点运算为代价。因此，我们的结果与最近增加测试时计算以提高性能的趋势有趣地相关——这里通过花费计算在神经网络内部解决序列优化问题来实现。

英文摘要

Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), which could only run sequentially in time and was therefore not scalable. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments study up to the billion-parameter scale, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.

URL PDF HTML ☆

赞 0 踩 0

2506.04281 2026-06-04 cs.LG

Uncovering Insights of Compound Flooding with Data-Driven AI

利用数据驱动AI揭示复合洪水的内在机制

Xu Zheng, Chaohao Lin, Sipeng Chen, Zhuomin Chen, Jimeng Shi, Jayantha Obeysekera, Jingchao Ni, Wei Cheng, Jason Liu, Dongsheng Luo

发表机构 * Florida International University（佛罗里达国际大学）； Florida State University（佛罗里达州立大学）； UIUC（伊利诺伊大学香槟分校）； University of Houston（休斯顿大学）； NEC Lab America（NEC美国实验室）； Singapore Management University（新加坡管理大学）

AI总结通过整合潮汐、降雨、地下水位和人类水管理活动，利用数据驱动方法分析南佛罗里达复合洪水，发现地下水位是主要预测因子，空间耦合状态比长期时间依赖更重要。

Comments Accepted to SIGKDD 2026 AI for Science Track; 12 Pages, 5 Figures, 6 Tables

详情

AI中文摘要

复合洪水由多个水文气象因素之间的非线性相互作用驱动，对灾害预防构成重大挑战。现有的预测方法，无论是基于物理的还是数据驱动的，通常强调时间模式，而较少探索多个相互作用因素如何共同塑造洪水动态。为了解决这个问题，我们通过整合潮汐条件、降雨、地下水位和人类水管理活动，对南佛罗里达（一个典型的复合洪水区域）进行了大规模数据驱动的复合洪水分析。我们的分析揭示了三个关键发现：（i）仅捕捉时间动态的模型无法代表复合事件期间的多因素相互作用；（ii）地下水位反映的地表下饱和度成为洪水严重程度的主要预测因子，在这个多孔沿海地区往往超过即时降雨强度；（iii）有限有效半径内周围监测站的空间状态为洪水提供了关键的因果背景，而延长时间历史在极端事件中收益递减。这些发现表明，复合洪水更多地受空间耦合系统状态而非长期时间依赖性的支配，挑战了以降雨为中心和序列主导的预测范式。通过将数据驱动模型定位为科学探究工具而非仅用于预测，本研究为复合洪水的机制提供了新见解，并为设计更基于物理的沿海环境早期预警系统提供了信息。我们的数据集和代码公开在 https://github.com/AslanDing/SFBench。

英文摘要

Compound flooding, driven by nonlinear interactions between multiple hydrometeorological factors, poses a significant challenge to hazard prevention. Existing forecasting approaches, whether physics-based or data-driven, often emphasize temporal patterns while underexploring how multiple interacting factors jointly shape flood dynamics. To address this problem, we conduct a large-scale data-driven analysis of compound flooding in South Florida, a typical area for compound flooding, by integrating tidal conditions, rainfall, groundwater stage, and human water management activities. Our analysis reveals three key findings: (i) models that capture temporal dynamics alone fail to represent multi-factor interactions during compound events; (ii) subsurface saturation, as reflected by groundwater levels, emerges as a dominant predictor of flood severity, often outweighing immediate rainfall intensity in this porous coastal region; and (iii) the spatial state of surrounding monitoring stations within a finite effective radius provides critical causal context for flooding, while extending temporal history yields diminishing returns during extreme events. These findings suggest that compound flooding is governed more by spatially coupled system states than by long-term temporal dependencies, challenging rain-centric and sequence-dominated forecasting paradigms. By framing data-driven models as tools for scientific inquiry rather than prediction alone, this study offers new insights into the mechanisms of compound flooding and informs the design of more physically grounded early-warning systems for coastal environments. Our dataset and code are publicly available at https://github.com/AslanDing/SFBench.

URL PDF HTML ☆

赞 0 踩 0

2505.19293 2026-06-04 cs.CL cs.AI cs.LG

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

100-LongBench：事实上的长上下文基准是否真的在评估长上下文能力？

Wang Yang, Hongye Jin, Shaochen Zhong, Song Jiang, Qifan Wang, Vipin Chaudhary, Xiaotian Han

发表机构 * Case Western Reserve University（凯斯西储大学）； Texas A&M University（德克萨斯A&M大学）； Rice University（里德大学）； University of California, Los Angeles（加州大学洛杉矶分校）； Meta（Meta公司）

AI总结针对现有长上下文基准无法分离基线能力与真实长上下文能力、且输入长度固定等问题，提出长度可控的长上下文基准和新指标，以有效评估大语言模型的长上下文能力。

详情

AI中文摘要

长上下文能力被认为是LLM最重要的能力之一，因为真正具备长上下文能力的LLM使用户能够轻松处理许多原本繁琐的任务——例如，阅读长文档寻找答案与直接询问LLM。然而，现有的基于真实任务的长上下文评估基准有两个主要缺陷。首先，像LongBench这样的基准通常没有提供适当的指标来将长上下文性能与模型的基线能力分开，使得跨模型比较不清晰。其次，此类基准通常以固定输入长度构建，这限制了它们在不同模型上的适用性，并且无法揭示模型何时开始崩溃。为了解决这些问题，我们引入了一个长度可控的长上下文基准和一个新颖的指标，该指标将基线知识与真实的长上下文能力解耦。实验证明了我们的方法在有效评估LLM方面的优越性。

英文摘要

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.

URL PDF HTML ☆

赞 0 踩 0

2505.17315 2026-06-04 cs.AI cs.CL cs.LG

Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning

更长上下文，更深思考：揭示长上下文能力在推理中的作用

Wang Yang, Zirui Liu, Hongye Jin, Qingyu Yin, Vipin Chaudhary, Xiaotian Han

发表机构 * Case Western Reserve University（凯斯西储大学）； University of Minnesota - Twin Cities（明尼苏达大学双城分校）； Texas A&M University（德克萨斯阿姆大学）

AI总结本研究通过实验发现，增强模型的长上下文能力（在监督微调前）能显著提升推理性能，即使对于短输入任务也有泛化收益，表明长上下文建模是推理能力的关键基础。

详情

AI中文摘要

近期语言模型展现出强大的推理能力，但长上下文能力对推理的影响仍未充分探索。在本工作中，我们假设当前推理能力的局限性部分源于长上下文能力不足，这一假设基于经验观察：（1）更高的上下文窗口长度通常带来更强的推理性能，（2）失败的推理案例与失败的长上下文案例相似。为验证这一假设，我们检验了在监督微调（SFT）前增强模型的长上下文能力是否能提升推理性能。具体而言，我们比较了架构和微调数据相同但长上下文能力不同的模型。结果揭示了一致趋势：长上下文能力更强的模型在SFT后，在推理基准上取得了显著更高的准确率。值得注意的是，即使在输入长度较短的任务上，这些增益也持续存在，表明长上下文训练为推理性能提供了可泛化的益处。这些发现表明，长上下文建模不仅对处理长输入至关重要，而且也是推理的关键基础。我们主张将长上下文能力作为未来语言模型设计的首要目标。

英文摘要

Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored. In this work, we hypothesize that current limitations in reasoning stem, in part, from insufficient long-context capacity, motivated by empirical observations such as (1) higher context window length often leads to stronger reasoning performance, and (2) failed reasoning cases resemble failed long-context cases. To test this hypothesis, we examine whether enhancing a model's long-context ability before Supervised Fine-Tuning (SFT) leads to improved reasoning performance. Specifically, we compared models with identical architectures and fine-tuning data but varying levels of long-context capacity. Our results reveal a consistent trend: models with stronger long-context capacity achieve significantly higher accuracy on reasoning benchmarks after SFT. Notably, these gains persist even on tasks with short input lengths, indicating that long-context training offers generalizable benefits for reasoning performance. These findings suggest that long-context modeling is not just essential for processing lengthy inputs, but also serves as a critical foundation for reasoning. We advocate for treating long-context capacity as a first-class objective in the design of future language models.

URL PDF HTML ☆

赞 0 踩 0

2502.17956 2026-06-04 cs.CL

Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments

在跨语言和多语言环境中更好地理解程序思维推理

Patomporn Payoungkhamdee, Pume Tuchinda, Jinheon Baek, Samuel Cahyawijaya, Can Udomcharoenchaikit, Potsawee Manakul, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Sarana Nutanong

发表机构 * School of Information Science and Technology, VISTEC（信息科学与技术学院，VISTEC）； KAIST（韩国科学技术院）； Cohere ； SCB 10X ； AI Singapore（AI新加坡）； Department of Computer Engineering, Chulalongkorn University（朱拉隆梭大学计算机工程系）

AI总结通过分离推理与代码执行，提出评估程序思维提示的框架，发现微调显著提升多语言推理能力，且推理质量与答案准确性强相关。

详情

DOI: 10.18653/v1/2025.findings-acl.817
Journal ref: Findings of the Association for Computational Linguistics: ACL 2025

AI中文摘要

多步推理对于大型语言模型至关重要，但多语言性能仍然具有挑战性。虽然思维链提示改进了推理，但由于推理与执行的纠缠，它在非英语语言中表现不佳。程序思维提示将推理与执行分离，提供了一种有前景的替代方案，但将挑战转移到从非英语问题生成程序上。我们提出了一个框架，通过分离多语言推理与代码执行来评估程序思维，以检验（i）微调对问题-推理对齐的影响，以及（ii）推理质量如何影响答案正确性。我们的发现表明，程序思维微调显著增强了多语言推理，优于思维链微调模型。我们进一步证明了推理质量（通过代码质量衡量）与答案准确性之间的强相关性，突出了其作为测试时性能改进启发式方法的潜力。

英文摘要

Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.

URL PDF HTML ☆

赞 0 踩 0

2505.15354 2026-06-04 cs.LG stat.ML

Post-Training Corrections for Improved Time-Series Forecasting

人在回路的自适应优化用于改进时间序列预测

Hamza Cherkaoui, Malik Tiomoko, Giuseppe Paolo, Zhang Yili, Yu Meng, Zhang Keli, Hafiz Tiomoko Ali

发表机构 * SAMOVAR Télécom SudParis Institut Polytechnique de Paris（Telecom SudParis高等研究院）； Noah Ark Lab（Noah Ark实验室）； Independent Researcher（独立研究员）

AI总结提出一种无需重训练或修改架构的轻量级后训练自适应优化框架，通过强化学习、上下文赌博机或遗传算法自动学习表达性变换来校正模型输出，并支持人类专家通过自然语言引导校正，从而在多个基准上以最小计算开销持续提升预测精度。

详情

AI中文摘要

时间序列预测模型即使在能源、金融和医疗等关键领域也经常产生系统性的、可预测的错误。我们引入了一种新颖的后训练自适应优化框架，无需重训练或架构更改即可提高预测准确性。我们的方法自动应用通过强化学习、上下文赌博机或遗传算法优化的表达性变换，以轻量级和模型无关的方式校正模型输出。理论上，我们证明了仿射校正总能降低均方误差；实际上，我们通过基于动态动作的优化扩展了这一思想。该框架还支持可选的人回路组件：领域专家可以使用自然语言指导校正，自然语言由语言模型解析为动作。在多个基准（例如电力、天气、交通）上，我们观察到以最小的计算开销持续提高准确性。我们的交互式演示展示了该框架的实时可用性。通过将自动事后改进与可解释和可扩展的机制相结合，我们的方法为实际预测系统提供了强大的新方向。

英文摘要

Time-series forecasting is a critical task in various business domains, but it remains inherently challenging. Typically, large forecasting models are trained in a single, resource-intensive run. Once training is completed, a natural question arises:~\emph{is there still potential for meaningful improvement in the model's performance?} Motivated by techniques from boosting, we introduce the concept of~\emph{post-training corrections}. This approach enhances a trained forecaster by sequentially applying a carefully selected set of corrections to its predictions. Our method offers a lightweight, model-agnostic, and scalable strategy to improve forecasting performance in practical settings. We provide theoretical foundations for the approach, starting with the affine correction case, and analyze the expected performance gains and computational costs in more general settings. Across a range of benchmark datasets, our method consistently delivers up to a $30\%$ improvement in forecasting accuracy over existing state-of-the-art models, with minimal computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2504.15587 2026-06-04 cs.LG cs.AI

MetaMolGen: A Neural Graph Motif Generation Model for De Novo Molecular Design

MetaMolGen: 一种用于从头分子设计的神经图基序生成模型

Zimo Yan, Jie Zhang, Zheng Xie, Chang Liu, Yizhen Liu, Yiping Song

发表机构 * National University of Defense Technology（国防科技大学）

AI总结提出基于元学习的分子生成模型MetaMolGen，通过标准化图基序分布和轻量级自回归序列模型，实现少样本和属性条件分子生成。

详情

DOI: 10.46793/match.97-2.11226

AI中文摘要

分子生成在药物发现和材料科学中扮演重要角色，尤其是在数据稀缺场景下，传统生成模型往往难以实现令人满意的条件泛化。为应对这一挑战，我们提出MetaMolGen，一种基于一阶元学习的分子生成器，专为少样本和属性条件分子生成而设计。MetaMolGen通过将图基序映射到标准化潜在空间来标准化其分布，并采用轻量级自回归序列模型生成忠实反映底层分子结构的SMILES序列。此外，它通过集成到生成过程中的可学习属性投影器，支持具有目标属性的分子的条件生成。实验结果表明，MetaMolGen在低数据条件下持续生成有效且多样的SMILES序列，优于传统基线。这突显了其在快速适应和高效条件生成方面的优势，适用于实际分子设计。

英文摘要

Molecular generation plays an important role in drug discovery and materials science, especially in data-scarce scenarios where traditional generative models often struggle to achieve satisfactory conditional generalization. To address this challenge, we propose MetaMolGen, a first-order meta-learning-based molecular generator designed for few-shot and property-conditioned molecular generation. MetaMolGen standardizes the distribution of graph motifs by mapping them to a normalized latent space, and employs a lightweight autoregressive sequence model to generate SMILES sequences that faithfully reflect the underlying molecular structure. In addition, it supports conditional generation of molecules with target properties through a learnable property projector integrated into the generative process.Experimental results demonstrate that MetaMolGen consistently generates valid and diverse SMILES sequences under low-data regimes, outperforming conventional baselines. This highlights its advantage in fast adaptation and efficient conditional generation for practical molecular design.

URL PDF HTML ☆

赞 0 踩 0

2504.12329 2026-06-04 cs.CL cs.AI

Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

推测性思考：在推理时利用大模型指导增强小模型推理能力

Wang Yang, Xiang Yue, Vipin Chaudhary, Xiaotian Han

发表机构 * Case Western Reserve University（凯斯西储大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出一种无需训练的推测性思考框架，通过让大推理模型在推理层面引导小模型，在提升小模型推理准确率的同时缩短输出长度。

详情

AI中文摘要

近期进展利用后训练来增强模型推理性能，这通常需要昂贵的训练流程，并且仍然存在低效、输出过长的问题。我们提出推测性思考，一种无需训练的框架，使大推理模型在推理层面引导小模型进行推理，区别于在词元层面操作的推测解码。我们的方法基于两个观察：（1）支持推理的词元（如“wait”）经常出现在结构分隔符（如“\n\n”）之后，作为反思或继续的信号；（2）大模型对反思行为有更强的控制，减少不必要的回溯同时提高推理质量。通过策略性地将反思步骤委托给能力更强的模型，我们的方法显著提升了推理模型的推理准确率，同时缩短了输出。在32B推理模型的辅助下，1.5B模型在MATH500上的准确率从83.2%提升至89.4%，实现了6.2%的大幅提升。同时，平均输出长度从5439个词元减少到4583个词元，下降了15.7%。此外，当应用于非推理模型（Qwen-2.5-7B-Instruct）时，我们的框架在相同基准上将准确率从74.0%提升至81.8%，实现了7.8%的相对提升。

英文摘要

Recent advances leverage post-training to enhance model reasoning performance, which typically requires costly training pipelines and still suffers from inefficient, overly lengthy outputs. We introduce Speculative Thinking, a training-free framework that enables large reasoning models to guide smaller ones during inference at the reasoning level, distinct from speculative decoding, which operates at the token level. Our approach is based on two observations: (1) reasoning-supportive tokens such as "wait" frequently appear after structural delimiters like "\n\n", serving as signals for reflection or continuation; and (2) larger models exhibit stronger control over reflective behavior, reducing unnecessary backtracking while improving reasoning quality. By strategically delegating reflective steps to a more capable model, our method significantly boosts the reasoning accuracy of reasoning models while shortening their output. With the assistance of the 32B reasoning model, the 1.5B model's accuracy on MATH500 increases from 83.2% to 89.4%, marking a substantial improvement of 6.2%. Simultaneously, the average output length is reduced from 5439 tokens to 4583 tokens, representing a 15.7% decrease. Moreover, when applied to a non-reasoning model (Qwen-2.5-7B-Instruct), our framework boosts its accuracy from 74.0% to 81.8% on the same benchmark, achieving a relative improvement of 7.8%.

URL PDF HTML ☆

赞 0 踩 0

2405.08036 2026-06-04 cs.LG cs.AI

Potentially Optimal Joint Actions Recognition for Cooperative Multi-Agent Reinforcement Learning

合作多智能体强化学习中潜在最优联合动作识别

Chang Huang, Shatong Zhu, Junqiao Zhao, Hongtu Zhou, Di Zhang, Hai Zhang, Chen Ye, Ziqiao Wang, Guang Chen

发表机构 * School of Computer Science and Technology, Tongji University（同济大学计算机科学与技术学院）； Stanford University（斯坦福大学）； MOE Key Lab of Embedded System and Service Computing, Tongji University, Shanghai, China（同济大学嵌入式系统与服务计算教育部重点实验室，上海，中国）； The University of Hong Kong（香港大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结针对值函数分解中单调性约束限制表达能力的问题，提出潜在最优联合动作加权方法，通过迭代加权训练保证最优策略恢复，在多个任务上超越现有方法。

Comments ICLR 2026

详情

Journal ref: ICLR 2026

AI中文摘要

值函数分解在合作多智能体强化学习（MARL）中被广泛使用。现有方法通常对联合动作值与个体动作值之间施加单调性约束以实现分散执行。然而，此类约束限制了值函数分解的表达能力，缩小了可表示的联合动作值范围，并阻碍了最优策略的学习。为解决这一问题，我们提出了潜在最优联合动作加权（POW）方法，该方法在现有近似加权策略可能失效的情况下确保最优策略恢复。POW通过一个理论上有依据的迭代加权训练过程，迭代地识别潜在最优联合动作并为其分配更高的训练权重。我们证明该机制保证了真实最优策略的恢复，克服了先前启发式加权策略的局限性。POW是架构无关的，可以无缝集成到现有的值函数分解算法中。在矩阵博弈、难度增强的捕食者-猎物任务、SMAC、SMACv2以及高速公路环境交叉口场景上的大量实验表明，POW显著提升了稳定性，并持续超越了最先进的基于值的MARL方法。

英文摘要

Value function factorization is widely used in cooperative multi-agent reinforcement learning (MARL). Existing approaches often impose monotonicity constraints between the joint action value and individual action values to enable decentralized execution. However, such constraints limit the expressiveness of value factorization, restricting the range of joint action values that can be represented and hindering the learning of optimal policies. To address this, we propose Potentially Optimal Joint Actions Weighting (POW), a method that ensures optimal policy recovery where existing approximate weighting strategies may fail. POW iteratively identifies potentially optimal joint actions and assigns them higher training weights through a theoretically grounded iterative weighted training process. We prove that this mechanism guarantees recovery of the true optimal policy, overcoming the limitations of prior heuristic weighting strategies. POW is architecture-agnostic and can be seamlessly integrated into existing value factorization algorithms. Extensive experiments on matrix games, difficulty-enhanced predator-prey tasks, SMAC, SMACv2, and a highway-env intersection scenario show that POW substantially improves stability and consistently surpasses state-of-the-art value-based MARL methods.

URL PDF HTML ☆

赞 0 踩 0

2408.01382 2026-06-04 cs.LG cs.GT

Explaining a probabilistic prediction on the simplex with Shapley compositions

用Shapley组合解释单纯形上的概率预测

Paul-Gauthier Noé, Miquel Perelló-Nieto, Jean-François Bonastre, Peter Flach

发表机构 * Laboratoire Informatique d’Avignon, Avignon Université, France（阿维尼昂信息实验室，阿维尼昂大学，法国）； University of Bristol, United Kingdom（布里斯托大学，英国）

AI总结本文引入Shapley组合，利用成分数据分析的Aitchison几何，为多类概率预测提供了一种基于公理的解释方法。

Comments Published in ECAI2024's proceedings

详情

DOI: 10.3233/FAIA240605

AI中文摘要

源于博弈论的Shapley值被广泛用于通过量化每个特征值对预测的贡献来解释机器学习模型的预测。这需要像二分类中那样的标量预测，而多类概率预测是离散概率分布，位于多维单纯形上。在这种多类设置中，Shapley值通常以一对多的方式单独计算每个类别，忽略了输出分布的组成性质。在本文中，我们引入Shapley组合作为一种有根据的方法来正确解释多类概率预测，使用成分数据分析中的Aitchison几何。我们证明了Shapley组合是满足Aitchison单纯形上的线性性、对称性和效率的唯一量，扩展了标准Shapley值的相应公理性质。我们在一系列场景中展示了这种正确的多类处理。

英文摘要

Originating in game theory, Shapley values are widely used for explaining a machine learning model's prediction by quantifying the contribution of each feature's value to the prediction. This requires a scalar prediction as in binary classification, whereas a multiclass probabilistic prediction is a discrete probability distribution, living on a multidimensional simplex. In such a multiclass setting the Shapley values are typically computed separately on each class in a one-vs-rest manner, ignoring the compositional nature of the output distribution. In this paper, we introduce Shapley compositions as a well-founded way to properly explain a multiclass probabilistic prediction, using the Aitchison geometry from compositional data analysis. We prove that the Shapley composition is the unique quantity satisfying linearity, symmetry and efficiency on the Aitchison simplex, extending the corresponding axiomatic properties of the standard Shapley value. We demonstrate this proper multiclass treatment in a range of scenarios.

URL PDF HTML ☆

赞 0 踩 0

2408.11121 2026-06-04 cs.LG cs.AI cs.CL cs.CR

DOMBA: Double Model Balancing for Access-Controlled Language Models via Minimum-Bounded Aggregation

DOMBA: 通过最小有界聚合实现访问控制语言模型的双模型平衡

Tom Segal, Asaf Shabtai, Yuval Elovici

发表机构 * Ben-Gurion University（本·古里安大学）

AI总结提出DOMBA方法，通过最小有界平均函数聚合两个不同访问级别文档训练的语言模型的概率分布，在保证安全性的同时实现高效用。

Comments Code: https://github.com/ppo1/DOMBA 11 pages, 3 figures

详情

DOI: 10.1609/aaai.v39i23.34695
Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39, pp. 25101-25109, 2025

AI中文摘要

大型语言模型（LLMs）的实用性在很大程度上取决于其训练数据的质量和数量。许多组织拥有大量数据语料库，可用于训练或微调针对其特定需求的LLMs。然而，这些数据集通常带有基于用户权限并由访问控制机制强制执行的访问限制。在此类数据集上训练LLMs可能导致敏感信息暴露给未经授权的用户。防止此类暴露的一种直接方法是为每个访问级别训练一个单独的模型。然而，由于每个模型的训练数据量相对于整个组织语料库的总量有限，这可能导致模型效用低下。另一种方法是在所有数据上训练单个LLM，同时限制未经授权信息的暴露。然而，当前针对LLMs的暴露限制方法对于访问控制数据无效，因为敏感信息在多个训练样本中频繁出现。我们提出DOMBA——双模型平衡——一种训练和部署LLMs的简单方法，可在提供高效用和访问控制功能的同时保证安全性。DOMBA使用“最小有界”平均函数（一个受较小值约束的函数，例如调和平均）聚合两个模型的概率分布，每个模型在具有（可能多个）不同访问级别的文档上训练。详细的数学分析和广泛评估表明，DOMBA在保护受限信息的同时，提供了与非安全模型相当的效用。

英文摘要

The utility of large language models (LLMs) depends heavily on the quality and quantity of their training data. Many organizations possess large data corpora that could be leveraged to train or fine-tune LLMs tailored to their specific needs. However, these datasets often come with access restrictions that are based on user privileges and enforced by access control mechanisms. Training LLMs on such datasets could result in exposure of sensitive information to unauthorized users. A straightforward approach for preventing such exposure is to train a separate model for each access level. This, however, may result in low utility models due to the limited amount of training data per model compared to the amount in the entire organizational corpus. Another approach is to train a single LLM on all the data while limiting the exposure of unauthorized information. However, current exposure-limiting methods for LLMs are ineffective for access-controlled data, where sensitive information appears frequently across many training examples. We propose DOMBA - double model balancing - a simple approach for training and deploying LLMs that provides high utility and access-control functionality with security guarantees. DOMBA aggregates the probability distributions of two models, each trained on documents with (potentially many) different access levels, using a "min-bounded" average function (a function that is bounded by the smaller value, e.g., harmonic mean). A detailed mathematical analysis and extensive evaluation show that DOMBA safeguards restricted information while offering utility comparable to non-secure models.

URL PDF HTML ☆

赞 0 踩 0

2412.06095 2026-06-04 cs.CL cs.FL cs.IT math.IT

Measuring Grammatical Diversity from Small Corpora: Derivational Entropy Rates, Mean Length of Utterances, and Annotation Invariance

从小语料库测量语法多样性：派生熵率、平均话语长度和注释不变性

Fermin Moscoso del Prado Martin

发表机构 * Department of Computer Science and Technology & Jesus College University of Cambridge（计算机科学与技术系及耶稣学院，剑桥大学）

AI总结本文从理论和实证上证明语法的派生熵与其生成的话语平均长度（MLU）之间存在根本联系，提出派生熵率作为衡量语法复杂性的新指标，并引入平滑诱导树库熵（SITE）从小树库中准确估计这些度量。

详情

DOI: 10.1162/COLI.a.15
Journal ref: Computational Linguistics (2025) 51 (4): 1191-1233

AI中文摘要

在许多领域，如语言习得、语言神经心理学、衰老研究和历史语言学，语料库被用于估计个体、社区或说话者类型在一段时间内产生的语法结构的多样性。在这些情况下，树库被视为可能遇到的句法结构的代表性样本。从小型语料库中记录的结构推广潜在的句法多样性需要谨慎的外推，其准确性受到代表性子语料库规模有限的制约。在本文中，我从理论和实证上证明，语法的派生熵与其生成的话语平均长度（MLU）之间存在根本联系，从而产生了一个新的度量——派生熵率。话语平均长度成为句法复杂性最实用的指标；我证明MLU不仅仅是一个代理，而是语法多样性的基本度量。结合新的派生熵率度量，它提供了一种无理论的语法复杂性评估。派生熵率索引了不同语法注释框架确定树库语法复杂性的速率。我引入了平滑诱导树库熵（SITE）作为准确估计这些度量的工具，即使从非常小的树库中也能做到。最后，我讨论了这些结果对自然语言处理和人类语言处理的重要启示。

英文摘要

In many fields, such as language acquisition, neuropsychology of language, the study of aging, and historical linguistics, corpora are used for estimating the diversity of grammatical structures that are produced during a period by an individual, community, or type of speakers. In these cases, treebanks are taken as representative samples of the syntactic structures that might be encountered. Generalizing the potential syntactic diversity from the structures documented in a small corpus requires careful extrapolation whose accuracy is constrained by the limited size of representative sub-corpora. In this article, I demonstrate -- theoretically, and empirically -- that a grammar's derivational entropy and the mean length of the utterances (MLU) it generates are fundamentally linked, giving rise to a new measure, the derivational entropy rate. The mean length of utterances becomes the most practical index of syntactic complexity; I demonstrate that MLU is not a mere proxy, but a fundamental measure of syntactic diversity. In combination with the new derivational entropy rate measure, it provides a theory-free assessment of grammatical complexity. The derivational entropy rate indexes the rate at which different grammatical annotation frameworks determine the grammatical complexity of treebanks. I introduce the Smoothed Induced Treebank Entropy (SITE) as a tool for estimating these measures accurately, even from very small treebanks. I conclude by discussing important implications of these results for both NLP and human language processing.

URL PDF HTML ☆

赞 0 踩 0

2411.19758 2026-06-04 cs.CV cs.AI cs.LG

LaVIDE: Language-Prompted Satellite Change Detection via Map-Image Alignment

LaVIDE: 通过地图-图像对齐的语言提示卫星变化检测

Shuguo Jiang, Fang Xu, Chuandong Liu, Hong Tan, Shengyang Li, Lei Yu, Wen Yang, Sen Jia, Gui-Song Xia

发表机构 * School of Computer Science, Wuhan University（武汉大学计算机学院）； School of Artificial Intelligence, Wuhan University（武汉大学人工智能学院）； Technology and Engineering Center for Space Utilization and the Key Laboratory of Space Utilization, Chinese Academy of Sciences（中国科学院空间利用技术与重点实验室）； School of Aeronautics and Astronautics, University of Chinese Academy of Sciences（中国科学院大学航空宇航学院）； School of Electronic Information, Wuhan University（武汉大学电子信息学院）； College of Computer Science and Software Engineering, Shenzhen University（深圳大学计算机科学与软件工程学院）

AI总结提出LaVIDE框架，利用受限提示学习和对象感知嵌入增强，通过语言弥合高层地图类别与低层图像细节之间的语义鸿沟，实现跨模态对齐，在多类与单类变化检测任务上分别提升IoU 18.4%和5.2%。

详情

AI中文摘要

基于地图参考和最新图像的遥感变化检测，在缺乏早期图像进行比较时，有助于及时观测地球表面。然而，高层地图类别与低层图像细节之间的语义鸿沟阻碍了提取同质特征以进行稳健的时间关联。与比较像素级视觉相似性或传播分割误差的传统方法不同，我们提出了一种新颖框架——LaVIDE（用于检测变化的语言-视觉判别器），该框架以语言为中介，弥合了高层地图类别与低层图像细节之间的语义鸿沟。具体来说，我们引入了受限提示学习来生成上下文感知的文本提示，使地图语义与图像内容对齐，并采用对象感知嵌入增强策略将对象级属性（如形状、边界）整合到地图表示中。这些组件能够在统一的语言-视觉特征空间中实现稳健的跨模态对齐。在四个基准数据集（DynamicEarthNet、HRSCD、BANDON和SECOND）上的大量实验表明，LaVIDE以显著优势超越了最先进的方法，在多类和单类变化检测任务上分别实现了18.4%和5.2%的IoU提升。我们的框架不仅提高了地图-图像变化检测的准确性，还为以最少人工干预快速更新地图提供了实用解决方案，有望在城市规划、灾害评估和生态保护等领域产生广泛影响。代码和数据集可在 https://github.com/ShuGuoJ/LAVIDE.git 获取。

英文摘要

Remote sensing change detection based on a map reference and an up-to-date image boosts timely observation of the Earth's surface when earlier images are lacking for comparison. However, the semantic gap between high-level map categories and low-level image details hinders the extraction of homogeneous features for robust temporal association in change detection. Unlike conventional approaches that either compare pixel-level visual similarity or propagate segmentation errors, \textcolor{black}{we propose a novel framework, \underline{La}nguage-\underline{VI}sion \underline{D}iscriminator for d\underline{E}tecting changes, LaVIDE}, which bridges the semantic gap between high-level map categories and low-level image details using language as an intermediary. Specifically, we introduce {\it restricted prompt learning} to generate context-aware textual prompts that align map semantics with image content, and an {\it object-aware embedding enhancement} strategy to integrate object-level attributes (e.g., shape, boundary) into map representations. These components enable robust cross-modal alignment within a unified language-vision feature space. Extensive experiments on four benchmarks, DynamicEarthNet, HRSCD, BANDON, and SECOND, demonstrate that LaVIDE outperforms state-of-the-art methods by significant margins, achieving $18.4\%$ and $5.2\%$ improvements in IoU on multi-class and single-class change detection tasks, respectively. Our framework not only advances the accuracy of map-image change detection but also provides a practical solution for rapid map updating with minimal human intervention, promising broad impacts in urban planning, disaster assessment, and ecological conservation. Code and datasets are available at: https://github.com/ShuGuoJ/LAVIDE.git.

URL PDF HTML ☆

赞 0 踩 0

2409.11901 2026-06-04 cs.CL

LLMs + Persona-Plug = Personalized LLMs

LLMs + Persona-Plug = 个性化大语言模型

Jiongnan Liu, Yutao Zhu, Shuting Wang, Xiaochi Wei, Erxue Min, Yu Lu, Shuaiqiang Wang, Dawei Yin, Zhicheng Dou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学人工智能学院 Gallagher 学院）； Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE（下一代智能搜索与推荐工程研究中心，教育部）； Baidu Inc.（百度公司）

AI总结提出一种轻量级插件式用户嵌入模块PPersona-Plug，通过建模用户历史上下文生成个性化嵌入，无需微调即可提升大语言模型输出个性化程度。

详情

AI中文摘要

个性化在众多语言任务和应用中扮演着关键角色，因为具有相同需求的用户可能根据个人兴趣偏好不同的输出。这促使了各种个性化方法的发展，旨在使大语言模型（LLMs）适应生成符合用户偏好的定制化输出。其中一些方法涉及为每个用户微调一个独特的个性化LLM，这过于昂贵而难以广泛应用。替代方法以即插即用的方式引入个性化信息，通过检索用户相关历史文本作为示例。然而，这种基于检索的策略可能会破坏用户历史的连续性，无法捕捉用户的整体风格和模式，从而导致次优性能。为了解决这些挑战，我们提出了一种新颖的个性化LLM模型PPersona-Plug。它通过一个轻量级的插件式用户嵌入模块，对每个用户的所有历史上下文进行建模，构建用户特定的嵌入。通过将该嵌入附加到任务输入中，LLMs可以更好地理解和捕捉用户的习惯与偏好，从而在不调整自身参数的情况下生成更个性化的输出。在语言模型个性化（LaMP）基准上的各种任务上的大量实验表明，所提出的模型显著优于现有的个性化LLM方法。

英文摘要

Personalization plays a critical role in numerous language tasks and applications, since users with the same requirements may prefer diverse outputs based on their individual interests. This has led to the development of various personalized approaches aimed at adapting large language models (LLMs) to generate customized outputs aligned with user preferences. Some of them involve fine-tuning a unique personalized LLM for each user, which is too expensive for widespread application. Alternative approaches introduce personalization information in a plug-and-play manner by retrieving the user's relevant historical texts as demonstrations. However, this retrieval-based strategy may break the continuity of the user history and fail to capture the user's overall styles and patterns, hence leading to sub-optimal performance. To address these challenges, we propose a novel personalized LLM model, PPlug. It constructs a user-specific embedding for each individual by modeling all her historical contexts through a lightweight plug-in user embedder module. By attaching this embedding to the task input, LLMs can better understand and capture user habits and preferences, thereby producing more personalized outputs without tuning their own parameters. Extensive experiments on various tasks in the language model personalization (LaMP) benchmark demonstrate that the proposed model significantly outperforms existing personalized LLM approaches.

URL PDF HTML ☆

赞 0 踩 0

2406.09407 2026-06-04 cs.CV

Towards Evaluating the Robustness of Visual State Space Models

评估视觉状态空间模型的鲁棒性

Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Shahbaz Khan, Salman Khan

发表机构 * Mohamed Bin Zayed University of AI（Mohamed Bin Zayed人工智能大学）； Center of Secure Cyber-Physical Security Systems（安全的网络物理安全系统中心）； Linköping University（林波伊大学）； Australian National University（澳大利亚国立大学）

AI总结本文全面评估了视觉状态空间模型（VSSMs）在遮挡、图像结构、常见损坏和对抗攻击等多种扰动下的鲁棒性，并与Transformer和CNN等架构进行比较，揭示了其优势和局限性。

Comments Accepted at The 5th Workshop of Adversarial Machine Learning on Computer Vision (CVPRW 2025)

详情

AI中文摘要

视觉状态空间模型（VSSMs）是一种结合了循环神经网络和潜变量模型优势的新型架构，通过有效捕捉长程依赖和建模复杂视觉动态，在视觉感知任务中表现出色。然而，它们在自然和对抗扰动下的鲁棒性仍然是一个关键问题。在这项工作中，我们全面评估了VSSMs在各种扰动场景下的鲁棒性，包括遮挡、图像结构、常见损坏和对抗攻击，并将其性能与Transformer和卷积神经网络等成熟架构进行比较。此外，我们研究了VSSMs在复杂视觉场景中针对物体-背景组合变化的鲁棒性，使用了专门设计用于测试模型性能的复杂基准。我们还使用模拟真实场景的损坏数据集评估了它们在目标检测和分割任务上的鲁棒性。为了更深入地理解VSSMs的对抗鲁棒性，我们进行了基于频率的对抗攻击分析，评估了它们对低频和高频扰动的性能。我们的发现突出了VSSMs在处理复杂视觉损坏方面的优势和局限性，为未来研究提供了宝贵的见解。我们的代码和模型将在 https://github.com/HashmatShadab/MambaRobustness 提供。

英文摘要

Vision State Space Models (VSSMs), a novel architecture that combines the strengths of recurrent neural networks and latent variable models, have demonstrated remarkable performance in visual perception tasks by efficiently capturing long-range dependencies and modeling complex visual dynamics. However, their robustness under natural and adversarial perturbations remains a critical concern. In this work, we present a comprehensive evaluation of VSSMs' robustness under various perturbation scenarios, including occlusions, image structure, common corruptions, and adversarial attacks, and compare their performance to well-established architectures such as transformers and Convolutional Neural Networks. Furthermore, we investigate the resilience of VSSMs to object-background compositional changes on sophisticated benchmarks designed to test model performance in complex visual scenes. We also assess their robustness on object detection and segmentation tasks using corrupted datasets that mimic real-world scenarios. To gain a deeper understanding of VSSMs' adversarial robustness, we conduct a frequency-based analysis of adversarial attacks, evaluating their performance against low-frequency and high-frequency perturbations. Our findings highlight the strengths and limitations of VSSMs in handling complex visual corruptions, offering valuable insights for future research. Our code and models will be available at https://github.com/HashmatShadab/MambaRobustness.

URL PDF HTML ☆

赞 0 踩 0

2404.11309 2026-06-04 cs.CV

Achieving Rotation-Invariant Convolution via Non-Learnable Orientation Alignment Operators

通过不可学习的朝向对齐算子实现旋转不变卷积

Hanlin Mo, Peihong Lei, You Hao, Guoying Zhao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出基于不可学习算子的旋转不变卷积（RIConvs），其参数量和计算过程与标准卷积相同，在多个视觉任务中提升准确率，尤其在数据有限时效果显著。

详情

AI中文摘要

在深度神经网络中实现旋转不变性而无需数据增强是一个研究热点。内在不变性使特征能够捕捉目标的固有属性，从而提升深度学习在视觉任务中的性能。基于多种类型的不可学习算子，本文提出了一套对任意旋转自然不变的卷积操作。与大多数先前方法不同，这些旋转不变卷积（RIConvs）具有与标准卷积相同的可学习参数数量和相似的计算过程，因此可以互换。使用MNIST-Rot数据集，我们验证了它们在不同旋转角度下的不变性，并与先前的旋转不变CNN进行了比较，其中两种基于梯度的RIConvs取得了最先进的结果。然后，我们将RIConvs与经典CNN骨干网络集成，并在纹理识别、飞机类型识别和遥感图像分类任务上进行了评估。结果表明，RIConvs显著提高了准确率，特别是在训练数据有限的情况下，并且即使在使用数据增强时也能提升性能。

英文摘要

Achieving rotational invariance in deep neural networks without data augmentation is a research hotspot. Intrinsic invariance enables features to capture targets' inherent properties, enhancing deep learning performance in visual tasks. Based on various types of non-learnable operators, this paper proposes a comprehensive set of convolution operations that are natually invariant to arbitrary rotations. Unlike most prior methods, these rotation-invariant convolutions (RIConvs) have the same number of learnable parameters and a similar computational process as standard convolutions, making them interchangeable. Using the MNIST-Rot dataset, we validate their invariance across rotation angles and compare them with previous rotation-invariant CNNs, where two gradient-based RIConvs achieve state-of-the-art results. Then, we integrate RIConvs with classic CNN backbones and evaluate them on texture recognition, aircraft type recognition, and remote sensing image classification tasks. Results show that RIConvs significantly improve accuracy, particularly with limited training data, and enhance performance even with data augmentation.

URL PDF HTML ☆

赞 0 踩 0

1905.04235 2026-06-04 cs.RO cs.SY eess.SY

Autonomous Locomotion Mode Transition in Quadruped Track-Legged Robots: A Simulation-Based Analysis for Step Negotiation

四足履轮腿机器人自主运动模式切换：基于仿真的步阶跨越分析

Jie Wang, Krispin Davies

发表机构 * University of Cambridge（剑桥大学）； ClearPath AI

AI总结本文提出了一种用于四足混合机器人自主切换运动模式的方法，特别是在跨越不同高度台阶时，通过能量效率评估机制实现平稳过渡。

详情

DOI: 10.1016/j.simpat.2024.102893

AI中文摘要

混合履轮腿机器人结合了轮式和腿式运动的优势，通过高效切换滚动和行走模式，在多种地形中实现适应性。然而，自动实现这些切换仍然是重大挑战。本文介绍了一种用于四足混合机器人自主模式切换的方法，特别是在跨越台阶时。我们的方法基于一种决策机制，利用所提出的基于能量的准则评估两种运动模式的能量效率。为了确保平稳跨越台阶，我们结合了两种攀爬步态，用于评估行走运动的能量使用情况。仿真结果验证了该方法的有效性，显示在不同高度的台阶上实现了成功的自主切换。我们提出的方法具有通用性，可以修改以适应类似机械配置的其他混合机器人，前提是其运动能量性能已先进行研究。

英文摘要

Hybrid track/wheel-legged robots combine the advantages of wheel-based and leg-based locomotion, granting adaptability across varied terrains through efficient transitions between rolling and walking modes. However, automating these transitions remains a significant challenge. In this paper, we introduce a method designed for autonomous mode transition in a quadruped hybrid robot with a track/wheel-legged configuration, especially during step negotiation. Our approach hinges on a decision-making mechanism that evaluates the energy efficiency of both locomotion modes using a proposed energy-based criterion. To guarantee a smooth negotiation of steps, we incorporate two climbing gaits designated for the assessment of energy usage in walking locomotion. Simulation results validate the method's effectiveness, showing successful autonomous transitions across steps of diverse heights. Our suggested approach has universal applicability and can be modified to suit other hybrid robots of similar mechanical configuration, provided their locomotion energy performance is studied beforehand.

URL PDF HTML ☆

赞 0 踩 0

2402.02555 2026-06-04 cs.CV cs.CL

High-Quality Entity Segmentation and Grounding

高质量实体分割与定位

Lu Qi, Yi-Wen Chen, Tao Zhang, Xiangtai Li, Xu Yang, Bo Du, Ming-Hsuan Yang

发表机构 * Wuhan University（武汉大学）； Insta360 Research（Insta360研究院）； Department of EECS, University of California, Merced（加州大学默塞德分校电子工程与计算机科学系）； Nanyang Technological University（南洋理工大学）； Institute of Automation of the Chinese Academy of Sciences（中国科学院自动化研究所）

AI总结提出ESG流水线，通过新数据集EntitySeg和两阶段解耦设计（CropFormer高质量分割+GELLA精确名词提取与语义匹配），实现高质量实体分割与定位，在五项任务上有效。

详情

AI中文摘要

在这项工作中，我们提出了ESG，一个由新数据集EntitySeg支持的高质量实体分割与定位流水线。首先，所提出的数据集命名为EntitySeg，包含跨越各种图像域和实体的图像，以及用于训练和测试的大量高分辨率图像和高质量掩码标注。然后，ESG主要由两个模块组成：用于高质量实体分割的CropFormer，以及用于从句子中精确提取名词并在语言和视觉区域之间进行语义匹配的GELLA。与现有联合训练分割和大语言模型的定位方法不同，ESG采用两阶段解耦设计，保留了高质量掩码和定位鲁棒性，避免了联合训练通常带来的权衡。CropFormer确保高质量实体分割结果，然后可以编码到GELLA模型中进行有效定位。大量实验结果表明，我们提出的流水线在五项任务上有效，包括实体分割、全景分割、开放词汇分割、指代分割和全景定位叙述。此外，ESG流水线的GELLA模块高度灵活，能够处理来自任何分割框架的掩码输入，这得益于其轻量级的颜色图/视觉编码器、语言/掩码解码器和关联模块。实体分割数据集和定位代码将在https://github.com/qqlu/Entity发布。

英文摘要

In this work, we propose ESG, a pipeline for high-quality entity segmentation and grounding supported by a new dataset EntitySeg. At first, the proposed dataset naming EntitySeg contains images spanning various image domains and entities, along with plentiful high-resolution images and high-quality mask annotations for training and testing. Then, the ESG mainly consists of two modules: CropFormer for high-quality entity segmentation whereas GELLA for accurate noun extraction from sentences and semantic matching between language and visual regions. Unlike existing grounding methods that jointly train a segmentation and a large language model, ESG adopts a two-stage decoupled design, preserving high-quality masks and grounding robustness without the trade-offs often introduced by joint training. CropFormer ensures high-quality entity segmentation results, which can then be encoded into the GELLA model for effective grounding. Extensive experimental results demonstrate the effectiveness of our proposed pipeline across five tasks, including entity segmentation, panoptic segmentation, open-vocabulary segmentation, referring segmentation, and panoptic localized narratives. Furthermore, GELLA module of ESG pipeline is highly flexible and capable of processing mask inputs from any segmentation framework, thanks to its lightweight colormap/vision encoder, language/mask decoder, and association module. The entity segmentation dataset and grounding code will be released at https://github.com/qqlu/Entity.

URL PDF HTML ☆

赞 0 踩 0

2209.15448 2026-06-04 cs.LG math.ST stat.ME stat.TH

Blessing from Human-AI Interaction: Super Reinforcement Learning in Confounded Environments

人机交互的福音：混杂环境下的超级强化学习

Jiayi Wang, Zhengling Qi, Chengchun Shi

发表机构 * Department of Mathematical Sciences, University of Texas at Dallas（德克萨斯大学达拉斯分校数学科学系）； Department of Statistics, London School of Economics and Political Science（伦敦政治经济学院统计系）； Department of Decision Sciences, George Washington University（乔治华盛顿大学决策科学系）

AI总结提出利用人机交互中的观察动作进行超级策略学习，在存在未测量混杂的情况下，通过近端因果推断实现优于标准最优策略和行为策略的超级策略。

详情

AI中文摘要

随着人工智能在社会中越来越普遍，整合人类和AI系统以发挥各自优势并降低风险的有效方法已成为重要优先事项。在本文中，我们引入了超级策略学习的范式，该范式利用人机交互进行数据驱动的序贯决策。这种方法将来自AI或人类的观察动作作为输入，以实现决策者（人类或AI）在策略学习中更强的oracle。在存在未测量混杂的决策过程中，过去智能体采取的动作可以揭示未公开信息的有价值见解。通过以一种新颖且合法的方式将这些信息纳入策略搜索，所提出的超级策略学习将产生一个超级策略，该策略保证优于标准最优策略和行为策略（例如，过去智能体的动作）。我们将这种更强的oracle称为人机交互的福音。此外，为了解决使用批处理数据寻找超级策略时的未测量混杂问题，在近端因果推断框架下建立了一系列非参数和因果识别。基于这些新颖的识别结果，我们开发了几种超级策略学习算法，并系统研究了它们的理论性质，例如有限样本遗憾保证。最后，通过大量模拟和实际应用说明了我们方法的有效性。

英文摘要

As AI becomes more prevalent throughout society, effective methods of integrating humans and AI systems that leverage their respective strengths and mitigate risk have become an important priority. In this paper, we introduce the paradigm of super policy learning that takes advantage of Human-AI interaction for data driven sequential decision making. This approach utilizes the observed action, either from AI or humans, as input for achieving a stronger oracle in policy learning for the decision maker (humans or AI). In the decision process with unmeasured confounding, the actions taken by past agents can offer valuable insights into undisclosed information. By including this information for the policy search in a novel and legitimate manner, the proposed super policy learning will yield a super-policy that is guaranteed to outperform both the standard optimal policy and the behavior one (e.g., past agents' actions). We call this stronger oracle a blessing from human-AI interaction. Furthermore, to address the issue of unmeasured confounding in finding super-policies using the batch data, a number of nonparametric and causal identifications are established under the framework of proximal causal inference. Building upon on these novel identification results, we develop several super-policy learning algorithms and systematically study their theoretical properties such as finite-sample regret guarantee. Finally, we illustrate the effectiveness of our proposal through extensive simulations and real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2606.05129 2026-06-04 cs.CR cs.LG

Preserving Data Privacy in Learning Causal Structure with Fully Homomorphic Encryption

在全同态加密下学习因果结构时保护数据隐私

Jian Yang, Yuan Tong, Qinbin Li, Zeyi Wen, Xiaofang Zhou

发表机构 * Hong Kong University of Science and Technology (Guangzhou)（香港理工大学（广州））； Hong Kong University of Science and Technology（香港理工大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结针对分布式因果结构学习中的隐私泄露问题，提出基于全同态加密的方法，通过电路简化、除法和对数近似以及SIMD批处理技术，在加密数据上高效完成因果结构学习，并支持扩展到差分隐私。

详情

AI中文摘要

保护数据隐私是结构数据管理和数据挖掘中的重要课题。然而，分布式因果结构学习中的隐私泄露问题是一个持续的挑战，特别是在需要数据传输和计算的情况下。在本文中，我们提出了一种基于全同态加密（FHE）的方法，该方法在密文上进行计算，保持数据在传输和计算过程中加密。然而，由于FHE计算成本高且对除法和对数运算的支持有限，将FHE应用于因果结构学习具有挑战性。为了应对这一挑战，我们提出了一系列新颖的技术，包括（i）电路简化以提高效率，（ii）通过牛顿-拉夫森倒数和泰勒展开近似除法和对数，以及（iii）使用SIMD加速的批处理技术来增强整个学习过程。此外，我们的方法可以轻松扩展到FHE之外，通过展示其可移植性来支持差分隐私。实验结果表明，我们的方法在测试的数据集上实现了与明文版本高度一致且可比的因果结构。最后，即使在FHE的隐私保护下，我们的方法也能在几十分钟内高效且实际地完成因果结构学习。

英文摘要

Preserving data privacy is an important topic in structural data management and data mining. However, the issue of privacy leakage in distributed causal structure learning is a persistent challenge, especially in cases where data transmission and computation are required. In this paper, we propose a method based on fully homomorphic encryption (FHE) that performs calculations on ciphertexts, keeping data encrypted in transition and computation. Nevertheless, adopting FHE to causal structure learning is challenging due to the high computation cost and limited support on division as well as logarithm operations in FHE. To tackle this challenge, we propose a series of novel techniques including (i) circuit simplification for better efficiency, (ii) approximation of division and logarithm through Newton-Raphson Reciprocal and Taylor expansion, and (iii) a batching technique with SIMD-acceleration to enhance the whole learning process. Additionally, our method can be easily extended beyond FHE by demonstration of its portability to support differential privacy. Empirical results show that our method achieves high consistency and comparable causal structure with the plaintext version in the datasets tested. Last, our method is efficient and practical to complete learning causal structures in tens of minutes even under the privacy protection of FHE.

URL PDF HTML ☆

赞 0 踩 0

2606.05124 2026-06-04 cs.GR cs.CV cs.LG

Geometry Gaussians: Decoupling Appearance and Geometry in Gaussian Splatting

几何高斯：在高斯泼溅中解耦外观与几何

Hongyu Zhou, Zorah Lähner

发表机构 * University of Bonn（波恩大学）； Lamarr Institut（拉马尔研究所）

AI总结针对3D高斯泼溅在几何表示与外观渲染间的冲突，提出通过为每个溅射添加几何不透明度参数并配合透明度优化流程，实现几何与外观的解耦，提升复杂场景（尤其是透明物体）的渲染与几何性能。

详情

AI中文摘要

在3D高斯泼溅（3DGS）成功用于新视角合成后，许多工作探索了如何将其用于几何表面表示。然而，直接从3DGS中提取准确的几何信息仍然具有挑战性，且往往会降低外观渲染质量。在这项工作中，我们通过使用完整的地面真值纹理和几何信息进行训练，证明了默认形式的3DGS本质上不适合同时表示纹理和几何。我们还提出了一种简单的解决方案，即为每个溅射应用一个额外的几何不透明度参数，并配合可选的透明度策划优化流程。我们的实验，无论是使用地面真值还是视觉基础模型的几何输入，都表明这一改变在多种数据集上提高了渲染和几何性能，尤其是对于包含透明物体的复杂场景，我们的方法带来了显著提升。

英文摘要

After the success of 3D Gaussian Splatting (3DGS) for novel view synthesis, many works have explored how to also use it for geometric surface representation. However, extracting accurate geometric information directly from 3DGS remains challenging and can often reduce the appearance rendering quality. In this work, we show that 3DGS in its default form is inheritedly unsuited to represent texture and geometry at the same time, by training with complete ground-truth texture and geometry information. We also propose a simple solution by applying a single additional geometry opacity parameter to each splat, together with an optional transparency-curated optimization pipeline. Our experiments, both with ground-truth and vision foundation model geometric input, show that this change leads to improved rendering and geometry performance on a wide variety of dataset, and especially complex scenes with transparent objects benefit significantly from our method.

URL PDF HTML ☆

赞 0 踩 0

2606.05045 2026-06-04 math.DS cs.LG

Learning Control-Affine Reduced-Order Models via Autoencoders

通过自编码器学习控制仿射降阶模型

Ali Mjalled, Martin Mönnigmann

发表机构 * Automatic Control and Systems Theory Ruhr-Universität Bochum（自动控制与系统理论梅尔恩大学波恩分校）

AI总结提出一种利用自编码器同时学习降阶潜在空间和控制仿射状态空间动力学的框架，并扩展为序列模型以提高预测精度，通过反馈线性化验证其有效性。

详情

AI中文摘要

本文提出了一种用于识别控制仿射降阶模型（ROM）的框架。该方法利用自编码器（AE）将高维状态以及潜在的高维输入变换为适合控制仿射状态空间动力学的降维潜在变量。这是通过同时训练AE和状态空间模型实现的。此外，我们将离散ROM公式扩展为基于序列的模型，该模型处理状态和输入历史以提高预测精度，同时保持控制仿射结构。我们通过对导出的模型应用反馈线性化来激励我们的框架，并提出了有效使用它的指南。所提出的框架在两个数值示例上进行了评估，并将其性能与基线模型（其中AE识别具有线性状态空间动力学的潜在空间）进行了比较。评估涉及测试数据上ROM的预测精度及其将系统控制到期望状态或轨迹的有效性。

英文摘要

We present in this paper a framework for the identification of control-affine reduced-order models (ROMs). The proposed method utilizes autoencoders (AEs) to transform the high-dimensional states, and potentially the high-dimensional inputs, into reduced latent ones suitable for control-affine state-space dynamics. This is achieved by simultaneous training of the AE and the state-space model. In addition, we extend the discrete ROM formulation to a sequence-based model, which processes state and input histories to improve prediction accuracy while preserving the control-affine structure. We motivate our framework by applying feedback linearization to the derived models, and we present guidelines for its efficient use. The proposed framework is assessed on two numerical examples and its performance is compared to a baseline model, where the AE identifies a latent space with linear state-space dynamics. The assessment involves evaluating the prediction accuracy of the ROM on test data and its effectiveness in controlling the system to a desired state or trajectory.

URL PDF HTML ☆

赞 0 踩 0

2606.05037 2026-06-04 cs.SE cs.AI

Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery

自反式API：结构优于冗长，助力AI代理恢复

Arquimedes Canedo, Grama Chethan

发表机构 * Siemens Digital Industries Software, USA（西门子数字工业软件公司）

AI总结提出自反式API，在验证失败时返回机器可读的结构化建议，使AI代理无需外部推理即可修复请求并重试，在Anthropic模型上将任务完成率提升36.7-40.0个百分点，且每成功令牌效率提升1.8-2.2倍。

详情

AI中文摘要

当AI代理调用API并遇到验证错误时，它需要的不仅仅是哪里出错了——它需要下一步该做什么。自反式API在验证失败时返回一个机器可读的 recovery_feedback.suggestions[] 负载，足以让代理修复请求并在无需外部推理的情况下重试。在一个经过泄露审计的试点实验（每单元N=30，3个LLM，10个对抗性任务）中，结构化建议在Anthropic模型上将任务完成率提升了+36.7至40.0个百分点（Fisher精确检验 p ≤ 0.0022），每成功令牌效率提高了1.8至2.2倍。在gpt-4o-mini上提升不显著（p=0.435）；在计费API上的第二个领域复制确认了这一模式。该比较仅在审计了LLM基准测试中两个未记录的答案泄露类别后才成立。我们提供了 audit_prompt_leakage.py 作为可重用的CI基础设施。代码和数据：https://github.com/arquicanedo/self-reflective-apis。

英文摘要

When an AI agent calls an API and hits a validation error, it needs more than what went wrong -- it needs what to do next. A self-reflective API returns, on validation failure, a machine-readable recovery\_feedback.suggestions[] payload sufficient for the agent to repair the request and retry without external reasoning. On a leak-audited pilot ($N{=}30$ per cell, 3 LLMs, 10 adversarial tasks), structured suggestions lift task-completion rate by $+36.7$--$40.0$pp over plain-English diagnoses on Anthropic models (Fisher's exact $p \le 0.0022$), at $1.8$--$2.2\times$ better per-success token efficiency. The lift is not significant on gpt-4o-mini ($p{=}0.435$); a second-domain replication on a billing API confirms the pattern. The comparison only holds after auditing two undocumented classes of answer leakage in LLM benchmarks. We shipaudit\_prompt\_leakage.py as reusable CI infrastructure. Code and data: https://github.com/arquicanedo/self-reflective-apis.

URL PDF HTML ☆

赞 0 踩 0

2606.05004 2026-06-04 cs.CR cs.AI

SharedRequest: Privacy-Preserving Model-Agnostic Inference for Large Language Models

SharedRequest: 面向大型语言模型的隐私保护模型无关推理

Peihua Mai, Xuanrong Gao, Youlong Ding, Xianglong Du, Wei Liu, Yan Pang

发表机构 * National University of Singapore (Chongqing) Research Institute（新加坡国立大学（重庆）研究院）； Chongqing Key Laboratory of Trusted Perception and Interaction Technology for Intelligent and Connected Vehicles（重庆智能网联车辆可信感知与交互技术重点实验室）； National University of Singapore（新加坡国立大学）； Hebrew University of Jerusalem（耶路撒冷希伯来大学）； State Key Laboratory of Intelligent Vehicle Safety Technology, Chongqing, China（重庆智能车辆安全技术国家重点实验室）； CHONGQING CHANGAN AUTOMOBILE Co., Ltd（重庆长安汽车有限公司）

AI总结提出一种模型无关的隐私保护推理框架SharedRequest，通过批量级别混淆和语义分组实现高效隐私保护，相比差分隐私基线效用提升20%以上，查询成本降低5倍。

Comments accepted by ACL 2026 (main)

详情

AI中文摘要

随着ChatGPT等公共大型语言模型（LLMs）的广泛部署，保护用户提示隐私已成为一个日益关键的问题。现有的隐私保护推理方法要么牺牲效用，要么牺牲效率，并且通常需要特定于模型的修改，限制了其兼容性。在本文中，我们提出了SharedRequest，一个模型无关的隐私保护LLM推理框架，它将隐私保护重新定义为批量级别而非单个提示级别。关键思想是通过将原始提示与噪声变体混合来混淆敏感信息，同时将语义等效的指令分组，以在大量查询批次中分摊推理成本，对LLM响应质量影响最小。该设计独立于LLM架构，无需访问模型参数或进行架构修改。实验结果表明，与先前的差分隐私基线相比，SharedRequest实现了超过20%的效用提升，并且其共享提示机制相比非批量推理将查询成本降低了5倍。

英文摘要

With the widespread deployment of public large language models (LLMs) such as ChatGPT, protecting user prompt privacy has become an increasingly critical issue. Existing privacy-preserving inference methods sacrifice either utility or efficiency, and often require model-specific modifications that limit their compatibility. In this paper, we propose SharedRequest, a model-agnostic framework for privacy-preserving LLM inference that reformulates privacy protection at the batch level rather than the individual-prompt level. The key idea is to obscure sensitive information by mixing original prompts with noisy variants, while grouping semantically equivalent instructions to amortize the inference cost over a large batch of queries with minimal impact on LLM response quality. This design is independent of the LLM architecture, requiring no access to model parameters or architectural modification. Empirical results demonstrate that SharedRequest achieves over $20\%$ higher utility compared to prior differential privacy baselines, and its shared-prompt mechanism reduces query cost by up to $5\times$ compared to non-batched inference.

URL PDF HTML ☆

赞 0 踩 0

2606.04989 2026-06-04 cs.HC cs.RO

What Can Eye Gaze Teach Us About Real-World Cycling? Insights From the Oxford RobotCycle Project

眼动能教会我们关于真实世界骑行的什么？来自牛津RobotCycle项目的见解

Benjamin Hardin, Efimia Panagiotaki, Daniele De Martini, Lars Kunze

发表机构 * University of Oxford（牛津大学）； University of the West of England（西英格兰大学）

AI总结本研究利用可穿戴眼动追踪眼镜，通过分析不同环境（如自行车道、汽车道和共享公交车道）和事件（如超车和行人）下的眼动模式，揭示了骑行中感知危险的潜意识差异，并评估了眼动追踪在估计骑行压力和认知负荷方面的潜力。

详情

AI中文摘要

尽管对骑行情境的身体危险已有较多了解，但对骑行的感知危险知之甚少。此外，危险感知可能在潜意识层面被过滤，因此难以自我报告。为此，这些潜意识感知可以通过眼动等生理指标揭示。本文探讨了英国牛津骑行的感知安全性，并研究了可穿戴眼动追踪眼镜在不同环境和事件下产生关于感知差异见解的能力。本文发现，在自行车道、汽车道和共享公交车道之间，眼动模式发生变化，代表了每种车道类型的不同认知挑战。本文表明，不同交叉路口的眼动模式显著不同，这可能对骑行者的压力有影响。最后，与无事件骑行相比，在超车和道路行人等事件发生时，眼动模式存在差异。本文总结了使用可穿戴眼动追踪器估计压力和骑行者工作量的优点和局限性。

英文摘要

Although much is known about the physical danger of cycling situations, less is understood about the perceived danger of cycling. Furthermore, perception of danger may be filtered at a subconscious level and therefore difficult for one to self-report. To this end, these subconscious perceptions can be revealed through physiological metrics such as eye gaze. This paper explores the perceived safety of cycling in Oxford, United Kingdom and explores the ability of wearable eye tracking glasses to produce insights about the differences in perception under different environments and events. This paper finds that eye gaze patterns change between using bike lanes, car lanes and shared bus lanes, representing different cognitive challenges of each lane type. This paper presents that different intersections have significantly different eye gaze patterns which may have implications for cyclist stress. Finally, eye gaze patterns differ in the presence of events such as passes and pedestrians in the road compared to when cycling with no events. This paper draws conclusions on the benefits and limitations of using wearable eye trackers to estimate stress and cyclist workload.

URL PDF HTML ☆

赞 0 踩 0

2606.04967 2026-06-04 cs.SE cs.AI

From Prompt to Process: a Process Taxonomy and Comparative Assessment of Frameworks Supporting AI Software Development Agents

从提示到流程：支持AI软件开发智能体的框架流程分类与比较评估

Sanderson Oliveira de Macedo

发表机构 * Federal Institute of Goias（戈亚斯联邦理工学院）

AI总结提出六维流程分类法，对六个AI软件开发框架进行评分比较，揭示流程深度与可移植性之间的结构性权衡。

详情

AI中文摘要

AI编程工具不再仅仅是自动补全或聊天助手：它们组织为开发框架，包含流程、角色、工件和验证。最近的调查绘制了用于软件工程的智能体和LLM，但缺少一项以将这些能力转化为流程的操作框架为中心的研究。我们对主要来源进行了定向搜索，采用功能性纳入标准和牵引力测量，选择了六个框架：GitHub Spec Kit、OpenSpec、BMAD Method、Get Shit Done (GSD)、Spec Kitty和Reversa。每个框架通过不同路径攻击AI开发：完整和轻量变体的规范驱动开发、智能体驱动的敏捷规划、智能体上的上下文工程、工作树隔离与审查，以及从遗留系统中恢复操作规范。我们的核心贡献是一个六维流程分类法：规范、上下文、角色、执行、验证和可移植性，并附带一个评分标准，使其成为可复制的工具。我们将其应用于六个框架和一个样本外案例Spec-Flow。两个结果突出。在已经采用某种流程的框架中，存在趋同：孤立的提示失去中心地位，持久工件、工作合同、可追溯性和人工审查成为减少歧义和协调智能体的机制。并且没有框架强覆盖所有六个维度，暴露了流程深度与跨智能体可移植性之间的结构性权衡。我们还发现了反复出现的风险：规范与代码之间的漂移、对生成工件的过度信任、社区扩展的脆弱性、平台依赖性以及缺乏完整流程的基准测试。我们以一个研究议程结束，侧重于中间质量指标、上下文治理、安装安全性和可重复性。

英文摘要

AI tools for programming are no longer just autocomplete or chat assistants: they organize themselves as development frameworks, with process, roles, artifacts and verification. Recent surveys map agents and LLMs for software engineering, but a study centered on the operational frameworks that turn these capabilities into process is missing. We ran a directed search of primary sources, with a functional inclusion criterion and traction measurement, and selected six frameworks: GitHub Spec Kit, OpenSpec, BMAD Method, Get Shit Done (GSD), Spec Kitty and Reversa. Each attacks AI development through a different path: spec-driven development in full and lightweight variants, agent-driven agile planning, context engineering over the agent, worktree isolation and review, and recovery of operational specifications from legacy systems. Our central contribution is a six-dimension process taxonomy: specification, context, roles, execution, validation and portability, with a scoring rubric that turns it into a replicable instrument. We apply it to the six frameworks and an out-of-sample case, Spec-Flow. Two results stand out. Among frameworks that already adopt some process there is convergence: the isolated prompt loses centrality, and persistent artifacts, work contracts, traceability and human review become mechanisms that reduce ambiguity and coordinate agents. And no framework strongly covers all six dimensions, exposing a structural trade-off between process depth and portability across agents. We also found recurring risks: drift between specification and code, excessive trust in generated artifacts, fragility of community extensions, platform dependence and a lack of benchmarks for the complete process. We close with a research agenda for empirical evaluation, focused on intermediate-quality metrics, context governance, installation security and reproducibility.

URL PDF HTML ☆

赞 0 踩 0

2606.04957 2026-06-04 cs.CR cs.IR cs.LG

NLLog: Lightweight, Explainable SOC Anomaly Detection via Log-to-Language Rewriting

NLLog: 通过日志到语言重写的轻量级、可解释的SOC异常检测

Samuel Ndichu, Tao Ban, Seiichi Ozawa, Takeshi Takahashi, Daisuke Inoue

发表机构 * University of Tokyo（东京大学）； National Institute of Information and Communications Technology（日本信息通信技术研究所）

AI总结提出NLLog流水线，将日志模板重写为自然语言句子，结合TF-IDF加权和树集成分类，利用TreeSHAP提供可解释的异常检测，在HDFS、BGL和AIT数据集上实现低误报率和低延迟。

Comments 15 pages, 11 figures, 12 tables; submitted to ACSAC 2026

详情

AI中文摘要

系统生成的日志是安全监控的基础，但其僵化的基于模板的格式阻碍了自动化分析和人类理解。我们提出NLLog（自然语言日志），一个轻量级流水线，它确定性地将解析后的模板重写为WHO-WHAT-SEVERITY句子，通过词频-逆文档频率加权进行池化，使用树集成对会话进行分类，并通过TreeSHAP反向投影证据供分析师审查。在Hadoop分布式文件系统（HDFS）和Blue Gene/L（BGL）语料库上，NLLog超过了两个复现的匹配协议基线；在HDFS、BGL和AIT警报数据集上，它保持了低误报率，且延迟适用于安全运营中心分类。覆盖度、稀疏与密集、忠实性和对抗性消融实验表明，回退充分性依赖于语料库，部署前的覆盖度检查可以揭示细化需求，并且可审计的确定性重写结合轻量级密集编码为日志异常检测和分类提供了可测量的表示层。

英文摘要

System-generated logs underpin security monitoring, yet their rigid template-based format hinders both automated analysis and human comprehension. We present NLLog (Natural-Language Log), a lightweight pipeline that deterministically rewrites parsed templates into WHO-WHAT-SEVERITY sentences, pools them with term-frequency-inverse-document-frequency weighting, classifies sessions with tree ensembles, and back-projects evidence with TreeSHAP for analyst review. On Hadoop Distributed File System (HDFS) and Blue Gene/L (BGL) corpora, NLLog exceeds two reproduced matched-protocol baselines; across HDFS, BGL, and the AIT Alert Data Set, it sustains low false-positive rates with commodity-hardware latency suitable for security operations center triage. Coverage, sparse-versus-dense, faithfulness, and adversarial ablations show that fallback sufficiency is corpus-dependent, that an enrollment-time coverage check can surface refinement requirements before deployment, and that an auditable deterministic rewrite combined with lightweight dense encoding provides a measurable representation layer for log-anomaly detection and triage.

URL PDF HTML ☆

赞 0 踩 0

2606.04946 2026-06-04 cs.DS cs.LG stat.ML

A General Framework for Dynamic Consistent Submodular Maximization

动态一致子模最大化的通用框架

Paul Dütting, Federico Fusco, Silvio Lattanzi, Ashkan Norouzi-Fard, Ola Svensson, Morteza Zadimoghaddam

发表机构 * ETH Zurich（苏黎世联邦理工学院）； KTH Royal Institute of Technology（皇家理工学院）； University of Toronto（多伦多大学）

AI总结针对全动态环境下的子模最大化问题，提出一个通用算法框架，首次实现具有次线性一致性的常数因子近似解。

Comments Accepted at ICML 2026

详情

AI中文摘要

一致性是动态子模最大化中的一个重要性质，它要求算法始终维持一个接近最优的解，并且在每一步只对解进行少量调整。先前的工作仅在仅插入的情况下探讨了这个问题，其中算法面临 $n$ 个插入的流，并建立了基数约束版本的下界和上界。我们在全动态设置中考虑这个问题，其中操作流可能同时包含插入和删除。我们开发了一个通用框架来设计该设置下的算法，并通过实例化得到了首个具有次线性一致性的常数因子近似。对于基数约束，我们提出了一个 $\frac 12 - O(\varepsilon)$ 近似，其一致性为 $O\left(\frac{1}{\varepsilon^2}\right)$。对于秩-$k$ 拟阵约束，我们构造了一个 $\frac 14 - O(\varepsilon)$ 近似于动态最优解，其一致性为 $O\left(\frac{\log k}{\varepsilon^2}\right)$。

英文摘要

Consistency is an important property in dynamic submodular maximization and entails maintaining a near-optimal solution at all times, making only a small number of adjustments to the solution in each step. Prior work has explored this question for the insertion-only case, where the algorithm faces a stream of $n$ insertions, and has established lower and upper bounds for the cardinality-constrained version of the problem. We consider this question in the fully dynamic setting, where the stream of operations may contain both insertions and deletions. We develop a general framework for designing algorithms for this setting, and instantiate it to obtain the first constant-factor approximations with sublinear consistency. For cardinality constraints, we propose a $\frac 12 - O(\varepsilon)$ approximation that is $O\left(\frac{1}{\varepsilon^2}\right)$ consistent. For rank-$k$ matroid constraints, we construct a $\frac 14 - O(\varepsilon)$ approximation to the dynamic optimum that is $O\left(\frac{\log k}{\varepsilon^2}\right)$ consistent.

URL PDF HTML ☆

赞 0 踩 0

2606.04909 2026-06-04 cs.IR cs.CL

BEATS: Bootstrapping E-commerce Attribute Taxonomies for Search through Iterative Human-AI Collaboration

BEATS: 通过迭代人机协作引导电商搜索属性分类

Yung-Yu Shih, Shang-Yu Su, Tzu-I Ho, Dongzhe Wang, Yun-Nung Chen

发表机构 * National Taiwan University（国立台湾大学）； Rakuten Group, Inc.（拉肯集团）； Taiwan Rakuten Ichiba, Inc.（台湾拉肯Ichiba公司）； Rakuten Asia Pte. Ltd.（拉肯亚洲有限公司）

AI总结针对新兴市场电商平台缺乏结构化属性模式的问题，提出BEATS框架，利用人机协作的LLM流水线从零构建产品属性分类，并通过属性标注提升搜索系统性能。

Comments 6 pages, 1 figure, 5 tables. Accepted to SIGIR 2026 Industry Track. Official version: https://doi.org/10.1145/3805712.3808520

详情

DOI: 10.1145/3805712.3808520

AI中文摘要

新兴市场的电商平台通常使用欠发达的产品目录，仅包含类别分类而缺乏结构化属性模式。缺乏细粒度产品属性限制了搜索能力——阻碍分面过滤、降低查询理解、削弱搜索系统使用的语义表示。我们提出BEATS，一种人机协作的LLM框架，用于从零开始引导产品属性分类。我们的方法扩展了一个多阶段LLM生成流水线，包含两个关键生产阶段：(1) 模型开发者主动进行质量检查以过滤错误输出，以及(2) 领域专家本地工作人员进行人工标注以验证生成的属性。该框架迭代运行——每个生成阶段的提示基于质量检查观察和标注者在连续轮次中的反馈进行优化，逐步提高属性质量。一旦属性分类建立，我们使用LLM对单个产品项目进行结构化属性标注，丰富其上下文表示。丰富的目录直接有益于搜索系统的多个组件：实现细粒度基于属性的过滤、为排序模型提供结构化特征、改善密集检索的语义表示。我们通过在属性丰富的产品数据上训练密集检索模型来验证生成的分类，证明相对于使用原始目录信息的基线有一致的改进。我们的系统已在台湾乐天部署，丰富了9个主要类别，涵盖2,694个子类别，生成了67,277个属性，超过540万产品已使用生成的属性进行标注，并计划丰富整个产品目录。

英文摘要

E-commerce platforms in emerging markets often operate with underdeveloped product catalogs that contain only category taxonomies but lack structured attribute schemas. This absence of fine-grained product attributes limits search capabilities -- preventing faceted filtering, degrading query understanding, and weakening semantic representations used by search systems. We present BEATS, a human-in-the-loop LLM framework for bootstrapping product attribute taxonomies entirely from scratch. Our approach extends a multi-stage LLM generation pipeline with two critical production stages: (1) proactive quality checking by model developers to filter erroneous outputs, and (2) human annotation by domain-expert local staff to validate generated attributes. The framework operates iteratively -- prompts at each generation stage are refined based on quality check observations and annotator feedback across successive rounds, progressively improving attribute quality. Once the attribute taxonomy is established, we employ LLMs to perform structured attribute tagging on individual product items, enriching their contextual representations. The enriched catalog directly benefits multiple components of the search system: enabling granular attribute-based filtering, providing structured features for ranking models, and improving semantic representations for dense retrieval. We validate the generated taxonomy by training dense retrieval models on attribute-enriched product data, demonstrating consistent improvements over baselines using original catalog information. Our system has been deployed at Rakuten Taiwan, enriching 9 major categories spanning 2,694 sub-categories with 67,277 generated attributes, and over 5.4 million products have been tagged with the generated attributes, with plans to enrich the entire product catalog.

URL PDF HTML ☆

赞 0 踩 0