arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.28340 2026-05-28 stat.ML cs.LG

Decision-focused learning for optimal PV-Battery scheduling

面向决策的光伏-电池调度优化学习

Joris Depoortere, Hussain Kazmi, Johan Driesen

AI总结提出一种决策聚焦学习框架，通过训练LSTM光伏发电预测器以最小化电池调度成本，相比传统两阶段方法降低平均电费3.6%，验证了预测与优化目标对齐的重要性。

详情

DOI: 10.1016/j.est.2026.121152
Journal ref: Journal of Energy Storage Volume 154, Part A, 10 April 2026, 121152

AI中文摘要

近年来住宅光伏的使用急剧增加。随着电池系统变得更加经济实惠，光伏-电池系统的最优运行可以为家庭带来显著节省。最优控制需要正确预测底层参数（如光伏发电量）以调度电池。尽管由于算法进步和数据可用性，预测模型变得越来越准确，但准确性通常以通用指标衡量，这些指标可能与下游应用不一致。本研究提出了一种决策聚焦学习框架，通过在下游电池系统最优调度上训练长短期记忆光伏能量预测器，将优化和预测集成在一起。将所提出的方法与标准两阶段方法进行比较。在14个月的评估期内，决策聚焦方法在根据完美预测和无优化基线定义的性能界限归一化后，将20栋建筑的平均电费降低了3.6%。关键的是，尽管该模型的均方根误差为19.9%，显著高于解耦模型的8.2%，但仍实现了这一财务改善。对决策聚焦模型进行热启动进一步改善了结果，平均成本降低约8%，同时减轻了对统计准确性的负面影响（均方根误差为13.7%）。这些发现在20个家庭以及每个家庭单独在0.001水平上具有统计显著性。这些结果表明，将预测模型与优化目标对齐对于在光伏-电池系统中实现成本优势至关重要。未来的研究应在其他数据集、替代预测模型和替代优化算法上重复这些发现。

英文摘要

The use of residential photovoltaics has increased dramatically in recent years. With battery systems becoming more affordable, the optimal operation of a photovoltaic-battery system can bring significant savings to households. Optimal control requires correct forecasts of underlying parameters, such as photovoltaic power generation, to schedule the battery. While forecasting models have become increasingly accurate due to algorithmic advances and data availability, accuracy is typically measured in generic metrics which might not align with the downstream application. This study proposes a decision-focused learning framework that integrates optimization and prediction by training a Long Short-Term Memory photovoltaic energy forecaster on the downstream optimal scheduling of a battery system. The proposed methodology is compared against a standard two-phase approach. Across a 14-month evaluation period, the decision-focused method reduced average electricity costs across twenty buildings by 3.6% when normalized against performance bounds defined by a perfect forecast and a baseline of no optimization. Critically, this financial improvement was achieved despite the model exhibiting a root mean squared error of 19.9%, significantly higher than the decoupled model's 8.2%. Warm-starting the decision-focused model further improves results, lowering average cost by approximately 8%, while also mitigating the negative impact on statistical accuracy (root mean squared error of 13.7%). The findings are statistically significant at the 0.001 level across the twenty households and for each household individually. These results demonstrate that aligning forecast models with optimization goals is key for achieving cost advantages in PV-battery systems. Future research should replicate these findings on other datasets, alternate forecasting models and alternate optimization algorithms.

URL PDF HTML ☆

赞 0 踩 0

2605.28321 2026-05-28 cs.SE cs.AI

Multi-Agent LLM-based Metamorphic Testing for REST APIs

基于多智能体LLM的REST API蜕变测试

Shehroz Khan, Abdullah Mughees, Gaadha Sudheerbabu, Tanwir Ahmad, Dragos Truscan

AI总结提出ARMeta方法，利用基于LLM的多智能体工作流自动识别蜕变测试场景并生成可执行测试，以解决REST API测试中的预言问题。

Comments Author submitted version accepted for publication the IEEE Conference on Computers, Software, and Applications (COMPSAC2026), July 7-11, 2026, Madrid Spain

详情

AI中文摘要

随着REST API在软件系统中日益重要，其验证也变得更为关键。因此，测试和发现潜在问题对于提高软件质量至关重要。然而，测试REST API的主要挑战在于难以评估API调用的输出是否正确，即测试预言问题。蜕变测试是一种基于规约的测试方法，适用于正确输出未知或未明确指定的情况。为了检查系统的正确性，需要指定不同输出之间的关系。我们提出了ARMeta，一种支持工具的方法，利用基于LLM的多智能体工作流来支持使用OpenAPI文档化的REST API的蜕变测试。该智能体工作流用于识别蜕变测试场景，并以Given-When-Then格式进行规约。这些场景自动实现为可执行测试，并针对被测系统执行。我们在两个公开的暴露REST接口的Web应用程序上评估了ARMeta，并将其性能与基于场景的测试基线进行了比较。结果表明，ARMeta探索的行为可作为现有基于场景的测试方法的补充。

英文摘要

As REST APIs become an increasingly significant part of software systems, their validation is becoming more critical. Hence, testing and uncovering underlying issues are of utmost importance for improving software quality. However, testing REST APIs is challenging mainly due to the difficulty of assessing whether the output of an API call is correct, i.e., the test oracle problem. Metamorphic testing is a specification-based testing approach for situations where correct outputs are unknown or not specified explicitly. To check the correctness of a system, relations between the different outputs are specified. We present ARMeta, a tool-supported approach that uses an LLM-based multi-agent workflow to support metamorphic testing of REST APIs documented with OpenAPI. The agentic workflow is used to identify metamorphic test scenarios and specify them in the Given-When-Then format. These scenarios are automatically implemented as executable tests and executed against the system under test. We evaluate ARMeta on two publicly available web applications that expose REST interfaces and compare its performance with a scenario-based testing baseline. The results show that ARMeta explores behaviors that serve as a complement to existing scenario-based testing approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.28258 2026-05-28 cs.SE cs.AI cs.CV cs.HC

GUI Agents for Continual Game Generation

面向持续游戏生成的GUI智能体

Yixu Huang, Bo Li, Na Li, Zhe Wang, Kaijie Chen, Haonan Ge, Qingyi Si, Yuanzhe Shen, Ruihan Yang, Guangjing Wang, Hongcheng Guo

AI总结提出利用GUI智能体作为客观评估者和主观测试者，通过PlaytestArena和Play2Code框架实现持续游戏生成，显著提升可玩性。

详情

AI中文摘要

生成一个游戏与制作一个可玩的游戏不同。尽管代码生成取得了进展，现有方法将游戏生成视为从提示到产物的单次翻译，导致交互层面的失败未被检测。我们认为评估和改进游戏生成需要一个玩家，并研究了图形用户界面（GUI）智能体在此过程中的两个角色：（1）作为客观评估者，为此我们引入了PlaytestArena，这是一个新的评估环境，将8个游戏类型的200个基于浏览器的游戏生成任务与预期的游戏行为准则配对，由GUI智能体在浏览器中加载每个构建并玩它来裁决；（2）作为主观测试者，为此我们提出了Play2Code，其中游戏智能体和GUI智能体在共享内存的持续循环中运行，将游戏生成转化为编码和游戏之间的对话。我们的实验表明，即使是前沿模型也难以直接生成可玩的游戏，而Play2Code达到了66.8%的准则通过率，分别比单次传递和智能体编码基线提高了37.1和14.6个百分点。进一步分析表明，GUI测试者的反馈比人类报告更可追溯，但在某些方面具有类似人类测试者的特质，将游戏测试确立为交互式代码生成的关键测试平台。我们的项目网站位于https://continual-game-generation.vercel.app/。

英文摘要

Generating a game is not the same as making one that can be played. Despite advances in code generation, existing approaches treat game generation as one-shot translation from prompt to artifact, leaving interaction-level failures undetected. We argue that evaluating and improving game generation requires a player, and study two roles for graphical user interface (GUI) agents in this process: (1) as an objective evaluator, for which we introduce PlaytestArena, a new evaluation environment that pairs 200 browser-based game generation tasks across eight genres with rubrics of expected in-play behaviors, adjudicated by a GUI agent that loads each build in a browser and plays it; and (2) as a subjective playtester, for which we propose Play2Code, where a game agent and a GUI agent operate in a sustained loop with shared memory, turning game generation into a dialogue between coding and playing. Our experiments show that even frontier models struggle to generate playable games directly, while Play2Code achieves a 66.8\% rubric pass-rate, improving over single-pass and agentic-coding baselines by 37.1 and 14.6 points respectively. Further analysis shows that GUI playtester feedback is more traceable than a human report, yet idiosyncratic in ways reminiscent of human testers, establishing game playtesting as a critical testbed for interactive code generation. Our project website is available at https://continual-game-generation.vercel.app/.

URL PDF HTML ☆

赞 0 踩 0

2605.28251 2026-05-28 stat.ML cs.CY cs.LG

Counterfactually Fair Regression via Optimal Transport

通过最优传输实现反事实公平回归

M. Generali Lince, S. Gaucher, J-J. Vie, P. Loiseau

AI总结本文采用因果不确定性视角，通过重采样噪声定义反事实公平性，提出基于最优传输的后处理估计器，并证明其有限样本公平性保证和风险界。

详情

AI中文摘要

我们考虑学习一个反事实公平回归器的问题。我们采用因果不确定性视角，其中反事实公平性通过重采样噪声定义。我们专注于为一种新的后处理估计器获得理论公平性保证。我们首先证明反事实公平性等价于满足以潜在变量为条件的群体均等。这使我们能够通过重心分位数映射提供最优公平回归器的闭式表达式。为了处理连续潜在变量，我们提出了一种离散化的后处理方法。然后，在温和的正则性假设下，我们证明了我们的估计器具有高概率的有限样本公平性保证，不公平性衰减率为 $ ilde O(n^{-1/3})$，并建立了匹配的风险界 $ ilde O(n^{-1/3})$。我们给出了几乎公平预测的过剩风险的下界。最后，我们将结果扩展到宽松反事实公平性的设置。我们在真实世界和合成数据上验证了我们的方法。

英文摘要

We consider the problem of learning a counterfactually fair regressor. We adopt a causal uncertainty view in which counterfactual fairness is defined with resampled noise. We focus on obtaining theoretical fairness guarantees for a new post-processing estimator. We begin by showing that counterfactual fairness is equivalent to satisfying demographic parity conditional on the latent variable. This allows us to provide a closed-form expression of the optimal fair regressor via a barycentric quantile map. In order to handle continuous latent variables, we propose a discretized post-processing method. Then, under mild regularity assumptions, we prove high-probability finite-sample fairness guarantees for our estimator, providing an unfairness decay at rate $\tilde O(n^{-1/3})$, and establishing a matching risk bound of order $\tilde O(n^{-1/3})$. We provide a matching lower bound on the excess risk of almost fair predictions. Finally, we extend our results to the setting of relaxed counterfactual fairness. We validate our approach on real-world and synthetic data.

URL PDF HTML ☆

赞 0 踩 0

2605.28233 2026-05-28 stat.ML cs.CY cs.LG

Geometry of Relaxed Fair Regression: A Unified Framework for Aware and Unaware Settings

松弛公平回归的几何：统一感知与无感知设置框架

M. Generali Lince, V. Divol, R. Flamary, S. Gaucher, P. Loiseau

AI总结本文通过最优传输理论统一了感知与无感知设置下的公平回归问题，提出了基于Wasserstein-2和全变差惩罚的算法，在松弛公平约束下实现准确预测。

详情

AI中文摘要

公平-准确权衡是部署公平感知机器学习方法的核心问题。当敏感属性在推理时不可用——即所谓的无感知设置时，在松弛公平约束下获得准确预测的原则性方法基本缺失。在这项工作中，我们通过将人口统计平价惩罚下的回归问题表述为最优传输问题来填补这一空白。我们的框架统一了感知和无感知设置，并通过最优传输映射刻画了在平方Wasserstein-2和全变差惩罚下的最优预测函数。这些结果表明，惩罚的选择反映了根本不同的公平哲学：Wasserstein惩罚诱导出平滑的、群体范围内的妥协，而全变差惩罚则对个体子集强制执行精确的平价。基于这些理论刻画，我们提出了一种易于实现、计算高效且在实际基准测试中始终匹配或超越最先进基线的算法。

英文摘要

Fairness-accuracy trade-offs are a central concern in the deployment of fairness-aware machine learning methods. When sensitive attributes are unavailable at inference time-the so called unawareness setting, principled methods for obtaining accurate predictions under relaxed fairness constraints are largely missing. In this work, we address this gap by formulating regression under a demographic parity penalty as an optimal transport problem. Our framework unifies both the \emph{aware} and \emph{unaware} settings and characterizes optimal prediction functions via optimal transport maps, under both squared Wasserstein-2 and Total Variation penalties. These results reveal that the choice of penalty reflects fundamentally different fairness philosophies: the Wasserstein penalty induces a smooth, population-wide compromise, while Total Variation enforces exact parity for a subset of individuals. Building on these theoretical characterizations, we propose an algorithm that is simple to implement, computationally efficient, and consistently matches or outperforms state-of-the-art baselines on real-world benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.28219 2026-05-28 cs.HC cs.AI cs.LG

SmartIterator: Visual Analytics Workflows for Supervising Unsupervised Data Grouping

SmartIterator: 监督无监督数据分组的可视化分析工作流

Gennady Andrienko, Natalia Andrienko

AI总结提出SmartIterator可视化分析方法，通过六阶段工作流和IteraScope协调视图，系统探索参数扫描下的分组结果，支持用户理解数据结构和做出知情决策。

详情

AI中文摘要

无监督学习方法——主题建模、基于划分和基于密度的聚类——在没有人类指导的情况下产生数据分组，但选择和评估这些分组本身不应是无监督的。我们提出了\emph{SmartIterator}（SI），一种可视化分析方法，将参数扫描中分组结果的完整序列视为一等分析对象。对于每个方法族，SI提供了一个结构化的六阶段工作流，引导分析师系统地探索分组结果——从质量指标概览，经过过渡稳定性评估、成员置信度评估、内容和上下文检查、循环原型验证，到知情决策——在此过程中逐步建立对数据结构的累积理解。这些工作流通过\emph{IteraScope}（IS）实现，这是一个协调的可视化显示，结合了质量指标图表与语义颜色编码、带有桑基式过渡流和成员置信度小提琴图的一维组嵌入、带有HDBSCAN检测的循环原型的二维组嵌入（突出显示捕获所有持久模式的迭代），以及用于上下文解释的特定领域链接视图。我们在以下三个场景中演示了这些工作流：（1）来自VAST Challenge 2011的模拟社交媒体消息（基于密度的聚类，根据真实情况进行验证），（2）约1500个NUTS-3区域的欧盟人口统计数据（基于划分的聚类），以及（3）30年的IEEE VIS论文（NMF主题建模）。这些工作流构成了主要贡献：它们提供了可操作的、针对特定方法的指导，用于导航参数空间、研究数据结构如何随配置变化，以及将分析理解扎根于领域背景——从而产生关于数据的知识，这是任何单个“最佳”结果都无法提供的。

英文摘要

Unsupervised learning methods -- topic modeling, partition-based and density-based clustering -- produce data groupings without human guidance, yet choosing and evaluating those groupings should not itself be unsupervised. We present \emph{SmartIterator}~(SI), a visual analytics approach that treats the full sequence of grouping results across a parameter sweep as a first-class analytical object. For each method family, SI provides a structured six-phase workflow that guides the analyst through systematic exploration of grouping results -- from quality-metric overview through transition-stability assessment, membership-confidence evaluation, content and context inspection, and recurrent-archetype verification to an informed decision -- building cumulative understanding of data structure along the way. The workflows are operationalized through \emph{IteraScope}~(IS), a coordinated visual display combining quality-metric charts with semantic color encoding, a 1D group embedding with Sankey-style transition flows and violin plots of membership confidence, a 2D group embedding with HDBSCAN-detected recurrent archetypes that highlights iterations capturing all persistent patterns, and domain-specific linked views for contextualized interpretation. We demonstrate the three workflows on: (1)~simulated social-media messages from the VAST Challenge 2011 (density-based clustering, validated against ground truth), (2)~EU population statistics across ${\sim}1\,500$ NUTS-3 regions (partition-based clustering), and (3)~30 years of IEEE VIS papers (NMF topic modeling). The workflows constitute the main contribution: they provide actionable, method-specific guidance for navigating parameter spaces, studying how data structure evolves across configurations, and grounding analytical understanding in domain context -- yielding knowledge about the data that no single ``best'' result can provide.

URL PDF HTML ☆

赞 0 踩 0

2605.28214 2026-05-28 cs.CR cs.LG cs.MA

Out of Sight, Not Out of Mind: Unveiling Latent Attack in Latent-based Multi-Agent Systems

眼不见，心不烦：揭示基于潜在的多智能体系统中的潜在攻击

Chenxi Wang, Ruiyang Huang, Jiayan Sun, Lei Wei, Yifan Wu

AI总结研究潜在表示能否携带攻击信息，提出通过潜在干预激活攻击效果的框架，实验表明潜在攻击在清洁执行中显著降低任务性能，尤其影响智能体间KV缓存传递。

Comments 27 pages, 7 figures, 3 tables. Preprint

详情

AI中文摘要

基于潜在的多智能体系统用隐藏表示替代部分显式智能体间通信，为高效灵活的智能体协作提供了新方向。然而，将协调移至潜在空间也可能将攻击移至可见文本检查范围之外。本文研究潜在状态能否携带在清洁执行期间仍然有效的攻击相关信息。为探究此问题，我们引入了一个潜在攻击框架，通过潜在干预重新激活攻击诱导的效果，而无需重用对抗性文本。大量实验表明，由此产生的纯潜在攻击在清洁执行中能显著降低任务性能，尤其当应用于智能体间KV缓存传递而非局部隐藏状态时。进一步的控制分析表明，这种性能下降不能归因于任意扰动或无效生成。总体而言，我们的发现表明基于潜在的协作并未消除攻击风险，而是将部分风险转移至较不可见的执行状态，这要求超越可见文本检查的安全防护措施。

英文摘要

Latent-based multi-agent systems replace parts of explicit inter-agent communication with hidden representations, offering a new direction for efficient and flexible agent collaboration. However, moving coordination into latent space may also move attacks beyond the reach of visible-text inspection. In this paper, we study whether latent states can carry attack-associated information that remains effective during clean executions. To examine this question, we introduce a latent attack framework that reactivates attack-induced effects through latent interventions without reusing adversarial text. Extensive experiments show that the resulting latent-only attacks can substantially degrade task performance in clean executions, especially when applied to inter-agent KV-cache handoffs rather than local hidden states. Further control analyses indicate that this degradation cannot be reduced to arbitrary perturbations or invalid generation. Overall, our findings suggest that latent-based collaboration does not remove attack risk. It shifts part of the risk into less observable execution states, calling for safeguards beyond visible-text inspection.

URL PDF HTML ☆

赞 0 踩 0

2605.28187 2026-05-28 cs.IR cs.AI cs.CY cs.SI

Whose Name Comes Up? III: Persona Prompting Effects in LLM-Based Scholar Recommendation

谁的名字会出现？III：基于LLM的学者推荐中的人设提示效应

Annabella Sánchez-Guzmán, Lukas Eberhard, Denis Helic, Lisette Espín-Noboa

AI总结本研究通过构建基准测试，分离模型选择与提示设计对LLM学者推荐的影响，发现提示设计（语言、地点、角色与任务）显著影响推荐质量（事实性、覆盖度）和社会代表性（多样性、均等性）。

Comments 25 pages (10 main, 2 references, 13 appendix), 6 figures in main, 13 figures in appendix (under-review)

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用作学者推荐系统，塑造了学术界中被视为专家的人选。现有的审计仍然以英语为中心、单一学科且忽略人设，导致输出变异性的来源尚不明确。为此，我们提出了一个基准测试，以分离模型选择和提示设计对推荐的影响。我们通过改变人设提示（语言、地点、角色与任务）和上下文（领域、资历、k）审计了43个LLM。将推荐的学者与Semantic Scholar在六个科学学科上进行比较，以衡量技术质量（事实性、覆盖度）和社会代表性（多样性、均等性）。基本技术质量由模型选择驱动，事实性和均等性由上下文驱动，多样性由地点驱动。南非提示产生的事实性较低的列表，而日本提示产生的事实性高但同质化的列表，偏向高产的学者。因此，提示设计是基于LLM的学者发现中一个不可忽视的维度，应与模型选择一起系统审计。

英文摘要

Large language models (LLMs) are increasingly used as scholar recommenders, shaping who is seen as an expert in academia. Existing audits remain English-centric, single discipline, and persona-agnostic, leaving the source of output variability poorly understood. To this end, we propose a benchmark that disentangles the effects of model choice and prompt design on recommendations. We audit 43 LLMs by varying persona prompts (language, location, role-and-task) and context (field, seniority, k). Recommended scholars are compared against Semantic Scholar over six scientific disciplines to measure technical quality (factuality, coverage) and social representativeness (diversity, parity). Basic technical quality is driven by model choice, factuality and parity by context, and diversity by location. South Africa prompts yield less factual lists, while Japan prompts yield highly factual but homogeneous lists skewed toward highly productive scholars. Prompt design is thus a non-trivial axis of LLM-based scholar discovery and should be systematically audited alongside model choice.

URL PDF HTML ☆

赞 0 踩 0

2605.28164 2026-05-28 cs.NE cs.AI

Performance and Explainability Requirements of Evolutionary Algorithms in Real-World Physics-Informed Optimization

进化算法在实际物理信息优化中的性能和可解释性要求

Helena Stegherr, Michael Heider, Nils Meyer, Tobias Thummerer, Thomas Wendler, Pierre Aublin, Ennio Idrobo-Àvila, Lars Mikelsons, Sebastian Zaunseder, Jörg Hähner

AI总结本文通过五个实际物理优化问题，分析领域专家对进化算法在性能和可解释性方面的需求，并指出现有方法未能充分应用于复杂实际场景的差距。

详情

AI中文摘要

进化计算提供了多种工具来解决复杂的实际优化问题。然而，研究通常集中在较小、简化的问题和优化算法上，这些算法在实际场景中有时无法满足期望。此外，在此类设置中，对应用算法及其提供的解决方案的信任通常至关重要，但这需要理解搜索过程本身。这导致在许多应用背景下（包括基于物理的建模）实践者往往不会认真考虑进化计算。本文详细介绍了可以缓解这些问题的进化计算技术。首先，由领域专家介绍并描述了五个实际的基于物理的优化问题。针对每个问题，提出了进化算法在性能和可解释性方面的要求，以增加信任和可用性。我们发现，所有领域专家都期望快速收敛到良好解决方案，并希望获得关于结果如何形成的一些解释，而其他要求则强烈依赖于具体问题。最后，我们介绍了现有方法，这些方法可用于改进进化算法的这些方面，但据我们所知，从未在复杂的实际场景中使用过。这意味着两个领域之间存在需要弥合的差距，以充分发挥进化计算的潜力。

英文摘要

Evolutionary computation offers a variety of tools to solve complex real-world optimization problems. However, research often focuses on smaller, simplified problems and optimization algorithms that sometimes miss expectations in real-world scenarios. Additionally, trust in the applied algorithm and the solutions it provides is often essential in such settings, but requires an understanding of the search process itself. This leads to evolutionary computation often not being seriously considered by practitioners in many application contexts, among them physics-based modeling. In this article, techniques from evolutionary computation are detailed that can alleviate these problems. First, five real-world physics-based optimization problems are introduced and described by domain experts. For each of these, the requirements for the evolutionary algorithm regarding performance and explainability to increase trust and usability are presented. We found that all domain experts expect fast convergence to a good solution and want some explanations for how the results were formed, while other requirements strongly depend on the respective problem. Finally, we present existing approaches that can be leveraged to improve those aspects of evolutionary algorithms but have to our knowledge never been employed in complex real-world scenarios. This implies a gap between both domains that needs to be closed to exploit the full potential of evolutionary computation.

URL PDF HTML ☆

赞 0 踩 0

2605.28154 2026-05-28 cs.HC cs.RO

Robo-Blocks: Generative Scaffolding in End-User Design and Programming of Social Robots

Robo-Blocks：社交机器人终端用户设计与编程中的生成式支架

Arissa J. Sato, Callie Y. Kim, Nathan Thomas White, Abhinav Maneesh, Yuqing Wang, Hui-Ru Ho, Bilge Mutlu

AI总结通过研究通过设计（RtD）过程，提出基于LLM的积木式编程环境Robo-Blocks，利用生成式支架将高级想法转化为可执行机器人行为，支持新手程序员，并揭示了用户角色与使用模式。

详情

DOI: 10.1145/3800645.3812997

AI中文摘要

由于需要规划、交互设计和编程方面的专业知识，编程社交机器人对新手机器人程序员来说具有挑战性。虽然大型语言模型（LLM）通过从自然语言描述生成代码具有巨大潜力，但它们可能掩盖编程的关键元素并取代设计者的意图，最终导致过度依赖而非发展编程技能。在本文中，我们通过研究通过设计（RtD）过程，探索基于LLM的社交机器人编程工具如何支持新手机器人程序员。我们设计并原型化了Robo-Blocks，这是一个基于积木的编程环境，利用LLM通过结构化叙述为新手机器人程序员提供生成式支架，将高级想法连接到可执行的机器人行为。通过与新手的部署，我们发现了生成式支架的新兴用户角色和使用模式，并展示了这种支架如何塑造终端用户的设计和编程策略。我们提出了有效使用生成式支架及其融入社交机器人编程实践的设计见解。

英文摘要

Programming social robots is challenging for novice robot programmers due to required expertise in planning, interaction design, and programming. While large language models (LLMs) hold significant promise through code generation from natural-language descriptions, they can obscure critical elements of programming and supplant designer intent, eventually resulting in over-reliance instead of developing programming skills. In this paper, we explore how LLM-based social-robot-programming tools can support novice robot programmers through a Research through Design (RtD) process. We designed and prototyped Robo-Blocks, a block-based programming environment that leverages LLMs to offer novice robot programmers generative scaffolding through structured narratives that connect high-level ideas to executable robot behaviors. Through deployment with novices, we discovered emerging user personas and usage patterns for generative scaffolding and showed how this scaffolding shapes end-user design and programming strategies. We present design insights for the effective use of generative scaffolding and its integration into the practice of social-robot programming.

URL PDF HTML ☆

赞 0 踩 0

2605.28153 2026-05-28 physics.ao-ph cs.LG

Skillful high-resolution weather forecasting independent of physical models

独立于物理模型的高分辨率天气预报

Pengcheng Zhao, Siqi Xiang, Weixin Jin, Zekun Ni, Jiang Bian, Zuliang Fang, Hongyu Sun, Bin Zhang, Richard E. Turner, Jonathan Weyn, Haiyu Dong, Kit Thambiratnam, Qi Zhang

AI总结提出ObsCast系统，仅使用观测数据训练，无需数值天气预报数据，实现短期高分辨率区域预报，性能优于传统NWP。

Comments 26 pages, 10 figures

详情

AI中文摘要

准确及时的天气预报对现代社会的高影响决策至关重要。基于机器学习的天气预报正在成为一种替代方案，用于生成初始条件、预报，甚至在端到端系统中同时生成两者。这些方法比传统数值天气预报（NWP）更快，且通常具有更高的技能。然而，即使是端到端模型也通常依赖NWP生成的再分析数据进行监督，从而继承了这些NWP的偏差和分辨率限制，并限制了在缺乏合适再分析产品、更新频率低或生产成本高的环境中的适应性。在此，我们介绍ObsCast，一个区域系统，它在训练和推理中均不使用任何NWP派生数据，同时生成分析和预报，并在短期高分辨率区域建模中实现了最先进的性能。在美国本土和欧洲，ObsCast在近地面变量方面优于业务NWP，预报时效达18小时，并产生有技巧的降水预报。它提供了一种更简单、更适应的方法，直接从本地观测构建和完善区域预报服务，无需开发复杂且昂贵的传统预报流程。

英文摘要

Accurate and timely weather forecasts are critical for high-impact decisions in modern society. Machine-learning-based weather prediction is emerging as an alternative for producing initial conditions, forecasts, and even both in end-to-end systems. These methods deliver predictions faster and often with higher skill than traditional numerical weather prediction (NWP). However, even end-to-end models typically rely on NWP-generated reanalyses for supervision, thereby inheriting the biases and resolution limitations of those NWPs, and limiting adaptation to settings where suitable reanalysis products are unavailable, infrequently updated, or expensive to produce. Here we introduce ObsCast, a regional system that generates both analysis and predictions, without using any NWP-derived data in either training or inference, while still achieving state-of-the-art performance in short-term high-resolution regional modeling. Over the contiguous United States and Europe, ObsCast outperforms operational NWP for near-surface variables through 18 h and produces skillful precipitation forecasts. It provides a simpler and more adaptable route to build and refine regional forecasting services directly from local observations, without the need to develop complex and costly traditional forecasting pipelines.

URL PDF HTML ☆

赞 0 踩 0

2605.28148 2026-05-28 cs.SE cs.AI

DeltaMCP: Incremental Regeneration via Spec-Aware Transformation for MCP servers

DeltaMCP: 通过规范感知转换实现MCP服务器的增量再生

Aditya Pujara, Xiaogang Zhu, Hsiang-Ting Chen

AI总结针对企业级API与MCP工具集同步维护的挑战，提出DeltaMCP，一种基于规范感知的增量再生工具，仅更新受影响的MCP服务器工具，实验表明能减少开发者开销并提升可维护性与版本一致性。

详情

AI中文摘要

LLM的快速发展以及模型上下文协议（MCP）的引入，通过确定性和结构化方法彻底改变了智能代理与API交互的方式。虽然一些现有系统（如AutoMCP）试图自动化之前完全手动生成MCP服务器的过程，但它们未能解决不断发展的企业级API与其相应MCP工具集实现之间保持同步的反复挑战。本文介绍了DeltaMCP，一种面向企业级MCP服务器的规范感知增量再生工具。DeltaMCP使开发者能够在给定其对应服务的OpenAPI规范新版本时，仅更新MCP服务器中受影响的工具。使用Azure REST API规范作为评估数据集，DeltaMCP在生成质量和系统性能方面与基线全量生成方法进行了基准测试。结果表明，DeltaMCP减少了开发者开销，同时提高了可维护性和版本一致性。这项研究为企业寻求为基于LLM的系统维护高保真、最新MCP服务器基础设施提供了一种可扩展的方法。

英文摘要

The rapid development of LLMs coupled with the introduction of Model Context Protocol (MCP) has revolutionized how intelligent agents interact with APIs through deterministic and structured methods \cite{ModelContextProtocolIntro2025}. While some existing systems like AutoMCP attempt to automate a previously completely manual process of generating MCP servers, they fail to address the recurring challenge of maintaining synchronization between evolving enterprise-level APIs and their corresponding MCP toolset implementation \cite{mastouri2025makingrestapisagentready}. This paper introduces DeltaMCP, a specification-aware, incremental regeneration tool for enterprise-grade MCP servers. DeltaMCP enables developers to only update the affected tooling of MCP servers, given a new release of it's corresponding service's OpenAPI specification. Using Azure REST API specifications as the evaluation dataset, DeltaMCP is benchmarked against baseline full generation methods on generation quality and system performance. The results demonstrate the reduction in developer overhead through DeltaMCP whilst improving maintainability and version consistency. This research offers a scalable approach for enterprises seeking to maintain high-fidelity, up-to-date MCP server infrastructures for LLM-based systems.

URL PDF HTML ☆

赞 0 踩 0

2605.23933 2026-05-28 cs.CY cs.AI

KT4EQG: Personalized Exercise Question Generation via Knowledge Tracing

KT4EQG: 通过知识追踪实现个性化习题生成

Xinyi Gao, Qiucheng Wu, Lu Ding, Q. Vera Liao, Kaizhi Qian, Ying Xu, Shiyu Chang, Yang Zhang

AI总结提出KT4EQG框架，利用知识追踪模型选择最合适的概念，并训练基于LLM的题目生成器，以生成个性化习题，实验证明其有效性。

详情

AI中文摘要

教育题目生成（EQG）旨在合成定制的习题以增强学生学习。一个有效的EQG系统应理想地通过建模学生的知识状态并为每个学生个性化题目，从而提供最大的学习收益。然而，现有的EQG方法很少能实现如此细粒度的个性化。在本文中，我们探讨了EQG如何从知识追踪（KT）中受益，KT基于历史表现建模学生的知识状态并预测未来表现。我们提出了KT4EQG，一个在KT模型指导下为个体学生生成有效题目的个性化EQG框架。具体来说，KT4EQG通过利用KT模型选择最适合学生练习的知识概念，以最大化学生在整体知识掌握上的潜在提升。然后训练一个基于LLM的题目生成器，以忠实于所选概念生成题目。在XES3G5M和MOOCRadar上的实验结果表明，KT4EQG始终比有限或没有个性化的方法生成更有效的题目。

英文摘要

Educational Question Generation (EQG) aims to synthesize customized exercise questions that enhance student learning. An effective EQG system should ideally personalize questions for each student by modeling the student's knowledge state and generating questions that provide the greatest learning benefit. However, few existing EQG approaches are able to achieve such fine-grained personalization. In this paper, we explore how EQG can benefit from knowledge tracing (KT), which models students' knowledge states based on historical performance and predicts future performance. We propose KT4EQG, a personalized EQG framework that generates effective questions for individual students under the guidance of a KT model. Specifically, KT4EQG seeks to maximize a student's potential improvement in overall knowledge mastery by leveraging the KT model to select the most suitable knowledge concept for the student to practice. An LLM-based question generator is then trained to produce a question faithfully grounded in the selected concept. Experimental results on XES3G5M and MOOCRadar show that KT4EQG consistently generates more effective questions than methods with limited or no personalization.

URL PDF HTML ☆

赞 0 踩 0

2605.28122 2026-05-28 cs.CR cs.AI cs.CL

SNARE: Adaptive Scenario Synthesis for Eliciting Overeager Behavior in Coding Agents

SNARE: 自适应场景合成以诱发编码代理中的过度行为

Yubin Qu, Yi Liu, Gelei Deng, Yanjun Zhang, Yuekang Li, Ying Zhang, Leo Yu Zhang

AI总结提出SNARE流水线，通过组合良性场景片段并使用无评判器预言机评分与汤普森采样，自适应地诱发编码代理的过度行为，并在4×5代理-模型矩阵上评估，发现19.51%的良性运行触发过度行为，且代理框架比模型影响更大。

详情

AI中文摘要

编码代理以一系列shell、文件和网络操作执行良性任务，其中任何操作都可能悄然超出授权范围而任务仍完成。我们称此为过度行为：提示并非对抗性且运行成功，但超出范围的操作可能泄露凭据或删除文件。现有基准未能捕捉：任务完成套件认可任何完成的运行，越狱套件探测对抗性提示，而先前唯一的过度行为基准对每个代理-模型对应用单一固定提示集，导致其最易和最难的配对测量不足。我们提出SNARE（为非对抗场景合成自适应奖励引导诱发），该流水线从可重用范围和陷阱片段组合良性场景，用无评判器预言机对每次运行评分，标记陷阱模式匹配及未经请求的文件添加或删除，并使用汤普森采样将每对运行预算导向最常触发它的场景。在24个过度行为原型上实例化得到OverEager，我们在四个编码代理和五个基础模型的4×5矩阵上运行。在10,000次良性运行中，19.51%触发过度行为，每对比率跨度达11.9倍。这种变化由代理框架驱动，而非模型：框架占56%而模型占21%，因此任何单一框架或单一模型评估都会低估矩阵约五分之一。

英文摘要

A coding agent executes a benign task as a sequence of shell, file, and network actions, any of which can quietly exceed the authorized scope while the task still completes. We call this overeager behavior: the prompt is not adversarial and the run succeeds, yet an out-of-scope step can leak credentials or delete files. Existing benchmarks miss it: task-completion suites credit any finished run, jailbreak suites probe adversarial prompts, and the one prior overeager benchmark applies a single fixed prompt set to every agent-model pair, leaving its easiest and most resistant pairs under-measured. We present SNARE (Synthesizing Non-adversarial scenarios for Adaptive Reward-guided Elicitation), a pipeline that composes benign scenarios from reusable scope and trap fragments, scores each run with a judge-free oracle flagging trap-pattern matches and unsolicited file additions or deletions, and uses Thompson sampling to steer each pair's run budget toward the scenarios that most often trigger it. Instantiating it over 24 overeager archetypes yields OverEager, which we run across a 4x5 matrix of four coding agents and five base models. Across 10,000 benign runs, 19.51% trigger overeager behavior, with per-pair rates spanning 11.9x. This variation is driven by the agent framework, not the model: the framework accounts for 56% of it against the model's 21%, so any single-framework or single-model evaluation undercounts the matrix by about a fifth.

URL PDF HTML ☆

赞 0 踩 0

2605.28116 2026-05-28 cs.CR cs.AI cs.CL

MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content

MIRAGE：通过用户生成内容对移动GUI代理的上下文感知提示注入

Ruoqi Guo, Yi Liu, Gelei Deng, Yiheng Xiong, Yuekang Li, Ying Zhang, Leo Yu Zhang, Lida Zhao, Ji Jie, Yuxiao Lu

AI总结提出MIRAGE管道，通过将攻击者控制的文本嵌入用户生成内容区域，在不修改代理、应用或操作系统的情况下，对视觉语言模型驱动的移动GUI代理实现高成功率的提示注入攻击。

详情

AI中文摘要

由视觉语言模型（VLM）驱动的移动图形用户界面（GUI）代理将屏幕视为渲染像素并根据所见选择动作，因此无法可靠地将受信任的界面元素与用户生成内容区分开来。我们提出MIRAGE（移动逼真对抗性GUI示例注入），这是一个管道，通过将攻击者控制的文本放入普通用户生成内容区域，将良性移动截图转化为提示注入样本，而无需修改代理、应用程序或操作系统。MIRAGE分三个阶段运行：定位器识别截图上用户可控制的区域，生成器合成上下文感知的有效载荷并以应用程序的原生风格渲染，策展人调节逼真度并在应用程序、区域类型和攻击意图之间平衡样本。一个关键挑战是，注入的截图必须在视觉上与真实用户内容难以区分，同时仍能转移代理的注意力；我们通过分离控制可达性、逼真度和分布平衡的阶段来解决这一问题。在一个涵盖十个应用程序和十一种攻击意图的1,111样本基准测试中，所有五个被评估的VLM代理都易受攻击，攻击成功率为23%-30%，并且MIRAGE在人类逼真度评分上高于最强的先前攻击（3.02对比2.52，满分5分）。我们进一步发现，每个样本的逼真度和攻击成功率不相关，因此仅靠视觉质量过滤无法可靠防御此威胁。

英文摘要

Mobile graphical user interface (GUI) agents driven by vision-language models (VLMs) perceive the screen as rendered pixels and choose actions from what they see, so they cannot reliably separate trusted interface elements from user-generated content. We present MIRAGE (Mobile Injection of Realistic Adversarial GUI Examples), a pipeline that turns benign mobile screenshots into prompt-injection samples by placing attacker-controlled text into ordinary user-generated content regions, without modifying the agent, the application, or the operating system. MIRAGE operates in three stages: a Localizer identifies user-controllable regions on the screenshot, a Generator synthesises context-aware payloads and renders them in the application's native style, and a Curator moderates realism and balances the samples across applications, region types, and attack intents. A key challenge is that an injected screenshot must stay visually indistinguishable from genuine user content while still diverting the agent; we address this by separating the stages that control reach, realism, and distributional balance. On a 1,111-sample benchmark spanning ten applications and eleven attack intents, all five evaluated VLM agents are vulnerable, with attack success rates of 23%-30%, and MIRAGE scores higher on human realism ratings than the strongest prior attack (3.02 versus 2.52 out of 5). We further find that per-sample realism and attack success are uncorrelated, so visual-quality filtering alone cannot reliably defend against this threat.

URL PDF HTML ☆

赞 0 踩 0

2605.28112 2026-05-28 cs.CR cs.CL cs.IR

A Wolf in Sheep's Clothing: Targeted Routing Hijacking in Federated RAG

披着羊皮的狼：联邦RAG中的目标路由劫持

Junjie Mu, Qiongxiu Li

AI总结本文提出路由劫持攻击，恶意客户端伪造语义配置文件以吸引目标查询，导致检索结果被篡改，并设计了一种基于信任的后路由框架来缓解该攻击。

Comments Under review. Code available at https://github.com/Junjie-Mu/routing-hijacking-fedrag

详情

AI中文摘要

联邦检索增强生成（FedRAG）对隐私敏感的应用具有吸引力，因为原始数据保留在本地。因此，路由必须依赖客户端提供的语义配置文件，这为操纵创造了新的机会。我们引入了路由劫持，一种路由阶段的攻击，其中恶意客户端伪造其配置文件以吸引目标查询，尽管其底层数据不相关。我们表明这种漏洞是严重的。在三种代表性的FedRAG路由架构中，路由劫持一致地错误路由目标查询，并导致下游中断和失败，包括证据缺失、投毒、错误答案和幻觉。在一个高风险的MedQA-USMLE案例研究中，我们进一步表明，投毒的检索证据可以误导不同规模的模型，导致错误答案、幻觉和谄媚失败。现有防御无法弥补这一差距：加密路由保留了被利用的排名，而拜占庭鲁棒的联邦学习（FL）规则难以迁移到异构的路由配置文件。为了解决这一差距，我们提出了一种基于信任的后路由框架，该框架使用返回证据反馈（包括检索相关性、配置文件一致性和跨客户端一致性）对客户端进行重新加权；在线实验表明，它抑制了重复查询上的持续劫持，并迁移到学习的神经路由器。我们的发现将路由完整性确立为FedRAG中一个新的安全挑战，并强调需要更强的防御来确保安全的联邦检索。

英文摘要

Federated Retrieval-Augmented Generation (FedRAG) is attractive for privacy-sensitive applications because raw data remain local. As a result, routing must rely on client-provided semantic profiles, creating a new opportunity for manipulation. We introduce Routing Hijacking, a routing-stage attack in which a malicious client forges its profile to attract target queries despite having irrelevant underlying data. We show that this vulnerability is severe. Across three representative FedRAG routing architectures, Routing Hijacking consistently misroutes target queries and leads to downstream disruptions and failures, including missing evidence, poisoning, incorrect answers, and hallucinations. In a high-stakes MedQA-USMLE case study, we further show that poisoned retrieved evidence can mislead models across scales, leading to incorrect answers, hallucinations, and sycophantic failures. Existing defenses do not close this gap: encrypted routing preserves the exploited ranking, and Byzantine-robust Federated Learning (FL) rules transfer poorly to heterogeneous routing profiles. To address this gap, we propose a trust-aware post-routing framework that reweights clients using returned-evidence feedback, including retrieval relevance, profile consistency, and cross-client agreement; online experiments show that it suppresses persistent hijacking over recurring queries and transfers to a learned neural router. Our findings establish routing integrity as a new security challenge in FedRAG and highlight the need for stronger defenses for secure federated retrieval.

URL PDF HTML ☆

赞 0 踩 0

2605.28078 2026-05-28 cs.CR cs.AI cs.LG stat.ML

Mind the Gap: Mixtures of Gaussians in Approximate Differential Privacy

注意差距：近似差分隐私中的高斯混合机制

Huikang Liu, Aras Selvi, Wolfram Wiesemann

AI总结针对已知敏感度的标量实值查询函数，设计了一类混合高斯加性噪声机制，在中等和低隐私预算下显著降低噪声幅度和方差，接近最优性。

Comments ICML 2026 style: 9 main pages followed by acknowledgements, references, appendices

详情

AI中文摘要

我们设计了一类加性噪声机制，满足标量实值查询函数的 $(\varepsilon, δ)$-差分隐私（DP），这些函数具有已知敏感度，特别关注中等和低隐私预算。这些机制称为 extit{混合机制}，通过混合多个高斯分布构建，这些高斯分布共享相同的方差，但均值和混合权重不同。得到的分布可以解释为零均值高斯（如解析高斯机制中所用）和额外高斯（其均值取决于查询函数的敏感度）的凸组合。我们推导了 $(\varepsilon, δ)$-DP 所需方差的严格条件，并提供了高效算法来计算它们。与解析高斯机制相比，我们的机制产生了显著更低的期望噪声幅度（$l_1$-损失）和方差（零均值分布的 $l_2$-损失）。在激励我们设计的低隐私预算下，我们的机制接近最优性，几乎消除了解析高斯机制的所有最优性差距。

英文摘要

We design a class of additive noise mechanisms that satisfy $(\varepsilon, δ)$-differential privacy (DP) for scalar, real-valued query functions with known sensitivities, with a particular focus on moderate and low-privacy regimes. These mechanisms, which we call \textit{mixture mechanisms}, are constructed by mixing multiple Gaussian distributions that share the same variance but differ in their means and mixture weights. The resulting distributions can be interpreted as convex combinations of a zero-mean Gaussian (as used in the analytic Gaussian mechanism) and additional Gaussians whose means depend on the sensitivity of the query function. We derive tight conditions on the variances required for $(\varepsilon, δ)$-DP and provide efficient algorithms to compute them. Compared to the analytic Gaussian mechanism, our mechanisms yield substantially lower expected noise amplitudes ($l_1$-loss) and variances ($l_2$-loss for zero-mean distributions). In the low-privacy regime that motivates our design, our mechanisms approach optimality, mitigating nearly all of the optimality gap of the analytic Gaussian mechanism.

URL PDF HTML ☆

赞 0 踩 0

2605.28074 2026-05-28 cs.CR cs.CL cs.IR

SilentRetrieval: Hijacking Retrieval-Augmented Generation via Semantically-Preserving Adversarial Data Poisoning

SilentRetrieval：通过语义保持的对抗性数据投毒劫持检索增强生成

Jiachen Qian

AI总结提出SilentRetrieval两阶段数据投毒攻击，通过协调束搜索和上下文自适应触发生成，在保持文档流畅性的同时实现高检索命中率和攻击成功率，并评估了防御措施的有效性。

Comments 12 pages, 4 figures, KDD '26 camera-ready version

详情

DOI: 10.1145/3770855.3818186
Journal ref: Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26), August 09--13, 2026, Jeju Island, Republic of Korea

AI中文摘要

检索增强生成（RAG）缓解了LLM的幻觉问题，但引入了一个关键漏洞：语料库完整性。我们提出SilentRetrieval，一种两阶段数据投毒攻击，通过对抗性构造但流畅的文档劫持RAG系统。第一阶段使用协调束搜索，一种具有流畅性-相似性目标的多令牌联合优化方法，在约束困惑度的同时保持有毒宿主文档的可检索性。第二阶段使用上下文自适应触发生成，一种由冻结LLM驱动的轻量级触发融合步骤，将操纵触发器集成到文档内容中。在合成目标答案的单投毒文档每查询评估下，SilentRetrieval在Natural Questions和MS MARCO上分别达到84.6%/81.3%的HR@10和57.5%/54.8%的ASR-LLM，同时保持接近良性的困惑度。跨四个目标LLM的模型评估显示，在固定触发器生成器下具有非平凡的有效性，针对未见检索器（包括ColBERT和商业嵌入模型）的迁移测试在相同注入语料库协议下平均HR@10为64.7%。在采样维基百科规模评估中，SilentRetrieval在0.016%投毒率下保持74.2%的HR@10。结合检索侧和生成侧防御可大幅降低攻击成功率，但会引入延迟权衡。人工评估显示，与不流畅基线相比，标记率显著降低，但在当前样本量下数值上仍比良性内容更可疑。

英文摘要

Retrieval-Augmented Generation (RAG) mitigates LLM hallucinations but introduces a critical vulnerability: corpus integrity. We present SilentRetrieval, a two-stage data poisoning attack that hijacks RAG systems through adversarially crafted yet fluent documents. Stage 1 uses Coordinated Beam Search, a multi-token joint optimization method with a fluency-similarity objective, to keep a poisoned host document retrievable while constraining perplexity. Stage 2 uses Context-Adaptive Trigger Generation, a lightweight trigger-fusion step driven by a frozen LLM, to integrate manipulation triggers into document content. Under a one-poisoned-document-per-query evaluation with synthetic target answers, SilentRetrieval achieves 84.6%/81.3% HR@10 and 57.5%/54.8% ASR-LLM on Natural Questions and MS MARCO, while maintaining near-benign perplexity. Cross-model evaluation across four target LLMs shows nontrivial effectiveness under a fixed trigger generator, and transfer tests against unseen retrievers, including ColBERT and commercial embedding models, yield 64.7% average HR@10 under the same injected-corpus protocol. In a sampled Wikipedia-scale evaluation, SilentRetrieval retains 74.2% HR@10 at a 0.016% poisoning ratio. Combined retrieval-side and generation-side defenses reduce attack success substantially but incur a latency trade-off. Human evaluation shows substantially lower flag rates than disfluent baselines, while remaining numerically more suspicious than benign content at the current sample size.

URL PDF HTML ☆

赞 0 踩 0

2605.28064 2026-05-28 eess.AS cs.AI cs.HC

I Hear, Therefore I Trust: A Socio-Technical Investigation of Humans as Synthetic Speech Detectors

我听见，故我信任：人类作为合成语音检测器的社会技术研究

Lelia Erscoi, Tomi Kinnunen

AI总结通过定位任务实验，研究人类在感知和语境中检测语音深度伪造的能力，发现话语类别是检测准确性和感知质量的主要决定因素，信任线索无主效应但影响检测行为，完全合成语音的检测低于随机水平。

Comments To be included in Odyssey 2026: The Speaker and Language Recognition Workshop, Session 4.2, 23-26 June, Lisbon, Portugal

详情

AI中文摘要

自动深度伪造检测已受到大量研究关注，然而人类实际遇到合成语音的社会技术环境仍知之甚少。我们将语音深度伪造检测作为感知和语境过程进行研究，呈现一个定位任务，其中47名参与者在三种操纵的信任线索下（指导框架、情感启动和来源标注）标记真实、完全合成和部分合成话语中的疑似合成片段。参与者提供了关于机械性、表现力、可懂度、清晰度、平静度和评估自信度的质量评分。话语类别是检测准确性和感知质量的主要决定因素；信任线索未产生主效应，但激发了检测行为。完全合成语音的检测低于随机水平。质量评分与话语类型相关，表明在显性检测失败时存在隐性区分。

英文摘要

Automatic deepfake detection has received considerable research attention, yet the socio-technical environment in which humans actually encounter synthetic speech remains poorly understood. We investigate voice deepfake detection as a perceptual and contextual process, presenting a localization task in which 47 participants marked suspected synthetic segments across authentic, fully synthetic, and partially synthetic utterances under three manipulated trust cues: instructional framing, affective priming, and provenance labeling. Participants provided quality ratings on mechanicalness, expressiveness, intelligibility, clarity, calmness, and confidence of evaluation. Utterance class was the primary determinant of detection accuracy and perceptual quality; trust cues produced no main effects but motivated detection behavior. Fully synthetic speech was detected at below-chance levels. Quality ratings tracked utterance type, indicating implicit discrimination where overt detection failed.

URL PDF HTML ☆

赞 0 踩 0

2605.28000 2026-05-28 cs.SE cs.AI

Tool Forge: A Validation-Carrying Toolchain for Governed Agentic Execution

Tool Forge：一种用于受控代理执行的携带验证的工具链

Swanand Rao

AI总结本文提出 Tool Forge，一种将自然语言能力意图转换为经过验证、沙盒验证、编目工具制品，并通过令牌高效路由层暴露给代理的工具链，解决了代理执行中工具层缺乏治理和验证的问题。

Comments 9 pages, 2 figures, 3 tables. Code: https://github.com/nextmoca/tool-forge

详情

AI中文摘要

大型语言模型代理越来越被期望执行操作工作：调用API、操作文件、组装工作流以及在企业系统内行动。然而，这种执行所依赖的工具层仍然通常被视为手工编写的集成工件或暴露给模型的静态模式列表。本文介绍了Tool Forge，一种携带验证的工具链，用于将自然语言能力意图转换为受控、沙盒验证、编目的工具制品，并通过令牌高效的路由层将这些制品暴露给代理。Tool Forge将工具视为一个包含意图、能力契约、实现、依赖策略、测试、文档、运行时验证证据、生命周期状态、凭证绑定和路由元数据的胶囊。它还引入了一个路由器，该路由器暴露意图限定的工具会话，而不是将完整的目录模式加载到模型上下文中。我们描述了系统架构、验证流水线、面向MCP的路由模型、治理控制以及来自开源实现的初始可重复基准测试。在83个路由器基准测试案例中，Tool Forge Router实现了0.901的聚合微平均F1，同时相对于朴素的全目录模式暴露，估计将任务流工具上下文减少了99.2%。在25个本地工具任务的端到端生成探测中，Tool Forge生成了25个工具包中的25个，在确定性接受检查中达到了0.940的微平均F1，并通过了25个沙盒验证中的23个。这些结果作为初始系统基准测试呈现，而非最先进的主张。论文指出了对抗性路由、更广泛的API接地、沙盒隔离和跨系统评估方面的剩余挑战。

英文摘要

Large language model agents are increasingly expected to perform operational work: calling APIs, manipulating files, assembling workflows, and acting inside enterprise systems. Yet the tool layer on which this execution depends is still commonly treated as either a hand-written integration artifact or a static list of schemas exposed to a model. This paper introduces Tool Forge, a validation-carrying toolchain for converting natural-language capability intent into governed, sandbox-verified, cataloged tool artifacts and exposing those artifacts to agents through a token-efficient routing layer. Tool Forge treats a tool as a capsule containing intent, capability contract, implementation, dependency policy, tests, documentation, runtime validation evidence, lifecycle state, credential bindings, and routing metadata. It also introduces a Router that exposes intent-scoped tool sessions instead of loading full catalog schemas into the model context. We describe the system architecture, validation pipeline, MCP-facing routing model, governance controls, and initial reproducible benchmarks from the open-source implementation. Across 83 Router benchmark cases, Tool Forge Router achieves aggregate micro-F1 of 0.901 while reducing estimated task-flow tool context by 99.2% relative to naive full-catalog schema exposure. In a 25-case end-to-end generation probe over local-tool tasks, Tool Forge generates 25 of 25 tool bundles, reaches micro-F1 of 0.940 against deterministic acceptance checks, and passes 23 of 25 live sandbox validations. These results are presented as an initial systems benchmark, not as a state-of-the-art claim. The paper identifies remaining challenges in adversarial routing, broader API grounding, sandbox isolation, and cross-system evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.27999 2026-05-28 cs.HC cs.AI

Learning to Assign Prediction Tasks to Agents with Capacity Constraints

学习将预测任务分配给具有容量限制的智能体

Shang Wu, Saatvik Kher, Padhraic Smyth

AI总结针对容量受限的多个智能体（人类或AI），提出一种序贯探索-利用策略学习框架，以最大化整体预测性能。

2605.27967 2026-05-28 stat.ME cs.AI cs.LG stat.ML

Multi-Teacher Knowledge Distillation via Teacher-Informed Mixture Priors

通过教师引导的混合先验进行多教师知识蒸馏

Luyang Fang, Yongkai Chen, Jiazhang Cai, Ping Ma, Wenxuan Zhong

AI总结提出多教师贝叶斯知识蒸馏（MT-BKD）框架，利用贝叶斯推断和教师引导的先验分布，结合熵加权机制，实现多教师知识的高效融合与不确定性量化。

详情

AI中文摘要

知识蒸馏是一种强大的模型压缩方法，能够高效部署复杂的深度学习模型（教师模型），包括大型语言模型。然而，其潜在的统计机制尚不明确，且不确定性评估常被忽视，特别是在需要多样化教师专业知识的实际场景中。为解决这些挑战，我们引入了 extit{多教师贝叶斯知识蒸馏}（MT-BKD），其中蒸馏学生模型在贝叶斯框架内从多个教师模型学习。我们的方法利用贝叶斯推断来捕捉蒸馏过程中的固有不确定性。我们引入了一种教师引导的先验，整合来自教师模型和特定任务训练数据的外部知识，提供了更好的泛化性、鲁棒性和可扩展性。此外，一种基于熵的加权机制自适应地调整每个教师的影响，使学生能够有效组合多个专业知识来源。MT-BKD增强了学生模型学习过程的可解释性，提高了预测准确性，并提供了不确定性量化。我们在合成任务和真实任务（包括蛋白质亚细胞定位预测和图像分类）上验证了MT-BKD。实验表明，我们的MT-BKD框架在性能提升和稳健的不确定性量化方面表现出色，突显了其优势。

英文摘要

Knowledge distillation is a powerful method for model compression, enabling the efficient deployment of complex deep learning models (teachers), including large language models. However, its underlying statistical mechanisms remain unclear, and uncertainty evaluation is often overlooked, especially in real-world scenarios requiring diverse teacher expertise. To address these challenges, we introduce \textit{Multi-Teacher Bayesian Knowledge Distillation} (MT-BKD), where a distilled student model learns from multiple teachers within the Bayesian framework. Our approach leverages Bayesian inference to capture inherent uncertainty in the distillation process. We introduce a teacher-informed prior, integrating external knowledge from teacher models and task-specific training data, offering better generalization, robustness, and scalability. Additionally, an entropy-based weighting mechanism adaptively adjusts each teacher's influence, allowing the student to combine multiple sources of expertise effectively. MT-BKD enhances the interpretability of the student model's learning process, improves predictive accuracy, and provides uncertainty quantification. We validate MT-BKD on both synthetic and real-world tasks, including protein subcellular location prediction and image classification. Our experiments show improved performance and robust uncertainty quantification, highlighting the strengths of our MT-BKD framework.

URL PDF HTML ☆

赞 0 踩 0

2605.27955 2026-05-28 cs.PL cs.CL

Skill-as-Pseudocode: Refactoring Skill Libraries to Pseudocode for LLM Agents

技能即伪代码：将技能库重构为面向LLM智能体的伪代码

Xinze Li, Yuhang Zang, Yixin Cao, Aixin Sun

AI总结提出Skill-as-Pseudocode (SaP)方法，自动将Markdown技能库转换为带类型伪代码，通过确定性质量检查解决LLM智能体在检索技能时产生的混淆循环问题，在ALFWorld任务上显著优于基线。

Comments Preprint. Code: https://github.com/InternLM/Skill-as-Pseudocode

详情

AI中文摘要

面向LLM智能体的Markdown技能库以自由格式的散文形式提供，迫使智能体在每次检索时重新推导输入模式和具体调用语法。我们观察到，这通常会产生一个“困惑 -> 重新检索 -> 仍然困惑”的循环，智能体发出部分正确的动作，收到无信息的环境反馈，并重新检索相同的散文。我们提出Skill-as-Pseudocode (SaP)，一种将Markdown技能库自动转换为带类型伪代码的方法，并具有确定性质量控制。对于从一个或多个技能中提取的相似过程性段落簇，SaP提取一个带类型契约，并通过四重确定性验证器（覆盖、绑定、替换、风险）进行过滤。通过的契约与恢复的具体动作模板一起内联到重写的技能骨架中，为智能体提供两个互补信号：技能功能的类型签名和如何调用它的具体模板。在包含134个游戏的ALFWorld未见分割上，使用gpt-4o-mini，跨三个种子汇总，SaP在402场配对游戏中赢得82场，而Graph-of-Skills (GoS)基线赢得47场（汇总McNemar检验p = 8.2e-5），每场游戏输入token减少22.8% +/- 6.4%，LLM调用减少14.5% +/- 4.1%。

英文摘要

Markdown skill libraries for LLM agents ship as free-form prose, forcing the agent to re-derive both the input schema and the concrete invocation syntax on every retrieval. We observe that this often produces a "confused -> re-retrieve -> still confused" loop in which the agent issues a partially-correct action, receives uninformative environment feedback, and re-retrieves the same prose. We propose Skill-as-Pseudocode (SaP), an automatic conversion of markdown skill libraries into typed pseudocode with deterministic quality control. For each cluster of similar procedural passages drawn from one or more skills, SaP extracts a typed contract and filters it through a four-check deterministic verifier (coverage, binding, replacement, risk). Promoted contracts are inlined into a rewritten skill skeleton together with restored concrete action templates, giving the agent two complementary signals: a typed signature for what the skill does and a concrete template for how to invoke it. On the 134-game ALFWorld unseen split with gpt-4o-mini, pooled across three seeds, SaP wins 82/402 paired games versus 47/402 for the Graph-of-Skills (GoS) baseline (pooled McNemar p = 8.2e-5), at -22.8 +/- 6.4% input tokens and -14.5 +/- 4.1% LLM calls per game.

URL PDF HTML ☆

赞 0 踩 0

2605.27946 2026-05-28 stat.ML cs.LG

Is Backpropagation Optimal? When Synthetic Gradients Improve Sample Efficiency

反向传播是最优的吗？合成梯度何时提高样本效率

Yibo Jacky Zhang, Zeyu Tang, Sanmi Koyejo

AI总结本文通过理论分析，提出合成梯度作为反向传播的替代方案，并证明在某些条件下合成梯度能实现更低的梯度估计均方误差，从而显著提高样本效率。

2605.27942 2026-05-28 quant-ph cs.DS cs.LG

Quantum principal component analysis without eigenvector recovery

无需特征向量恢复的量子主成分分析

Yewei Yuan, Michele Minervini, Mark M. Wilde, Nana Liu

AI总结提出一种基于测量的软PCA框架，用熵正则化费米-狄拉克滤波器替代硬top-k投影，通过量子电路实现无需特征向量恢复的主子空间评分。

详情

AI中文摘要

主成分分析（PCA）传统上通过协方差或核矩阵、主导特征向量提取和硬秩-$k$投影实现。这些步骤在高维和量子数据场景中计算成本高，对小特征间隙敏感，并且当下游任务仅需要主子空间分数时是不必要的。这种基于分数的目标在异常检测、谱能量分析和其他后选择任务等应用中很重要。为了满足这些需求，我们引入了一个基于测量的软PCA框架，用熵正则化的费米-狄拉克滤波器替换硬top-$k$投影器。该滤波器是PCA的熵正则化变分公式的唯一优化器，并在零温度极限下收敛到经典PCA投影器。该滤波器直接解释为量子测量，自然暗示了量子方法。对于由量子特征态表示的中心协方差算子，单个固定电路结合阈值校准，可以在无需秩相关电路更新或特征向量恢复的情况下，访问不同秩预算或保留方差水平的所有最优滤波器。对于新输入，相同的校准量子电路产生软主子空间分数、谱能量分布和后选择滤波态。训练和测试数据所需的中心化在量子协议内部相干执行，这对于没有直接可用的经典特征向量或中心化Gram矩阵的量子数据尤为重要。通过将PCA重新构建为校准测量任务，该框架绕过了迭代特征向量提取的需要，并实现了对于归一化分数秩或保留方差评分的维度无关样本复杂度$O(η^{-2})$，加性精度为$η$。

英文摘要

Principal component analysis (PCA) is traditionally implemented through a covariance or kernel matrix, leading-eigenvector extraction, and hard rank-$k$ projection. These steps can be computationally costly in high-dimensional and quantum-data settings, sensitive to small eigengaps, and unnecessary when downstream tasks only require principal-subspace scores. Such score-based objectives are important in applications such as anomaly detection, spectral-energy profiling, and other postselection tasks. To address these needs, we introduce a measurement-based soft PCA framework replacing the hard top-$k$ projector with an entropy-regularized Fermi--Dirac filter. This filter is the unique optimizer of an entropy-regularized variational formulation of PCA and converges to the classical PCA projector in the zero-temperature limit. This filter has a direct interpretation as a quantum measurement, which naturally suggests a quantum approach. For centered covariance operators represented by quantum feature states, a single fixed circuit, together with threshold calibration, accesses all optimal filters for different rank budgets or retained-variance levels without rank-dependent circuit updates or eigenvector recovery. For new inputs, the same calibrated quantum circuit yields soft principal subspace scores, spectral energy profiles, and postselected filtered states. The required centering of both training and test data is performed coherently inside the quantum protocol, which is particularly important for quantum data where no classical feature vectors or centered Gram matrix are directly available. By reframing PCA as a calibrated measurement task, this framework bypasses the need for iterative eigenvector extraction and achieves a dimension-independent sample complexity $O(η^{-2})$ for normalized fractional-rank or retained variance scoring at additive accuracy $η$.

URL PDF HTML ☆

赞 0 踩 0

2605.27937 2026-05-28 physics.ins-det cs.LG physics.med-ph

Machine learning enables experimental access to photon-by-photon arrival times in scintillation detectors

机器学习实现闪烁探测器中逐光子到达时间的实验获取

Yuya Onishi, Ryosuke Ota, Fumio Hashimoto, Kibo Ote, Go Akamatsu, Hideaki Tashima, Taiga Yamaya

AI总结本研究利用深度学习从探测器波形中直接估计单个光子的到达时间，无需修改探测器结构，通过无监督学习与物理模型结合，实现了时间分辨率提升、相互作用深度相关的光子传输可视化以及切伦科夫与闪烁光子的分类。

详情

AI中文摘要

具有优异时间分辨率的闪烁探测器能够在正电子发射断层扫描中更精确地定位辐射源，从而显著提高对癌症和痴呆等疾病的诊断能力。在皮秒尺度所需的极端时间精度下，探测器性能由探测器内产生的闪烁光子的微观动力学及其后续探测过程决定。然而，由于光电探测器的结构限制，探测器信号传统上仅被视为多个光子的集体响应。在本研究中，我们利用深度学习克服了这一基本限制，实现了对单个光子时间信息的直接访问。所提出的方法直接从探测器波形估计逐光子到达时间，无需对探测器结构进行任何修改；该方法通过将无监督学习框架与物理信息探测器响应模型相结合，在逐事件基础上运行，无需真实标签。通过结合蒙特卡洛模拟和实验测量在各种探测器配置下的全面验证，我们实验证明了时间分辨率的提高，可视化了依赖于相互作用深度的光子传输，并基于估计的光子级时间信息使用统一的深度学习框架对切伦科夫和闪烁光子进行了分类。这些结果提供了对光子动力学的实验访问，弥合了理论建模与实验观察之间的差距，并为探测器物理和优化开辟了一条新的数据驱动途径。

英文摘要

Scintillation detectors with excellent timing resolution enable more precise localization of radiation sources in positron emission tomography, leading to substantial improvements in diagnostic capability for diseases such as cancer and dementia. At the extreme timing precision required for such applications at the picosecond scale, detector performance is governed by the microscopic dynamics of scintillation photons generated within the detector and their subsequent detection processes. However, detector signals have conventionally been treated only as collective responses of many photons due to structural constraints inherent to photodetectors. In this study, we overcome this fundamental limitation using deep learning, enabling direct access to the timing information of individual photons. The proposed method estimates photon-by-photon arrival times directly from detector waveforms without requiring any modification to the detector structure; the method operates on an event-by-event basis without ground-truth labels by integrating an unsupervised learning framework with a physically informed detector-response model. Through comprehensive validation combining Monte Carlo simulation and experimental measurements across various detector configurations, we experimentally demonstrate improved timing resolution, visualized depth-of-interaction-dependent photon transport, and classified Cherenkov and scintillation photons based on the estimated photon-level timing information using a unified deep learning-based framework. These results provide experimental access to photon dynamics, bridging the gap between theoretical modeling and experimental observation, and they open a new data-driven pathway for discovery in detector physics and optimization.

URL PDF HTML ☆

赞 0 踩 0

2605.27929 2026-05-28 q-bio.NC cs.LG

Exploratory Experience Shapes the Geometry of Predictive Representations

探索性经验塑造预测表征的几何结构

Kseniia Shilova, Abdelrahman Sharafeldin, Advay Balakrishnan, Hannah Choi

AI总结通过构建树状迷宫中的在线学习智能体，研究探索与利用行为策略如何影响基于预测编码的内部表征几何结构，发现探索行为促进更具空间组织性的表征，且与小鼠实验数据一致。

详情

AI中文摘要

主动感知通过行动-感知循环将行为与学习联系起来：行动决定了用于更新内部感知预测模型的观测，而该模型随后指导下一步行动。预测编码框架为建模这一过程提供了自然方式，因为内部表征不断更新以预测未来观测。这里，我们探究探索性和利用性行为策略如何塑造这些内部预测表征。我们在树状迷宫中构建了一个在线学习智能体，其可调参数控制探索与利用模式之间的平衡。智能体根据自身行为产生的经验更新基于预测编码的感知模型。该模型预测未来迷宫状态和奖励概率，使智能体能够通过探索期间的预期信息增益或利用期间的预测奖励来选择行动。结果表明，产生的内部预测表征强烈依赖于智能体的行为模式。探索性智能体发展出更具空间组织性的表征，并更好地在潜在空间中保留迷宫转换的结构。相反，利用性智能体学习到组织性较差的表征。然后，我们用水剥夺小鼠在相同迷宫中导航的自然轨迹训练该预测模型，并将结果表征与智能体轨迹学习到的表征进行比较。更具探索性的小鼠表现出与探索性智能体高度匹配的表征几何结构，而访问模式受限的小鼠则类似于奖励驱动的利用性智能体。总之，这些发现表明，在人工智能体和动物中，探索通过围绕空间位置和转换上下文组织潜在空间，使预测模型能够形成泛化的内部表征。

英文摘要

Active sensing links behavior and learning through an action-perception loop: actions determine the observations used to update internal predictive models of perception, which subsequently guide the next actions. Predictive-coding frameworks provide a natural way to model this process, since internal representations are continuously updated to predict future observations. Here, we ask how exploratory and exploitative behavioral strategies shape these internal predictive representations. We build an online learning agent in a tree-like maze with a controllable parameter regulating the balance between exploratory and exploitative regimes. The agent updates a predictive-coding-based perception model from experience generated by its own behavior. The model predicts both future maze states and reward probability, allowing the agent to select actions either by expected information gain during exploration or by predicted reward during exploitation. We show that the resulting internal predictive representations depend strongly on the agent's behavioral regime. Exploratory agents develop representations that are more spatially organized and better preserve the structure of maze transitions in latent space. In contrast, exploitative agents learn less organized representations. We then train this predictive model on natural trajectories of water-deprived mice navigating the same maze and compare the resulting representations with those learned from agent trajectories. More exploratory mice show representational geometries that closely match those of exploratory agents, whereas mice with more restricted visitation patterns resemble reward-driven, exploitative agents. Together, these findings suggest that exploration enables predictive models to form generalized internal representations by organizing latent space around both spatial location and transition context in artificial agents and animals.

URL PDF HTML ☆

赞 0 踩 0

2605.27856 2026-05-28 cs.IR cs.AI

Fine-Tuned LLM as a Complementary Predictor Improving Ads System

微调LLM作为改进广告系统的互补预测器

Hui Yang, Daiwei He, Kevin Jiang, Taejin Park, Kungang Li, Jiajun Luo, Yuying Chen, Xinyi Zhang, Sihan Wang, Haoyu He, Yu Liu, Lakshmi Manoharan, David Xue, Shubham Barhate, Runze Su, Duna Zhan, Ling Leng, Siping Ji, Jinfeng Zhuang, Alice Wu, Leo Lu, Han Sun, Zhifang Liu

AI总结提出将微调的开源LLM作为广告特定辅助预测器，从用户画像和历史中预测广告主，增强候选生成并为下游排序提供先验信息，在工业广告系统中取得离线改进和在线业务提升。

详情

AI中文摘要

推荐系统驱动着信息流、广告和短视频平台的用户参与和变现，但将大型语言模型的最新进展转化为推荐系统的收益仍然罕见，尤其是在广告和工业级生产规模的实际场景中。先前真实世界的LLM成功通常分为三类：(a) 直接预测下一项以生成候选的生成式检索，(b) 使用LLM进行后期重排序，以及(c) 利用LLM进行辅助信号增强。我们为广告引入了一种互补范式：微调的开源LLM不作为排序器，而是作为广告特定的辅助预测器，从用户画像和历史中预测可能的广告主。这种LLM驱动的广告主预测增强了传统候选生成，并为下游排序提供了信息先验。在大规模生产广告系统中开发，我们的方法产生了显著的离线改进和可衡量的在线业务影响，展示了LLM的世界知识和预测能力可以被有效利用。除了验证LLM在广告应用中的有效性，我们的结果表明，有针对性的辅助预测可以在检索和后期排序中解锁端到端的收益，为大规模LLM增强推荐提供了一条实用路径。

英文摘要

Recommendation systems power engagement and monetization across feeds, ads, and short-video platforms, but translating the latest advances in Large Language Models into Recommendation Systems (RecSys) gains remains rare, particularly in advertising and production-scale real-world industry setups. Prior real-world LLM successes typically fall into three buckets: (a) generative retrieval that directly predicts the next items for candidate generation, (b) late-stage re-ranking that uses LLMs, and (c) auxiliary signal enrichment with LLMs. We introduce a complementary paradigm for ads: a fine-tuned open-source LLM used not as a ranker, but as an ads-specific ancillary predictor, forecasting likely advertisers from user profiles and histories. This LLM-driven advertiser prediction augments conventional candidate generation and provides informative priors to downstream ranking. Developed in a large-scale production advertising system, our approach produces substantial offline improvements and measurable online business impact, demonstrating that LLM world knowledge and predictive capacity can be efficiently harnessed. Beyond validating LLMs for ads applications, our results show that targeted ancillary predictions can unlock end-to-end gains across both retrieval and late-stage ranking, offering a practical path to LLM-enhanced recommendation at scale.

URL PDF HTML ☆

赞 0 踩 0

2605.27849 2026-05-28 cs.PL cs.AI cs.CL

FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation

FPMoE：一种用于函数式代码生成的稀疏混合专家方法

Loc Pham, Lang Hong Nguyet Anh, Thanh Le-Cong

AI总结针对LLM在函数式编程语言上性能差的问题，提出基于稀疏MoE架构的FPMoE模型，通过语言特定专家和共享专家分别消除干扰和捕获跨语言抽象，以3B活跃参数达到远超微调基线并匹配大模型的效果。

详情

AI中文摘要

尽管基于LLM的代码生成取得了快速进展，但现有模型主要针对命令式语言进行训练，导致函数式编程语言（FPLs）如Haskell、OCaml和Scala长期未被充分探索，即使是前沿模型在FPLs上的表现也明显较差。微调是一种自然的补救措施，但我们的实验表明，每种语言的微调无法捕获共享的函数式抽象，而合并的多语言微调则引入了跨语言干扰。为了解决这个问题，我们引入了FPMoE，这是一个轻量级的开源代码生成模型，基于稀疏混合专家（MoE）架构，包含三个语言特定的路由专家（分别对应Haskell、OCaml和Scala）和一个共享专家，用于捕获跨语言的函数式模式，如单子推理和类型导向编程。这种设计同时解决了两种失败模式：专用专家消除了干扰，而共享专家保留了单语言模型遗漏的抽象。在FPEval上，FPMoE显著优于微调基线，并且仅使用3B活跃参数，即可匹配包括DeepSeek-Coder-6.7B、Qwen2.5-Coder-14B-Instruct和Qwen3-Coder-30B-A3B在内的更大模型的性能。

英文摘要

Despite rapid progress in LLM-based code generation, existing models are predominantly trained on imperative languages, leaving functional programming languages (FPLs) such as Haskell, OCaml, and Scala chronically underexplored, with even frontier models performing substantially worse on FPLs. Fine-tuning is a natural remedy, but our experiments show that per-language fine-tuning fails to capture shared functional abstractions, while merged multi-language fine-tuning introduces cross-language interference. To address this, we introduce FPMoE, a lightweight, open-source code generation model built on a sparse Mixture-of-Experts (MoE) architecture with three language-specific routed experts (one each for Haskell, OCaml, and Scala) and a shared expert that captures cross-language functional patterns such as monadic reasoning and type-directed programming. This design resolves both failure modes simultaneously: dedicated experts eliminate interference, while the shared expert preserves abstractions that per-language models miss. On FPEval, FPMoE substantially outperforms fine-tuned baselines and, with only 3B active parameters, matches the performance of much larger models including DeepSeek-Coder-6.7B, Qwen2.5-Coder-14B-Instruct, and Qwen3-Coder-30B-A3B.

URL PDF HTML ☆

赞 0 踩 0

2605.27845 2026-05-28 cs.SI cs.AI physics.soc-ph

Snippet-Driven Supply Chain Discovery with LLMs: Scaling Visibility in China

基于片段的供应链发现：利用LLMs在中国实现规模化可见性

Hiroto Fukada, Takayuki Mizuno

AI总结提出一种基于网络搜索片段的方法，利用大语言模型构建供应链知识图谱，以低成本扩展对中国企业间关系的覆盖范围。

Comments 8 pages, 4 figures, 3 tables

详情

AI中文摘要

金融和经济研究通常依赖于结构化的供应链披露和商业数据库。在中国，供应商-客户披露通常仅限于上市公司的重大合作伙伴，导致非上市公司和长尾企业间关系在结构化数据中记录不足。公共网络证据可以通过企业、政府和贸易媒体披露部分弥补这一差距；然而，大规模的全文本网络挖掘成本高昂，因为页面通常难以访问或使用大语言模型（LLM）处理成本过高。我们提出了一种基于片段的方法来构建供应链知识图谱（SCKG），以企业为节点，企业间关系为边。网络搜索片段是与查询相关的摘要，随搜索结果返回。我们将其用作基于LLM的关系提取的可扩展第一层证据。我们从提取效率和覆盖范围两方面评估该流程。在提取效率方面，穷举全文本分块发现的独特关系数量是片段的19.8倍，但需要的输入token数量是片段的251.2倍，且冗余度更高。在覆盖范围方面，我们使用130,685家中国企业作为搜索种子，涵盖截至2024年的上海/深圳上市公司和大型非上市公司。在上市公司子集中，生成的SCKG覆盖的企业数量是CSMAR披露基准的7.2倍，关系数量是9.3倍，同时揭示了重尾度分布模式。保留的来源元数据使SCKG成为可审计的披露数据库补充。

英文摘要

Financial and economic research often relies on structured supply-chain disclosures and commercial databases. In China, supplier--customer disclosure is typically limited to major partners of listed firms, leaving unlisted firms and long-tail inter-firm links poorly captured in structured data. Public web evidence can partly complement this gap through corporate, government, and trade-media disclosures; however, full-text web mining at scale is costly because pages are often inaccessible or expensive to process with large language models (LLMs). We propose a snippet-driven method for constructing a supply chain knowledge graph (SCKG), with firms as nodes and inter-firm relationships as edges. Web search snippets are query-biased summaries returned with search results. We use them as a scalable first-pass evidence layer for LLM-based relationship extraction. We evaluate the pipeline in terms of extraction efficiency and coverage. For extraction efficiency, exhaustive full-text chunking discovers 19.8$\times$ more unique relationships than snippets, but requires 251.2$\times$ more input tokens and yields higher redundancy. For coverage, we use 130,685 Chinese firms as search seeds, covering Shanghai/Shenzhen-listed firms and large unlisted firms as of 2024. In the listed-firm subset, the resulting SCKG covers 7.2$\times$ more firms and 9.3$\times$ more relationships than the CSMAR disclosure-based benchmark, while revealing heavy-tailed degree patterns. Retained provenance metadata make the SCKG an auditable complement to disclosure-based databases.

URL PDF HTML ☆

赞 0 踩 0