arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.30738 2026-06-01 cs.AI

MAVEN: Improving Generalization in Agentic Tool Calling

MAVEN:提升智能体工具调用的泛化能力

Omkar Ghugarkar, Vishvesh Bhat, Muhammad Ahmed Mohsin, Asad Aali

AI总结 提出MAVEN框架,通过轻量级符号推理脚手架实现结构化分解、自适应工具编排和中间验证,在多个基准测试中显著提升模型性能,且成本仅为前沿专有模型的约1/10。

详情
AI中文摘要

跨智能体工具调用环境的泛化仍然是可靠智能体推理系统的核心挑战。尽管大语言模型在单个基准测试上取得了强劲结果,但它们在组合推理策略、保留中间状态以及跨域协调工具方面的能力仍未得到充分探索。我们提出MAVEN(模块化智能体验证与执行网络),这是一种轻量级符号推理脚手架,用于结构化分解、自适应工具编排和中间验证。我们在包括BFCL v3、TauBench、Tau2Bench、AceBench在内的既定工具调用基准上评估MAVEN,并引入MAVEN-Bench,这是一个针对多步数学和物理推理的压测基准,具有显式验证和对抗性任务组合。MAVEN-Bench揭示了部分推理质量与端到端任务成功之间的巨大差距;在直接的MAVEN-Bench运行中,MAVEN在不进行额外训练的情况下,将其GPT-OSS-120b基础模型的准确率从48%提升至71%。同时,它在使用开源权重骨干且估计成本约为1/10的情况下,与前沿专有基线保持竞争力,这表明以轻量级验证为中心的脚手架可以增强组合推理,并激励对智能体进行更注重过程的评估。

英文摘要

Generalization across agentic tool-calling environments remains a central challenge for reliable agentic reasoning systems. Although large language models achieve strong results on individual benchmarks, their ability to compose reasoning strategies, preserve intermediate states, and coordinate tools across domains remains underexplored. We present MAVEN (Modular Agentic Verification and Execution Network), a lightweight symbolic reasoning scaffold for structured decomposition, adaptive tool orchestration, and intermediate verification. We evaluate MAVEN across established tool-calling benchmarks, including BFCL v3, TauBench, Tau2Bench, AceBench, and introduce MAVEN-Bench, a stress-test benchmark for multi-step mathematical and physical reasoning with explicit verification and adversarial task composition. MAVEN-Bench exposes a substantial gap between partial reasoning quality and end-to-end task success; in direct MAVEN-Bench runs, MAVEN improves its GPT-OSS-120b base model from 48% to 71% accuracy without additional training. It also remains competitive with frontier proprietary baselines while using an open-weight backbone with an estimated cost ratio of roughly 1/10, suggesting that lightweight verification-centered scaffolds can strengthen compositional reasoning and motivate more process-aware evaluation of agents in the wild.

2605.30736 2026-06-01 cs.LG cs.AI cs.CL

OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline-Online Learning

OrcaRouter: 一种面向生产的混合离线-在线学习LLM路由器

Zhenghua Bao, Fengya Tian, Chris Zhang, Zhenjun Chen, Xile Ma, Yi Shi

AI总结 提出OrcaRouter,一种结合LinUCB上下文赌博机与混合离线-在线学习协议的生产级LLM路由器,通过离线全信息反馈和在线赌博机学习实现低成本高精度模型选择。

详情
Comments
6 pages, 1 table. Technical report
AI中文摘要

大型语言模型的快速发展,每个模型具有不同的能力和推理成本,引发了一个实际部署问题:给定一个传入请求,应由哪个模型处理?我们提出OrcaRouter,一种面向生产的LLM路由器,它结合了基于词法和句子嵌入特征的LinUCB上下文赌博机与混合离线-在线学习协议。在离线阶段,OrcaRouter通过在一组精心策划的路由提示上评估每个候选模型来获取全信息反馈,生成一个奖励矩阵,用于为每个臂拟合一个岭回归器。在部署时,它从这些参数初始化,并可选地从赌博机反馈中继续学习,在观察到奖励后仅更新所选模型的臂。在我们提交RouterArena时(2026年5月20日),OrcaRouter-Adaptive以72.08的竞技场得分在公共RouterArena排行榜上排名第二,在每1000次查询成本1.00美元的情况下实现了75.54%的准确率。

英文摘要

The rapid development of large language models, each with distinct capabilities and inference costs, raises a practical deployment question: given an incoming request, which model should handle it? We present OrcaRouter, a production-oriented LLM router that combines a LinUCB-based contextual bandit over lexical and sentence-embedding features with a hybrid offline-online learning protocol. Offline, OrcaRouter obtains full-information feedback by evaluating each candidate model on a curated set of routing prompts, yielding a reward matrix used to fit one ridge regressor per arm. At deployment time, it initializes from these parameters and can optionally continue learning from bandit feedback, updating only the selected model's arm after observing its reward. At the time of our RouterArena submission (May 20, 2026), OrcaRouter-Adaptive ranked second on the public RouterArena leaderboard with an arena score of 72.08, achieving 75.54% accuracy at a cost of USD 1.00 per 1,000 queries.

2605.30734 2026-06-01 cs.LG cs.CV

Beyond Accuracy: Evaluating Efficiency, Robustness and Explainability in Deep Learning for Malaria Diagnosis

超越准确率:评估深度学习在疟疾诊断中的效率、鲁棒性和可解释性

Olivier Kanamugire, Kerol Djoumessi

AI总结 本研究在NLM-Malaria数据集上基准测试四种深度学习模型,联合评估预测性能、鲁棒性和事后可解释性,发现轻量级模型在性能上与重型模型相当,但可解释性在图像损坏下脆弱。

详情
Comments
Under review
AI中文摘要

疟疾仍然是撒哈拉以南非洲地区的主要死亡原因,该地区诊断基础设施匮乏,使得及时准确的诊断尤其具有挑战性。虽然深度学习为自动化疟疾筛查提供了一条有前景的途径,但临床采用受到计算成本和决策不透明性的阻碍。本研究在NLM-Malaria数据集上基准测试了四种涵盖广泛设计架构和模型容量的深度学习模型,联合评估了预测性能、鲁棒性和事后可解释性。我们发现,轻量级、高效设计的模型在预测性能上与更重的模型相当,Friedman检验确认无统计显著差异。基于CAM的XAI方法一致地定位诊断相关区域,而细粒度归因方法产生的解释针对性较弱,尤其是在使用更重的骨干网络时。在三种图像损坏下的鲁棒性评估进一步揭示,模型置信度下降速度快于准确率,为人工审核提供了实用信号。然而,没有一种XAI方法对损坏具有鲁棒性,即使在预测仍然准确的情况下,解释可靠性也会在临床实践中可能出现的噪声水平下降。这些发现支持在资源受限环境中部署轻量级架构用于疟疾诊断,同时强调事后解释的脆弱性,这是负责任临床部署的重要考虑因素。

英文摘要

Malaria remains a leading cause of mortality in sub-Saharan Africa, where scarce diagnostic infrastructure makes timely, accurate diagnosis particularly challenging. While deep learning offers a compelling path toward automated malaria screening, clinical adoption is hindered by computational cost and opacity in decision-making. This work benchmarks four deep learning models spanning a wide range of designed design architectures and model capacities on the NLM-Malaria dataset, jointly evaluating predictive performance, robustness, and post-hoc explainability. We find that lightweight, efficient-by-design models match their heavier counterparts in predictive performance, and the Friedman test confirms no statistically significant performance differences. CAM-based XAI methods consistently localize diagnostically relevant regions, while fine-grained attribution methods produce less targeted explanations, particularly with heavier backbones. Robustness evaluation under three types of image corruption further reveals that model confidence degrades faster than accuracy, providing a practical signal for human review. However, no XAI method is robust to corruption, with explanation reliability degrading at noise levels plausible in clinical practice, even when predictions remain accurate. These findings support the deployment of lightweight architectures for malaria diagnosis in resource-constrained settings, while highlighting the vulnerability of post-hoc explanations as an important consideration for responsible clinical deployment.

2605.30729 2026-06-01 cs.LG cs.IR

SemStruct: Contextualizing Semantic Embeddings with Structural Information for Schema Matching

SemStruct: 利用结构信息对语义嵌入进行上下文化以实现模式匹配

Inwon Kang, Kavitha Srinivas, Nandana Mihindukulasooriya, Sola Shirai, Parikshit Ram, Horst Samulowitz, Oshani Seneviratne

AI总结 提出SemStruct框架,通过将冻结的预训练语言模型与图神经网络的结构归纳偏置相结合,利用行级共现关系作为结构信息,在模式匹配任务中实现最先进性能。

详情
Comments
Accepted to KDD 26 Research Track
AI中文摘要

模式匹配是集成异构数据源的基本步骤。虽然预训练语言模型通过捕获语言语义彻底改变了这一任务,但它们通常将表格数据视为独立列描述的序列化文本。这种序列化丢弃了关键的结构信息——具体来说,行级共现,即关系上下文——迫使模型仅依赖列标题语义或独立分布。为弥补这一差距,我们提出了SemStruct,一个将冻结的PLM的语义能力与图神经网络的结构归纳偏置相结合的框架。我们将表格建模为一个异构图,其中列和值是由行连接的节点,允许GNN在结构上传播消歧上下文。与需要专有LLM访问和语言模型微调的其他最先进方法不同,SemStruct保持语言模型冻结,仅训练一个轻量级结构编码器。在Valentine和SOTAB-SM基准上的大量实验表明,SemStruct实现了最先进的性能,在复杂的、可语义连接的数据集上超越了完全微调的基线。此外,我们的消融研究表明,行表示主要作为拓扑导管而非语义实体,验证了在模式匹配中显式结构建模的必要性。

英文摘要

Schema matching is a fundamental step in integrating heterogeneous data sources. While Pre-trained Language Models (PLMs) have revolutionized this task by capturing linguistic semantics, they typically process tabular data as serialized text sequences of standalone column descriptions. This serialization discards critical structural information -- specifically, the row-level co-occurrences, i.e. the relational context -- forcing models to rely solely on column header semantics or standalone distributions. To bridge this gap, we propose SemStruct, a framework that joins the semantic power of frozen PLMs with the structural inductive bias of Graph Neural Networks (GNNs). We model the table as a heterogeneous graph where columns and values are nodes connected by rows, allowing the GNN to propagate disambiguating context across the structure. Unlike other state-of-the-art methods that require proprietary LLM access and fine-tuning of language models, SemStruct keeps the language model frozen and trains only a lightweight structural encoder. Extensive experiments on the Valentine and SOTAB-SM benchmarks demonstrate that SemStruct achieves state-of-the-art performance, outperforming fully fine-tuned baselines on complex, semantically joinable datasets. Furthermore, our ablation studies reveal that row representations serve primarily as topological conduits rather than semantic entities, validating the necessity of explicit structural modeling in schema matching.

2605.30728 2026-06-01 cs.LG cs.DC

Reducing the GPU Memory Bottleneck with Lossless Compression for ML -- Extended

通过无损压缩减少机器学习中的GPU内存瓶颈——扩展版

Aditya K Kamath, Arvind Krishnamurthy, Marco Canini, Simon Peter

AI总结 提出无损压缩算法IBP,通过识别和消除张量组中的不变位,利用GPU优化的解压缩和异步PCIe传输,显著减少数据传输时间,加速GNN训练、DLRM嵌入查找和LLM推理。

详情
Journal ref
2026. In Proceedings of the 21st European Conference on Computer Systems. Association for Computing Machinery, 899-918
Comments
Extended version of paper published at 21st European Conference on Computer Systems (EUROSYS '26), April 27-30, 2026, Edinburgh, Scotland Uk
AI中文摘要

机器学习(ML)训练和推理经常处理远超GPU内存容量的数据集,迫使它们依赖PCIe进行按需张量传输,导致关键的传输瓶颈。有损压缩已被提出以缓解瓶颈,但会引入依赖工作负载的精度损失,使得在现有ML部署中使用变得复杂甚至不可行。我们探索无损压缩作为替代方案,以避免这种部署复杂性。我们确定了无损压缩可以集成到ML流水线中的位置,同时最小化对GPU执行的干扰。基于我们的发现,我们引入了不变位打包(IBP),一种新颖的无损压缩算法,旨在最小化ML的数据传输时间。IBP识别并消除张量组中的不变位,通过利用warp并行性、低开销位操作和异步PCIe传输的GPU优化解压缩来提高吞吐量。我们提供易于使用的API,通过为GNN训练以及DLRM和LLM推理框架添加IBP支持来展示它们。IBP平均实现了74%更快的GNN训练、180%更快的DLRM嵌入查找和24%更快的LLM推理。

英文摘要

Machine learning (ML) training and inference often process data sets far exceeding GPU memory capacity, forcing them to rely on PCIe for on-demand tensor transfers, causing critical transfer bottlenecks. Lossy compression has been proposed to relieve bottlenecks but introduces workload-dependent accuracy loss, making it complex or even prohibitive to use in existing ML deployments. We explore lossless compression as an alternative that avoids this deployment complexity. We identify where lossless compression can be integrated into ML pipelines while minimizing interference with GPU execution. Based on our findings, we introduce Invariant Bit Packing (IBP), a novel lossless compression algorithm designed to minimize data transfer time for ML. IBP identifies and eliminates invariant bits across groups of tensors, improving throughput through GPU-optimized decompression that leverages warp parallelism, low-overhead bit operations, and asynchronous PCIe transfers. We provide easy-to-use APIs, showcasing them by adding IBP support to GNN training, as well as DLRM and LLM inference frameworks. IBP achieves, on average, 74% faster GNN training, 180% faster DLRM embedding lookup, and 24% faster LLM inference.

2605.30727 2026-06-01 cs.CL

MosaicLeaks:Privacy Risks in Querying-in-the-Open for Deep Research Agents

MosaicLeaks:深度研究代理的开放查询中的隐私风险

Alexander Gurung, Spandana Gella, Alexandre Drouin, Issam H. Laradji, Perouz Taslakian, Rafael Pardinas

AI总结 针对深度研究代理在查询外部工具时可能泄露本地敏感信息的问题,提出MosaicLeaks基准测试和隐私感知深度研究(PA-DR)框架,通过强化学习降低信息泄露风险。

详情
AI中文摘要

深度研究代理越来越多地将私有本地文档与网络检索等外部工具结合,这带来了隐私风险:代理的外部查询可能泄露其本地上下文中的敏感信息。这种风险因马赛克效应而被放大,即单个查询看似无害,但聚合起来可能泄露信息。我们引入了MosaicLeaks,一个包含1,001个多跳深度研究任务的基准测试,这些任务将私有企业文档和公共网络语料库链接起来,迫使代理进行依赖本地信息的外部查询。我们通过一个仅观察代理外部查询并尝试在三个层面推断私有信息的对抗性LLM来评估泄露:代理的研究意图、特定私有问题的答案以及关于企业文档的可验证声明。我们发现,不同系列和大小的模型在三个层面上频繁泄露信息;零样本隐私提示减少了泄露但未消除泄露;仅针对任务性能的强化学习加剧了泄露。为了解决这个问题,我们提出了隐私感知深度研究(PA-DR),这是一个RL框架,结合了任务成功的情境奖励和一个学习到的隐私分类器,以提供对每个查询和马赛克级别泄露的密集信用分配。使用PA-DR训练Qwen3-4B-Instruct将准确率从48.7%提高到58.7%,并将答案和全信息泄露从34.0%降低到9.9%。

英文摘要

Deep research agents increasingly combine private local documents with external tools like web retrieval, creating a privacy risk: an agent's external queries may leak sensitive information from its local context. This risk is amplified by the mosaic effect, where individual queries may appear harmless but become revealing in aggregate. We introduce MosaicLeaks, a benchmark of 1,001 multi-hop deep research tasks that chain private enterprise documents and a public web corpus, forcing agents to make external queries that depend on local information. We evaluate leakage with an adversary LLM that observes only the agent's external queries and attempts to infer private information at three levels: the agent's research intent, answers to specific private questions and verifiable claims about the enterprise documents. We find that models across families and sizes frequently leak at all three levels, that zero-shot privacy prompting reduces but does not eliminate leakage and that reinforcement learning for task performance alone worsens leakage. To address this, we propose Privacy-Aware Deep Research (PA-DR), an RL framework that combines situational rewards for task success with a learned privacy classifier to provide dense credit assignment over both per-query and mosaic-level leakage. Training Qwen3-4B-Instruct with PA-DR improves accuracy from 48.7% to 58.7% and reduces answer and full-information leakage from 34.0% to 9.9%.

2605.30723 2026-06-01 cs.CL

Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents

技能并非一刀切:面向 LLM 智能体的模型感知技能对齐

Jianxiang Yu, Jiapeng Zhu, Bochen Lin, Qier Cui, Zichen Ding, Xiang Li

AI总结 提出 MASA 框架,通过层次化技能进化管道和轻量级条件重写器,为不同能力的 LLM 主干网络自适应调整技能,显著提升长时交互任务性能。

详情
AI中文摘要

LLM 智能体越来越多地检索外部策划的技能——在决策时获取的程序性指令——以提高在长时交互任务上的表现。现有的技能库通常被视为模型无关的,在不同容量和行为的骨干网络上重用相同的技能表述。然而,我们在多个模型规模上的控制实验表明,技能的有效性强烈依赖于模型:对某个骨干网络有益的技能可能对另一个有害。受此观察启发,我们提出了 MASA(模型感知技能对齐),一个无需修改智能体权重即可为每个目标骨干网络调整技能的框架。MASA 分两个阶段运行:(1)一个层次化技能进化管道,通过爬山法和 UCB 驱动的树搜索,在环境反馈和模型能力概况的指导下,迭代重写通用和任务特定技能;(2)一个在进化轨迹上训练的轻量级模型条件技能重写器,以在单次前向传播中重现适应过程。在三个交互环境和四个骨干网络上的实验表明,MASA 始终实现最佳整体性能,比最强基线高出最多 25.8 个点。学习到的重写器进一步泛化到未见过的任务和环境,无需额外搜索,以极低的推理成本持续超越更大的教师 LLM。

英文摘要

LLM agents increasingly retrieve externally curated skills-procedural instructions retrieved at decision time-to improve performance on long-horizon interactive tasks. Existing skill libraries are typically treated as model-agnostic, reusing the same skill formulations across backbones with substantially different capacities and behaviors. However, our controlled experiments across multiple model scales show that skill effectiveness is strongly model-dependent: a skill that benefits one backbone can harm another. Motivated by this observation, we propose MASA Model-Aware Skill Alignment, a framework that adapts skills to each target backbone without modifying agent weights. MASA operates in two stages: (1) a hierarchical skill evolution pipeline that iteratively rewrites general and task-specific skills using hill climbing and UCB-driven tree search, guided by environment feedback and model capability profiles; and (2) a lightweight model-conditioned skill rewriter trained on evolution trajectories to reproduce the adaptation in a single forward pass. Experiments across three interactive environments and four backbones show that MASA consistently achieves the best overall performance, with gains of up to 25.8 points over the strongest baseline. The learned rewriter further generalizes to unseen tasks and environments without additional search, consistently outperforming a much larger teacher LLM at a fraction of the inference cost.

2605.30720 2026-06-01 cs.LG cs.AI econ.GN q-fin.EC stat.ML

Kalimati Vegetable Price Index Forecasting with a Momentum Corrected Online Stacking Ensemble

Kalimati蔬菜价格指数预测:基于动量校正的在线堆叠集成方法

Sahaj Raj Malla

AI总结 针对新兴经济体农产品价格高波动性问题,提出动量校正在线堆叠集成模型,通过构建逆波动率加权综合指数和64个因果特征,在90天预测期实现RMSE=1.771、MAPE=0.68%、R²=0.845的优异性能。

详情
Comments
21 pages, 8 figures, 2 tables
AI中文摘要

由于高波动性、频繁的供应中断以及强烈的文化需求影响,新兴经济体的农产品价格预测十分困难。本研究引入了Kalimati蔬菜价格指数(KVPI),这是一个新的逆波动率加权综合指数,汇总了加德满都十年(2013-2023年)的135种日度批发商品。通过创建稳定的宏观信号,KVPI减少了单个作物建模固有的噪声。我们开发了包含64个因果有效特征的丰富特征集,包括节日领先滞后效应、滚动统计量和日历变量。对涵盖统计、树基、深度学习、混合和Transformer架构的14种预测模型,在短期(7天)、中期(14天和30天)和长期(90天)预测期上进行了严格评估。树基集成方法表现出显著的鲁棒性,而经典统计模型和复杂Transformer在处理噪声数据集时表现不佳。提出的动量校正在线堆叠集成模型取得了最强性能,在90天预测期上均方根误差(RMSE)为1.771,平均绝对百分比误差(MAPE)低至0.68%,并解释了84.5%的方差(R²=0.845)。这一开源流程为尼泊尔及类似市场的政策制定者和供应链参与者提供了实用、可靠的工具,以预测价格波动并加强粮食安全。

英文摘要

Forecasting agricultural commodity prices in emerging economies is difficult due to high volatility, frequent supply disruptions, and strong cultural influences on demand. This study introduces the Kalimati Vegetable Price Index (KVPI), a new inverse-volatility weighted composite index that aggregates 135 daily wholesale commodities from Kathmandu over ten years (2013-2023). By creating a stable macro-level signal, the KVPI reduces the noise inherent in modelling individual crops. A rich set of 64 causally valid features was developed, including festival lead-lag effects, rolling statistics, and calendar variables. Fourteen forecasting models spanning statistical, tree-based, deep learning, hybrid, and transformer architectures were rigorously evaluated across short (7-day), medium (14- and 30-day), and long-term (90-day) horizons. Tree-based ensembles proved notably robust, while classical statistical models and complex transformers struggled with the noisy dataset. The proposed Momentum-Corrected Online Stacking Ensemble achieved the strongest performance, yielding a Root Mean Square Error (RMSE) of 1.771, an exceptionally low Mean Absolute Percentage Error (MAPE) of 0.68%, and explaining 84.5% of the variance (R-squared = 0.845) at the 90-day horizon. This open-source pipeline provides policymakers and supply chain actors in Nepal and similar markets with a practical, reliable tool for anticipating price movements and strengthening food security.

2605.30719 2026-06-01 cs.LG cs.AI

When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?

何时LLMs足以作为序列RL任务的策略优化器?

Stephane Hatgis-Kessell, Emma Brunskill

AI总结 提出PromptPO方法,利用LLM通过Python描述状态空间、动作空间和奖励函数,基于rollout反馈迭代生成和优化可执行策略,在多种环境中匹配或超越标准RL基线,但在细粒度连续控制任务中表现不足。

详情
AI中文摘要

我们研究大型语言模型(LLMs)何时可以作为强化学习(RL)任务的有效黑盒策略优化器,即何时可以用LLM替代经典RL算法?我们通过引入提示策略优化(PromptPO)来探索这个问题,这是一种迭代方法,它用状态空间、动作空间和奖励函数的Python描述提示LLM,然后让LLM根据rollout反馈生成并优化可执行策略。在硬探索环境、Meta-World机器人任务以及几个现实世界控制问题中,PromptPO通常匹配或超过标准RL基线的性能,同时使用显著更少的环境交互。为了最大化期望回报,且无需进一步显式提示,PromptPO输出的策略范围从调谐的比例控制器或基于规则的规划到运行值迭代等规划算法的策略。我们的结果表明,当LLM能够利用关于环境或优化策略的先验知识时,基于LLM的策略优化是足够的。PromptPO在MuJoCo领域中的表现不如标准RL基线,这展示了基于LLM的策略优化在需要细粒度连续控制的设置中可能存在的局限性。

英文摘要

We study when large language models (LLMs) can serve as effective black-box policy optimizers for reinforcement learning (RL) tasks, i.e., when can we replace classical RL algorithms with an LLM? We explore this question by introducing Prompted Policy Optimization (PromptPO), an iterative method that prompts an LLM with Python descriptions of the state space, action space, and reward function, then has it generate and refine executable policies based on rollout feedback. Across hard exploration environments, Meta-World robotics tasks, and several real-world control problems, PromptPO often matches or exceeds the performance of standard RL baselines while using substantially fewer environment interactions. To maximize expected return, and without further explicit prompting, the policies PromptPO outputs range from tuned proportional controllers or rule-based plans to policies that run planning algorithms like value iteration. Our results demonstrate that LLM-based policy optimization is sufficient when the LLM can leverage prior knowledge about the environment or optimization strategy. PromptPO underperforms standard RL baselines in MuJoCo domains. This demonstrates possible limitations of LLM-based policy optimization to settings that requiring fine-grained continuous control.

2605.30717 2026-06-01 cs.CL

Neuron-Level Interventions for Gendered and Gender-Neutral Generation in Language Models

语言模型中性别化与性别中立生成的神经元级干预

Zhiwen You, Nafiseh Nikeghbal, Jana Diesner

AI总结 提出神经元级干预方法,通过识别与女性、男性和性别中立相关的特定神经元,实现对语言模型生成文本的性别控制,并发现性别神经元主要集中在前几层。

详情
AI中文摘要

语言模型(LMs)即使在给定中性提示时也可能产生性别化语言和刻板印象。以往关于LM性别偏见的研究大多通过二元视角(女性 vs. 男性)审视性别,对性别中立形式(如they/them代词或中性措辞的职位名称)关注有限。性别相关信号如何在LM的内部表示中编码仍是一个开放问题。在这项工作中,我们研究了LM中三类性别特异性神经元:女性、男性和性别中立。我们提出了一种神经元级干预方法,用于识别与每个性别类别紧密相关的神经元。然后通过受控生成测试这些神经元,表明激活或掩蔽性别相关神经元可以将句子导向目标性别形式,同时保留其原始含义。为了评估我们性别干预方法的有效性,我们整理了两个数据集,其中包含跨所有三个性别类别标记的受控句子,并通过人工评估验证数据质量。在两个开源LM上的实验表明,性别特异性神经元并非均匀分布在模型各层;相反,它们高度集中在最早层,后面层的贡献较小。与现有方法相比,我们的方法实现了更精确的性别控制,通过两个评估标准,对非目标性别类别的泄露更少,输出质量稳定。总体而言,我们的工作研究了性别如何在LM中编码,并为受控性别干预提供了一种简单而有效的方法,适用于神经元干预评估和性别偏见缓解。代码和数据集可在 https://github.com/zhiwenyou103/Gender-Neuron-Intervention 获取。

英文摘要

Language models (LMs) can produce gendered language and stereotypes even when given neutral prompts. Most prior work on gender bias in LMs primarily examines gender through a binary lens (feminine vs. masculine), with limited attention to gender-neutral forms, such as they/them pronouns or neutrally phrased job titles. How gender-related signals are encoded in the internal representations of LMs remains an open question. In this work, we study gender-specific neurons in LMs across three categories: feminine, masculine, and gender-neutral. We propose a neuron-level intervention method to identify neurons that are strongly tied to each gender category. We then test these neurons through controlled generation, showing that activating or masking gender-related neurons can steer a sentence toward a target gender form while preserving its original meaning. To evaluate the effectiveness of our gender-intervention approach, we curate two datasets with controlled sentences labeled across all three gender categories and validate the data quality through human evaluation. Experiments on two open-source LMs show that gender-specific neurons are not evenly distributed across model layers; instead, they concentrate heavily in the earliest layers with smaller contributions from later layers. Compared to existing methods, our method achieves more precise gender control, with less leakage into non-target gender categories and stable output quality through two evaluation criteria. Overall, our work examines how gender is encoded in LMs and provides a simple yet effective approach toward controlled gender intervention for both neuron intervention evaluation and gender bias mitigation. Code and datasets are available at: https://github.com/zhiwenyou103/Gender-Neuron-Intervention

2605.30716 2026-06-01 cs.CV cs.AI

Simple Token-Efficient Vision-Language Model for Case-level Pathology Synoptic Report Generation

用于病例级病理学概要报告生成的简单令牌高效视觉语言模型

Zhiyuan Yang, Jiahao Cheng, Vincent Quoc-Huy Trinh, Mahdi S. Hosseini

AI总结 提出一种简单令牌高效的视觉语言模型,通过5倍放大率的512×512补丁和两阶段监督训练,在有限GPU内存下实现病例级多WSI病理报告生成,显著降低序列长度并提升效率。

详情
Comments
Accepted by the DeLTA 2026 conference
AI中文摘要

从全切片图像(WSI)生成临床有用的病理报告具有挑战性,原因在于十亿像素分辨率、长视觉令牌序列以及病例级推理的复杂性(单个病例可能包含多个具有异质性组织和模糊发现的WSI)。我们提出了一种简单的令牌高效视觉语言模型,用于病例级概要报告生成,在受限GPU内存下保持实用性。我们的架构遵循最小的三组件设计:冻结的病理补丁编码器、轻量级两层MLP视觉语言对齐器和大语言模型解码器,并带有显式的WSI标记令牌以分隔病例内的切片。训练分两个监督阶段进行:(1)仅对齐器的WSI字幕生成,使用异质WSI-文本对;(2)病例级监督微调,基于病例-报告对进行结构化报告生成。为了减少序列长度,我们使用5倍放大率下的$512 \times 512$补丁表示每个切片,与常用的20倍补丁相比,平均序列长度减少高达64倍。结合高效训练技术,我们仅用半块NVIDIA H100 GPU即可实现实际训练。在两个训练阶段中,我们的方法在ROUGE-L/METEOR/BLEU-4上取得了高分,同时在内存和运行时间上显著更高效。在基于AI的评估中,我们的模型始终优于强基线。大量消融实验表征了性能-效率权衡,并确定了在多WSI设置中提高鲁棒性的简单选择。总体而言,这项工作为高效病理报告生成提供了一个强大且可复现的基线,降低了在有限计算资源下进行多WSI VLM研究的门槛。

英文摘要

Generating clinically useful pathology reports for pathology cases from whole-slide images (WSIs) is challenging due to gigapixel resolution, long visual-token sequences, and the complexity of case-level reasoning, where a single case may contain multiple WSIs with heterogeneous tissues and ambiguous findings. We present a simple token-efficient vision--language model for case-level synoptic report generation that remains practical under constrained GPU memory. Our architecture follows a minimal three-component design: a frozen pathology patch encoder, a lightweight two-layer MLP vision-language aligner, and a large language model decoder, with an explicit WSI marker token to separate slides within a case. Training proceeds in two supervised stages: (1) aligner-only WSI captioning using heterogeneous WSI-text pairs, and (2) case-level supervised fine-tuning on case-report pairs for structured report generation. To reduce sequence length, we represent each slide using $512 \times 512$ patches at $5\times$ magnification, which reduces the average sequence length by up to $64\times$ times compared to the commonly used $20\times$ patches. Combined with efficient training techniques, we enable practical training with only half a NVIDIA H100 GPU. Across both training stages, our approach achieves high ROUGE-L/METEOR/BLEU-4 scores while being substantially more efficient in memory and runtime. In AI-based evaluations, our model is consistently preferred over strong baselines. Extensive ablations characterize performance-efficiency trade-offs and identify simple choices that improve robustness in multi-WSI settings. Overall, this work provides a strong, reproducible baseline for efficient pathology report generation, lowering the barrier to multi-WSI VLM research under limited compute.

2605.30714 2026-06-01 cs.CV

Vision-Based Localization in Dense Urban Environments: A Case Study of an Urban Village in China

基于视觉的密集城市环境定位:中国城中村案例研究

Menglin Wu, Rui Cao

AI总结 针对密集城市环境中GPS信号不可靠和地图数据不完整的问题,提出一种基于双摄像头系统的低成本视觉地理定位方法,并在广州石牌村数据集上评估现有模型性能。

详情
AI中文摘要

城中村是快速城市化过程中出现的广泛非正规住区,现已成为中国大城市中农民工的主要居住中心。这些区域建筑密集,常导致GPS信号不可靠,而不完整的地图数据进一步影响精确路线规划和导航。这些问题不仅阻碍日常出行,还对应急响应构成重大挑战,因为混乱的道路布局和GPS不准确可能使疏散工作复杂化。为应对这些挑战,我们提出了一种针对密集城市环境的实用视觉地理定位解决方案。我们的方法采用低成本的数采流程,利用双摄像头系统(包括全景相机和智能手机相机)捕获同步的360度全景图和查询图像。以广州著名的密集城中村石牌村为案例,我们开发了专门的图像地理定位数据集。然后,我们评估并比较了现有模型在不同场景类型下的性能,以识别其优缺点。研究结果展示了基于视觉的定位在密集城中村环境中的潜力和局限性。我们的框架旨在改善GPS覆盖较差区域的步行导航、最后一公里配送和应急管理,最终支持这些非正规住区中的弱势群体。

英文摘要

Urban villages, the widespread informal settlements which have emerged as a result of rapid urbanization, are now major residential hubs for migrant workers in large cities in China. The dense arrangement of buildings in these areas often leads to unreliable GPS signals, while incomplete mapping data further impairs accurate route planning and navigation. These issues not only hinder everyday mobility but also pose significant challenges for emergency response, as confusing road layouts and GPS inaccuracies can complicate evacuation efforts. To address these challenges, we propose a practical vision-based geo-localization solution tailored for dense urban environments. Our approach features a low-cost data collection pipeline utilizing a dual-camera system, comprising a panoramic camera and a smartphone camera, to capture synchronized 360-degree panoramas and query images. Using Shipai Village, a well-known densely populated urban village in Guangzhou, as a case study, we develop a specialized image geo-localization dataset. We then assess and compare the performance of existing models across various scene types to identify their strengths and weaknesses. The findings demonstrate both the potential and limitations of visual-based localization in dense urban-village environments. Our framework aims to enhance pedestrian navigation, last-mile delivery, and emergency management in areas with poor GPS coverage, ultimately supporting the vulnerable populations living within these informal settlements.

2605.30713 2026-06-01 cs.LG cs.CV cs.MM

Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models

多样性至关重要:重新审视视觉-语言模型中的测试时计算

Yijie Tong, Yifan Hou, Shaobo Cui, Antoine Bosselut, Mrinmaya Sachan

AI总结 针对视觉-语言模型(VLM)中测试时计算(TTC)策略应用不足的问题,提出基于预测熵的ETTC方法,通过利用模型间的置信度差异提升集成性能,理论证明并实验验证其优于多数投票和最佳单模型。

详情
Comments
ICML 2026
AI中文摘要

测试时计算(TTC)策略已成为提升大型语言模型(LLM)推理能力的一种轻量级方法。然而,它们在视觉-语言模型(VLM)中的应用和益处尚未得到充分探索。我们对七个VLM和六个基准进行了TTC的系统研究,特别分析了基于特征的评分和多数投票方法。我们发现特征启发式方法失败,而投票在单模型设置中仅带来微小提升。我们从理论上证明,这种局限性源于缺乏预测多样性:当输出高度相关时,投票收益甚微。相比之下,多模型集成提供了更丰富的多样性,但标准的多数投票未能考虑不同模型的能力差异。为解决这一问题,我们提出了基于熵的TTC(ETTC),它根据预测熵选择最自信的预测。在单模型情况下,我们的方法退化为多数投票,但在模型集成中,它利用置信度差异优先考虑更强的模型。我们证明,在温和假设下ETTC优于多数投票,并通过实验表明它始终优于投票和最佳个体模型。关键在于,我们的结果表明,较小的模型可以协同增强较大的模型,释放出标准策略无法实现的集成增益。

英文摘要

Test-time compute (TTC) strategies have emerged as a lightweight approach to boost reasoning in large language models (LLMs). However, their application and benefits for vision-language models (VLMs) remain underexplored. We present a systematic study of TTC across seven VLMs and six benchmarks, specifically analyzing feature-based scoring and majority voting methods. We find that feature heuristics fail and voting yields only modest gains in single-model settings. We theoretically show that this limitation stems from a lack of prediction diversity: when outputs are highly correlated, voting provides little benefit. In contrast, multi-model ensembles offer richer diversity, yet standard majority voting fails to account for varying model capabilities. To address this, we propose Entropy-based TTC (ETTC), which selects the most confident prediction based on predictive entropy. Our method reduces to majority voting in the single-model case, but in model ensembles, it leverages confidence disparities to prioritize stronger models. We prove that ETTC outperforms majority voting under mild assumptions and empirically demonstrate that it consistently surpasses both voting and the best individual model. Crucially, our results show that smaller models can synergistically enhance larger ones, unlocking ensembling gains not achievable with standard strategies.

2605.30712 2026-06-01 cs.CL

ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents

ExpGraph: 面向LLM智能体的模型无关经验学习与图结构记忆

Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu, Xueqiang Xu, Haozhen Zhang, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You

AI总结 提出ExpGraph框架,通过图结构记忆和强化学习实现冻结LLM执行器的外部经验复用,在问答、数学推理、代码生成和多步智能体任务上显著提升性能并减少交互步骤。

详情
AI中文摘要

大型语言模型(LLM)智能体在推理、工具使用和多步交互方面表现出强大的能力,但它们通常从零开始解决任务,未能重用先前经验中的成功策略或失败教训。对收集的经验进行微调可以改善重用,但当出现更强或更合适的执行器时,这种方法不够灵活。我们提出ExpGraph,一个模型无关的经验学习框架,使冻结且可替换的LLM执行器能够通过外部经验重用而无需参数更新来改进。ExpGraph将历史轨迹总结为可重用的技能和失败教训,将它们组织为自进化经验图中的节点,并通过图扩散和效用感知排序检索有用的经验。一个轻量级的检索副驾驶通过强化学习进行训练,使用比较有无检索经验时执行器性能的反馈,同时图根据下游任务结果在线更新。我们在ExpSuite上评估ExpGraph,涵盖问答、数学推理、代码生成以及包括ALFWorld和AppWorld在内的多步智能体环境。ExpGraph在静态任务上,对于较小和较大的执行器,分别比最强基线提高12.2%和4.7%;在智能体环境中提高21.4%和12.7%,同时平均交互步骤减少12.7%和21.6%。消融实验表明,图结构经验、效用感知排序和自适应检索共同实现了跨不同任务和执行器模型的有效经验重用。

英文摘要

Large language model (LLM) agents have shown strong capabilities in reasoning, tool use, and multi-step interaction, but they often solve tasks from scratch and fail to reuse successful strategies or failure lessons from prior experience. Fine-tuning on collected experience can improve reuse, but it is inflexible when stronger or more suitable executors emerge. We propose ExpGraph, a model-agnostic experience learning framework that enables frozen and replaceable LLM executors to improve through external experience reuse without parameter updates. ExpGraph summarizes historical trajectories into reusable skills and failure lessons, organizes them as nodes in a self-evolving experience graph, and retrieves useful experiences through graph diffusion and utility-aware ranking. A lightweight retrieval copilot is trained with reinforcement learning using feedback that compares executor performance with and without retrieved experiences, while the graph is updated online from downstream task outcomes. We evaluate ExpGraph on ExpSuite, covering question answering, mathematical reasoning, code generation, and multi-step agentic environments including ALFWorld and AppWorld. ExpGraph improves over the strongest baseline by 12.2% and 4.7% on static tasks with smaller and larger executors, and by 21.4% and 12.7% in agentic environments, while reducing average interaction steps by 12.7% and 21.6%. Ablations show that graph-structured experience, utility-aware ranking, and adaptive retrieval jointly enable effective experience reuse across diverse tasks and executor models.

2605.30711 2026-06-01 cs.CL cs.AI cs.LG stat.ML

SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs

SAGE: 一种用于智能体大语言模型中高效记忆演化的新颖门控机制

Sijia Wang, Dhanajit Brahma, Ricardo Henao

AI总结 提出SAGE门控机制,基于von Mises-Fisher密度估计和自适应阈值,将记忆写入控制建模为新奇性检测问题,在LoCoMo上以更低成本实现最优token-F1。

详情
AI中文摘要

智能体大语言模型必须持续决定新提取的事实是应添加、与现有记忆合并还是忽略,然而先前的工作更侧重于检索和存储,而非原则性的写入端控制。我们将记忆演化视为一个新颖性检测问题,并提出SAGE(Spherical Adaptive Gate for memory Evolution),一种用于记忆演化的球形自适应门控机制,它通过基于von Mises-Fisher的密度估计器对记忆嵌入上的候选事实进行评分,并使用跟踪记忆存储几何结构的自适应阈值对其进行路由。SAGE将明确新颖的事实解析为ADD,明确冗余的事实解析为NOOP,仅将不确定的情况发送给LLM合并步骤,从而减少了昂贵的写入时推理。在LoCoMo上,SAGE在所有七个开放权重骨干对比中均实现了对Mem0的最佳平均token-F1,而在GPT-4o-mini上,它将添加阶段的API成本降低了3.4倍,添加阶段延迟降低了2.5倍,且平均评判分数差距很小。作为A-Mem的即插即用二进制门控,SAGE在五个模型上跳过了大约16-18%的LLM调用,且在开放权重骨干上质量变化极小。这些结果表明,新颖性感知的写入控制是提高长期智能体记忆中记忆质量和系统效率的实用杠杆。

英文摘要

Agentic LLMs must continuously decide whether newly extracted facts should be added, merged with existing memories, or ignored, yet prior work has focused more on retrieval and storage than on principled write-side control. We frame memory evolution as a novelty-detection problem and propose SAGE, a Spherical Adaptive Gate for memory Evolution that scores candidate facts with a von Mises-Fisher-based density estimator over memory embeddings and routes them with an adaptive threshold that tracks memory-store geometry. SAGE resolves clearly novel facts as ADD, clearly redundant facts as NOOP, and sends only uncertain cases to an LLM merge step, reducing expensive write-time reasoning. On LoCoMo, SAGE achieves the best average token-F1 against Mem0 on all seven open-weight backbone comparisons, while on GPT-4o-mini it reduces add-phase API cost by 3.4$\times$ and add-phase latency by 2.5$\times$ with only a small average judge-score gap. As a drop-in binary gate for A-Mem, SAGE skips roughly 16-18% of LLM calls across five models with minimal quality change on open-weight backbones. These results suggest that novelty-aware write control is a practical lever for improving both memory quality and system efficiency in long-term agentic memory.

2605.30700 2026-06-01 cs.CV cs.LG

Mathematical Morphology in Machine Learning

机器学习中的数学形态学

Erick Oliveira Rodrigues, Aura Conci

AI总结 将数学形态学引入机器学习,提出基于形态学重建的快速聚类算法和一种结合闵可夫斯基与切比雪夫距离的新型距离度量,并设计新型形态学分类器以建模形状、密度和分形信息。

详情
Journal ref
sibgrapi 2018
AI中文摘要

本工作将数学形态学——一种成熟的视觉计算理论——引入机器学习,以利用标准技术常忽视的形状和密度方面。我们提出了一种基于形态学重建的快速聚类算法,该算法能精确保留聚类形状和密度。该方案具有独特特性:内在的最大聚类感知、无成本的噪声去除以及由结构元素控制的多样化增长模式。此外,我们提出了一种结合闵可夫斯基距离和切比雪夫距离的新型距离度量,对于形态学膨胀非常高效。在 $Z^2$ 离散邻域迭代中,它比曼哈顿距离快约1.3倍,比欧几里得距离快约329.5倍。当使用k近邻(k-NN)分类器在33个UCI数据集上与其他14种距离度量进行评估时,我们的度量在大多数情况下(33例中的26例)达到了高于平均的准确率,并在9个案例中取得了最佳整体准确率。最后,我们引入了新型形态学分类器。与现有文献不同,本方案独特地对数据集中的形状、密度和分形信息进行建模。

英文摘要

This work introduces mathematical morphology-an established visual computing theory-into machine learning to exploit shape and density aspects often overlooked by standard techniques. We propose a fast clustering algorithm based on morphological reconstruction that accurately preserves cluster shapes and density. This scheme offers unique features: an intrinsic sense of maximal clusters, cost-free noise removal, and diverse growth patterns controlled by structuring elements.Additionally, we propose a novel distance metric combining Minkowski and Chebyshev distances, highly efficient for morphological dilations. In $Z^2$ discrete neighbourhood iterations, it is roughly 1.3 times faster than Manhattan and 329.5 times faster than Euclidean distances. When evaluated using a k-Nearest Neighbours (k-NN) classifier across 33 UCI datasets against 14 other distances, our metric achieved above-average accuracies most frequently (26 of 33 cases) and the best overall accuracy in 9 cases.Finally, we introduce novel morphological classifiers. Unlike current literature, this proposal uniquely models shape, density, and fractal information in datasets.

2605.30699 2026-06-01 cs.LG cs.CV

A Context-Aware Middleware for Medical Image Based Reports: An approach based on image feature extraction and association rules

基于医学图像报告的情境感知中间件:一种基于图像特征提取和关联规则的方法

Erick O. Rodrigues, Jose Viterbo, Aura Conci, Trueman Mac Henry

AI总结 提出一种情境感知中间件,通过图像特征提取和关联规则,自动将医学图像分派给最合适的医疗人员,以提高医疗工作流程效率。

详情
Journal ref
2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA)
AI中文摘要

本工作提出了一种用于医疗工作流程组织和效率提升的情境感知中间件。在医院、实验室和远程放射学公司中,每位医生或技术人员都专注于特定类型的诊断或分析。因此,某些类型的医学图像通常会被转发给特定的医生或特定群体。这种转发非常耗时。也就是说,反复决定谁是最合适的医生,以及他在特定情境下是否可用,既繁琐又可能非常低效。因此,所提出的中间件能够处理并收集每位医疗人员所分析图像的数据。基于收集的数据和当前临床情境,中间件能够推断出谁是最适合接收特定传入医学图像的人员。

英文摘要

This work proposes a context-aware middleware for medical workflow organization and efficiency improvement. In hospitals, laboratories and teleradiology companies, each physician or technician is specialized in a specific kind of diagnosis or analysis. Therefore, certain types of medical images are often forwarded to a certain physician or a certain group. This forwarding is time consuming. That is, repeatedly deciding who would be the best physician, whether he is available at a certain moment given a certain context is exhaustive and may be very inefficient. Thus, the proposed middleware has the ability to process and collect data from images analyzed by each medical staff. Based on the collected data and current clinical context, the middleware is able to infer who would be the best fit staff to receive a certain incoming medical image.

2605.30698 2026-06-01 cs.CV cs.AI cs.MA

Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence

先见后议:用视觉证据对齐多智能体共识

Yuhan Wang, Shuochen Chang, Yalin Feng, Dongsheng Ma, Yuanzi Li, Zhengren Wang, Yinglong Yang, Yufei Chen, Yikang Wang, Shaoxu Sun, Wentao Zhang

AI总结 提出EAGLE框架,通过显式暴露各智能体的视觉证据区域并相互验证,实现无需训练的多智能体视觉问答协作,提升共识可靠性。

详情
AI中文摘要

视觉语言模型(VLM)在视觉问答(VQA)上取得了强劲性能。为了减轻个体幻觉和盲点,通过多智能体协作聚合不同视角已成为一种有前景的范式。虽然这种方法在文本问答中取得了巨大成功,但其在多模态领域的潜力仍未充分探索。现有的多智能体VQA方法主要采用以文本为中心的协议,专注于文本讨论而忽略视觉信息的对齐。在这项工作中,我们揭示了一个关键见解:答案级别的共识对于可靠的多智能体VQA是不够的; extit{对齐的视觉证据}——智能体所依赖的图像区域的共享支持——对于可信的共识至关重要。为了利用这一见解,我们提出了EAGLE( extbf{E}vidence- extbf{A}ligned extbf{G}rounded mu extbf{L}ti-agent r extbf{E}asoning),一个无需训练的以证据为中心的框架,用于协调多个VLM智能体。EAGLE显式暴露每个智能体的定位区域作为视觉证据,允许对证据进行相互验证,并使用证据一致性指导最终决策。在六个VQA基准上的实验表明,EAGLE在跨领域实现了最佳平均性能,同时保持轻量、可解释且易于部署。

英文摘要

Vision-language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations and blind spots, aggregating diverse perspectives via multi-agent collaboration has emerged as a promising paradigm. While this approach has shown great success in textual QA, its potential in the multimodal domain remains under-explored. Existing multi-agent VQA methods predominantly adapt text-centric protocols, focusing on textual discussions while ignoring the alignment of visual information. In this work, we reveal a key insight: answer-level agreement is insufficient for reliable multi-agent VQA; \textit{aligned visual evidence} -- shared support from the image regions agents rely on -- is essential for trustworthy consensus. To leverage this insight, we propose EAGLE (\textbf{E}vidence-\textbf{A}ligned \textbf{G}rounded mu\textbf{L}ti-agent r\textbf{E}asoning), a training-free evidence-centered framework for coordinating multiple VLM agents. EAGLE explicitly exposes each agent's grounding regions as visual evidence, enables mutual verification over the evidence, and uses evidence consistency to guide final decision-making. Experiments on six VQA benchmarks show that EAGLE achieves best average performance across domains while remaining lightweight, interpretable, and practical for deployment.

2605.30696 2026-06-01 cs.RO cs.SY eess.SY

Geometry-Aware Control Barrier Functions for Collision Avoidance via Bernstein Polynomial Approximations

基于Bernstein多项式近似的几何感知控制障碍函数用于碰撞避免

Siwon Jo, Yanze Zhang, Yupeng Yang, Wenhao Luo

AI总结 提出一种基于Bernstein多项式符号距离场的几何感知控制障碍函数,统一表示障碍物与机器人,利用多项式可微性实现闭环控制约束,保证单机器人和异构多机器人碰撞避免的安全性与效率。

详情
Comments
8 pages; Accepted by 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)
AI中文摘要

安全导航通常依赖于基于机器人和障碍物形状的明确定义条件,当它们具有不规则几何形状时可能具有挑战性。虽然控制障碍函数(CBF)提供了一种有效机制来强制执行安全集前向不变性,但常见的形状替代(例如球体或超椭球体)要么在非结构化场景中过于保守,要么需要许多局部基元,这会增加约束数量并降低实时性能。在本文中,我们介绍了一种基于Bernstein多项式符号距离场(BP-SDF)的新型几何感知控制障碍函数(CBF)。它提供了一种统一的方式来表示障碍物和机器人,从而用统一的最小距离来表示障碍函数。得益于Bernstein多项式的可微性,可以轻松地在闭环中强制执行控制约束。我们通过不同环境下的仿真验证了该方法在单机器人导航和异构多机器人碰撞避免中保证安全性的效率和性能。

英文摘要

Safe navigation often relies on well-defined conditions based on the shape of robots and obstacles, and can be challenging when they have irregular geometries. While Control Barrier Functions (CBFs) offer an efficient mechanism to enforce safe set forward invariance, common shape surrogates (e.g., spheres or super-ellipsoids) either are overly conservative in unstructured scenes or require many local primitives, which inflates constraint counts and degrades real-time performance. In this paper, we introduce a novel geometry-aware Control Barrier Function (CBF) based on Bernstein-Polynomial Signed Distance Fields (BP-SDFs). It provides a unified way to represent the obstacles and robots, so as to represent the barrier function with a unified minimum distance. Benefiting from the differentiability of the Bernstein polynomials, one can easily enforce the control constraints in a closed loop. We validate the method's efficiency and performance to guarantee safety in single-robot navigation and heterogeneous multi-robot collision avoidance via simulations under different environments.

2605.30695 2026-06-01 cs.RO

Primitive Subspaces Mediate Few-Shot Transfer in VLAs

原始子空间介导VLA中的少样本迁移

Anya Singh, Cabrel Happi, Jai Relan, Varun Nair, Vidyut Baradwaj

AI总结 本研究通过原始感知训练在视觉-语言-动作(VLA)策略中构建可迁移的子技能库,仅需少量演示即可实现少样本迁移,相比平坦训练方法样本效率提升3倍。

详情
AI中文摘要

在工业环境中部署视觉-语言-动作(VLA)策略需要能够以低成本教授新任务,而当前VLA缺乏这一特性,因为每个新任务都需要微调。我们研究原始感知训练是否会产生一种可迁移的产物:一个学习到的子技能库,可以在推理时根据少量演示进行组合,以执行策略从未训练过的任务。我们在REASSEMBLE接触式装配数据集上,使用匹配的LoRA微调配方和固定超参数,训练了两种具有不同归纳偏置的VLA架构——OpenVLA和$π_{0.5}$,并在平坦轨迹和原始分割的回合(带有原始特定语言提示)之间改变训练方式。我们从训练中保留6个对象-任务组合,并评估少样本迁移:模型接收$m \in \{0, 1, 3, 5, 10\}$个保留任务的演示,并在不更新权重的情况下尝试执行。我们在三个训练种子上重复实验,并在第二个数据集(LIBERO-Long)上进行验证。原始训练模型仅需m=3个演示即可达到微调上限性能的78%,而平坦训练模型需要m=10个演示才能达到相同水平——这是一个3倍的样本效率差距,在种子、架构和数据集上均得到复现。为了建立因果关系,我们消融了隐藏状态的原始可解码子空间,结果显示少样本迁移性能下降32个百分点,而消融相同维度的随机子空间则没有影响,这表明原始表示是因果必要的,而非与迁移偶然相关。我们识别并纠正了评估分块策略时的一个方法论陷阱:单步动作范围门的族系膨胀会导致与真实人类演示相比的假失败率高出数量级。

英文摘要

Deploying vision-language-action (VLA) policies in industrial environments requires the ability to teach new tasks at low cost, a property current VLAs lack, since each new task requires fine-tuning. We investigate whether primitive-aware training produces a transferable artifact: a learned library of sub-skills that can be composed at inference time, conditioned on a small number of demonstrations, to perform tasks the policy was never trained on. We train two VLA architectures with different inductive biases, OpenVLA and $π_{0.5}$, on the REASSEMBLE contact-rich assembly dataset under matched LoRA fine-tuning recipes and locked hyperparameters, varying training between flat trajectories and primitive-segmented episodes with primitive-specific language prompts. We hold out 6 object-task combinations from training and evaluate few-shot transfer: models receive $m \in \{0, 1, 3, 5, 10\}$ demonstrations of a held-out task and attempt execution without weight updates. We replicate across three training seeds and validate on a second dataset (LIBERO-Long). Primitive-trained models reach 78% of fine-tuned upper-bound performance with only m=3 demonstrations, while flat-trained models require m=10 demonstrations to reach the same level -- a $3\times$ sample efficiency gap that replicates across seeds, architectures, and datasets. To establish causation, we ablate the primitive-decodable subspace of hidden states and show few-shot transfer degrades by 32 percentage points while ablating a random subspace of equal dimensionality has no effect, indicating primitive representations are causally necessary rather than incidentally correlated with transfer. We identify and correct a methodological pitfall in evaluating chunked policies: family-wise inflation of single-step action-range gates produces order-of-magnitude higher false-failure rates against ground-truth human demonstrations.

2605.30694 2026-06-01 cs.LG

Universal Decision Learners

通用决策学习器

Sridhar Mahadevan

AI总结 本文提出通用决策学习器(UDL)的范畴论框架,通过左Kan扩展和右Kan扩展将局部决策行为规范地扩展到全局一致行为,统一了规划、强化学习、因果干预、在线学习和博弈均衡等多种决策形式。

详情
Comments
15 pages
AI中文摘要

许多决策理论——规划、强化学习、因果干预、在线学习和博弈均衡——将局部信息转化为全局一致的行为。本文提出一个共同的范畴论形式化:通用决策学习器(UDL)通过一对通用构造将部分指定的决策函子从观测上下文扩展到新上下文。左Kan扩展表达展开、聚合和候选生成;右Kan扩展表达一致性、约束满足和不动点语义。核心主张并非每个决策问题都有相同的算法,而是许多决策形式化实例化同一个通用问题:规范地扩展局部行为数据,然后刻画全局一致的扩展。我们给出抽象的UDL构造,证明其通用比较性质,定义Kan不变的行为等价性和最小抽象,并展示贝尔曼方程、规划递归、因果干预、在线遗憾和均衡如何作为特例出现。补充材料更详细地发展了强化学习特例。

英文摘要

Many theories of decision making -- planning, reinforcement learning, causal intervention, online learning, and game-theoretic equilibrium -- turn local information into globally coherent behavior. This paper proposes a common categorical formulation: a Universal Decision Learner (UDL) extends a partially specified decision functor from observed contexts to new contexts by a pair of universal constructions. Left Kan extensions express rollout, aggregation, and candidate generation; right Kan extensions express consistency, constraint satisfaction, and fixed-point semantics. The central claim is not that every decision problem has the same algorithm, but that many decision formalisms instantiate the same universal problem: extend local behavioral data canonically, then characterize the globally coherent extensions. We give the abstract UDL construction, prove its universal comparison property, define Kan-invariant behavioral equivalence and minimal abstractions, and show how Bellman equations, planning recursions, causal interventions, online regret, and equilibria arise as special cases. The supplementary material develops the reinforcement-learning specialization in more detail.

2605.30693 2026-06-01 cs.CR cs.CL

Triaging Threats to Specialized Guardrails

针对专用护栏的威胁分类

Wenjie Jacky Mo, Xiaofei Wen, Rui Cai, Boyu Zhu, Sicong Jiang, Zihan Wang, Minglai Yang, Zhe Zhao, Muhao Chen

AI总结 提出GuardZoo统一基准和RouteGuard路由专家框架,通过将对话分流至专用护栏,解决单一护栏在多种威胁域上的任务干扰问题,提升细粒度检测和泛化能力。

详情
AI中文摘要

构建稳健的安全护栏对于在多样化的实际应用中部署大型语言模型至关重要。然而,这一目标仍然具有挑战性,因为安全风险涵盖异质的威胁领域,而现有数据集仅覆盖零散的风险子集并依赖不一致的分类法。因此,目前尚不清楚现有护栏能否在狭窄的评估设置之外泛化。为了更好地理解护栏模型的稳健性,我们首先引入GuardZoo,一个统一的人工标注基准,包含32,460个样本,覆盖15个不同的不安全类别。在GuardZoo上的评估揭示,单一护栏遭受任务干扰:不同的威胁领域需要不同的决策边界,这些边界难以压缩到单个模型中。因此,我们提出RouteGuard,一个路由-专家框架,将每个对话分流到专门的专家护栏进行威胁特定检测。实验表明,RouteGuard在强护栏基线上提高了细粒度威胁检测,在域外评估下泛化更好,并支持灵活的模块化扩展以应对新兴威胁。

英文摘要

Building robust safety guardrails is essential for deploying Large Language Models across diverse real-world applications. However, this goal remains challenging because safety risks span heterogeneous threat domains, while existing datasets cover only fragmented risk subsets and rely on inconsistent taxonomies. Consequently, it remains unclear whether current guardrails can generalize beyond narrow evaluation settings. To better understand the robustness of guardrail models, we first introduce GuardZoo, a unified human-annotated benchmark with 32,460 samples covering 15 distinct unsafe categories. Evaluation on GuardZoo reveals that monolithic guardrails suffer from task interference: different threat domains require distinct decision boundaries that are difficult to compress into a single model. We therefore propose RouteGuard, a router-expert framework that triages each conversation to specialized expert guardrails for threat-specific detection. Experiments show that RouteGuard improves fine-grained threat detection over strong guardrail baselines, generalizes better under out-of-domain evaluation, and supports flexible modular expansion to emerging threats.

2605.30690 2026-06-01 cs.CL

ElasticMem: Latent Memory as a Learnable Resource for LLM Agents

ElasticMem: 将潜在记忆作为LLM智能体的可学习资源

Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu, Xueqiang Xu, Haozhen Zhang, Ge Liu, Jiaxuan You

AI总结 提出ElasticMem框架,通过可学习的弹性潜在记忆分配策略,在记忆密集型问答和具身智能体控制任务中显著提升准确率并降低token开销。

详情
AI中文摘要

长期记忆对于LLM智能体在长时间交互中连贯推理、个性化响应和复用过往经验至关重要。然而,现有的记忆增强方法通常将记忆视为固定资源:文本空间方法将检索到的记忆拼接进上下文窗口,导致大量token开销和对噪声证据的敏感性;而潜在空间方法减少了文本成本,但仍依赖刚性检索或固定容量的记忆接口。这造成了查询相关的记忆效用与固定记忆分配之间的不匹配。我们提出ElasticMem,一种记忆增强的LLM框架,学习将记忆用作弹性潜在资源。ElasticMem构建了一个离线潜在记忆库,包含检索键和内容缓存,从推理器的隐藏状态自适应地检索记忆,通过学得的策略为每个检索到的记忆分配可变潜在预算,并将选中的潜在状态作为软记忆token注入生成过程。整个记忆使用过程通过组相对策略优化与下游任务奖励进行联合优化。我们在MemorySuite上评估ElasticMem,涵盖记忆密集型QA和具身智能体控制。在Qwen2.5-3B-Instruct和Qwen2.5-7B-Instruct骨干网络上,ElasticMem在加权平均QA准确率上分别比最强基线提升26.2%和24.6%,在ALFWorld成功率上分别提升66.3%和27.2%,同时实现了最低的ALFWorld token成本。消融和定性分析进一步表明,自适应检索和弹性预算分配帮助ElasticMem优先考虑有用证据和可迁移计划,超越了刚性余弦相似度。ElasticMem的代码将在https://github.com/ulab-uiuc/ElasticMem发布。

英文摘要

Long-term memory is essential for LLM agents to reason coherently across extended interactions, personalize responses, and reuse past experience. However, existing memory-augmented methods typically treat memory as a fixed resource: text-space approaches concatenate retrieved memories into the context window, causing substantial token overhead and sensitivity to noisy evidence, while latent-space approaches reduce textual cost but still rely on rigid retrieval or fixed-capacity memory interfaces. This creates a mismatch between query-dependent memory utility and fixed memory allocation. We propose ElasticMem, a memory-augmented LLM framework that learns to use memory as an elastic latent resource. ElasticMem builds an offline latent memory bank with retrieval keys and content caches, retrieves memories adaptively from the reasoner's hidden state, assigns each retrieved memory a variable latent budget through a learned policy, and injects selected latent states as soft memory tokens for generation. The full memory-use process is optimized with downstream task rewards through group-relative policy optimization. We evaluate ElasticMem on MemorySuite, covering memory-intensive QA and embodied agent control. Across Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct backbones, ElasticMem improves weighted average QA accuracy by 26.2% and 24.6%, and improves ALFWorld success rate by 66.3% and 27.2%, respectively, over the strongest baselines, while achieving the lowest ALFWorld token cost. Ablations and qualitative analyses further show that adaptive retrieval and elastic budget allocation help ElasticMem prioritize useful evidence and transferable plans beyond rigid cosine similarity. Our code for ElasticMem will be released at https://github.com/ulab-uiuc/ElasticMem.

2605.30689 2026-06-01 cs.CV cs.AI

ConTrans: Learning Text-enhanced Local-global Temporal Representations for Zero-shot Temporal Action Localization

ConTrans:学习文本增强的局部-全局时间表示用于零样本时间动作定位

Kanchan Keisham, Thenukan Pathmanathan, Thangarajah Akilan

AI总结 针对零样本时间动作定位中忽略局部相关性和特征表示能力不足的问题,提出融合卷积归纳偏置与Transformer自注意力的多尺度编码器ConTrans,联合捕获细粒度局部依赖和长程全局上下文,在ActivityNet-1.3和THUMOS14上显著超越现有方法。

详情
Comments
4 figures, 8 tables
AI中文摘要

零样本时间动作定位(ZS-TAL)旨在检测和定位未修剪视频中未见过的动作。然而,现有方法主要关注建模长程上下文信息,常常忽略了视频帧之间基于相对偏移的关键局部相关性。此外,由于网络架构的浅层性,其特征表示能力受限,阻碍了性能提升。在本文中,我们通过引入一种新颖的局部-全局多尺度特征表示模块来解决这些局限性。我们提出了一种新颖的多尺度编码器架构,称为ConTrans,它将卷积(Conv)归纳偏置与Transformer自注意力相结合,以共同捕获细粒度的局部依赖和长程全局上下文,从而比现有方法获得更全面的特征表示。在ActivityNet-1.3和THUMOS14数据集上的实验评估表明,ConTrans显著优于现有方法,为ZS-TAL建立了新的基准。

英文摘要

Zero-shot Temporal Action Localization (ZS-TAL) aims to detect and locate previously unseen actions in untrimmed videos. However, existing approaches primarily focus on modeling long-range contextual information, often neglecting the critical relative-offset-based local correlations between video frames. Furthermore, their performance is hindered by limited feature representation capabilities due to the shallow nature of their network architectures. In this paper, we address these limitations by introducing a novel local-global multi-scale feature representation module. We propose a novel multi-scale encoder architecture, termed ConTrans, that integrates convolutional (Conv) inductive biases with transformer Self-attention to jointly capture fine-grained local dependencies and long-range global context, leading to more comprehensive feature representations than existing methods. Experimental evaluations on the ActivityNet-1.3 and THUMOS14 datasets demonstrate that ConTrans significantly outperforms existing methods, establishing a new benchmark for ZS-TAL.

2605.30686 2026-06-01 cs.CR cs.AI cs.LG

Depth-Dependent Indirect Prompt Injection in Tool-Calling ReAct Agents: Injection Depth, Payload Framing, and Turn-Budget Sensitivity

工具调用ReAct代理中深度相关的间接提示注入:注入深度、载荷框架和轮次预算敏感性

Mohammadreza Rashidi

AI总结 通过四个对照实验(共460次试验),研究在工具调用ReAct代理中,注入深度、载荷框架和轮次预算对间接提示注入攻击成功率的影响,发现注入深度是主导变量,且仅清理第一个工具观察可捕获67%的注入成功。

详情
Comments
17 pages, 16 figures
AI中文摘要

将链式推理与工具调用交错的ReAct代理越来越多地用于实际任务,如调度、文件检索和数据访问。它们的工具观察循环创建了一个直接攻击面:控制任何工具返回值的攻击者可以嵌入指令,将代理从用户目标引开,这种威胁称为间接提示注入。现有基准在固定条件下评估固定注入位置的攻击成功率(ASR),留下了三个未探索的风险维度:载荷在工具序列中出现的位置(注入深度)、使用的修辞风格(框架)以及代理允许的轮次数(轮次上限)。我们在五个攻击类别的20个场景中进行了四项对照研究,总共对GPT-4o-mini和Claude Haiku进行了460次试验,总API成本低于0.36美元。研究1显示,GPT-4o-mini的ASR从深度1的60%衰减到深度4和5的0%(Cramer's V = 0.58,p < 0.001;限制在序列内深度1-3:V = 0.47,p = 0.0013),这是由于深度1的模型抵抗和更深位置在遇到载荷前任务完成所致。研究2在Claude Haiku上重复了深度实验,通过保守的工具调用和真正的指令抵抗,在每个深度均实现了0%的ASR。研究3显示,在深度1,框架将ASR调节在25%(中性)到75%(角色)之间,范围达50个百分点,但在每个条件下N=20时未达到统计显著性。研究4确认ASR在3、5和7的轮次上限下稳定,表明轮次预算在此设置中不是风险因素。我们的结果确立了注入深度为主导变量,并表明仅清理第一个工具观察可捕获67%的测量注入成功。

英文摘要

ReAct agents that interleave chain-of-thought reasoning with tool calls are increasingly deployed for real tasks such as scheduling, file retrieval, and data access. Their tool observation loop creates a direct attack surface: an adversary who controls any tool's return value can embed instructions that redirect the agent away from the user's goal, a threat known as indirect prompt injection. Existing benchmarks evaluate attack success rate (ASR) at a fixed injection position under fixed conditions, leaving three risk dimensions unexplored: where in the tool sequence the payload appears (injection depth), what rhetorical register it uses (framing), and how many turns the agent is permitted (turn cap). We conduct four controlled studies on 20 scenarios spanning five attack categories, totalling 460 trials against GPT-4o-mini and Claude Haiku at a combined API cost under 0.36 USD. Study 1 shows that ASR against GPT-4o-mini decays from 60% at depth 1 to 0% at depths 4 and 5 (Cramer's V = 0.58, p < 0.001; restricted to within-sequence depths 1-3: V = 0.47, p = 0.0013), driven by model resistance at depth 1 and task completion before payload encounter at deeper positions. Study 2 replicates the depth experiment on Claude Haiku, which achieves 0% ASR at every depth through a combination of conservative tool invocation and genuine instruction resistance. Study 3 shows that framing modulates ASR between 25% (neutral) and 75% (persona) at depth 1, a 50-percentage-point range that does not reach statistical significance at N = 20 per condition. Study 4 confirms that ASR is stable across turn caps of 3, 5, and 7, indicating the turn budget is not a risk factor in this setting. Our results establish injection depth as the dominant variable and show that sanitising only the first tool observation captures 67% of measured injection successes.

2605.30685 2026-06-01 cs.CY cs.AI cs.CL cs.HC

How Early Adopters Used Generative AI Worldwide: Variation by Country Income and Language

早期采用者如何在全球范围内使用生成式AI:按国家收入和语言的差异

Madeleine I. G. Daepp, Isaac Slaughter

AI总结 基于大规模匿名化AI聊天机器人交互数据,实证分析了不同国家早期采用者在使用生成式AI上的差异,发现教育用途在低收入国家更普遍,休闲用途与收入正相关,且英语交互在非英语主导国家中过度代表,表明语言性能改进可能影响数字鸿沟或跨越式发展。

详情
AI中文摘要

全球范围内人们正在使用AI,但并非所有人都以相同的方式使用。利用一个广泛可用的免费AI聊天机器人的大规模匿名化、去标识化和隐私清洗的交互数据集,我们实证描述了不同国家早期采用者使用情况的差异。在大多数国家,尤其是低收入国家,教育是最常见的使用领域,教育使用与国家GDP之间存在明显的负相关。相比之下,休闲相关使用与国家收入水平正相关。我们发现,语言也塑造了使用模式:在研究期间,英语交互在那些主要语言未被现有模型很好服务的地区过度代表。我们的工作表明,改善跨语言性能可能是决定这项技术扩大数字鸿沟还是实现跨越式发展的关键因素。

英文摘要

AI is being used by people globally, but not everyone is using it in the same ways. Using a large-scale dataset of anonymized, de-identified, and privacy-scrubbed interactions with a widely available and free AI chatbot, we empirically characterize differences in early adopters' usage across countries. Schooling is the most common domain of use in most countries, particularly low-income countries, with a strong inverse association evident between schooling and country-level GDP. Leisure-related use, by contrast, is positively associated with country-level income. Language, we find, also shapes use: English-language interactions are overrepresented in places where the predominant languages were not well-served by existing models during the period of the study. Improving performance across languages may be a key factor, our work suggests, in whether this technology expands digital divides or enables leapfrogging.

2605.30680 2026-06-01 cs.AI cs.MA

Healthcare Mechanisms from Policy-as-Code Search under Strategic Provider Response

战略提供者响应下的策略即代码搜索中的医疗机制

Zihan Wang, Xiang Xu, Hongyuan Zha, Wenhao Li

AI总结 将医疗机制设计转化为语言模型的程序合成,通过多智能体模拟器Medi-Sim评估策略提供者响应下的均衡,并利用LLM引导的进化代码搜索合成可检查的混合目标程序。

详情
Comments
32 pages, 18 figures, 4 tables
AI中文摘要

医疗机制与它们所引发的战略提供者响应密不可分:现有的医疗AI基准固定了这种响应,因此无法通过它们产生的均衡来评估机制。我们将医院机制设计重新定义为语言模型的程序合成:类型化、可检查的规则程序由Medi-Sim执行和评分,Medi-Sim是一个具有五个战略提供者渠道(编码、选择、延迟、努力、分诊)的多智能体模拟器。激励扫描恢复了经典的健康经济学发现作为相邻制度——在利润压力下的过度编码和低复杂度患者选择,以及古德哈特式漂移,其中测量绩效与真实结果呈负相关——而单个审计杠杆暴露了压力迁移:关闭编码渠道使低复杂度选择增加一倍以上。LLM引导的进化代码搜索在相同的规则程序空间上合成一个可检查的混合目标程序,该程序消除了过度编码,将拒绝率减半,并保留了大部分以利润为导向的基线的资金。

英文摘要

Healthcare mechanisms are inseparable from the strategic provider response they induce: existing healthcare AI benchmarks hold this response fixed and so cannot evaluate mechanisms by the equilibrium they produce. We recast hospital mechanism design as program synthesis for language models: typed, inspectable rule programs are executed and scored by Medi-Sim, a multi-agent simulator with five strategic provider channels (coding, selection, delay, effort, triage). An incentive sweep recovers classical health-economics findings as adjacent regimes -- up-coding and low-complexity-patient selection under profit pressure, and Goodhart-style drift where measured performance becomes anti-correlated with true outcomes -- and a single audit lever exposes pressure migration: closing the coding channel more than doubles low-complexity selection. LLM-guided evolutionary code search over the same rule-program space then synthesizes an inspectable mixed-objective program that eliminates up-coding, halves rejection, and retains most of the profit-oriented baseline's funds.

2605.30677 2026-06-01 cs.CR cs.AI cs.SE

Investigating Detection and Obfuscation of Prompt Injection Attacks Against Software Reverse Engineering AI Agents

针对软件逆向工程AI代理的提示注入攻击的检测与混淆研究

Brian Crawford, Patrick McClure

AI总结 本研究针对软件逆向工程AI代理面临的提示注入攻击,提出了检测反编译器输出中提示注入字符串的防御策略,并探索了攻击混淆及相应防御方法。

详情
AI中文摘要

代理型软件逆向工程系统容易受到嵌入可执行二进制文件源代码中的提示注入攻击。本研究展示了检测对抗性示例程序的反编译器输出中提示注入字符串存在的防御策略。还探讨了混淆这些攻击的方法以及随后防御这些混淆的方法。本研究推进了对代理型软件分析系统风险和安全性的理解,这对于将其部署到生产级网络工作流中是必要的。

英文摘要

Agentic software reverse engineering systems are vulnerable to prompt injection attacks placed into the source code of executable binary files. This research demonstrates defensive tactics for detecting the presences of prompt injection strings in the decompiler output of adversarial example programs. Methods for obfuscating these attacks and subsequent methods for defending against these obfuscations are also explored. This research advances the understanding of risk and security of agentic software analysis systems necessary for their deployment into production-level cyber workflows.

2605.30675 2026-06-01 cs.CL cs.AI

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty

大型语言模型不确定性中的人类对齐、校准与激活模式

Kyle Moore, Jesse Roberts, Daryl Watson, William Ward, Grayson Heyboer

AI总结 研究大型语言模型的不确定性与人类不确定性的相似性,通过分析行为与内部激活模式,发现模型在多项选择和开放式事实回忆数据集上同时存在对齐与校准,并描述了指令微调的影响。

详情
AI中文摘要

不确定性量化是大型语言模型行为分析中一个庞大且不断发展的子领域。为了识别和对抗幻觉,该领域主要关注测量和改进校准,即不确定性判断对任务效能的准确性。在这项工作中,我们探讨了一个相对未被充分探索的问题:大型语言模型的不确定性与人类不确定性有多相似。我们研究了大型语言模型的外部行为和内部激活模式中是否存在类似人类的不确定性信号,即不确定性对齐。我们识别了模型在涵盖多项选择和开放式事实回忆的多种数据集上是否同时表现出对齐和校准的证据。并且我们描述了指令微调对这些方面的影响。

英文摘要

Uncertainty Quantification is a large and growing subfield of large language model behavioral analysis. Primarily to recognize and combat hallucination, the field has largely focused on measuring and improving calibration, the accuracy of uncertainty judgments to task efficacy. In this work, we investigate the relatively underexplored question of how similar large language model uncertainty is to human uncertainty. We investigate the presence and strength of human-similar uncertainty signals, deemed uncertainty alignment, in large language model overt behavior and internal activation patterns. We identify whether the models show evidence of simultaneous alignment and calibration on a variety of datasets covering both multiple choice and open ended factual recall. And we characterize the effect of instruct fine-tuning on each of these facets.

2605.30673 2026-06-01 cs.CL

TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation

TeachObs:多模态教学观察与模型评估的人工验证基准

Yeil Jeong, Youngjin Yoo, Seobin Sohn, Hyejin Han, Jinseo Lee, Scott Howard, Unggi Lee

AI总结 提出TeachObs基准,包含30节公开课视频的5158个15秒场景,由7名研究者标注39个二值观察码,并评估5个前沿视觉大语言模型在三种任务上的表现。

详情
AI中文摘要

课堂视频包含可观察的教学实践,但其教学和视觉信号很少以适合模型评估的形式组织。我们提出了 extit{TeachObs},一个用于课堂视频中多模态教学观察的人工验证基准。 extit{TeachObs}包含来自八个国家的30节公开课视频,分为5158个固定的15秒场景。七名研究者用39个二值观察码标注每个场景,涵盖20个视觉码(如手势、板书、指向和视觉材料)和19个非视觉码(如指导、监控、提问、反馈和反思)。基于Krippendorff's alpha,使用可靠性和流行度感知规则构建黄金片段标签。除了片段级标签,三名专家评分员对30节课进行了课程级评分和定性评估,涵盖教学设计、教学实施、学生反应、学习材料和课程收尾,评分员覆盖范围详见正文。利用这两个人工参考层,我们在三个任务上评估了五个具备视觉能力的前沿大语言模型——纯文本片段编码、文本+帧片段编码,以及在LLM作为评判协议下的课程级覆盖评分——发现没有单一模型在所有三个任务中持续优于其他模型,添加中间帧会同时增加每个场景的真实和虚假归因,并且模型评估相对于专家评分员高估了程序清晰的课程。因此, extit{TeachObs}支持细粒度标注基准测试和整课评估,展示了AI系统在哪些方面可以辅助课堂视频分析,以及在哪些方面专家判断仍然必要,涵盖不同学科、课堂形式和标注难度水平。

英文摘要

Classroom videos contain observable teaching practices, but their pedagogical and visual signals are rarely organized in forms suitable for model evaluation. We present \textit{TeachObs}, a human-validated benchmark for multimodal teaching observation in classroom videos. \textit{TeachObs} includes 30 public lesson videos from eight countries divided into 5,158 fixed 15-second scenes. Seven researchers annotated each scene with 39 binary observation codes, covering 20 visual codes, such as gesture, board work, pointing, and visual materials, and 19 nonvisual codes, such as instruction, monitoring, questioning, feedback, and reflection. Gold segment labels are constructed using reliability- and prevalence-aware rules based on Krippendorff's alpha. In addition to segment-level labels, three expert raters produced lesson-level ratings and qualitative evaluations of instructional design, instructional delivery, learner response, learning materials, and lesson closure across the 30 lessons, with rater coverage detailed in the body. Using these two human reference layers, we evaluate five vision-capable frontier LLMs across three tracks - text-only segment coding, text + frame segment coding, and lesson-level coverage scored under an LLM-as-judge protocol - and find that no single model consistently outperforms others across all three tracks, that adding a mid-frame inflates both true and false attributions per scene, and that model evaluations over-rate procedurally clear lessons relative to expert raters. \textit{TeachObs} therefore supports both fine-grained annotation benchmarking and whole-lesson evaluation, showing where AI systems can assist classroom video analysis and where expert judgment remains necessary across varied subjects, classroom formats, and annotation difficulty levels.