arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1764
2606.06787 2026-06-08 cs.AI 新提交

AdMem: Advanced Memory for Task-solving Agents

AdMem: 面向任务求解智能体的高级记忆

Runzhe Wang, Huilin Lu, Shengjie Liu, Li Dong, Jason Zhu

发表机构 * Princeton University(普林斯顿大学) Amazon(亚马逊) Arm

AI总结 提出一种统一自动记忆框架,集成语义、情节和程序记忆,通过双层级设计和多智能体架构实现自动生成、奖励标注与自适应检索,提升长程多轮任务的鲁棒性和成功率。

详情
AI中文摘要

大型语言模型(LLM)作为使用工具的智能体展现出潜力,但在需要记忆、组织和重用知识的长期任务中仍然受限。先前的记忆方法旨在解决这一问题,但主要侧重于存储事实信息。最近关于程序性记忆的工作改进了任务重用,但往往退化为重放过去的成功,而未处理失败案例或在线可扩展性。我们引入了一个统一且自动的记忆框架,该框架在结合短期和长期存储的双层级设计中集成了语义、情节和程序性记忆。一个包含执行者、记忆者和批评者智能体的多智能体架构实现了自动记忆生成、奖励标注和自适应检索。长期记忆通过基于奖励的评估、合并和剪枝进行管理,确保了可扩展性和持续改进。跨多种环境的实验表明,与现有基线相比,我们的方法提高了长程多轮任务的鲁棒性和成功率。这项工作凸显了全面、自适应记忆对于推进基于LLM的智能体的重要性。

英文摘要

Large Language Models (LLMs) show promise as tool-using agents but remain limited in long-horizon tasks that require remembering, organizing, and reusing knowledge. Prior memory approaches aim to resolve the situation, but mainly focus on storing factual information. Recent work on procedural memory improves task reuse, yet often reduces to replaying past successes without addressing failure cases or online scalability. We introduce a unified and automatic memory framework that integrates semantic, episodic, and procedural memory in a bi-level design combining short-term and long-term stores. A multi-agent architecture with actor, memory, and critic agents enables automatic memory generation, reward annotation, and adaptive retrieval. Long-term memory is managed through reward-based evaluation, merging, and pruning, ensuring scalability and continual improvement. Experiments across various environments show that our approach improves robustness and success on long multi-turn tasks compared to existing baselines. This work highlights the importance of comprehensive, adaptive memory for advancing LLM-based agents.

2606.06781 2026-06-08 cs.CL 新提交

When Better Codebooks Are Not Enough: Predictive Performance and Behavioral Reliability in LLM Political Event Coding

当更好的代码手册还不够:LLM政治事件编码中的预测性能与行为可靠性

Zixian He, Bharath Raahul Murugesan, Patrick Brandt, Yibo Hu

发表机构 * Independent Researcher(独立研究者) Illinois Institute of Technology(伊利诺伊理工学院) The University of Texas at Dallas(德克萨斯大学达拉斯分校)

AI总结 本研究探讨在政治事件编码任务中,将专家代码手册优化为LLM友好形式能显著提升分类性能,但预测增益并未完全转化为行为可靠性,模型在代码手册变化下仍可能失效。

详情
Comments
14 pages, 3 figures, 11 tables
AI中文摘要

高准确率并不一定使LLM成为忠实的编码器。这个问题很重要,因为许多社会科学研究依赖专家编写的代码手册将文本转化为结构化数据。我们在政治事件编码中研究这个问题,这是一个具有挑战性的源-目标关系分类任务,超越了普通的句子级分类,模型必须使用详细的编码规则确定一个行为者对另一个行为者做了什么。我们测试了当专家代码手册被操作化为LLM友好形式(包含更清晰的定义、示例、检索上下文和困难案例规则)时是否变得更有效。然后,我们在标签名称、代码手册顺序和标签-定义映射的受控变化下评估行为可靠性。更清晰的代码手册显著提高了分类性能,尤其是对于细粒度事件分类。然而,这些预测增益并未完全转化为行为可靠性。模型可能产生有效的标签并恢复定义,但在代码手册的受控变化下仍未能通过行为可靠性测试。这些发现表明,代码手册引导的LLM系统不仅应根据准确性进行评估,还应根据它们是否保留了使编码输出对社会科学研究有意义的编码逻辑来评估。

英文摘要

High accuracy does not necessarily make an LLM a faithful coder. This issue matters because many social-science studies rely on expert-written codebooks to turn text into structured data. We study this problem in political event coding, a challenging source-target relation classification task beyond ordinary sentence-level classification, where models must determine what one actor did to another using detailed coding rules. We test whether expert codebooks become more effective when operationalized into LLM-friendly forms with clearer definitions, examples, retrieved context, and rules for difficult cases. We then evaluate behavioral reliability under controlled changes to label names, codebook order, and label-definition mappings. Clearer codebooks substantially improve classification performance, especially for fine-grained event classification. However, these predictive gains do not fully translate into behavioral reliability. Models may produce valid labels and recover definitions while still failing behavioral reliability tests under controlled codebook changes. These findings suggest that codebook-guided LLM systems should be evaluated not only by accuracy, but also by whether they preserve the coding logic that makes coded outputs meaningful for social-science research.

2606.06776 2026-06-08 cs.LG 新提交

A Rolling-Window Framework for Churn Prediction and Behavioral Driver Identification

一个用于流失预测和行为驱动因素识别的滚动窗口框架

Muhammad Jawad Mufti, Omar Hammad, Haitham Saleh, Muqaddas Gull

发表机构 * Information and Computer Science Department, King Fahd University of Petroleum and Minerals(国王法赫德石油和矿物大学信息与计算机科学系) Interdisciplinary Research Center for Smart Mobility and Logistics (IRC-SML), King Fahd University of Petroleum and Minerals(智能交通与物流交叉研究中心(IRC-SML),国王法赫德石油和矿物大学) SDAIA–KFUPM Joint Research Center for Artificial Intelligence, King Fahd University of Petroleum and Minerals(SDAIA-KFUPM人工智能联合研究中心,国王法赫德石油和矿物大学)

AI总结 提出基于滚动行为窗口的流失预测框架,在非契约服务环境中实现连续风险评估,特征模型准确率87.6%、ROC-AUC 0.94,序列模型召回率96.1%。

详情
AI中文摘要

客户流失预测是客户分析中的核心任务,特别是在非契约、按使用付费的服务环境中,脱离行为并非明确可观察,而必须从行为不活跃中推断出来。现有的流失预测方法通常依赖于简化的时间假设或客户行为的单点表示,这限制了它们支持持续风险评估、可解释性和随时间推移的现实部署的能力。本研究提出了一个时间显式的流失预测框架,该框架使用滚动行为窗口对客户行为进行建模,从而随着客户活动的演变实现重复和实例级别的流失风险估计。客户行为在固定的30天观察窗口内进行总结,随后是30天的未来流失评估窗口,确保行为证据与流失结果之间存在明确的时间分离。该框架在统一的时间设计内集成了基于特征和基于序列的学习方法。所提出的方法在来自非契约服务平台的大规模真实世界数据集上进行了评估。实证结果表明了强大且稳定的预测性能,基于特征的模型准确率达到87.6%,ROC-AUC为0.94,而基于序列的模型通过捕获时间脱离模式实现了高达96.1%的召回率。对未来未见数据的评估证实了在时间偏移下具有显著的鲁棒性,无需模型重新训练即可保持准确率高于83%和ROC-AUC超过0.91。总体而言,研究结果强调,精心设计的时间框架(而非仅模型复杂性)对于实现稳健、可解释且可部署的流失预测至关重要。本研究为动态服务环境中面向流失的决策支持提供了实用基础。

英文摘要

Customer churn prediction is a central task in customer analytics, particularly in non-contractual, pay-per-use service environments where disengagement is not explicitly observed and must be inferred from behavioral inactivity. Existing churn prediction approaches often rely on simplified temporal assumptions or single-point representations of customer behavior, which limit their ability to support continuous risk assessment, interpretability, and realistic deployment over time. This study proposes a temporally explicit churn prediction framework that models customer behavior using rolling behavioral windows, enabling repeated and instance-level churn risk estimation as customer activity evolves. Customer behavior is summarized within a fixed 30-day observation window, followed by a 30-day future churn evaluation window, ensuring a clear temporal separation between behavioral evidence and churn outcomes. The framework integrates feature-based and sequence-based learning approaches within a unified temporal design. The proposed approach is evaluated on a large-scale, real-world dataset from a non-contractual service platform. Empirical results demonstrate strong and stable predictive performance, with accuracy reaching 87.6% and ROC-AUC of 0.94 for the feature-based model, while the sequence-based model achieves recall as high as 96.1% by capturing temporal disengagement patterns. Evaluation on future unseen data confirms meaningful robustness under temporal shift, with accuracy remaining above 83% and ROC-AUC exceeding 0.91 without model retraining. Overall, the findings highlight that carefully designed temporal framing, rather than model complexity alone, is critical for achieving robust, interpretable, and deployment-ready churn prediction. The study provides a practical foundation for churn-oriented decision support in dynamic service environments.

2606.06762 2026-06-08 cs.RO 新提交

Multi-Robot Planning and Control from CCTV Camera Networks in a Real Warehouse

基于真实仓库中闭路电视摄像机网络的多机器人规划与控制

Luke Robinson, Benjamin Ramtoula, Anas Izaaryene, Paul Newman, Daniele De Martini

发表机构 * Oxford Robotics Institute, University of Oxford, UK(牛津大学机器人研究所,牛津大学,英国) Robot Systems Group, Technical University of Munich, Germany(机器人系统组,慕尼黑技术大学,德国)

AI总结 提出仅利用分布式CCTV网络和边缘计算实现多机器人协调规划与控制的方法,在真实仓库中验证了四台机器人和30个摄像头的系统,首次实现仅依赖外部摄像头网络的现场多机器人协调。

详情
AI中文摘要

利用环境中嵌入的摄像头对移动机器人进行离车控制,通过将感知和计算移离机器人,为可扩展的自主性提供了一条实用路径。我们将这一思想从单机器人情况扩展到真实仓库中的协调车队,仅使用分布式CCTV网络和边缘计算驱动多个机器人。该系统完全在未校准的、基于像素的拓扑相机图的图像空间中运行,支持灵活相机放置下的大范围操作。分层规划器为每个机器人选择相机序列,并通过每个视图规划其图像空间运动,采用优先-联合策略协调机器人,将重叠的相机区域视为一次仅由一个机器人持有的共享资源,以防止碰撞和死锁。我们在一个真实仓库中验证了该方法,该仓库有四个机器人和30个摄像头,分布在六个27米长的过道中,报告了任务时间和协调统计数据。据我们所知,这是首次仅使用外部摄像头网络和离车计算进行多机器人规划和协调的现场演示,机器人未携带任何特定于任务的导航硬件。

英文摘要

Off-board control of mobile robots from cameras embedded in the environment offers a practical path to scalable autonomy, moving sensing and compute off the robots. We extend this idea from the single-robot case to coordinated fleets in a real warehouse, driving multiple robots with only a distributed CCTV network and edge compute. The system operates entirely in image space over an uncalibrated, pixel-wise topological camera graph, enabling wide-area operation with flexible camera placement. A hierarchical planner selects a camera sequence per robot and plans its image-space motion through each view, coordinating robots with a prioritised-then-joint strategy and treating overlapping camera regions as shared resources held by one robot at a time to prevent collisions and deadlocks. We validate the approach in a real warehouse with four robots and 30 cameras across six 27 m aisles, reporting mission times and coordination statistics. To our knowledge, this is the first field demonstration of multi-robot planning and coordination using only an external camera network and off-board compute, with robots carrying no task-specific navigation hardware.

2606.06761 2026-06-08 cs.RO cs.AI 新提交

AxisGuide: Grounding Robot Action Coordinate System in RGB Observations for Robust Visuomotor Manipulation

AxisGuide: 在RGB观测中接地机器人动作坐标系以实现鲁棒的视觉运动操控

Jiyun Jang, Yujin Sung, Woosung Joung, Daewon Chae, Sangwon Lee, Sohwi Kim, Jinkyu Kim, Jungbeom Lee

发表机构 * Korea University(韩国大学) University of Michigan(密歇根大学) KT R&D Center(KT研发中心) Kakao Mobility(Kakao移动)

AI总结 针对视觉运动策略在分布偏移下动作执行失败的问题,提出AxisGuide方法,通过渲染机器人基座坐标系轴并叠加提示通道,增强动作坐标理解,显著提升泛化性能。

详情
Comments
Accepted to Robotics: Science and Systems (RSS) 2026
AI中文摘要

通过大规模行为克隆训练的视觉运动操控策略已实现强大的语义场景理解,但在分布偏移下往往无法可靠地执行正确的低级动作。例如,即使在具有相同场景布局、相机视角和光照的简单拾取任务中,当物体被放置在未见过的位置时,性能也会大幅下降。我们认为这一差距源于动作理解不足,即无法在图像空间中解释机器人基座坐标系。为解决此问题,我们引入AxisGuide,一种轻量级引导方法,桥接语义场景理解和动作坐标解释。利用相机参数和末端执行器位姿,AxisGuide在每个相机视图中渲染机器人基座轴,并通过少量提示通道增强RGB观测,明确可视化图像空间中+x、+y和+z运动的含义。在LIBERO仿真和真实环境中的广泛评估表明,AxisGuide带来了显著的性能提升和更好的泛化能力,凸显了显式动作坐标提示对于学习可靠且可迁移的通用视觉运动策略的有效性。

英文摘要

Visuomotor manipulation policies trained via large-scale behavior cloning have achieved strong semantic scene understanding, yet often fail to reliably execute correct low-level actions under distribution shifts. For example, even in a simple pickup task with identical scene layouts, camera viewpoints, and illumination, performance can degrade substantially when the object is placed at unseen locations. We argue that this gap arises from insufficient action understanding, namely the inability to interpret the robot's base-frame action coordinate system in image space. To address this issue, we introduce AxisGuide, a lightweight guidance method that bridges semantic scene understanding and action-coordinate interpretation. Using camera parameters and end-effector poses, AxisGuide renders the robot base-frame axes in each camera view and augments RGB observations with a small set of cue channels that explicitly visualize the meaning of the +x, +y, and +z motions in image space. Extensive evaluations in both the LIBERO simulation and real-world environments demonstrate that AxisGuide yields substantial performance gains and improved generalization, highlighting the effectiveness of explicit action-coordinate cues for learning reliable and transferable generalist visuomotor policies.

2606.06755 2026-06-08 cs.CL cs.ET 新提交

PromptPrint: Behavioral Biometrics Through Natural Language Prompting in LLMs

PromptPrint: 通过自然语言提示在LLMs中的行为生物特征

Shaiv Patel, Kartik Narayan, Vishal Patel

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出PromptPrint,研究用户与LLM交互的简短提示是否包含可识别的行为生物特征,通过词汇、句法和话语模式分析,发现词汇稳定性假设成立,但存在唯一性-一致性悖论,且身份信号对语义改写脆弱。

详情
Comments
10 pages, 6 figures
AI中文摘要

作者归属研究传统上关注长篇、表达性文本;然而,与大型语言模型(LLM)的交互通常是简短且任务驱动的提示。这引发了一个基本问题:这样的提示是否包含稳定、可识别作者且独特的信号?我们引入了PromptPrint,一项对基于提示的身份的系统研究,假设用户的习惯性词汇、句法和话语模式形成可学习的行为生物特征。使用来自1,034名用户的20,680个真实提示,我们建立了三个关键发现。首先,词汇表示显著优于语义编码器,支持“词汇稳定性假设”:身份主要编码在表面层面的词汇选择中,而非抽象意图。其次,风格特征表现出“唯一性-一致性悖论”:用户在整个群体中高度独特,但在不同上下文中行为不一致。第三,对抗性分析揭示了一个清晰的脆弱性谱:身份信号对微小的词汇扰动具有鲁棒性,但在语义改写下显著退化。总体而言,我们的结果展示了大规模下的强识别性能,确立了基于提示的身份作为一种可行的行为生物特征。这项工作为LLM交互中的用户建模引入了新视角,对安全和隐私具有重要意义。数据和代码将在我们的工作被接受后发布。

英文摘要

Authorship attribution research has traditionally focused on long-form, expressive texts; however, interactions with large language models (LLMs) are typically brief and task-driven prompts. This raises a fundamental question: do such prompts contain a stable, author-identifiable, and distinctive signal? We introduce PromptPrint, a systematic study of prompt-based identity, the hypothesis that a user's habitual vocabulary, syntax, and discourse patterns form a learnable behavioral biometric. Using 20,680 real prompts from 1,034 users, we establish three key findings. First, lexical representations significantly outperform semantic encoders, supporting the "lexical stability hypothesis": identity is primarily encoded in surface-level word choice rather than abstract intent. Second, stylometric features exhibit a "uniqueness-consistency paradox": users are highly distinctive across the population, yet behaviorally inconsistent across contexts. Third, adversarial analysis reveals a clear vulnerability spectrum: identity signals are robust to minor lexical perturbations but degrade substantially under semantic paraphrasing. Overall, our results demonstrate strong identification performance at scale, establishing prompt-based identity as a viable behavioral biometric. This work introduces a new perspective on user modeling in LLM interactions, with important implications for security and privacy. Data and code will be released upon the acceptance of our work.

2606.06746 2026-06-08 cs.LG 新提交

Performance Variation in Deep Reinforcement Learning

深度强化学习中的性能变化

Haruto Tanaka, A. Rupam Mahmood

发表机构 * Department of Computing Science University of Alberta(计算科学系阿尔伯塔大学) Alberta Machine Intelligence Institute (Amii)(阿尔伯塔人工智能研究所(Amii)) CIFAR AI Chair(CIFAR人工智能主席)

AI总结 针对深度强化学习算法运行间鲁棒性低的问题,提出基于百分位的统计量(min-max IPR)和可视化方法(逐运行百分位高亮),用于评估性能变化,并通过三个案例验证其有效性。

详情
AI中文摘要

深度强化学习(RL)算法通常存在运行间鲁棒性低的问题,表现为相同配置的智能体在独立运行中性能显著变化。尽管这一问题在研究和实践中带来一系列挑战,但开发评估方法的研究相对较少;相反,RL研究通常报告估计平均性能的不确定性。在本文中,我们概述了传统不确定性和变化估计的局限性,特别是它们与目的的不匹配以及低估风险。然后,我们提出了一种基于百分位的替代统计量和可视化方法,分别称为min-max IPR和逐运行百分位高亮。这些基于百分位的工具易于解释,并依赖于样本百分位的标准性质,提供了关于运行间性能变化的丰富信息。我们通过三个案例研究证明了这一点。首先,我们表明LayerNorm和倒数第二层归一化缩小了PPO的性能变化,而SAC的变化基本不变。其次,我们比较了PPO、SAC、TD-MPC和TD-MPC2,并表明TD-MPC在四个算法中变化最小,同时数据效率最高。最后,在五个Atari环境上比较DQN和Rainbow时,我们表明两种算法表现出相似水平的性能变化。

英文摘要

Deep reinforcement learning (RL) algorithms often suffer from low run-to-run robustness, manifesting as significant performance variation across independent runs of identically configured agents. Although this issue poses a spectrum of challenges across research and practice, relatively few studies develop methods to evaluate it; RL research instead often reports uncertainty in the estimated mean performance. In this paper, we outline the limitations of conventional uncertainty and variation estimates, particularly their misalignment with purpose and the risk of underreporting. We then propose an alternative percentile-based statistic and visualization method, min-max IPR and run-wise percentile highlighting, respectively. These percentile-based tools are easy to interpret and rely on standard properties of sample percentiles, providing rich information about run-to-run performance variation. We demonstrate this through three case studies. First, we show that LayerNorm and penultimate-layer normalizations narrow performance variation in PPO, whereas the variation is mostly unchanged in SAC. Second, we compare PPO, SAC, TD-MPC, and TD-MPC2, and show TD-MPC exhibits the least variation while being the most data efficient among the four. Finally, in a comparison of DQN and Rainbow on five Atari environments, we show that both algorithms exhibit similar levels of performance variation.

2606.06745 2026-06-08 cs.CL 新提交

When to Think Deeply: Inhibitory Deliberation for LLM Reasoning

何时深度思考:用于LLM推理的抑制性深思

Zhixuan He, Yue Feng

发表机构 * University of Birmingham, United Kingdom(英国伯明翰大学)

AI总结 提出IDPR框架,通过抑制控制器根据快速答案决定是否启动慢速推理,在数学推理测试集上仅调用8.20%的慢速推理,准确率从47.90%提升至48.92%。

详情
AI中文摘要

推理型大语言模型可以通过深思推理提高问题求解性能,但对每个输入都调用慢速推理在计算上昂贵且往往不必要。我们提出IDPR,一个响应条件抑制性深思框架。IDPR首先生成一个简洁的直观答案,然后使用抑制控制器决定该特定响应是否应被释放或抑制以支持慢速推理。与仅输入路由器不同,抑制控制器以快速答案和快速侧证据为条件,包括置信度、logit边际、可解析性和生成成本。我们从配对的快速-慢速结果中训练控制器,并在准确率优先的慢速调用预算下,在保留验证集上选择抑制阈值。在一个保留的5000示例数学推理测试集上,IDPR仅对8.20%的示例调用慢速推理,并将准确率从47.90%提升至48.92%。在相同的慢速调用预算下,随机路由将准确率降至46.76%,而最强的基于置信度的基线达到48.22%。IDPR还实现了最高的纠正精度,表明响应条件抑制能更好地识别受益于慢速推理的快速答案。

英文摘要

Reasoning Large Language Models can improve problem-solving performance through deliberative inference, but invoking slow reasoning for every input is computationally expensive and often unnecessary. We propose IDPR, a framework for response-conditioned inhibitory deliberation. IDPR first generates a concise intuitive answer and then uses an inhibition controller to decide whether that specific response should be released or suppressed in favor of slow reasoning. Unlike input-only routers, the inhibition controller conditions on the fast answer and fast-side evidence, including confidence, logit margin, parseability, and generation cost. We train the controller from paired fast-slow outcomes and select the inhibition threshold on a held-out validation set under an accuracy-first slow-call budget. On a held-out 5,000-example mathematical reasoning test set, IDPR invokes slow reasoning on only 8.20% of examples and improves accuracy from 47.90% to 48.92%. Under the same slow-call budget, random routing decreases accuracy to 46.76%, while the strongest confidence-based baseline reaches 48.22%. IDPR also achieves the highest corrective precision, showing that response-conditioned inhibition better identifies fast answers that benefit from slow reasoning.

2606.06743 2026-06-08 cs.SD cs.AI cs.CL 新提交

HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec

HybridCodec: 快速双流、语义增强的神经音频编解码器

Arjun Gangwar, S Umesh

发表机构 * Indian Institute of Technology, Madras(印度理工学院马德拉斯分校)

AI总结 提出HybridCodec,一种结合语义蒸馏与双流架构的统一神经音频编解码器,实现强解耦、跨语言鲁棒性及3倍速度提升。

详情
Comments
5 pages, 5 tables, 1 figure, Accepted at Interspeech 2026
AI中文摘要

随着多模态大语言模型的出现,神经音频编解码器作为语音分词器的流行度激增。具有语义和声学解耦的新编解码器架构已经出现。将语义信息引入编解码器模型有两种主要方法:一种是从SSL表示中将语义信息蒸馏到第一个RVQ层,另一种是维护语义和声学特征的独立流。我们提出HybridCodec,一种结合了两种范式的统一架构。它采用独立的语义和声学分枝,同时将SSL表示蒸馏到语义流中。这种设计确保了强解耦,而无需在推理期间使用SSL模型。HybridCodec在域内测试集上展示了优越的语义特化(RVQ-1)和有竞争力的重建(RVQ-all)。我们展示了其在域外和零样本跨语言设置中的鲁棒性,相比现有双流模型实现了3倍加速。

英文摘要

The popularity of neural audio codecs as speech tokenizers has surged with the advent of Multimodal Large Language Models. New codec architectures with semantic and acoustic disentanglement have emerged. There are two main approaches to introduce semantic information into codec models: one distills semantic information from SSL representations into the first RVQ layer, while the other maintains separate streams for semantic and acoustic features. We propose HybridCodec, a unified architecture that combines both paradigms. It employs separate semantic and acoustic branches while distilling SSL representations into the semantic stream. This design ensures strong disentanglement without requiring an SSL model during inference. HybridCodec shows superior semantic specialization (RVQ-1) on in-domain test set and competitive reconstruction (RVQ-all). We demonstrate its robustness in out-of-domain and zero-shot cross-lingual settings, achieving a 3x speedup over existing dual-stream models.

2606.06741 2026-06-08 cs.AI cs.CL cs.LG 新提交

OpenSkill: Open-World Self-Evolution for LLM Agents

OpenSkill: 面向LLM智能体的开放世界自我进化

Zhiling Yan, Dingjie Song, Hanrong Zhang, Wei Liang, Yuxuan Zhang, Yutong Dai, Lifang He, Philip S. Yu, Ran Xu, Xiang Li, Lichao Sun

发表机构 * Lehigh University(莱维大学) University of Illinois Chicago(伊利诺伊大学芝加哥分校) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) Salesforce AI Research(Salesforce人工智能研究) Massachusetts General Hospital and Harvard Medical School(麻省总医院和哈佛医学院)

AI总结 提出OpenSkill框架,使智能体在无目标任务监督下,利用开放世界资源自举构建技能和验证信号,实现自我进化,在多个基准上取得最佳自动通过率。

详情
Comments
20 pages, 4 figures and 8 tables. Code is avalable at https://github.com/OpenLAIR/OpenSkill
AI中文摘要

自我进化智能体需要在部署后进行适应,但现有方法假设存在可用的学习循环,例如精心策划的技能、成功的轨迹或验证信号。真实的开放世界部署可能不提供这些,只提供一个任务提示。在这项工作中,我们研究开放世界自我进化,其中智能体必须从零开始构建其技能和自身的验证信号,使用开放世界资源但没有目标任务监督。我们提出OpenSkill,一个启动这个循环的框架:它从文档、代码库和网络中获取基础知识和验证锚点,将它们综合成可迁移的技能,并根据自建的虚拟任务(基于锚点而非目标答案)来优化这些技能。因此,开放世界既提供了要学习的知识,也提供了一个独立于监督的练习环境,目标任务监督保留用于最终评估。在三个基准和两个目标智能体上,OpenSkill在满足无监督约束的同时取得了最佳自动通过率。分析表明,其技能无需特定模型适应即可跨模型迁移,并且其自建验证器与真实结果一致,尽管从未访问过这些结果。

英文摘要

Self-evolving agents requires adaptation after deployment, but existing approaches assume a usable learning loop, such as curated skills, successful trajectories, or verifier signals. Real open-world deployments may provide none of these, offering only a task prompt. In this work, we study open-world self-evolution, where an agent must build both its skills and its own verification signals from scratch, using open-world resources but no target-task supervision. We propose OpenSkill, a framework that bootstraps this loop: it acquires grounded knowledge and verification anchors from documentation, repositories, and the web, synthesizes them into transferable skills, and refines those skills against self-built virtual tasks grounded in the anchors rather than in target answers. The open world thus supplies both the knowledge to be learned and a supervision-independent practice environment, with target-task supervision reserved for final evaluation. Across three benchmarks and two target agents, OpenSkill attains the best automated pass rate while satisfying the no-supervision constraint. Analysis shows its skills transfer across models without model-specific adaptation, and its self-built verifier aligns with ground-truth outcomes despite never accessing them.

2606.06740 2026-06-08 cs.SD cs.AI cs.CL 新提交

Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations

多语言多说话人单元声码器:离散语音表示的系统分析

Naman Kothari, Arjun Gangwar, Adarsh Arigala, S Umesh

发表机构 * National Institute of Technology, Trichy(印度Trichy国家理工学院) Indian Institute of Technology, Madras(印度Madras理工学院)

AI总结 分析基于BigVGAN的单元声码器在多语言多说话人语音生成中的表现,发现聚类大小控制可懂度,显式说话人条件防止身份崩溃,语言监督在低聚类大小时有益。

详情
Comments
5 pages, 5 tables, 1 figure, Accepted at Interspeech 2026
AI中文摘要

通过k-means聚类自监督嵌入获得的离散语音单元纠缠了音素、说话人和语言信息,导致多语言多说话人语音生成中的说话人混合和跨语言干扰。尽管在音频大语言模型和语音到语音系统中使用日益增多,单元声码器仍然研究不足。我们分析了基于BigVGAN的单元声码器,涵盖四种印度语言。我们使用WER、说话人相似度和单元级指标研究了聚类大小与条件策略之间的相互作用。结果表明,聚类大小通过提高音素区分性来控制可懂度,而显式说话人条件对于防止身份崩溃不可或缺。语言监督主要在单元仍模糊的较小聚类大小时带来进一步收益。我们的分析显示,在较小库存时,不同语言中相似音素会坍缩到相同的聚类ID,而较大的聚类会逐渐将它们分离。

英文摘要

Discrete speech units obtained via k-means clustering of self supervised embeddings entangle phonetic, speaker, and language information, causing speaker mixing and cross-lingual interference in multilingual multi-speaker speech generation. Despite growing use in Audio LLMs and speech to speech systems, unit vocoders remain underexplored. We analyze a BigVGAN based unit vocoder, across four Indian languages. We study the interaction between cluster size and conditioning strategies using WER, speaker similarity, and unit level metrics. Results show that cluster size governs intelligibility by improving phonetic discriminability, while explicit speaker conditioning is indispensable for preventing identity collapse. Language supervision yields further gains mainly at lower cluster sizes where units remain ambiguous. Our analysis shows similar phonemes across languages collapse to the same cluster IDs at smaller inventories, with larger clusters progressively separating them.

2606.06738 2026-06-08 cs.CL 新提交

Modular Monolingual Adaptation using Pretrained Language Models

使用预训练语言模型的模块化单语适应

Nalin Kumar, Ondřej Dušek

发表机构 * Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics(查尔斯大学数学与物理系形式与应用语言学研究所)

AI总结 提出一种模块化方法,通过替换标记、冻结对应嵌入并调整模型其余部分,在低资源语言上提升NLU任务性能,优于全模型微调。

详情
Comments
Accepted to ACL 2026 Industry Track
AI中文摘要

为低资源语言构建单语语言模型通常依赖于通过在整个模型上对目标语言进行微调来适应预训练语言模型。这种方法比从头开始训练更受欢迎,因为它能够实现有效的知识迁移。此外,先前的工作表明,使用特定于语言的标记器可以增强适应性。在这项工作中,我们假设全模型调优通常是不必要的,并提出了一种更模块化的方法。具体来说,我们替换标记,冻结相应的嵌入,并调整模型的其余部分。我们在苏格兰盖尔语、爱尔兰语和克丘亚语上进行实验,其中克丘亚语是一种非常低资源的语言(8.5k训练实例)。在自然语言理解任务——掩码填充、命名实体识别和词性标注上的评估表明,我们提出的方法在将模型适应低资源语言时提高了性能。此外,我们提供了对训练策略有效性、预训练嵌入选择和模型的全面分析。

英文摘要

Building monolingual language models (LMs) for low-resource languages typically relies on adapting pretrained language models (PLMs) by finetuning the whole model on the target language. This approach is widely favored over training from scratch, as it enables effective knowledge transfer. Additionally, prior work has shown that using a language-specific tokenizer can enhance the adaptability. In this work, we hypothesize that full model tuning is often unnecessary and propose a more modular approach. Specifically, we replace the tokens, freeze the corresponding embeddings, and tune the rest of the model. We use Scottish Gaelic, Irish, and Quechua for our experiments, with Quechua being a very low-resource language (8.5k training instances). Evaluation on natural language understanding (NLU) tasks -- mask filling, NER, and POS -- shows that our proposed approach improves performance when adapting models to low-resource languages. Additionally, we provide a comprehensive analysis of the effectiveness of training strategies, the choice of pretrained embeddings, and models.

2606.06727 2026-06-08 cs.RO cs.SY eess.SY 新提交

IDDMBSE: Integrating Data-Driven and Model-Based Systems Engineering for Trusted Autonomous Cyber-Physical Systems

IDDMBSE:集成数据驱动和基于模型的系统工程用于可信自主网络物理系统

John S. Baras, Sai Sandeep Damera, Ryan Matheu, Clinton Enwerem, Praveen M. S. Kumar

发表机构 * Institute for Systems Research, University of Maryland, College Park(系统研究所,马里兰大学,College Park)

AI总结 提出IDDMBSE方法,将MBSE V流程与数据驱动循环结合,通过开源工具链PERFECT、TRADES-X和VERITAS实现,在自主地面机器人全生命周期验证其有效性。

详情
Comments
9 pages, 11 figures. This work has been submitted to the IEEE for possible publication
AI中文摘要

自主网络物理系统(CPS)处于基于模型的系统工程(MBSE)和数据驱动的机器学习与人工智能(ML/AI)的交汇点,但目前尚无一种集成的系统工程(SE)方法论能够原生地覆盖两者。我们通过IDDMBSE(一种集成的数据驱动和基于模型的系统工程方法论)来填补这一空白,该方法将严格的MBSE V流程扩展为每一步都包含数据驱动循环,并基于SysML、自主栈以及混合模型驱动加数据驱动的权衡架构。我们将IDDMBSE实例化为一个可互操作的开源工具链:PERFECT,它将SysML系统架构映射到可执行的ROS自主栈,用于可扩展的性能评估;TRADES-X,它将设计空间探索分解为基于模型的优化阶段和随后的数据驱动评估阶段;以及VERITAS,它将形式化验证、数据驱动验证和运行时验证结合到一个统一的保证工作流中。我们在一个可信自主地面机器人的全开发生命周期中演示了IDDMBSE,涵盖传感器套件选择、风险敏感路径规划、行为树任务验证、基于共形预测的鲁棒感知以及有保证的多机器人协调,所有这些都在一个我们随工具链一起发布的、具有争议地形的Isaac Sim测试场中进行了演练。最后,我们概述了IDDMBSE如何在SysML v2 / KerML基础上重新构建,以实现语言原生的可组合性和更紧密的ML/AI集成。

英文摘要

Autonomous cyber-physical systems (CPS) sit at the intersection of Model-Based Systems Engineering (MBSE) and data-driven Machine Learning and Artificial Intelligence (ML/AI), yet no integrated Systems Engineering (SE) methodology natively spans both. We address this gap with IDDMBSE, an Integrated Data-Driven and Model-Based Systems Engineering methodology that extends the rigorous MBSE V-process with a data-driven loop at every step, anchored in SysML, the autonomy stack, and a hybrid model-based plus data-driven trade-off architecture. We instantiate IDDMBSE as an interoperable, open-source tool chain: PERFECT, which maps SysML system architectures to executable ROS autonomy stacks for scalable performance evaluation; TRADES-X, which decomposes design-space exploration into a model-based optimization stage followed by a data-driven evaluation stage; and VERITAS, which combines formal, data-driven, and runtime verification into a single assurance workflow. We demonstrate IDDMBSE on a Trusted Autonomous Ground Robot across its development lifecycle, spanning sensor-suite selection, risk-sensitive path planning, behavior-tree task verification, conformal-prediction-based robust perception, and assured multi-robot coordination, all exercised in a contested-terrain Isaac Sim test range that we release with the tool chain. We close by sketching how IDDMBSE is being re-formulated on SysML v2 / KerML foundations to enable language-native composability and tighter ML/AI integration.

2606.06724 2026-06-08 cs.LG 新提交

Synthics: Synthetic Physics-like Datasets for Machine Learning

Synthics: 用于机器学习的类物理合成数据集

Jari Vepsäläinen

发表机构 * Aalto University(阿尔托大学)

AI总结 提出基于贝叶斯概率上下文无关文法从方程语料库生成结构相似的合成回归数据集,通过非侵入式探测确定物理有效域,统计验证表明合成数据在结构保真度上优于纯概率方法,且在下游超参数调优中性能接近真实数据。

详情
AI中文摘要

代表性数据是机器学习的基础,因为有限的数据会阻碍泛化。收集足够的真实样本通常是不可行的。合成数据生成提供了一种实用的解决方案,但前提是生成的数据能忠实反映真实观测的结构。本文提出了一种方法,用于生成结构上类似于给定方程语料库中物理方程的合成回归数据集。该方法使用贝叶斯概率上下文无关文法来捕捉语料库的底层代数结构,并从中采样新方程。为了确保生成的输入位于物理有意义的域内,通过非侵入式探测为每个方程表征适用域,同时恢复变量间约束。输入采样进一步通过从有效域的随机子范围中混合均匀分布和截断正态分布进行抽取,模拟真实的实验条件。使用Kolmogorov-Smirnov检验对Feynman方程语料库进行统计验证。生成的方程在全部八个研究的结构特征上与语料库匹配,而未经平滑的纯概率文法仅匹配两个特征,表明贝叶斯先验对于在给定语料库规模下实现结构保真度至关重要。在下游超参数调优任务中,基于合成数据调优的梯度提升回归器在真实数据上平均选出20个配置中的第6优配置,与基于真实数据调优的结果相当,并显著优于随机表达式树(第10优)和噪声(第19优)。

英文摘要

Representative data is fundamental in machine learning, as limited data hinders generalisation. Collecting sufficient real-world samples is often infeasible. Synthetic data generation offers a practical solution, but only if the generated data faithfully reflects the structure of real observations. In this paper, a method for generating synthetic regression datasets that structurally resemble physics equations from a given equation corpus is presented. The approach uses a Bayesian Probabilistic Context-Free Grammar to capture the underlying algebraic structure of the corpus, from which novel equations are sampled. To ensure the generated inputs lie within a physically meaningful domain, the applicability domain is characterised for each equation through non-intrusive probing, also recovering inter-variable constraints. Input sampling further mimics realistic experimental conditions by drawing from random sub-ranges of the valid domain with mixed uniform and truncated normal distributions. The generated data is statistically validated against the Feynman equation corpus using Kolmogorov-Smirnov tests. The generated equations match the corpus on all of the eight studied structural features, compared to only two for an unsmoothed purely probabilistic grammar, demonstrating that the Bayesian prior is essential for structural fidelity given the size of the corpus. In a downstream hyperparameter-tuning task, a gradient-boosted regressor tuned on the synthetic data picks, on average, the 6th-best configuration out of 20 on real data, matching the result of tuning on real data itself and substantially outperforming random expression trees (10th) and noise (19th).

2606.06721 2026-06-08 cs.RO cs.AI 新提交

SCOUT: Semantic scene COverage via Uncertainty-guided Traversal

SCOUT: 基于不确定性引导遍历的语义场景覆盖

Junyu Mao, Sara Ayoubi, Vishnu D. Sharma, Ilija Hadžić, Matthew Andrews

发表机构 * Nokia Bell Labs, France(诺基亚贝尔实验室,法国) Nokia Bell Labs, Murray Hill, NJ, USA(诺基亚贝尔实验室,美国,新泽西州 Murray Hill) Imperial College London(帝国理工学院伦敦分校) Locus Robotics(Locus机器人技术公司)

AI总结 提出SCOUT框架,通过不确定性引导的遍历规划与概率场景图构建的闭环,使机器人主动探索并逐步理解环境,实现语义场景完整性作为操作目标。

详情
Comments
2026 ICRA Workshop on Uncertainty in Open World Robotics
AI中文摘要

长时间运行的机器人不应仅仅访问空间,而应逐步理解空间。然而,大多数3D场景图管线将感知视为固定数据集上的后处理阶段,将场景表示与决定首先观察什么的决策解耦。我们提出SCOUT,一种在线语义探索框架,通过将主动遍历与概率场景图构建耦合来闭合这一循环。给定先验2D占用地图和带姿态的RGB-D观测,SCOUT增量构建一个不确定性感知的3D场景图,其节点维护融合的几何和开放词汇对象标签的后验信念,而边编码结构关系,如在上、内部、属于和旁边。这些信念被反馈给不确定性引导的遍历规划器,该规划器通过平衡期望语义确定性增益、几何覆盖增益和旅行成本来选择视点。这样,当额外证据重要时,机器人重新访问模糊对象,当场景不完整时,扩展到未见的自由空间。由此产生的系统将语义场景完整性视为操作目标,而非语义映射的被动副产品,朝着能够在最少人工干预下巡逻、更新和推理不断变化的室内环境的自主智能体迈进。

英文摘要

Robots that operate over extended periods should not merely visit space; they should progressively understand it. Yet most 3D scene graph pipelines treat perception as a post-processing stage over a fixed dataset, decoupling scene representation from the decisions that determine what is observed in the first place. We present SCOUT, an online semantic exploration framework that closes this loop by coupling active traversal with probabilistic scene graph construction. Given a prior 2D occupancy map and posed RGB-D observations, SCOUT incrementally builds an uncertainty-aware 3D scene graph whose nodes maintain fused geometry and posterior beliefs over open-vocabulary object labels, while edges encode structural relations such as on, inside, belong, and next to. These beliefs are fed back to an uncertainty-guided traversal planner, which selects viewpoints by balancing expected semantic certainty gain, geometric coverage gain, and travel cost. In this way, the robot revisits ambiguous objects when additional evidence matters and expands into unseen free space when the scene remains incomplete. The resulting system treats semantic scene completeness as an operational objective rather than a passive by-product of semantic mapping, moving toward autonomous agents that can patrol, update, and reason about evolving indoor environments with minimal human intervention.

2606.06718 2026-06-08 cs.LG cs.AI cs.SY eess.SY 新提交

MSAIC-Net: A Multi-Scale Attention and Imbalance-Aware Contrastive Network for ECG-Based Myocardial Substrate Abnormality Detection

MSAIC-Net:用于基于心电图的心肌基质异常检测的多尺度注意力和不平衡感知对比网络

Canyu Lei, Fenglin Zhang, Derek Bivona, Cristiane Singulane, Jonathan Pan, Kenneth Bilchick, Amit R. Patel, Jianxin Xie

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 提出多尺度注意力增强卷积网络MSAIC-Net,通过并行空洞卷积提取多尺度特征、通道注意力重加权、不平衡感知对比学习及导联置换重要性分析,在低数据量UVA队列和大规模PTB-XL数据集上实现心肌瘢痕和心肌梗死检测的准确性和可解释性提升。

详情
AI中文摘要

心肌基质异常,如心肌瘢痕和心肌梗死(MI),与不良心血管结局相关。心电图(ECG)为检测这些异常提供了一种低成本且广泛可用的工具,但由于异质性导联依赖性表现、高维多导联信号、类别不平衡以及深度学习模型的可解释性有限,基于ECG的检测仍然具有挑战性。我们提出了一种多尺度注意力增强卷积网络(MSAIC-Net)用于基于ECG的心肌基质异常检测。MSAIC-Net采用并行空洞卷积分支,在多个时间感受野上提取ECG特征,使模型能够捕捉局部和更长时间范围的时间模式。然后使用通道注意力自适应地重新加权信息性导联和特征通道表示。为了解决类别不平衡并提高特征可分性,我们引入了一种新颖的不平衡感知监督对比学习策略,鼓励同一类别的样本形成紧凑表示,同时增加异常和正常样本之间的分离。进一步引入导联置换重要性来量化每个ECG导联的贡献并提高模型可解释性。该方法在两个互补数据集上进行了评估:来自弗吉尼亚大学(UVA)健康系统的低数据量机构队列用于心肌瘢痕分类,以及来自PhysioNet的大规模公共PTB-XL数据集用于MI识别。实验结果表明,MSAIC-Net优于基线模型,在低数据量的UVA队列中改进尤为显著。总体而言,所提出的框架为基于ECG的心肌基质异常检测提供了一种有效且可解释的方法。

英文摘要

Myocardial substrate abnormalities, such as myocardial scar and myocardial infarction (MI), are associated with adverse cardiovascular outcomes. Electrocardiography (ECG) provides a low-cost and widely available tool for detecting these abnormalities, but ECG-based detection remains challenging due to heterogeneous lead-dependent manifestations, high-dimensional multi-lead signals, class imbalance, and the limited interpretability of deep learning models. We propose a multi-scale attention-enhanced convolutional network (MSAIC-Net) for ECG-based myocardial substrate abnormality detection. MSAIC-Net employs parallel atrous convolutional branches to extract ECG features across multiple temporal receptive fields. %, enabling the model to capture both local and longer-range temporal patterns. Channel attention is then used to adaptively reweight informative lead-wise and feature-channel representations. To address class imbalance and improve feature separability, we introduce a novel imbalance-aware supervised contrastive learning strategy that encourages samples from the same class to form compact representations while increasing separation between abnormal and normal samples. Lead-wise permutation importance is further incorporated to quantify the contribution of each ECG lead and improve model interpretability. The proposed method was evaluated on two complementary datasets: a low-data institutional cohort from the University of Virginia (UVA) Health System for myocardial scar classification and the large-scale public PTB-XL dataset from PhysioNet for MI identification. Experimental results show that MSAIC-Net outperforms baseline models, with particularly pronounced improvements in the low-data UVA cohort. Overall, the proposed framework provides an effective and interpretable approach for ECG-based detection of myocardial substrate abnormalities.

2606.06717 2026-06-08 cs.LG cs.AI q-bio.BM q-bio.QM 新提交

ShallowBench: Benchmarking Generative Drug Design Models on Shallow-Pocket Targets

ShallowBench: 浅口袋靶标上的生成式药物设计模型基准测试

Saket Reddy, Shiwei Liu

发表机构 * University of Illinois - Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出ShallowBench基准,包含5780个浅口袋靶标,用于评估生成式药物设计模型在低凹度界面上的性能,揭示现有模型预测结合亲和力较弱的问题。

详情
AI中文摘要

虽然生成式AI模型在基于结构的药物设计中已展现出显著成功,但它们主要依赖深结合口袋,难以对具有挑战性的低口袋性靶标(如历史上“不可成药”的肿瘤靶标KRAS和MYC)采样有效配体。为弥补这一空白,我们引入了ShallowBench,这是一个从CrossDocked2020中提取的包含5780个浅口袋靶标的严格精选基准。通过计算Alpha Shape“盖子”体积与底层蛋白质原子体素体积之间的差异,我们成功分离出低凹度靶标,同时确保足够的结合表面积。评估多种最先进的生成模型显示,在这些低凹度界面上预测的结合亲和力较弱。因此,ShallowBench为生成生物学模型提供了一个严格的基准,并强调了需要能够应对这些具有挑战性靶标的新型架构创新或损失函数。

英文摘要

While generative AI models have demonstrated remarkable success in structure-based drug design, they predominantly rely on deep binding pockets and struggle to sample effective ligands for challenging low-pocketability targets, such as the historically "undruggable" oncology targets KRAS and MYC. To address this gap, we introduce ShallowBench, a strictly curated benchmark of 5,780 shallow-pocket targets extracted from CrossDocked2020. By computing the difference between an Alpha Shape "lid" volume and the underlying protein atom voxel volume, we successfully isolated targets with low concavity while ensuring sufficient surface area for binding. Evaluating various state-of-the-art generative models reveals weaker predicted binding affinity on these low-concavity interfaces. ShallowBench therefore provides a rigorous benchmark for generative biology models and highlights the necessity of new architectural innovations or loss functions capable of navigating these challenging targets.

2606.06715 2026-06-08 cs.CL cs.AI cs.LG 新提交

Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles

主题情感是否导致感知意识形态?比较政治新闻文章中人类与LLM的标注

Upasana Chatterjee

发表机构 * Columbia University(哥伦比亚大学)

AI总结 研究主题情感对感知政治意识形态的因果效应,通过比较人类与LLM标注,发现微调GPT-4o-mini产生显著因果效应,归因于捷径学习。

详情
Comments
Accepted to ACL SRW 2026
AI中文摘要

我们探究主题情感是否对感知政治意识形态具有因果效应,以及答案是否取决于意识形态标签的分配者。使用来自AllSides的文章,结合Llama-3.3-70b-versatile的共享情感标注,我们比较了来自专家人类标注者、GPT-4o-mini(基线和微调)以及Llama-3.3-70B的意识形态标签。我们应用双重机器学习(DML)和社区级中介分析于所有四种标注范式。人类标注在社区水平未产生显著因果效应。微调后的GPT-4o-mini达到了最高的分类准确率(F1=72.48),并且是唯一在社区水平产生显著处理效应和中介中显著自然直接效应(NDE)的标注范式。我们将此解释为捷径学习的证据:对意识形态标签数据进行微调导致模型内化了一种虚假的情感-意识形态耦合,而这种耦合在人类判断中对此任务并不起作用。这种耦合在基于F1的评估中结构上不可见,对LLM标注作为银标签以及在下游因果分析中作为人类判断的代理的使用具有影响。

英文摘要

We ask whether topic sentiment has a causal effect on perceived political ideology, and whether the answer depends on who assigns the ideology label. Using articles from AllSides, paired with shared sentiment annotations from Llama-3.3-70b-versatile, we compare ideology labels from expert human annotators, GPT-4o-mini (baseline and finetuned), and Llama-3.3-70B. We apply Double Machine Learning (DML) and community-level mediation analysis across all four annotation paradigms. Human annotations yield no significant causal effects at the community level. Fine-tuned GPT-4o-mini achieves the highest classification accuracy (F1=72.48) and is the only annotator paradigm that produces significant community-level treatment effects and significant natural direct effects (NDEs) in mediation. We interpret this as evidence of shortcut learning: fine-tuning on ideology-labeled data causes the model to internalise a spurious sentiment--ideology coupling not operative in human judgment for this task. This coupling is structurally invisible to F1-based evaluation, with implications for the use of LLM annotations as silver labels and as proxies for human judgment in downstream causal analyses.

2606.06714 2026-06-08 cs.CV 新提交

Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception

锚定而非分级:视觉-语言模型在纹理倾斜感知中失败

Qian Zhang, Michal Golovanevsky, Fulvio Domini, James Tompkin

发表机构 * Brown University(布朗大学) Harvard University(哈佛大学)

AI总结 研究视觉-语言模型(VLM)在纹理倾斜感知任务中的表现,发现零样本和上下文提示均产生锚定失败,仅预测少数离散角度,监督微调部分缓解但残留锚定,表明问题在于表示到输出的语言接口无法分级表达。

详情
AI中文摘要

人类从纹理感知表面倾斜时,会表现出系统性的、分级的偏差,这些偏差在心理物理实验中可靠地出现。先前的研究表明,无监督CNN再现了几种类人偏差,而有监督CNN则没有。视觉-语言模型(VLM)是否表现出类似的能力?在多个VLM家族和模型规模中,零样本和上下文提示都产生了独特的失败:倾斜仅在少量锚点(例如0°、±25°、±45°)处被预测,且几乎不依赖于刺激视场、光学倾斜或表面曲率。监督微调部分弥补了这种失败,但残留的锚定仍然存在。虽然高级视觉-语言基准测试的成功可能不需要对低级几何线索的敏感性,但我们将锚定解释为表示到输出语言接口的失败:不一定缺乏几何编码,而是无法以分级形式表达它。

英文摘要

Human perception of surface slant from texture exhibits systematic, graded biases that emerge reliably in psychophysical experiments. Prior work showed that unsupervised CNNs reproduce several human-like biases, while supervised CNNs do not. Do Vision-Language Models (VLMs) exhibit similar competences? Across multiple VLM families and model scales, zero-shot and in-context prompting both produce distinctive failures: slant is predicted at only a small set of anchors (e.g., 0\degree, $\pm$25\degree, $\pm$45\degree) with little dependence on stimulus field of view, optical slant, or surface curvature. Supervised fine-tuning partially remediates the failure, but residual anchoring persists. While success in high-level vision-language benchmarks might not require sensitivity to low-level geometric cues, we interpret anchoring as a failure at the representation-to-output language interface: Not necessarily an absence of geometric encoding, but a failure to express it in a graded form.

2606.06704 2026-06-08 cs.RO 新提交

Optimal Control Approach for Non-prehensile Ball Juggling Using a 7-DoF Manipulator

使用7自由度机械臂进行非抓取式抛球的最优控制方法

Joel Ramadani, Vasilije Rakčević, Riddhiman Laha, Arne Sachtler, Valentin Le Mesle, Achim J. Lilienthal, Sami Haddadin

发表机构 * Technical University of Munich(慕尼黑技术大学) Mohamed Bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) German Aerospace Center (DLR), Institute of Robotics and Mechatronics(德国航空航天中心(DLR)机器人与机电机构研究所)

AI总结 提出一种基于模型的两阶段最优控制框架,用于7自由度机械臂使用工具进行非抓取式抛球,生成周期性抛球轨迹并通过离线计算实现实时误差校正。

详情
Comments
8 pages, accepted at ICRA 2026
AI中文摘要

非抓取式物体操作技能对于现实世界的机器人交互至关重要,能够实现高度动态的任务,例如在托盘上平衡玻璃杯或控制物体在桌子上滑动。其中,以高速操作要求和由此产生的混合动力学的普遍敏感性为特征的任务尤其难以完成。在这些任务中,抛球可以被视为一个极具挑战性的动作。机器人抛球的关键在于实现欠驱动物体的动态稳定。由于物体不具备自我校正能力,其稳定性完全依赖于施加在其上的力。这创建了一个对控制输入敏感的系统,其中时机对于持续抵消偏差并维持期望行为至关重要。我们开发了一种系统方法,用于控制一个7自由度机械臂使用工具进行非抓取式抛球。我们的主要贡献是一个基于模型的框架,用于生成抛球轨迹并稳定该混合系统的周期性抛球运动。该框架包含一个两阶段最优控制方法,用于计算稳定抛球所需的底层可行运动模式。然后,离线计算的轨迹被组织起来,以便在不在线求解最优控制问题的情况下实现实时误差校正。我们首先在仿真环境中评估所提出控制器的性能,然后使用Franka Emika Panda机器人进行实验,以证明其有效性。

英文摘要

Non-prehensile object manipulation skills are important for real-world robot interactions, enabling highly dynamic tasks such as balancing a glass on a tray or the controlled sliding of items on a table. Among such tasks, those characterised by high-speed manipulation requirements and general sensitivity of the resulting hybrid dynamics are particularly hard to accomplish. Within these, juggling can be seen as a highly challenging maneuver to be solved. The key to robotic juggling is achieving dynamic stabilisation of an underactuated object. Since the object does not possess the ability of self-correction, its stability is entirely dependent on the forces applied to it. This creates a system that is sensitive to control inputs, where timing is critical to continuously counteract deviations and maintain the desired behavior. We develop a systematic method to control a 7-degree-of-freedom manipulator performing non-prehensile ball juggling with a tool. Our primary contribution is a model-based framework for generating juggling trajectories and stabilizing a periodic juggling motion for this hybrid system. The framework incorporates a two-stage optimal control approach to compute the underlying feasible motion patterns required for stable juggling. Offline-computed trajectories are then organised to enable real-time error correction without solving optimal control problems online. We demonstrate the effectiveness of the resulting controller by first evaluating its performance in a simulation environment and performing an experiment using a Franka Emika Panda robot.

2606.06696 2026-06-08 cs.CV cs.AI 新提交

MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models

MMBU: 大规模多模态生物医学理解基准,用于探测视觉语言模型的感知能力

Ryan D'Cunha, Alejandro Lozano, Xiaoxiao Sun, Daniel Vela Jarquin, Min Woo Sun, Josiah Aklilu, James Burgess, Yuhui Zhang, Ryan Nayebi, Paola Avila, Robayo, Jin Ye, Ming Hu, Zhongying Deng, Junjun He, Xin Chen, Yue Yao, Robert Tibshirani, Jeffrey J. Nirschl, Serena Yeung-Levy

发表机构 * Stanford University(斯坦福大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Instituto Tecnológico de Monterrey(蒙特雷技术学院) Monash University(墨尔本大学) University of Cambridge(剑桥大学) Shanghai Jiao Tong University(上海交通大学) Shandong University(山东大学)

AI总结 提出MMBU基准,涵盖35个子模态,通过分类、定位和检测任务系统评估VLM在生物医学领域的视觉感知和泛化能力,发现高准确率可能掩盖感知缺陷。

详情
AI中文摘要

视觉和语言模型(VLM)在转变生物医学成像工作流程方面具有巨大潜力,从检测胸部X光片中的病变到显微镜下的细胞特征分析。然而,实现这一潜力需要稳健且细粒度的视觉感知。模型需要正确解释图像中的细微特征,并且必须在不同的生物医学模态、尺度和上下文中做到这一点。尽管如此,当前的基准仍然有限。为了解决这些差距,我们引入了大规模多模态生物医学理解(MMBU)基准。它是迄今为止最大的生物医学视觉和语言基准,涵盖35个子模态,具有丰富的结构化元数据。它包括开放和封闭版本的非接地分类、接地分类和物体检测,从而能够系统地评估模型在生物尺度、临床环境和成像模态上的性能。通过评估15个开源权重和2个前沿VLM,我们发现虽然医学适应为某些模型带来了可衡量的提升,但通常在高准确率报告中的表现可能掩盖了视觉感知和领域泛化方面的缺陷。

英文摘要

Vision and language models (VLMs) hold immense promise to transform biomedical imaging workflows, from detecting lesions in chest X-rays to profiling cellular features in microscopy. Realizing this potential, however, requires robust and fine-grained visual perception. Models need to correctly interpret subtle features in images, and they must do so across diverse biomedical modalities, scales, and contexts. Nevertheless, current benchmarks remain limited. To address these gaps, we introduce the Massive Multimodal Biomedical Understanding (MMBU) benchmark. It is the largest biomedical vision and language benchmark to date, covering 35 submodalities with rich structured metadata. It includes both open and closed versions of ungrounded classification, grounded classification, and object detection, enabling systematic evaluation of model performance across biological scales, clinical settings, and imaging modalities. Evaluating 15 open-weight and 2 frontier VLMs, we find that while medical adaptation provides measurable gains for some models, the high accuracy often reported on established benchmarks can mask deficiencies in visual perception and domain generalization.

2606.06695 2026-06-08 cs.CV 新提交

S23DR 2026 Winning Solution

S23DR 2026 获胜方案

Jan Skvrna, Miroslav Purkrabek, Lukas Neumann

发表机构 * Visual Recognition Group(视觉识别组) Czech Technical University in Prague(布拉格捷克技术大学)

AI总结 提出一种基于条件集和流匹配DiT的3D线框重建方法,通过全局粗预测、局部细化及多采样一致性步骤,在S23DR 2026挑战中取得HSS=0.654的领先成绩。

详情
AI中文摘要

本文介绍了在S23DR 2026挑战中针对从稀疏SfM、拟合深度和语义分割进行结构化3D线框重建的获胜方案。该方法将顶点视为条件集,并使用以Perceiver风格场景令牌为条件的流匹配DiT对64个顶点令牌进行去噪。全局通道预测粗略结构,船体裁剪的第二通道对其进行细化,小规模的多采样一致性步骤确保随机采样器行为良好。最终系统在私有排行榜上排名第一,达到HSS = 0.654。

英文摘要

This text presents the winning solution to the S23DR 2026 challenge for structured 3D wireframe reconstruction from sparse SfM, fitted depth, and semantic segmentations. The method treats vertices as a conditional set and denoises 64 vertex tokens with a flow-matching DiT conditioned on Perceiver-style scene tokens. A global pass predicts the coarse structure, a hull-cropped second pass refines it, and a small multi-sample consensus step keeps the stochastic sampler well behaved. The final system ranked first on the private leaderboard, achievingHSS = 0.654.

2606.06690 2026-06-08 cs.CV 新提交

RPC-GS: Gaussian Splatting with native RPC Rendering for Satellite Imagery

RPC-GS:基于原生RPC渲染的卫星图像高斯泼溅

Valentin Wagner, Sebastian Bullinger, Christoph Bodensteiner, Michael Arens

发表机构 * Fraunhofer Institute of Optronics, System Technologies and Image Exploitation(弗劳恩霍夫光学研究所、系统技术与图像 exploitation 研究所)

AI总结 提出首个原生使用RPC模型的高斯泼溅框架RPC-GS,通过直接投影高斯均值和协方差避免近似误差,在卫星基准数据集上重建误差最低。

详情
AI中文摘要

我们提出了RPC-GS,这是首个原生使用有理多项式相机(RPC)模型的卫星图像高斯泼溅框架。RPC模型是表示现代推扫式卫星传感器复杂成像几何的事实标准。为了简化渲染,先前的卫星高斯泼溅方法用透视或仿射相机近似替代RPC模型,导致重建过程中的几何误差。RPC-GS通过在泼溅过程中直接通过RPC模型投影高斯均值和协方差,避免了这些近似。我们将RPC模型嵌入一系列精心选择的地理坐标变换链中,该变换表示从适合泼溅的场景坐标到图像坐标的映射。为了映射高斯协方差矩阵,我们推导了基于数值稳健的雅可比协方差投影,用于(部分非线性的)坐标变换。由于RPC缺乏明确的相机深度概念,我们集成了基于度量射线的深度公式。我们在统一框架中对RPC、透视和仿射相机模型进行了基准测试,我们的原生RPC渲染器在领先的卫星基准数据集上始终实现最低的重建误差,在DFC2019上,平均高程误差比透视和仿射近似分别提高了29.6%和63.8%,在IARPA2016上分别提高了9.9%和37.9%。我们公开代码以支持卫星成像领域高斯泼溅的未来研究。

英文摘要

We present RPC-GS, the first Gaussian Splatting framework for satellite imagery that operates natively with Rational Polynomial Camera (RPC) models. The RPC model is the de facto standard for representing the complex imaging geometry of modern pushbroom satellite sensors. To simplify rendering, prior satellite Gaussian Splatting methods replace the RPC model with perspective or affine camera approximations, leading to geometric errors during reconstruction. RPC-GS avoids these approximations by projecting Gaussian means and covariances directly through the RPC model during the splatting process. We embed the RPC model in a chain of carefully selected geo-coordinate transformations representing a mapping from splatting-suitable scene coordinates to image coordinates. To map the Gaussian covariance matrices, we derive a numerically robust Jacobian-based covariance projection for the (partially nonlinear) coordinate transformations. Since RPCs lack an explicit notion of camera depth, we integrate a metric ray-based depth formulation. We benchmark RPC, perspective, and affine camera models in a unified framework, with our native RPC renderer consistently achieving the lowest reconstruction error on leading satellite benchmark datasets, improving mean altitude error over perspective and affine approximations by 29.6% and 63.8% on DFC2019, and by 9.9% and 37.9% on IARPA2016. We release our code to support future research of Gaussian Splatting in the satellite imaging domain.

2606.06687 2026-06-08 cs.LG cs.DC cs.NI cs.SY eess.SY 新提交

Towards Serverless Semi-Decentralized Federated Learning with Heterogeneous Optimizers

面向异构优化器的无服务器半去中心化联邦学习

Su Wang, Mung Chiang, H. Vincent Poor

发表机构 * Department of Electrical and Computer Engineering, Princeton University(普林斯顿大学电子工程与计算机科学系) Department of Electrical and Computer Engineering, Purdue University(帕克森大学电子工程与计算机科学系)

AI总结 提出无服务器半去中心化联邦学习(SSD-FL),通过轻量级D2D初始化实现聚类,利用有效损失函数和Cheeger不等式优化聚类,提升收敛速度和通信效率。

详情
Comments
Under review at IEEE/ACM Transactions on Networking
AI中文摘要

我们研究了在具有异构机器学习优化器的去中心化联邦学习中的聚类形成,包括聚类的数量和组成。虽然集中式联邦学习中的聚类已经实现了可扩展性和资源节省,但其在完全去中心化环境中的价值和开发仍有待探索。在此类环境中优化聚类形成具有挑战性,尤其是由于网络图结构、本地数据异构性和不同本地ML模型优化器之间的复杂耦合。为了解决这些挑战,我们提出了无服务器半去中心化联邦学习(SSD-FL),一种不需要持久服务器基础设施的方法。在SSD-FL中,聚类通过轻量级、一次性的设备到设备(D2D)初始化阶段形成,之后实际的ML模型训练(以及共识和收敛过程)完全是无服务器的。在功能上,SSD-FL将全局轮次分割为簇内和簇间机制,通过新颖的“有效损失函数”确保全局收敛和共识,该函数将设备特定的ML优化器与基于网络图的正则化相结合。接下来,SSD-FL利用Cheeger不等式的共识差距,开发了一种迭代聚类算法,该算法根据我们推导的收敛和共识界限进行评估,其中包含一个独特的评分指标,用于量化设备和优化器的异构性。最后,针对三类去中心化联邦学习方法的实验评估验证了SSD-FL在各种网络图、数据集和本地优化器机制下提高了收敛速度和通信效率。

英文摘要

We investigate cluster formation, involving the number and composition of clusters, in decentralized federated learning (FL) with heterogeneous machine learning (ML) optimizers. While clustering in centralized FL has enabled scalability and resource savings, its value and development in fully decentralized environments have yet to be explored. Optimizing cluster formation in such environments is challenging, especially due to the complex coupling between network graph structures, local data heterogeneity, and different local ML model optimizers. To address these challenges, we propose serverless semi-decentralized FL (SSD-FL), a methodology requiring no persistent server infrastructure. In SSD-FL, cluster formation occurs via a lightweight, one-time device-to-device (D2D) initialization phase, after which actual ML model training (alongside consensus and convergence processes) is fully serverless. Functionally, SSD-FL segments global rounds into intra-cluster and inter-cluster regimes, ensuring global convergence and consensus through novel "effective loss functions" that integrate device-specific ML optimizers with network graph-based regularization. Next, SSD-FL leverages the consensus gap via the Cheeger inequality to develop an iterative clustering algorithm evaluated against our derived convergence and consensus bounds, which incorporate a unique scoring metric to quantify data and optimizer heterogeneity across devices. Finally, experimental evaluation against three categories of decentralized FL methodologies validate that SSD-FL improves both convergence speeds and communication efficiency across various network graphs, datasets, and local optimizer regimes.

2606.06686 2026-06-08 cs.RO cs.DS 新提交

On the Hardness of Optimal Motion on Trees

关于树上最优运动的难度

Tzvika Geft

发表机构 * Rutgers University(罗切斯特大学)

AI总结 本文证明,在树上,带标签和2色变体的多智能体路径寻找(MAPF)问题在距离、makespan和flowtime三个目标下均为NP难,解决了长期未决的经典Pebble Motion问题。

详情
AI中文摘要

本文提出了一个简单框架,解决了树上多智能体路径寻找(MAPF)在标准目标(距离、makespan和flowtime)下对于带标签和带颜色变体的复杂度。在MAPF中,智能体占据图的顶点,必须移动到目标顶点而不发生碰撞,同时优化给定目标。在带标签情况下,智能体是不同的,各自有目标;在带颜色情况下,相同颜色的智能体可互换。虽然许多MAPF变体已知是难解的,但树上几个基本情况仍然开放。我们证明了在树上,对于所有三个目标,带标签和2色MAPF都是NP难的。特别地,我们解决了经典的Pebble Motion问题,其中一次一个石子移动到相邻的空顶点,目标是最小化总移动次数。尽管这是最基本的离散运动模型之一,其在树上的复杂度几十年来一直未解决。此外,对于带颜色的Pebble Motion,我们给出了在任何图类上的第一个难度结果,仅用两种颜色,这是紧的。所有这些结果都是通过Stack Rearrangement的难度建立的,该问题本身是一个开放问题,要求最优地重新排列存储在栈中的物品,我们也证明了它是NP难的。值得注意的是,与栈的联系在所有问题上已经产生了在非常简单的树(细分星形)上的难度。总之,这些结果揭示了一个共同的易处理性障碍,它渗透了几个基本运动模型,从而统一并加强了先前的难度结果。

英文摘要

This paper presents a simple framework that settles the complexity of Multi-Agent Path Finding (MAPF) on trees across standard objectives--distance, makespan, and flowtime--for both labeled and colored variants. In MAPF, agents occupy the vertices of a graph and must move to target vertices without collisions while optimizing a given objective. In the labeled case, the agents are distinct and have respective targets; in the colored case, agents of the same color are interchangeable. While many MAPF variants are known to be intractable, several basic cases on trees have remained open. We prove NP-hardness on trees for both labeled and 2-colored MAPF under all three objectives. In particular, we resolve the classical Pebble Motion problem, where one pebble moves at a time to an adjacent empty vertex and the goal is to minimize the total number of moves. Despite being one of the most basic discrete motion models, its complexity on trees had remained open for several decades. Moreover, for colored Pebble Motion, we give the first hardness result on any graph class, already with two colors, which is tight. All of these results are established through the hardness of Stack Rearrangement, itself posed as an open problem, which asks to optimally rearrange items stored in stacks, and which we also prove to be NP-hard. Notably, the connection to stacks yields hardness already on very simple trees--subdivided stars--across all problems. Together, these results reveal a common tractability barrier that permeates several fundamental motion models, thereby unifying and strengthening prior hardness results.

2606.06685 2026-06-08 cs.CV cs.GR 新提交

RigPAPR: Rig-Based Animation of Static Neural Point Clouds from a Fixed-Viewpoint Video

RigPAPR:基于固定视角视频的静态神经点云绑定动画

Shichong Peng, Yanshu Zhang, Ke Li

发表机构 * APEX Lab(APEX实验室) School of Computing Science(计算科学学院) Simon Fraser University(西蒙弗雷泽大学)

AI总结 提出RigPAPR方法,通过直接线性混合蒙皮驱动静态神经点云,无需网格代理或姿态依赖校正,在合成和真实数据上减少关节边界伪影,新视角PSNR提升3+dB。

详情
Comments
An overview video is available at https://youtu.be/up3BwRHYWG8
AI中文摘要

静态神经点云重建从姿态图像中高保真地捕捉主体。给定这样的重建,我们的目标是使其动画化,以跟随主体的单目固定视角驱动视频(无论是捕获的还是由图像到视频生成产生的),并恢复一个绑定的、可重新姿态的3D资产。现有方法通过直接线性混合蒙皮或网格代理来变形高斯溅射,两者在关节连接处都容易出现伪影,即使有逐基元的校正。我们将伪影追溯到表示:每个溅射携带一个在规范姿态中校准的个体形状,以与其邻居拼接。在刚性LBS下,每个溅射随其骨骼移动但不能弯曲,因此规范拼接在关节边界处断裂成间隙和尖峰。邻近注意力点渲染则没有逐基元的形状;每个像素在渲染时从变形基元的位置重新组合,因此表面自然地随关节运动重新形成。我们提出RigPAPR,它自动绑定静态PAPR点云,并通过单个固定视角视频在直接LBS下驱动它,无需网格代理、姿态依赖校正或类别模板。在合成主体上,RigPAPR在有监督视角下匹配最强基线,在新视角下超过基于网格和高斯溅射的基线3+dB PSNR,并在合成和真实主体上生成更干净的关节边界渲染。

英文摘要

Static neural point reconstructions capture a subject at high fidelity from posed images. Given such a reconstruction, we aim to animate it to follow a monocular fixed-viewpoint driving video of the subject, whether captured or produced by image-to-video (I2V) generation, and to recover a rigged, re-posable 3D asset. Existing methods deform Gaussian splats through direct linear blend skinning (LBS) or mesh proxies, both of which are prone to joint-boundary artifacts under articulation, even with per-primitive corrections. We trace the artifact to the representation: each splat carries an individual shape calibrated in the canonical pose to tile with its neighbours. Under rigid LBS, each splat moves with its bone but cannot bend, so the canonical tiling breaks at joint boundaries into gaps and spikes. Proximity attention point rendering (PAPR) instead carries no per-primitive shape; each pixel is recomposed at render time from the deformed primitives' positions, so the surface re-forms naturally with the articulation. We present RigPAPR, which auto-rigs a static PAPR cloud and drives it under direct LBS from a single fixed-viewpoint video, without mesh proxy, pose-dependent correction, or category template. On synthetic subjects, RigPAPR matches the strongest baseline at the supervised view and exceeds mesh-based and Gaussian-splatting baselines at novel views by 3+dB PSNR, with cleaner joint-boundary renderings of both synthetic and real subjects.

2606.06684 2026-06-08 cs.CV 新提交

Adaptive Band Selection for Hyperspectral Classification with Spatially Disjoint Evaluation

面向空间分离评估的高光谱分类自适应波段选择

Ikram El-Hajri, Ouassim Karrakchou, Alejandro Mousist

发表机构 * International University of Rabat, Rabat, Morocco(拉巴特国际大学) Thales Alenia Space, Spain(西班牙泰勒斯阿莱尼亚空间公司)

AI总结 提出SGBR-HC方法,通过监督光谱排序初始化可训练稀疏门,自适应确定波段数,在空间分离评估下以约20个波段取得最高平均总体精度和Kappa系数。

详情
Comments
6 pages, 2 figures, 3 tables
AI中文摘要

基于可微选择器的高光谱波段选择方法可能对初始化和提取最终离散子集敏感,而预设的波段数量限制了灵活性。我们提出SGBR-HC(光谱组波段排序与硬混凝土初始化),一种两阶段方法,使用监督光谱排序来初始化可训练稀疏门,而不是将排序视为固定选择规则,让所选波段的数量由训练决定。第一阶段通过类别可分性和光谱多样性对训练像素的候选波段进行评分;该排序为第二阶段的门控逻辑值提供种子,第二阶段将稀疏门与空间分类器联合训练。在帕维亚大学和休斯顿2013数据集上进行空间分离评估,并通过在所选波段上重新训练新分类器进行验证,SGBR-HC以大约20个波段实现了最高的平均总体精度和Cohen's kappa。跳过第一阶段导致帕维亚大学的OA下降8.84个百分点,休斯顿2013下降22.15个百分点,证实了排序先验的作用。随机像素分割使帕维亚大学的OA膨胀30.56个百分点,强调了空间泄漏作为关键评估混淆因素。

英文摘要

Hyperspectral band selection methods based on differentiable selectors can be sensitive to initialization and to extracting a final discrete subset, while prescribed band counts limit flexibility. We propose SGBR-HC (Spectral-Group Band Ranking with Hard-Concrete initialization), a two-stage method that uses a supervised spectral ranking to initialize trainable sparse gates rather than treating ranking as a fixed selection rule, letting the number of selected bands be determined by training. Stage-1 scores candidate bands from training pixels by class discriminability and spectral diversity; this ranking seeds the gate logits for Stage-2, which trains the sparse gates jointly with a spatial classifier. Under spatially disjoint evaluation on Pavia University and Houston 2013, verified by retraining a fresh classifier on the selected bands, SGBR-HC achieves the highest mean overall accuracy and Cohen's kappa with approximately twenty bands. Bypassing Stage-1 degrades OA by 8.84 pp on Pavia University and 22.15 pp on Houston 2013, confirming the ranking prior's role. Random pixel splits inflate OA on Pavia University by 30.56 pp, underscoring spatial leakage as a critical evaluation confound.

2606.06679 2026-06-08 cs.CL cs.AI cs.CY 新提交

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

HKJudge:用于解释法院认定事实、推理过程和裁决结果的法律话语标注语料库

Xi Xuan, Wenxin Zhang, Yufei Zhou, King-kui Sin, Chunyu Kit

发表机构 * City University of Hong Kong(香港城市大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出首个句子级专家标注的法律话语数据集HKJudge,包含香港各级法院刑事判决,设计双层话语模式(26种修辞角色和3种判刑要素),并基于BERT和LLM进行基准评估。

详情
AI中文摘要

法院判决是法律实践和法理学的核心,然而香港判决的话语分析由于缺乏专家标注语料库而受到限制。我们引入了香港判决话语数据集(HKJudge),这是首个句子级专家标注的法律话语语料库。HKJudge包含香港法院层级所有五个级别的刑事判决,共计约29万句子和650万词元,由法律语言学专家完全标注。我们设计了一个双层话语模式,捕捉法院认定的事实、推理过程以及裁决结果。在句子层面,每个句子被分配26种修辞角色之一。在跨度层面,句子进一步标注了三个判刑要素(指控、监禁刑期、罚款)。十位法律语言学标注者进行了标注,标注者间一致性为κ=0.8。我们在HKJudge上定义了两个任务,称为修辞角色分类和法律要素提取,并提供了四种基于BERT的模型、两种开源LLM(在零样本和微调设置下)以及四种商业LLM在这两个任务上的首次基准评估。我们的工作展示了句子级话语标注对于建模香港判决结构的价值,并为未来法律判决预测研究提供了丰富的数据基础。HKJudge数据集和代码可在以下网址获取:https://this URL。

英文摘要

Court judgments are central to legal practice and jurisprudence, yet discourse analysis of Hong Kong judgments has received limited attention, owing largely to the absence of expert-annotated corpora. We introduce the Hong Kong Judgment Discourse Dataset (HKJudge), the first sentence-level expert-annotated legal discourse corpus. HKJudge includes criminal judgments across all five levels of HK's court hierarchy, comprising $\sim$290k sentences and $\sim$6.5 million tokens, fully annotated by legal linguistics experts. We design a two-tier discourse schema that captures what facts a court finds, how it reasons, and what it rules. At the sentence level, each sentence is assigned one of 26 rhetorical roles. At the span level, sentences are further annotated with three sentencing elements (charge, imprisonment term, fine). Ten legal linguistics annotators produced the annotations with an inter-annotator agreement of $κ= 0.8$. We formulate two tasks on HKJudge, termed rhetorical role classification and legal element extraction, and provide the first benchmark evaluation of four BERT-based models, two open-source LLMs under zero-shot and fine-tuning settings, and four commercial LLMs on both tasks. Our work demonstrates the value of sentence-level discourse annotation for modeling the structure of HK judgments and provides a rich data foundation for future work on legal judgment prediction. The HKJudge dataset and code are available at https://github.com/xuanxixi/HKJudge.

2606.06674 2026-06-08 cs.CL cs.CY 新提交

What Do People Actually Want From AI? Mapping Preference Plurality

人们真正希望从AI中得到什么?偏好多元性映射

Julia Sepúlveda Coelho, Scott A. Hale

发表机构 * Oxford Internet Institute, University of Oxford(牛津大学互联网研究所) Meedan

AI总结 通过分析75个国家1500份开放式回答,发现不同人对AI的期望各异,多数价值观仅被少数人要求,且同一词语(如“真实性”)含义分歧,某些能力存在争议,揭示当前RLHF偏好聚合方法的根本缺陷。

详情
Comments
Accepted at the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)
AI中文摘要

大型语言模型(LLMs)通常通过基于人类反馈的强化学习(RLHF)进行微调,以与人们的偏好和价值观对齐。然而,这种方法存在已知局限性:它聚合了冲突的偏好,通常依赖于不具有代表性的样本,并且仅使用二元比较。通过分析来自PRISM数据集跨越75个国家的1500份开放式回答,我们考察了人们真正希望从AI系统中得到什么,并揭示了当前方法的具体失败。我们发现不同的人想要不同的东西:大多数价值观被不到四分之一的受访者要求,真实性是唯一的例外,占49%。此外,相同的词语隐藏着不同的含义:当人们描述他们所说的“真实性”时,他们揭示了不同的、可能不相容的认识论基础,因为有些人要求有来源的主张,有些人要求专家意见,甚至有些人要求不受欢迎的观点。某些能力,即模型的行为有多像人类,以及某些特征,如AI护栏,是完全有争议的,有些人渴望它们,而另一些人则拒绝它们。我们还发现,人们经常使用上下文区分(AI“默认”应该做什么与“如果被要求”应该做什么),这是二元比较无法捕捉的。这些发现暴露了当前对齐实践中的根本问题。当49%的人要求真实性但以不同方式定义时,这不太可能被单个奖励模型捕捉到。尽管用户明确要求准确性,但在资金充足的模型中持续存在高幻觉率,这表明当前方法未能识别实际偏好。本文揭示了当前被扁平化为通用偏好模型的情境化、有争议、不完美的信号,这种做法被其他人描述为认识论暴力。

英文摘要

Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people's preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, and uses only binary comparisons. Analysing 1,500 open-ended responses from the PRISM dataset across 75 countries, we examine what people actually want from AI systems and reveal concrete failures of current methods. We find that different people want different things: most values are requested by fewer than a quarter of respondents, with truthfulness the sole exception at 49%. Furthermore, the same words hide divergent meanings: when people describe what they mean by "truthfulness", they reveal distinct, potentially incompatible, epistemological bases, as some ask for sourced claims, some for expert opinions, and some even ask for unpopular views. Certain capabilities, namely how human-like a model behaves, and some features, like AI guardrails, are outright controversial, with some desiring them and others rejecting them. We additionally find that people often use contextual distinctions (what AI should do "by default" versus "if requested") that binary comparisons cannot capture. These findings expose fundamental problems in current alignment practices. When 49% request truthfulness but define it differently, this is unlikely to be captured by a single reward model. The persistence of high hallucination rates in well-funded models, despite users' clear demands for accuracy, suggests that current methods fail to identify actual preferences. This paper sheds light on the situated, contested, imperfect signals that are currently being flattened into universal preference models, a practice others have characterised as epistemic violence.

2606.06673 2026-06-08 cs.LG 新提交

Uncertainty-Aware LLM-Guided Policy Shaping for Sparse-Reward Reinforcement Learning

不确定性感知的LLM引导策略塑形用于稀疏奖励强化学习

Ujjwal Bhatta, Utsabi Dangol, Sumaly Bajracharya, Rodrigue Rizk, KC Santosh

发表机构 * USD AI Research Lab(USD人工智能研究实验室)

AI总结 提出ULPS框架,结合校准的大语言模型与不确定性估计,通过A*轨迹微调BERT模型提供动作建议,并用熵机制平衡LLM引导与PPO策略,在MiniGridUnlockPickup基准上显著提升成功率、奖励效率和样本复杂度。

详情
Comments
Accepted to the 2026 IEEE Conference on Artificial Intelligence (IEEE CAI). 6 pages, 3 figures. Code available at: https://github.com/USD-AI-ResearchLab/uncertainty-aware-llm-rl
AI中文摘要

稀疏奖励和异构任务序列仍然是强化学习(RL)中的持续挑战,常常导致收敛缓慢、泛化能力弱和探索效率低下。我们提出不确定性感知的LLM引导策略塑形(ULPS),这是一个新颖的框架,将校准的大语言模型(LLM)集成到RL训练循环中,以提供结构化、不确定性调制的行为引导。ULPS采用基于A*的预言机来合成最优符号轨迹,用于微调基于BERT的语言模型。在训练过程中,该模型提供动作建议,其影响取决于通过蒙特卡洛(MC)dropout估计的认知不确定性。基于熵的混合机制自适应地平衡LLM引导和学习到的策略(通过近端策略优化,PPO),使智能体能够优先考虑可靠先验,同时保持适应性。我们在MiniGridUnlockPickup基准上评估ULPS,并观察到在成功率、奖励效率和样本复杂度上,相对于无引导、未校准和标准RL基线的一致改进。ULPS在微调后执行准确率提高了9%以上,需要更少的环境交互,并获得了更高的奖励AUC。我们的结果表明,集成符号A*轨迹、预训练语言先验和不确定性感知控制,为稀疏奖励领域中的多任务强化学习提供了一种原则性且有效的方法,并具有扩展到部分可观察和多智能体设置的潜力。

英文摘要

Sparse rewards and heterogeneous task sequences remain persistent challenges in Reinforcement Learning (RL), often resulting in slow convergence, weak generalization, and inefficient exploration. We propose Uncertainty-Aware LLM-Guided Policy Shaping (ULPS), a novel framework that integrates a calibrated Large Language Model (LLM) into the RL training loop to provide structured, uncertainty-modulated behavioral guidance. ULPS employs an A*-based oracle to synthesize optimal symbolic trajectories, which are used to fine-tune a BERT-based language model. During training, this model supplies action suggestions whose influence is conditioned on epistemic uncertainty estimated via Monte Carlo (MC) dropout. An entropy-based blending mechanism adaptively balances LLM guidance and the learned policy (via Proximal Policy Optimization, PPO), allowing the agent to prioritize reliable priors while preserving adaptability. We evaluate ULPS on the MiniGridUnlockPickup benchmark and observe consistent improvements in success rate, reward efficiency, and sample complexity over unguided, uncalibrated, and standard RL baselines. ULPS achieves more than 9% improvement in execution accuracy after fine-tuning, requires fewer environment interactions, and yields higher reward AUC. Our results demonstrate that integrating symbolic A* trajectories, pretrained language priors, and uncertainty-aware control offers a principled and effective approach to multi-task reinforcement learning in sparse-reward domains, with potential extensibility to partially observable and multi-agent settings.