arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.26884 2026-05-27 cs.CV

Small Object Detection in Industrial Recycling: A New Dataset and YOLO Performance Evaluation

工业回收中的小目标检测:新数据集与YOLO性能评估

Oussama Messai, Abbass Zein-Eddine, Abdelouahid Bentamou, Mickael Picq, Nicolas Duquesne, Stéphane Puydarrieux, Yann Gavet

AI总结 针对工业回收中小、密集、重叠目标的检测难题,本文提出新数据集并对比基于深度学习的监督方法,评估YOLO等系统的性能、精度与计算效率,同时探索数据增强与合成图像的优势。

详情
Journal ref
Journal of Electronic Imaging 2026
AI中文摘要

本文解决了检测小、密集和重叠目标的问题,这是计算机视觉中的一个主要挑战。我们重点回顾了基于深度学习监督方法提出的系统,并在一个包含超过1万张图像和12万个实例的新数据集上对这些系统进行了详细比较,突出了它们在工业回收流程用例中的性能、准确性和计算效率。通过这种比较分析,我们确定了当前最可靠的系统及其设计要解决的具体挑战。此外,我们探讨了数据增强和合成图像的好处。基于我们的分析,我们还提出了潜在的未来方向和创新解决方案,这些方案可以增强小、密集和重叠目标检测系统的有效性。我们的研究范围涵盖回收流程中的目标检测、长度测量和异常检测。异常检测策略对图像分辨率和缩放级别的变化具有鲁棒性,确保在工业应用中的可靠性能。所提出的数据集、方法和评估代码的仓库可在以下网址找到:https://github.com/o-messai/SDOOD

英文摘要

In this paper, we address the problem of detecting small, dense, and overlapping objects, a major challenge in computer vision. Our focus is on reviewing proposed methods based on deep learning supervised approaches. We provide a detailed comparison of these systems on a new dataset of more than 10k images and 120k instances, highlighting their performance, accuracy, and computational efficiency in the industrial recycling process use case. Through this comparative analysis, we identify the most reliable systems currently available and the specific challenges they are designed to tackle. Furthermore, we explore the benefits of data augmentation and synthetic images. Based on our analysis, we also propose potential future directions and innovative solutions that could enhance the effectiveness of small, dense and overlapped object detection systems. The scope of our investigations encompasses object detection, length measurement, and anomaly detection within the context of the recycling process. The anomaly detection strategy is robust against variations in image resolution and zoom levels, ensuring reliable performance in industrial applications. The repository of the proposed dataset, methods and evaluation codes can be found at: https://github.com/o-messai/SDOOD

2605.26601 2026-05-27 cs.CV

FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling

FTibSuite:面向藏语视觉语言建模的综合资源套件

Guixian Xu, Yide Liang, Zeli Su, Xuexian Song, Ziyin Zhang, Yushuang Dong, Ting Zhang, Xu Han

AI总结 针对藏语视觉语言建模缺乏可复现训练和评估基础设施的问题,提出FTibSuite资源套件,包含数据集FTibData、基准FTibBench和基线模型FTibVLM,在多项任务上取得显著性能提升。

详情
AI中文摘要

视觉语言模型取得了快速进展,但藏语由于缺乏可复现的训练和评估基础设施,仍然是一种严重服务不足的低资源语言。为填补这一空白,我们引入了FTibSuite,一个面向藏语视觉语言研究的综合资源套件,包括FTibData(人工验证的多模态训练语料库,涵盖持续预训练、图像-文本对齐和指令调优数据)、FTibBench(五个主流多模态基准的藏语改编版本,采用分层质量控制流程以减少翻译噪声)以及FTibVLM(基于Qwen3-VL-8B-Instruct通过三阶段适应流程构建的可复现基线)。在FTibBench上的实验表明,FTibVLM在所有任务上均取得一致的性能提升,例如将MMBench准确率从42.97提高到67.78,POPE-random准确率从47.53提高到80.56,同时保持了骨干模型原有的中文能力且退化最小,为藏语多模态研究提供了首个标准化基础。

英文摘要

Vision-language models have progressed rapidly, but Tibetan remains a severely underserved low-resource language due to the lack of reproducible training and evaluation infrastructure. To fill this gap, we introduce FTibSuite, a comprehensive resource suite for Tibetan vision-language research, consisting of FTibData (human-verified multimodal training corpora spanning continual pretraining, image-text alignment, and instruction tuning data), FTibBench (Tibetan adaptations of five mainstream multimodal benchmarks with a hierarchical quality-control workflow to reduce translation noise), and FTibVLM, a reproducible baseline built on Qwen3-VL-8B-Instruct via a three-stage adaptation pipeline. Experiments on FTibBench show FTibVLM delivers consistent performance gains across all tasks, such as improving MMBench accuracy from 42.97 to 67.78 and POPE-random accuracy from 47.53 to 80.56, while retaining the backbone's original Chinese capabilities with minimal degradation, providing the first standardized foundation for Tibetan multimodal research.

2605.25046 2026-05-27 cs.CV cs.AI

TinyFormer: Preserving Tiny Objects in YOLO-DETR Hybrid Real-time Detectors

TinyFormer: 在YOLO-DETR混合实时检测器中保留小目标

Jun-Wei Hsieh, Meng-Yu Kao, Ghufron Wahyu Kurniawan, Kuan-Chuan Peng

AI总结 提出TinyFormer混合检测器,通过并行双融合模块(PBM)保留浅层高分辨率特征,并设计空间语义适配器(SSA)补偿粗粒度标记化导致的空间损失,在MS COCO上实现小目标检测精度提升。

详情
AI中文摘要

YOLO系列和基于DETR的检测器在小目标检测方面存在困难。YOLO风格的模型受益于高效的密集预测,但其大步长骨干网络可能会抑制深层特征图中的小目标实例,并使网格分配变得模糊。基于DETR的模型通过集合预测去除了手工设计的后处理,但它们在粗粒度标记网格上进行推理,其中小目标仅占据少数弱标记,在匹配过程中容易被忽略。为了解决这些局限性,我们提出了TinyFormer,一种统一的YOLO-DETR混合实时检测器,它结合了ViT表示、无NMS的集合预测和YOLO风格的金字塔颈部,以实现准确的小目标检测。TinyFormer引入了并行双融合模块(PBM),该模块从浅层阶段构建高分辨率捷径到特征金字塔,在多尺度融合过程中保留精细的空间细节。我们进一步设计了空间语义适配器(SSA)来补偿粗粒度标记化导致的空间损失。SSA从早期阶段提取高分辨率线索并将其注入Transformer标记嵌入中,从而在不牺牲DETR全局建模能力的情况下改进小目标定位。在MS COCO上的实验表明,TinyFormer持续优于最近的YOLO系列检测器和强大的DEIMv2基线。即使没有PBM,TinyFormer-X也达到了58.4%的AP,而添加PBM将整体AP提高到58.5%,并在小目标上带来了1.6%的AP增益。使用Objects365预训练,TinyFormer-X-PBM达到了60.2%的AP,以更少的参数和更低的计算量超越了RF-DETR和其他Objects365预训练的检测器。这些结果表明,TinyFormer弥合了密集的YOLO风格特征融合和DETR风格集合预测之间的差距,为实时小目标检测提供了强大的精度-效率权衡。代码可在https://github.com/mmpmmpmmpjosh/TinyFormer获取。

英文摘要

YOLO-series and DETR-based detectors struggle with tiny-object detection. YOLO-style models benefit from efficient dense prediction, but their large-stride backbones may suppress tiny instances in deep feature maps and make grid assignment ambiguous. DETR-based models remove hand-crafted post-processing through set prediction, yet they reason over coarse token grids, where tiny objects occupy only a few weak tokens and are easily overlooked during matching. To address these limitations, we propose TinyFormer, a unified YOLO--DETR hybrid real-time detector that combines ViT representations, NMS-free set prediction, and a YOLO-style pyramid neck for accurate small-object detection. TinyFormer introduces a Parallel Bi-fusion Module (PBM), which builds high-resolution shortcuts from shallow stages to the feature pyramid, preserving fine spatial details during multi-scale fusion. We further design a Spatial Semantic Adapter (SSA) to compensate for the spatial loss caused by coarse tokenization. SSA extracts high-resolution cues from early stages and injects them into transformer token embeddings, improving tiny-object localization without sacrificing the global modeling ability of DETR. Experiments on MS COCO show that TinyFormer consistently outperforms recent YOLO-series detectors and the strong DEIMv2 baseline. TinyFormer-X achieves 58.4% AP even without PBM, while adding PBM improves the overall AP to 58.5% and brings a 1.6% AP gain on small objects. With Objects365 pre-training, TinyFormer-X-PBM reaches 60.2% AP, surpassing RF-DETR and other Objects365-pretrained detectors with fewer parameters and lower computation. These results demonstrate that TinyFormer bridges dense YOLO-style feature fusion and DETR-style set prediction, providing a strong accuracy-efficiency trade-off for real-time tiny-object detection. Code is available at https://github.com/mmpmmpmmpjosh/TinyFormer.

2605.27372 2026-05-27 cs.CV

G3T Up! Gravity Aligned Coordinate Frames Simplify Pointmap Processing

G3T 崛起!重力对齐的坐标框架简化点图处理

Bharath Raj Nagoor Kani, Noah Snavely

AI总结 提出G3T模型,通过预测重力对齐的点图而非相机中心点图,利用场景结构先验减少旋转自由度,提升3D重建精度。

详情
Comments
Project Page: https://g3t-paper.github.io/
AI中文摘要

现代前馈3D重建方法(如VGGT)在相机中心坐标框架中预测像素对齐的点图。然而,这种坐标框架的选择并非总是最优。我们提出改为在直立、重力对齐的框架中预测点图,该框架利用许多真实场景中存在的强结构线索。与相机中心框架不同,重力对齐框架在视点之间共享共同的垂直轴,减少了关联点图所需的旋转自由度。为此,我们引入了重力接地几何变换器(G3T),该模型从现有模型在重力对齐的3D数据上进行微调。G3T生成高度准确的重力感知预测,包括直立点图和相机到重力姿态。我们进一步介绍了G3T-Long,一种基于子图的增量式3D重建流程,该流程利用直立框架提供的减少的旋转自由度,实现了显著提高的重建精度。

英文摘要

Modern feed-forward 3D reconstruction methods like VGGT predict pixel-aligned pointmaps in camera-centric coordinate frames. However, this choice of coordinate frame is not always optimal. We propose instead to predict pointmaps in upright, gravity-aligned frames that exploit strong structural cues present in many real-world scenes. Unlike camera-centric frames, gravity-aligned frames share a common vertical axis across viewpoints, reducing the rotational degrees of freedom needed to relate pointmaps to one another. To this end, we introduce the Gravity Grounded Geometry Transformer (G3T), fine-tuned from existing models on gravity-aligned 3D data. G3T produces highly accurate gravity-aware predictions, including upright pointmaps and camera-to-gravity poses. We further introduce G3T-Long, a submap-based incremental 3D reconstruction pipeline that leverages the reduced rotational degrees of freedom afforded by upright frames to achieve significantly improved reconstruction accuracy.

2605.27371 2026-05-27 cs.CY cs.AI

Algorithmic Monocultures in Hiring

招聘中的算法同质化

Rishi Bommasani, Sarah H. Bana, Kathleen A. Creel, Dan Jurafsky, Percy Liang

AI总结 研究招聘算法同质化导致相同个体和种族群体被拒绝的问题,通过分析300万求职者的400万份申请数据,发现明显的种族差异和结果同质性。

详情
Comments
Published at FAccT 2026. Website: https://algorithmichiring.github.io/
AI中文摘要

许多雇主使用由少数几家算法供应商构建的算法筛选求职者。我们假设算法同质化导致相同的个体和相同种族群体的成员面临拒绝。我们获取并分析了一个包含300万求职者提交400万份申请的新数据集,所有申请均由同一供应商构建的算法筛选。我们发现求职者结果存在明显的种族差异。根据美国就业歧视标准,亚裔和黑人求职者提交的所有申请中,分别有14.74%和25.87%的申请提交给了对亚裔和黑人求职者产生不利影响的职位。个体也收到同质化的结果:在所有申请10个职位的求职者中,有4%被所有职位推荐拒绝,这一比例高于随机预期。为了更好地理解这种同质性,我们利用招聘算法的确定性可复制性,生成如果求职者申请所有职位本应获得的结果。我们表明,求职者需要广泛申请才能确保他们的申请被人审阅。

英文摘要

Many employers screen job applicants with algorithms built by the same few algorithm vendors. We hypothesize that algorithmic monoculture leads to the same individuals and members of the same racial groups facing rejection. We acquire and analyze a novel dataset of 3 million applicants submitting 4 million applications where all the applications are screened by algorithms built by the same vendor. We find clear racial disparities in applicant outcomes. Of all applications submitted by Asian and Black applicants, 14.74% and 25.87% are submitted to positions that adversely impact Asian and Black applicants, respectively, according to U.S. employment discrimination standards. Individuals also receive homogeneous outcomes: 4% of all applicants who apply to 10 positions are recommended for rejection from all positions, a rate higher than expected by chance. To better understand this homogeneity, we leverage the deterministic replicability of hiring algorithms to generate the outcomes applicants would have received if they applied to all positions. We show that applicants would need to apply widely in order to ensure their applications are considered by a human

2605.27366 2026-05-27 cs.AI cs.CL cs.LG cs.MA

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

MUSE-Autoskill: 通过技能创建、记忆、管理和评估实现自我进化智能体

Huawei Lin, Peng Li, Jie Song, Fuxin Jiang, Tieying Zhang

AI总结 提出MUSE-Autoskill框架,通过统一的技能生命周期(创建、记忆、管理、评估和优化)使LLM智能体持续提升任务解决能力,实验表明生命周期管理的技能可提高任务成功率、效率、复用性和跨智能体迁移。

详情
Comments
30 pages, 8 figures, 13 tables, working in progress
AI中文摘要

大型语言模型(LLM)智能体依赖可复用技能来解决复杂任务。然而,现有的技能创建方法将技能视为孤立和静态的工件,限制了其可复用性、可靠性和长期改进。我们提出了MUSE-Autoskill智能体(记忆利用技能进化),一个以技能为中心的智能体框架,让智能体通过统一的技能生命周期(创建、记忆、管理、评估和优化)持续提升任务解决能力。我们的框架使智能体能够按需创建技能,跨任务存储和复用技能,高效组织和选择技能,并通过单元测试和运行时反馈评估技能以进行持续优化。我们进一步引入了技能级记忆,为每个技能跨任务积累经验,从而实现更有效的复用和随时间适应。在SkillsBench上的实验提供了初步证据,表明生命周期管理的技能可以提高任务成功率、效率、复用性和跨智能体迁移,突出了将技能视为长期存在、具有经验意识和可测试资产的重要性。

英文摘要

Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent (Memory-Utilizing Skill Evolution), a skill-centric agent framework that lets agents continuously improve their task-solving capability by creating, reusing, and refining skills under a unified lifecycle (creation, memory, management, evaluation, and refinement). Our framework enables agents to create skills on demand, store and reuse them across tasks, organize and select them efficiently, and evaluate them through unit tests and runtime feedback for continuous refinement. We further introduce skill-level memory that accumulates experience for each skill across tasks, enabling more effective reuse and adaptation over time. Experiments on SkillsBench provide initial evidence that lifecycle-managed skills can improve task success, efficiency, reuse, and cross-agent transfer, highlighting the importance of treating skills as long-lived, experience-aware, and testable assets.

2605.27361 2026-05-27 cs.AI cs.SY eess.SY

Natural Language Query to Configuration for Retrieval Agents

面向检索代理的自然语言查询到配置

Melissa Z. Pan, Negar Arabzadeh, Mathew Jacob, Fiodar Kazhamiaka, Esha Choukse, Matei Zaharia

AI总结 提出BRANE方法,利用LLM将查询转换为工作负载特征,并训练轻量级预测器选择最优配置,在多个基准上实现成本-质量帕累托前沿的优化。

详情
AI中文摘要

现代检索代理暴露了许多配置选择——LLM、检索器、文档数量、跳数和合成策略——每个都影响答案质量和服务成本。目前,这些流水线通常针对每个工作负载手动调整一次,留下了大量每查询优化的空间。我们形式化了这个问题:给定一个自然语言查询以及一个准确性或预算目标,从预定义的流水线目录中选择在推理时最小化成本或最大化准确性的配置。我们提出了**BRANE**,它使用LLM将每个查询转换为工作负载特定的特征,然后训练一个轻量级的每配置预测器,估计流水线是否能正确回答查询。在推理时,**BRANE**选择最大化预测正确性(经成本惩罚)的配置,无需重新训练即可暴露可调的成本-质量权衡。在MuSiQue、BrowseComp-Plus和FinanceBench上,**BRANE**持续推动成本-质量帕累托前沿,以高达89%的成本降低匹配最佳固定配置的准确性,并优于LLM路由、基于规则和微调的Qwen3-4B基线。这些结果表明,对整个检索流水线进行每查询配置是静态工作负载级调优的实用替代方案。

英文摘要

Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time. We propose **BRANE**, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, **BRANE** selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, **BRANE** consistently pushes the cost-quality Pareto frontier, matches the best fixed configuration's accuracy at up to 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning.

2605.27360 2026-05-27 cs.NI cs.AI

GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis, Research, and Testing

GENESIS: 利用AI智能体实现自主6G RAN合成、研究与测试

Tamerlan Aghayev, Maxime Elkael, Michele Polese, Minh Dat Nguyen, Gabriele Gemmi, Andrea Lacava, Ali Saeizadeh, Reshma Prasad, Paolo Testolina, Angelo Feraudo, Soumendra Nanda, Pedram Johari, Salvatore D'Oro, Tommaso Melodia

AI总结 提出GENESIS框架,通过智能体、技能和钩子三种可组合原语及知识层SYNAPSE,将意图转化为经空口实验验证的解决方案,以加速6G无线接入网研发。

详情
Comments
18 pages, 16 figures
AI中文摘要

蜂窝研究与开发受制于六个结构性流程,每个流程每次迭代需要数月的体力工程工作:(i) 将标准或研究论文中的新特性综合为生产代码;(ii) 一致性测试和互操作性测试;(iii) 针对现场异常和多样化部署环境进行加固;(iv) 网络功能的数据驱动优化;(v) 发现并原型化未来标准的新波形、功能及能力;(vi) 保护协议栈免受漏洞攻击。尽管大型语言模型已将通用软件工程中类似的研发工作从数天压缩至数分钟,但其已知缺陷在无线接入网用例中更为严重:它们会幻觉应用程序编程接口并误读规范,导致RAN组件在第一次错误时即失去互操作性,并且它们严重依赖仿真来设计算法,而仿真在迁移到真实硬件时往往失效。为应对这些挑战,我们提出GENESIS,一个智能体人工智能框架,将意图(如规范条款、遥测异常或研究假设)转化为经空口实验验证的解决方案,并反馈到持久知识库中。GENESIS建立在三种可组合原语(智能体、技能、钩子)和一个知识层(SYNAPSE)之上,该知识层既作为事实来源,也作为框架产生的所有工件的接收者,使能力在多次运行中累积。

英文摘要

Cellular research and development (R&D) is throttled by six structural processes that each consume months of manual engineering work per iteration: (i) synthesizing new features from standards or research papers into production code; (ii) conformance and interoperability testing; (iii) hardening against field anomalies and diverse deployment environments; (iv) data-driven optimization of network functionalities; (v) discovering and prototyping novel waveforms, functionalities, and capabilities for future standards; and (vi) securing the stack against vulnerabilities. Although Large Language Models (LLMs) have compressed comparable R&D work in general software engineering from days to minutes, their known pitfalls worsen on Radio Access Network (RAN) use cases: they hallucinate Application Programming Interfaces (APIs) and mis-read specifications, which kills interoperability of RAN components at the first mistake, and they heavily rely on simulations for designing algorithms, which is notorious for breaking when transferred to real hardware. To address these challenges, we present GENESIS, an agentic Artificial Intelligence (AI) framework that converts intents (e.g., a specification clause, a telemetry anomaly, or a research hypothesis) into solutions validated with over-the-air experiments, fed back into a persistent knowledge base. GENESIS is built on three composable primitives (agents, skills, hooks) and a knowledge layer (SYNAPSE) that doubles as the source of ground truth and the recipient of every artifact the framework produces, making capabilities compound across runs.

2605.27358 2026-05-27 cs.LG cs.AI cs.CL

MobileMoE: Scaling On-Device Mixture of Experts

MobileMoE: 扩展设备端混合专家模型

Yanbei Chen, Hanxian Huang, Ernie Chang, Jacob Szwejbka, Digant Desai, Zechun Liu, Vikas Chandra, Raghuraman Krishnamoorthi

AI总结 针对设备端部署,提出MobileMoE系列子十亿参数MoE语言模型,通过联合优化架构和四阶段训练,在14个基准上匹配或超越领先的密集模型和MoE模型,并在智能手机上实现高效推理。

详情
AI中文摘要

混合专家(MoE)已成为千亿参数语言模型的事实标准架构,但其在十亿以下规模用于设备端部署的优势尚未得到充分探索。为弥补这一差距,我们提出MobileMoE,一系列设备端MoE语言模型,具有子十亿激活参数(0.3-0.9B激活,1.3-5.3B总参数),为设备端LLM建立了新的帕累托前沿。我们首先制定了一个设备端MoE缩放定律,在移动内存和计算约束下联合优化MoE架构,识别出一个设备端最佳点——具有细粒度和共享专家的适度稀疏性——同时实现内存和计算最优。基于推导出的架构,我们采用四阶段方案训练MobileMoE,包括预训练、中期训练、指令微调和量化感知训练,全部使用开源数据集。在14个基准上,MobileMoE匹配或超越领先的设备端密集LLM,推理FLOPs减少2-4倍,并以最多60%的参数匹配或超越最先进的MoE模型OLMoE-1B-7B。为弥合移动部署的最后一步,我们提供了首个在商用智能手机上的高效MoE推理,并进行了全面的设备端性能分析。在相当的INT4权重内存下,MobileMoE-S的预填充速度比密集基线MobileLLM-Pro快1.8-3.8倍,解码速度快2.2-3.4倍。

英文摘要

Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4$\times$ fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers $1.8$-$3.8\times$ faster prefill and $2.2$-$3.4\times$ faster decode than the dense baseline MobileLLM-Pro.

2605.27354 2026-05-27 cs.LG cs.AI cs.CL

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

利用稀疏自编码器的模型内部状态指导LLM后训练数据工程

Yi Jing, Zao Dai, Jinwu Hu, Zijun Yao, Lei Hou, Juanzi Li, Xiaozhi Wang

AI总结 提出SAERL框架,通过稀疏自编码器提取模型内部状态,建模数据多样性、难度和质量,用于强化学习数据工程,提升准确率并减少训练步数。

详情
AI中文摘要

模型内部状态编码了大型语言模型(LLM)处理其训练数据时的丰富信息;然而,后训练数据工程主要依赖外部信号,忽略了模型内部状态中丰富的内在信号。我们提出了SAERL,一个用于LLM强化学习(RL)的数据工程框架。它使用稀疏自编码器(SAE)这一先进的机制可解释性工具提取的模型内部状态,建模三种内在数据属性:多样性、难度和质量。每个属性支撑一个具体的数据工程操作:用于批次多样性控制的SAE空间聚类与适度批次混合、用于从易到难课程排序的难度代理,以及用于数据过滤的质量探针。SAERL在Qwen2.5-Math-1.5B上相比原始GRPO平均准确率提升3.00%,并以减少20%的训练步数达到目标准确率,在模型规模和RL算法上均有一致收益。实验表明,SAE在不同模型家族和规模间有效迁移,作为一种轻量级且可重用的数据工程工具。这些结果证明,模型内部状态是后训练数据工程中强大且实用的信号来源。

英文摘要

Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B, with consistent gains across model scales and RL algorithms. Experiments show that SAE transfers effectively across model families and scales, serving as a lightweight and reusable data engineering tool. These results demonstrate that model internals are a powerful and practical source of signals for post-training data engineering.

2605.27352 2026-05-27 cs.LG stat.ML

From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models

从分数到吉布斯校正器:加速均匀速率离散扩散模型

Yuchen Liang, Ness Shroff, Yingbin Liang

AI总结 提出吉布斯加速离散扩散(GADD)方法,利用具体分数函数构建吉布斯后验似然,无需额外训练即可实现均匀速率离散扩散模型的加速采样,达到$\mathcal{O}(\mathrm{polylog} (\varepsilon^{-1}))$的采样复杂度。

详情
AI中文摘要

离散扩散模型在文本和其他符号领域取得了强大的实证表现,但特别是对于均匀速率模型,它们通常需要许多步骤才能生成单个样本。现有的加速方法要么依赖训练额外的量,要么遭受慢混合问题。在这项工作中,我们提出了一种新颖的基于吉布斯的离散扩散模型校正器,称为吉布斯加速离散扩散(GADD)。GADD利用具体分数函数的结构直接构建吉布斯后验似然,除了标准分数估计外不需要任何额外训练。我们证明GADD实现了$\mathcal{O}(\mathrm{polylog} (\varepsilon^{-1}))$的整体采样复杂度,为均匀速率离散扩散模型的基于扩散的采样器提供了第一个这样的速率。我们还进行了数值实验,展示了GADD在合成数据、零样本文本采样和零样本条件音乐生成中的实际优势。这些结果证实了理论,并表明GADD在样本质量和墙钟效率上始终优于标准基线,包括原始欧拉方法和CTMC校正器。除此之外,我们的理论分析引入了一个新颖的框架,用于分析离散扩散模型中的预测器-校正器方法,这可能具有独立的意义。与依赖Girsanov测度变换技术的现有方法不同,我们的方法基于一个归纳论证,该论证在考虑校正器更新不准确性的同时,跟踪预测器迭代中的误差传播。

英文摘要

Discrete diffusion models have achieved strong empirical performance in text and other symbolic domains, but, especially for uniform-rate models, they often require many steps to generate a single sample. Existing acceleration methods either rely on training additional quantities or suffer from slow mixing. In this work, we propose a novel Gibbs-based corrector for discrete diffusion models, termed Gibbs-Accelerated Discrete Diffusion (GADD). GADD leverages the structure of the concrete score function to construct Gibbs posterior likelihoods directly, without requiring any additional training beyond standard score estimation. We show that GADD achieves an overall sampling complexity of $\mathcal{O}(\mathrm{polylog} (\varepsilon^{-1}))$, yielding the first such rate for diffusion-based samplers for uniform-rate discrete diffusion models. We also conduct numerical experiments demonstrating the practical advantages of GADD across synthetic data, zero-shot text sampling, and zero-shot conditional music generation. These results corroborate the theory and show that GADD consistently improves sample quality and wall-clock efficiency over standard baselines, including vanilla Euler methods and CTMC correctors. Beyond this, our theoretical analysis introduces a novel framework for analyzing predictor-corrector methods in discrete diffusion models, which may be of independent interest. Unlike existing approaches that rely on the Girsanov change-of-measure technique, our method is based on an induction argument that tracks error propagation across predictor iterations while accounting for inaccuracies in the corrector updates.

2605.27346 2026-05-27 cs.SD

MERIT: Learning Disentangled Music Representations for Audio Similarity

MERIT: 学习用于音频相似性的解耦音乐表示

Abhinaba Roy, Junyi Liang, Dorien Herremans

AI总结 针对现有音乐相似性模型将旋律、节奏和音色等维度纠缠在一起的问题,提出MERIT框架,通过条件音频生成和源分离茎的训练策略学习解耦的因子特定表示,实现各维度独立响应。

详情
AI中文摘要

当前的音乐相似性模型通常计算单一的、整体的分数,将旋律、节奏和音色等不同的音乐维度纠缠在一起。这限制了用户的控制和可解释性,使得无法执行细微的查询。我们引入了MERIT,一个学习针对这三个核心维度的解耦、因子特定音乐表示的框架。为了克服真实音频中缺乏孤立音乐变化的问题,我们使用了一种新颖的训练策略,该策略利用条件音频生成和源分离茎来强烈鼓励训练数据中的单因子变化。我们的评估展示了强大的因子级解耦。每个头部对其预期的感知维度响应强烈,而在其他维度上几乎保持随机,这种表示属性在合成训练域和独立的真实音频中均成立。

英文摘要

Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.

2605.27345 2026-05-27 cs.CL

MATCHA: Matching Text via Contrastive Semantic Alignment

MATCHA: 通过对比语义对齐进行文本匹配

Siran Li, Ece Sena Etoglu, Carsten Eickhoff, Seyed Ali Bahrainian

AI总结 针对现有评估指标无法区分语义矛盾的问题,提出MATCHA指标,通过双视角对比学习同时奖励语义一致性和惩罚矛盾,在多个基准上优于ROUGE和BERTScore。

详情
AI中文摘要

可靠的评估对于理解大型语言模型(LLM)的性能至关重要,但当今常用的指标,即词元重叠分数(如ROUGE)和基于嵌入的度量(如BERTScore),常常误判文档的语义相似性。我们的研究表明,词元重叠指标和基于嵌入的指标通常会将几乎相同的分数分配给直接相互矛盾的文本,从而可能掩盖根本性错误。我们引入了MATCHA,一种自动度量指标,它同时奖励与参考的语义一致性并惩罚矛盾。MATCHA采用双视角方法,衡量(i)与黄金文本的接近程度和(ii)与对抗性生成的反事实矛盾的距离。在八个公开基准上,MATCHA在问答、图像字幕生成、自然语言推理、摘要和语义文本相似性任务中,与人工标注相比,优于流行指标。在TruthfulQA数据集(即没有训练集的数据集,其中没有基于嵌入的指标可以局部训练)上,这种在根据参考匹配文本方面的改进相对于ROUGE-L达到18.38%,相对于BERTScore达到20.82%。定量比较和定性人工评估都证实了MATCHA的有效性和正确性,并揭示了现有指标的根本弱点。与23个嵌入模型(包括最先进的模型)作为类似BERTScore的度量相比,MATCHA在仅基于参考区分正确和错误陈述方面仍然是最准确的。我们的代码和指标公开可用(https://github.com/Siran-Li/MATCHA)。

英文摘要

Reliable evaluation is essential for understanding large language model (LLM) performance, yet today's go-to metrics, namely token-overlap scores (e.g., ROUGE) and embedding-based measures (e.g., BERTScore), often misjudge semantic similarity of documents. Our study shows that both token-overlap metrics and embedding-based metrics routinely assign nearly identical scores to texts that directly contradict each other, thereby potentially masking fundamental errors. We introduce MATCHA, an automatic metric that jointly rewards semantic agreement with a reference and penalizes contradictions. MATCHA employs a dual-view perspective that measures (i) proximity to the gold text and (ii) distance from an adversarially generated counterfactual contradiction. In eight public benchmarks, MATCHA outperforms popular metrics, compared with human annotations on question-answering, image caption generation, natural language inference, summarization, and semantic textual similarity tasks. On the TruthfulQA dataset (i.e., a dataset without a training set, where no embedding-based metrics could locally train on), this improvement in terms of matching texts with a reference reaches 18.38% over ROUGE-L and 20.82% over BERTScore. Both quantitative comparison and qualitative human assessments confirm the efficacy and validity of MATCHA and uncover fundamental weaknesses in pre-existing metrics. Compared with 23 embedding models, including top state-of-the-art ones, used as a metric similar to BERTScore, MATCHA remains the most accurate in distinguishing correct from incorrect statements solely based on a reference. Our code and metric are publicly available (https://github.com/Siran-Li/MATCHA).

2605.27343 2026-05-27 cs.CV cs.LG

Towards Controllable Image Generation through Representation-Conditioned Diffusion Models

通过表示条件扩散模型实现可控图像生成

Nithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen

AI总结 本文提出利用预训练自监督模型的表示作为条件,通过扩散模型实现无需大量标注的可控图像生成,并探索了表示空间中的平滑和分离特性。

详情
AI中文摘要

扩散模型已成为高质量图像生成和编辑的强大工具,但引导这些模型产生特定输出仍然是一个挑战。传统方法依赖于条件机制,如文本提示或语义图,这些需要大量标注的数据集。在这项初步工作中,我们探索了以预训练自监督模型的表示为条件的扩散模型。自条件机制不仅提高了无条件图像生成的质量,还提供了一个可用于控制生成的表示空间。我们通过识别变化方向来探索这个条件空间,并展示了在平滑性和分离性方面的有前景的特性。

英文摘要

Diffusion models have emerged as powerful tools for high-quality image generation and editing, but guiding these models to produce specific outputs remains a challenge. Conventional approaches rely on conditioning mechanisms, such as text prompts or semantic maps, which require extensively annotated datasets. In this preliminary work, we explore diffusion models conditioned on representations from a pre-trained self-supervised model. The self-conditioning mechanism not only improves the quality of unconditional image generation, but also provides a representation space that can be used to control the generation. We explore this conditioning space by identifying directions of variations, and demonstrate promising properties in terms of smoothness and disentanglement.

2605.27338 2026-05-27 cs.AI cs.CC cs.CL cs.LO

2-ASP(Q) programs with weak constraints: Complexity and efficient implementation

带有弱约束的2-ASP(Q)程序:复杂性与高效实现

Andrea Cuteri, Giuseppe Mazzotta, Francesco Ricca

AI总结 本文研究了带有两个量词和弱约束的ASP(Q)程序(2-ASP(Q)^w)的复杂性,并提出基于CEGAR技术的Casper系统实现策略,实验证明其有效性。

详情
AI中文摘要

ASP(Q)通过回答集上的量词扩展了回答集编程(ASP)。本文聚焦于带有两个量词和弱约束的ASP(Q)程序类,记为2-ASP(Q)^w。2-ASP(Q)^w是ASP(Q)的一个实际相关片段,其表达能力足以捕获直到类Delta_3^P的优化问题。在理论方面,我们提供了2-ASP(Q)^w程序主要计算任务的完整复杂性刻画,包括紧致完备性结果以及对先前工作未涉及的非平凡情况的分析。在实践方面,我们引入了在Casper系统中计算(最优)量化回答集的新策略,该策略依赖于针对ASP(Q)定制的反例引导抽象精化(CEGAR)技术。来自不同应用领域的硬基准测试的实验评估表明,所提出的技术在实践中是有效的。

英文摘要

ASP(Q) extends Answer Set Programming (ASP) with Quantifiers over answer sets. In this paper we focus on the class of ASP(Q) programs with two quantifiers and weak constraints, denoted as 2-ASP(Q)^w. 2-ASP(Q)^w is a practically relevant fragment of ASP(Q) that is expressive enough to capture optimization problems up to the class Delta_3^P. On the theoretical side, we provide a complete complexity characterization of the main computational tasks for 2-ASP(Q)^w programs, including tight completeness results and the analysis of nontrivial cases that have not been addressed in previous works. On the practical side, we introduce novel strategies for computing (optimal) quantified answer sets in the Casper system, that rely on a Counterexample-Guided Abstraction Refinement (CEGAR) technique tailored to ASP(Q). An experimental evaluation on hard benchmarks from different application domains shows that the proposed techniques are effective in practice.

2605.27336 2026-05-27 cs.CV

PARE: Pruning and Adaptive Routing for Efficient Video Generation

PARE:面向高效视频生成的剪枝与自适应路由

Yutong Wang, Yunke Wang, Tianfan Xue, Yu Qiao, Yaohui Wang, Xinyuan Chen, Chang Xu

AI总结 提出PARE方法,通过结构感知剪枝压缩宽度和输入自适应路由压缩深度,联合减少视频扩散Transformer的计算量,在Wan2.1-14B上实现每步计算大幅降低且质量保持。

详情
AI中文摘要

视频扩散Transformer(DiTs)能生成高质量视频,但由于宽块、深架构和迭代采样,需要大量计算。近期方法通过压缩宽度、深度或采样步数来降低成本,但通常采用固定架构,无法适应单个输入或去噪阶段。我们提出PARE(面向高效视频生成的剪枝与自适应路由),通过结构感知剪枝和输入自适应路由联合压缩宽度和深度。对于宽度,我们观察到注意力头分化为空间和时间角色,并设计考虑这种区分的重分评分,以防止运动关键的时间头被过早剪枝。对于深度,我们训练一个轻量级路由器,以去噪时间步和视觉内容为条件,动态选择每个步骤执行哪些块,实现每个输入的计算自适应,而非静态移除块。一个渐进式流程首先通过蒸馏恢复宽度剪枝的质量,然后联合优化学生和路由器以解耦两个学习目标。在Wan2.1-14B上的图像到视频和文本到视频生成实验表明,PARE在VBench各维度上显著减少每步计算同时保持质量,并与步蒸馏结合实现进一步加速。

英文摘要

Video Diffusion Transformers (DiTs) generate high-quality videos but demand substantial compute due to wide blocks, deep architectures, and iterative sampling. Recent methods reduce cost by compressing width, depth, or sampling steps, but typically commit to a fixed architecture that cannot adapt to individual inputs or denoising stages. We propose PARE (Pruning and Adaptive Routing for Efficient video generation), which jointly compresses width and depth with structure-aware pruning and input-adaptive routing. For width, we observe that attention heads specialize into spatial and temporal roles, and design importance scoring that accounts for this distinction to prevent motion-critical temporal heads from being pruned prematurely. For depth, we train a lightweight router conditioned on denoising timestep and visual content to dynamically select which blocks to execute at each step, enabling per-input compute adaptation rather than static block removal. A progressive pipeline first recovers width-pruned quality via distillation, then jointly optimizes the student and router to decouple the two learning objectives. Experiments on Wan2.1-14B for both image-to-video and text-to-video generation show that PARE substantially reduces per-step computation while preserving quality across VBench dimensions, and composes with step distillation for further acceleration.

2605.27333 2026-05-27 cs.CL

FinHarness: An Inline Lifecycle Safety Harness for Finance LLM Agents

FinHarness:面向金融LLM代理的内联生命周期安全约束框架

Haoxuan Jia, Yang Liu, Bin Chong, Yingguang Yang, Yancheng Chen, Jiayu Liang, Qian Li, Hanning Lu, Kefu Xu, Hao Zheng, Chongyang Zhang, Hao Peng, Philip S. Yu

AI总结 针对金融LLM代理在阻止提示诱导的未授权操作与批准合法多步骤业务流程之间的冲突,提出FinHarness内联安全约束框架,通过查询监控、工具监控和级联模块实现逐步骤风险评估与自适应验证,显著降低攻击成功率并保持良性批准率。

详情
AI中文摘要

金融LLM代理必须同时阻止提示诱导的未授权操作并批准合法的多步骤业务流程。然而,边界过滤器常常遗漏不可逆的中间轨迹工具调用,而事后LLM判断仅在终止后执行审计——对于干预来说为时已晚,且计算成本随轨迹长度线性增长。我们提出FinHarness,一个内联安全约束框架,通过三个组件端到端地封装金融代理:查询监控器融合单轮意图与跨轮漂移,工具监控器评估每个潜在工具调用,以及级联模块整合每步风险并在轻量级和高级LLM判断之间自适应路由验证。触发的风险因素作为事前证据重新注入代理输入,使代理能够自行拒绝、重新规划或批准。在FinVault上,路由的FinHarness将攻击成功率从38.3%降至15.0%,同时基本保持良性批准率(41.1%→39.3%),并且高级判断调用次数比始终使用高级判断的消融实验减少4.7倍。

英文摘要

Finance LLM agents must simultaneously block prompt-induced unauthorized actions and approve legitimate multi-step business workflows. However, boundary filters often miss irreversible mid-trajectory tool calls, while post-hoc LLM judges perform auditing only after termination -- too late for intervention and at a computational cost that scales linearly with trace length. We present FinHarness, an inline safety harness that wraps a finance agent end-to-end with three components: a Query Monitor that fuses single-turn intent with cross-turn drift, a Tool Monitor that evaluates each prospective tool call, and a Cascade module that integrates per-step risk and adaptively routes verification between a lightweight and an advanced-tier LLM judge. Fired risk factors are re-injected into the agent input as ex-ante evidence, enabling the agent to refuse, re-plan, or approve on its own. On FinVault, routed FinHarness cuts ASR from 38.3% to 15.0% while largely preserving benign approval ($41.1\% \to 39.3\%$), and uses $4.7\times$ fewer advanced-judge calls than an always-advanced ablation.

2605.27332 2026-05-27 cs.SE cs.AI cs.CV

EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering

EdgeFlow: 基于边缘图增强的VLM流程图处理用于工业需求工程

Zhifei Dou, Shabnam Hassani, Ou Wei

AI总结 提出EdgeFlow方法,通过向视觉语言模型(VLM)输入添加Canny边缘图作为结构先验,无需训练数据或微调即可提升流程图到Mermaid代码的转换精度,在工业数据集上节点F1提升17.39%,边F1提升16.94%。

详情
Comments
10 pages
AI中文摘要

流程图广泛应用于工业需求中,但通常以静态图像形式嵌入。视觉语言模型(VLM)在将这些流程图转换为机器可读模型以支持需求工程活动方面显示出潜力,然而,当直接应用于流程图转换时,它们常常在拓扑关键视觉细节上失败。为了解决这个问题,我们提出了EdgeFlow,它通过向VLM的原始输入添加确定性提取的Canny边缘图——作为结构先验——来改进流程图到Mermaid的转换,无需标注训练数据或领域特定的模型微调。我们在IndusReqFlow(一个来自真实世界需求的数据集)上评估了EdgeFlow。与现成的VLM相比,EdgeFlow将节点级F1提高了17.39个百分点,边级F1提高了16.94个百分点。在路径级别,EdgeFlow将路径F1提高了11.06个百分点,从而更好地支持基于模型的测试。这些结果表明,EdgeFlow提供了一种实用的、无需训练的方法,用于改进工业需求工程中保持拓扑结构的流程图到Mermaid转换。在公共合成基准上的跨数据集评估结果显示没有显著改进;这凸显了需要包含工业数据的多样化基准,以全面评估未来基于VLM的需求工程工具。

英文摘要

Flowcharts are widely used in industrial requirements, but usually remain embedded as static images. Vision Language Models (VLMs) show promise in the conversion of these flowcharts into machine-readable models for RE activities, yet, when directly applied to flowchart conversion, they often fail on topology-critical visual details. To address this, we propose EdgeFlow that augments a VLM's original input with a deterministically extracted Canny edge map-acting as a structural prior-to improve flowchart-to-Mermaid conversion, without requiring annotated training data or domain-specific model fine-tuning. We evaluate EdgeFlow on IndusReqFlow, a dataset sourced from real-world requirements. Compared with off-the-shelf VLMs, EdgeFlow improves node-level F1 by 17.39 percentage points and edge-level F1 by 16.94 percentage points. At the path level, EdgeFlow improves path F1 by 11.06 percentage points, enabling better support for model-based testing. These results demonstrate that EdgeFlow provides a practical, training-free means to improve topology-preserving flowchart-to-Mermaid conversion for industrial RE. Cross-dataset evaluation results on a public synthetic benchmark show no significant improvement; this highlights the need for diverse benchmarks incorporating industrial data for the comprehensive evaluation of future VLM-based RE tools.

2605.27331 2026-05-27 cs.AI

Maat: The Agentic Legal Research Assistant for Competition Protection

Maat: 面向竞争保护的法律研究智能助手

Basant Mounir, Farida Madkour, Amira Abdelaziz, Asmaa Sami

AI总结 提出Maat,一种基于ReAct框架的智能法律研究助手,通过RAG、网络搜索和用户澄清机制,在竞争法案例检索中显著优于现有通用和专用法律助手。

详情
Comments
5 pages, 1 figure
AI中文摘要

进行法律研究的竞争法专家必须查阅大量案例、决定和司法报告,以识别先例并评估竞争和合并案件中的关键要素。尽管通用研究助手(如Claude和ChatGPT)和法律助手(如SaulLM-7B和LegalGPT)越来越多地被用于辅助法律研究,但它们在竞争法分析方面仍然不足:缺乏专门的领域知识,提供不充分的官方引用,或虚构竞争法案例。我们提出Maat,一个ReAct智能体,它协调与研究过程不同任务对应的工具。Maat与竞争法专家迭代设计,使用RAG将案例和发现基于官方来源以确保可靠性,提供丰富的行内引用,在数据库覆盖不足时回退到网络搜索,并在查询模糊时提示用户澄清。Maat在案例特定任务上显著优于所有基线助手,在理论问题任务上表现与最佳基线相当。所使用的数据集可在GitHub上获取。

英文摘要

Competition law experts conducting legal research must review extensive volumes of cases, decisions, and judicial reports to identify precedents and assess key elements in competition and merger cases. Although general research assistants such as Claude and ChatGPT and legal assistants such as SaulLM-7B and LegalGPT are increasingly used to assist legal research, they remain inadequate for competition law analysis: they lack specialized domain expertise, provide insufficient official citations, or hallucinate competition law cases. We propose Maat, a ReAct agent that orchestrates tools corresponding to different tasks of the research process. Designed iteratively with competition law experts, Maat grounds cases and findings in official sources using RAG for reliability, provides rich in-line citations, falls back to web search when database coverage is insufficient, and prompts the user for clarification when queries are ambiguous. Maat significantly outperforms all baseline assistants on case-specific tasks and performs within range of the top baseline on theoretical question tasks. The dataset used is available on GitHub.

2605.27328 2026-05-27 cs.SE cs.AI cs.MA

Governed Evolution of Agent Runtimes through Executable Operational Cognition

通过可执行操作认知实现代理运行时的受控演化

Mariano Garralda-Barrio

AI总结 本文提出一个框架,通过可执行操作认知实现多智能体系统中代理生成工件的受控运行时演化,引入HarnessMutation机制在验证、可追溯、评估和回滚约束下进行生命周期感知的运行时适应。

详情
Comments
14 pages, 4 figures, 1 table. Reference implementation and associated source code available at: https://github.com/mgarralda/governed-runtime
AI中文摘要

近期智能体系统的进展越来越将代码视为可执行的操作基底,而非可丢弃的输出工件。先前的工作如\emph{Code as Agent Harness}将经过验证的智能体生成工件视为运行时实体,可以在长时间运行的认知循环中创建、执行、修订、持久化和重用。然而,这些工件的治理、生命周期管理和操作演化仍未被充分定义。 本文提出了一个通过可执行操作认知实现多智能体系统中受控运行时演化的框架。我们将智能体生成工件形式化为持久的运行时能力,这些能力逐渐成为操作基底的一部分,而非瞬时的中间输出。基于这一视角,我们引入了\emph{HarnessMutation}作为一种受控机制,用于在明确的验证、可追溯性、评估和回滚约束下进行生命周期感知的运行时适应。 该框架不将运行时适应视为无限制的自我修改,而是将演化建模为在持久操作记忆上的有界且可观察的过程。它进一步展示了这些思想如何在现代智能体运行时和面向治理的编排系统上实现,为适应性基础设施提供了概念基础,使其演化保持明确、可审计且受约束。

英文摘要

Recent advances in agentic systems increasingly treat code as an executable operational substrate rather than as a disposable output artifact. Prior work such as \emph{Code as Agent Harness} frames validated agent-generated artifacts as runtime entities that can be created, executed, revised, persisted, and reused within long-running cognitive loops. However, the governance, lifecycle management, and operational evolution of such artifacts remain under-specified. This paper proposes a framework for governed runtime evolution in multi-agent systems through executable operational cognition. We formalize agent-generated artifacts as persistent runtime capabilities that progressively become part of the operational substrate rather than transient intermediate outputs. Building on this perspective, we introduce \emph{HarnessMutation} as a governed mechanism for lifecycle-aware runtime adaptation operating under explicit validation, traceability, evaluation, and rollback constraints. Rather than treating runtime adaptation as unrestricted self-modification, the proposed framework models evolution as a bounded and observable process over persistent operational memory. It further shows how these ideas can be operationalized over modern agent runtimes and governance-oriented orchestration systems, providing a conceptual foundation for adaptive infrastructures whose evolution remains explicit, auditable, and constrained.

2605.27322 2026-05-27 cs.CL

Semantic Gradients Interactions in SSD: A Case Study in Racial Identity and Hate Speech

SSD中的语义梯度交互:种族身份与仇恨言论的案例研究

Felix Ostrowicki, Hubert Plisiecki

AI总结 本文提出交互式SSD方法,通过语义梯度交互模型研究调节变量对语义含义的影响,并在UC Berkeley仇恨言论语料库上验证了种族身份对仇恨言论判断的调节作用。

详情
AI中文摘要

我们引入了交互式SSD,这是监督语义微分的一种扩展,用于建模语义含义如何在调节变量(如群体、特征或条件)之间变化,使这种变化可检验和可解释。该方法估计一个主语义梯度、一个交互梯度和条件梯度,所有这些都可以通过标准SSD工具进行解释。我们在UC Berkeley仇恨言论测量语料库上进行了说明,测试了注释者的种族身份是否调节了对针对有色人种评论的仇恨言论判断。交互模型检测到显著的调节效应:共享梯度将非人化的敌意与反言论形成对比,而交互梯度则揭示了在哪些语义线索预测仇恨言论评分方面存在较小的群体相关差异。交互式SSD使调节后的含义-结果关系在统计上可检验和可解释。

英文摘要

We introduce interaction SSD, an extension of Supervised Semantic Differential that models how semantic meaning varies across moderators such as groups, traits, or conditions making this variation testable and interpretable. The method estimates a main semantic gradient, an interaction gradient, and conditional gradients, all interpretable through standard SSD tools. We illustrate it on the UC Berkeley Measuring Hate Speech corpus, testing whether annotator racial identity moderates hate-speech judgments of comments targeting people of color. The interaction model detects a significant moderation effect: the shared gradient contrasts dehumanizing hostility with counter-speech, while the interaction gradient reveals smaller group-linked differences in which semantic cues predict hate-speech ratings. Interaction SSD makes moderated meaning-outcome relationships statistically testable and interpretable.

2605.27320 2026-05-27 cs.AI cs.CY econ.GN q-fin.EC

Modeling Agentic Technical Debt and Stochastic Tax: A Standalone Framework for Measurement, Simulation, and Dashboarding

建模代理技术债务与随机税:一个用于测量、模拟和仪表盘展示的独立框架

Muhammad Zia Hydari, Raja Iqbal, Narayan Ramasubbu

AI总结 本文提出一个形式化且可管理的框架,区分代理技术债务(累积的设计与治理负债存量)与随机税(使用随机代理时产生的运营负担流),并通过应付账款模拟和电子表格说明其应用。

详情
AI中文摘要

代理AI系统将概率推理与通过工具、上下文、记忆、编排和外部工作流集成进行的委托行动相结合。本文开发了一个形式化且可管理的模型,区分了代理技术债务与随机税。代理技术债务是累积的设计与治理负债存量。随机税是在业务流程中使用随机代理时产生的重复性运营负担流。这两个概念相关但不同:债务可能放大税收,而即使债务最小化,税收仍可能为正。本文从一个紧凑的仪表盘表达式开始,将其扩展为更完整的结构模型,定义所有变量和参数,展示如何从运营数据中估算每个成本类别,并通过应付账款模拟和配套电子表格说明该框架。

英文摘要

Agentic AI systems combine probabilistic reasoning with delegated action through tools, context, memory, orchestration, and external workflow integration. This note develops a formal and managerially usable model that distinguishes Agentic Technical Debt from Stochastic Tax. Agentic Technical Debt is a stock of accumulated design and governance liability. Stochastic Tax is a recurring flow of operating burden that arises when stochastic agents are used in business workflows. The two constructs are related, but they are not the same: debt can amplify the tax, while the tax can remain positive even when debt is minimized. The note starts from a compact dashboard expression, expands it into a fuller structural model, defines all variables and parameters, shows how each cost category can be estimated from operational data, and illustrates the framework with an accounts-payable simulation and companion spreadsheet.

2605.27318 2026-05-27 cs.CV

Q-GeoMem: Question-Guided Geometric Memory for Video Spatial Reasoning

Q-GeoMem:面向视频空间推理的问题引导几何记忆

Xianqiang Gao, Qizhi Chen, Delin Qu, Haoming Song, Zhigang Wang, Bin Zhao, Dong Wang, Xuelong Li

AI总结 提出Q-GeoMem框架,通过问题引导的几何记忆机制,结合细粒度上下文库和语义几何证据库,在视频空间推理任务中实现最先进性能。

详情
AI中文摘要

视频空间推理需要在时间上累积依赖于视角的证据,同时保留对回答问题有用的信息。现有的空间视频语言模型改进了几何感知和长程上下文建模,但通常将记忆视为通用时间缓存,这可能引入冗余或无关的几何信息,削弱长程推理能力。我们提出 extbf{\ours},一种用于视频空间推理的问题引导几何记忆框架。\ours将相机条件几何注入视觉标记,并维护两种互补记忆:用于近期密集特征和相机状态的细粒度上下文库,以及用于紧凑长程证据的语义几何证据库。每个候选帧通过Q-Former基于的问题相关性与相对于已保留库的新颖性的乘积进行评分;该分数在读取时存储并重用,同时基于容量的替换规则保持库紧凑。在推理过程中,两种记忆在更新前被读取,并与当前帧表示自适应融合。在VSI-Bench和VSTI-Bench上的实验表明,\ours在评估的空间推理模型中达到了最先进的性能,验证了问题引导几何记忆的有效性。消融实验进一步验证了所提出的证据评分机制的贡献。

英文摘要

Video spatial reasoning requires accumulating viewpoint-dependent evidence over time while retaining information useful to the question being asked. Existing spatial video-language models improve geometric perception and long-range context modeling, but often treat memory as a generic temporal cache, which can introduce redundant or irrelevant geometry and weaken long-horizon reasoning. We propose \textbf{\ours}, a question-guided geometric memory framework for video spatial reasoning. \ours injects camera-conditioned geometry into visual tokens and maintains two complementary memories: a Fine-Grained Context Bank for recent dense features and camera states, and a Semantic-Geometric Evidence Bank for compact long-range evidence. Each candidate frame is scored by the product of Q-Former-based question relevance and novelty with respect to the retained bank; this score is stored and reused during reading, while a capacity-based replacement rule keeps the bank compact. During reasoning, both memories are read before update and adaptively fused with the current frame representation. Experiments on VSI-Bench and VSTI-Bench show that \ours achieves state-of-the-art performance among evaluated spatial reasoning models, validating the effectiveness of question-guided geometric memory. Ablations further verify the contribution of the proposed evidence scoring mechanism.

2605.27316 2026-05-27 cs.LG math.OC

Probabilistic Smoothing with Ratio-Monotone Transforms for Global Optimization

基于比率单调变换的概率平滑用于全局优化

Kukyoung Jang, Taehyun Cho, Junrui Zhang, Ping Xu, Kyungjae Lee

AI总结 提出一种结合灵活对称单峰核与单调比率变换的通用概率平滑框架,在温和条件下保持全局最优解并保证收敛性,实验证明鲁棒性和竞争力提升。

详情
AI中文摘要

概率平滑是全局优化的标准工具,但现有方法依赖高斯核和特定变换,通常导致强超参数敏感性和有限的鲁棒性。我们提出一个通用平滑框架,将灵活的对称单峰核与基于单调比率的变换相结合。在温和条件下,我们证明平滑后的目标函数保持全局最大值,并且所有驻点都集中在真实最优值附近,无需递减的平滑调度。我们进一步为随机梯度上升提供了显式的复杂度界,并证明留一法基线可证明地减少方差。在高维基准测试和黑盒对抗攻击上的实验表明,该方法具有改进的鲁棒性和竞争性能。

英文摘要

Probabilistic smoothing is a standard tool for global optimization, but existing methods rely on Gaussian kernels and specific transforms, often resulting in strong hyperparameter sensitivity and limited robustness. We propose a general smoothing framework that combines flexible symmetric unimodal kernels with monotonic ratio-based transformations. Under mild conditions, we show that the smoothed objective preserves the global maximizer and that all stationary points concentrate near the true optimum for sufficiently large amplification, without requiring a decreasing smoothing schedule. We further provide explicit complexity bounds for stochastic gradient ascent and show that a leave-one-out baseline provably reduces variance. Experiments on high-dimensional benchmarks and black-box adversarial attacks demonstrate improved robustness and competitive performance.

2605.27315 2026-05-27 cs.CL

Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery

真实图像,更差判断:评估视觉-语言模型在具体性和意象性上的表现

Yifan Jiang, Ruoxi Ning, Sheng Yao, Freda Shi

AI总结 本文通过具体性和意象性评分任务,发现真实图像上下文不仅未提升视觉-语言模型与人类判断的一致性,反而在视觉证据相关性低时加剧偏差,并表明推理时聚焦文本指令可缓解退化。

详情
AI中文摘要

视觉输入通常被认为能改善多模态模型的语言理解。我们通过询问视觉-语言模型(VLM)在词汇判断中能否区分有用的视觉证据与偶然的图像上下文来检验这一假设。我们使用人类的具体性和意象性评分,因为它们涵盖从抽象低意象词到具体高意象词等不同视觉相关性的词汇。我们发现,真实图像上下文并未带来一致的增益,反而常常损害与人类评分的一致性,尤其是在视觉证据最不相关时。通过探针分析和典型相关分析,辅以归因案例研究,我们发现真实图像上下文与表征偏移和对虚假视觉线索的更高敏感性相关,同时目标词汇属性的可恢复性减弱。我们进一步表明,在推理时指示模型仅关注文本内容可以减少这种退化,尤其是在这些脆弱子集上效果最明显。我们的发现表明,当前指令微调的VLM需要更好地校准视觉上下文何时应影响词汇判断。

英文摘要

Visual inputs are often assumed to improve language understanding in multimodal models. We examine this assumption by asking whether vision-language models (VLMs) can distinguish useful visual evidence from incidental image context in lexical judgments. We use human concreteness and imagery ratings because they span words with varying expected visual relevance, from abstract and low-imagery words to concrete and high-imagery words. We find that real-image contexts do not yield consistent gains and often hurt alignment with human ratings, most sharply when visual evidence is least relevant. Through probing and canonical correlation analysis, complemented by an attribution case study, we find that real-image contexts are associated with representational shifts and greater sensitivity to spurious visual cues, coinciding with weaker recoverability of the targeted lexical properties. We further show that instructing models to focus solely on textual content at inference time can reduce this degradation, with the clearest gains on these vulnerable subsets. Our findings suggest that current instruction-tuned VLMs need better calibration of when visual context should inform lexical judgments.

2605.27314 2026-05-27 cs.RO cs.SY eess.SY

Riding the Shifting Potential: When Reactive Control Suffices for Multi-Goal Behavior

驾驭变化势场:何时反应控制足以实现多目标行为

Vito Mengers, Oliver Brock

AI总结 本文提出通过零空间投影扩展图基世界模型中的交互结构,动态调整优先级以解决多目标冲突,在非凸障碍导航和非凸物体推拉任务中实现100%成功率,无需演示或重新训练。

详情
AI中文摘要

反应控制通常被认为不足以处理多目标任务,因为冲突目标会导致局部极小值。我们认为这一限制并非固有,而是源于无法反映目标当前交互方式的静态编码。我们利用图基世界模型中编码的交互结构,通过零空间投影对其进行扩展:冲突在产生处通过将低优先级梯度投影到高优先级梯度的零空间来解决,优先级根据当前状态连续确定。我们在两个目标冲突为核心问题的领域中进行了演示:非凸障碍导航(静态势场在此根本失败)和非凸物体推拉(我们的方法在一百个配置中达到100%成功率,而最速下降基线为0%,扩散策略约为55%,无需演示或重新训练)。相同的公式直接迁移到具有额外感知和运动学约束的真实机器人上,通过相同机制适应这些约束。

英文摘要

Reactive control is often considered insufficient for multi-objective tasks because conflicting objectives give rise to local minima. We argue this limitation is not inherent but arises from static encodings that fail to reflect how objectives currently interact. We exploit the interaction structure encoded in a graph-based world model by extending it with nullspace projections: conflicts are resolved where they arise by projecting lower-priority gradients into the nullspace of higher-priority ones, with priorities determined continuously from the current state. We demonstrate this in two domains where conflicts between objectives are central: navigation around non-convex obstacles, where static potential fields fundamentally fail, and planar pushing of non-convex objects, where our method achieves $100\%$ success across one-hundred configurations versus $0\%$ for the steepest-descent baseline and ${\sim}55\%$ for diffusion policy, without demonstrations or retraining. The same formulation transfers directly to a real robot with additional perceptual and kinematic constraints, accommodating them through the same mechanism.

2605.27313 2026-05-27 cs.CL

When Does Demographic Information Help? Data and Modeling Regimes for Perspective-Aware Hate Speech Detection

人口统计信息何时有帮助?面向观点的仇恨言论检测的数据与建模机制

Weibin Cai, Reza Zafarani

AI总结 本文通过分析数据划分属性和建模框架,研究了人口统计特征在仇恨言论检测中的增益条件,并提出了一种门控人口统计残差模型,在低训练分歧、高测试分歧等机制下有效提升性能。

详情
AI中文摘要

人口统计信息通常用于在主观任务(如仇恨言论检测)中对标注者观点进行建模,但其益处并不一致:在某些设置下提升性能,在其他设置下则表现为噪声。本文探讨了人口统计特征何时有帮助。我们分析了人口统计增益作为数据划分属性和建模框架的函数。对于数据划分,我们测量了标注者分歧,即标注者给同一示例分配不同标签的频率,以及训练规模和训练-测试人口统计覆盖率。我们发现人口统计增益集中在低训练分歧、高测试分歧、细粒度模糊性测量、充足训练数据和更大人口统计重叠的机制中。受这些机制的启发,我们引入了一种门控人口统计残差模型,将人口统计视为对纯文本预测的选择性调整。在MHS和POPQUORN上的实验表明,这种设计是有效的,尤其是在高分歧或低置信度的示例上。总体而言,我们的结果表明,人口统计信息不应默认假定有用;其价值取决于数据机制和建模框架的共同作用。

英文摘要

Demographic information is often used to model annotator perspectives in subjective tasks such as hate speech detection, but its benefit is inconsistent: it improves performance in some settings and behaves as noise in others. This paper asks when demographic features help. We analyze demographic gain as a function of both data split properties and modeling frameworks. For data splits, we measure annotator disagreement, namely how often annotators assign different labels to the same example, along with training size and train-test demographic coverage. We find that demographic gains concentrate in regimes with low training disagreement, high test disagreement, fine-grained ambiguity measurement, sufficient training data, and greater demographic overlap. Motivated by these regimes, we introduce a gated demographic residual model that treats demographics as a selective adjustment to text-only predictions. Experiments on MHS and POPQUORN show that this design is effective, especially on high disagreement or low confidence examples. Overall, our results suggest that demographics should not be assumed useful by default; their value depends jointly on the data regime and the modeling framework.

2605.27311 2026-05-27 cs.CL cs.CV

Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

Chartographer: 用于评估视觉语言模型的反事实图表生成

Yifan Jiang, Dae Yon Hwang, Jesse C. Cresswell, Freda Shi

AI总结 提出 Chartographer 框架,通过将图表逆向工程为可执行代码并生成反事实变体,揭示视觉语言模型在图表问答中的视觉推理缺陷。

详情
AI中文摘要

图表问答(QA)基准旨在提出需要视觉推理才能正确回答的问题,但模型通常可以通过捷径或基于自身背景知识对图表的先验熟悉度来达到解决方案。为了严格评估视觉推理,我们提出了反事实图表,其中图表问答任务保持不变,但底层图表和相应答案发生变化。我们引入了 Chartographer,一个将图表逆向工程为可执行代码、验证重构保真度、生成种子控制的反事实变体以及从可执行问答逻辑中推导新答案的框架。我们将该框架应用于现有的图表 QA 数据集,并评估了专有和开源视觉语言模型(VLM),测量了变化敏感性和泛化能力。反事实图表揭示了单一图表性能所隐藏的失败:VLM 在正确回答原始图表后通常无法泛化。我们发现,当更新后的图表需要新的视觉推理路径时,失败最为普遍。

英文摘要

Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background knowledge. To strictly evaluate visual reasoning, we propose counterfactual charts where the chart-question task remains fixed, but underlying chart and the corresponding answer are varied. We introduce Chartographer, a framework to reverse engineer charts into executable code, validate reconstruction fidelity, generate seed-controlled counterfactual variants, and derive new answers from executable QA logic. We apply this framework to existing chart QA datasets and evaluate proprietary and open-source vision-language models (VLMs), measuring variation sensitivity and generalizability. Counterfactual charts reveal failures hidden by single-chart performance: VLMs often fail to generalize after answering the original chart correctly. We find failures are most prevalent when updated charts require novel visual reasoning pathways.

2605.27310 2026-05-27 cs.CV

How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning

如何以及想象什么?统一多模态模型中的视觉思维用于跨视角空间推理

Qian Yang, Ankur Sikarwar, Huy Le, Le Zhang, Zhuan Shi, Perouz Taslakian, Aishwarya Agrawal

AI总结 提出View Dropout训练策略使模型利用中间思维图像进行跨视角空间推理,并发现全景视觉思维在信息性和可学习性上最优。

详情
Comments
Preprint
AI中文摘要

跨视角空间推理仍然是视觉语言模型(VLM)的薄弱环节:它们通常用语言推理,丢失了任务所需的细粒度几何信息。用图像思考旨在通过生成中间思维图像来解决这一问题,但近期工作表明模型常常忽略这些轨迹中的视觉证据。因此,我们提出如何让视觉思维起作用,以及哪种视觉思维效果最好。我们在统一多模态模型(UMMs)中研究这些问题,这类模型原生支持交错的图像-文本生成。对于第一个问题,我们提出视图丢弃(VDrop),一种训练时干预方法,将输入视图的部分内容从答案跨度中隐藏,同时使其对思维图像令牌可见。这鼓励模型在回答时使用思维图像,而不是仅依赖输入视图。一旦思维图像用于答案预测,我们研究哪种类型的视觉思维最有效。我们将其表述为可学习性-信息性权衡,并比较三种思维图像变体:俯视图、全景图和点匹配渲染图。在合成场景上训练,并在五个真实世界域外基准上评估,采用VDrop的全景视觉思维是唯一既信息丰富又可学习的配置,并实现了最佳的域外泛化。

英文摘要

Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to address this by generating an intermediate thinking image, but recent work shows that models often ignore the visual evidence in these traces. We therefore ask how to make visual thinking matter, and what kind of visual thinking works best. We study these questions in unified multimodal models (UMMs), which natively support interleaved image-text generation. For the first question, we propose View Dropout (VDrop), a training-time intervention that hides parts of one input view from the answer span while keeping them visible to the thinking-image tokens. This encourages the model to use the thinking image when answering, instead of relying only on the input views. Once the thinking image is used for answer prediction, we study which type of visual thinking is most effective. We frame this as a learnability-informativeness tradeoff and compare three thinking-image variants: top-down, panoramic, and point-matching renderings. Trained on synthetic scenes and evaluated on five real-world out-of-domain benchmarks, panoramic visual thinking with VDrop is the only configuration that is both informative and learnable, and it achieves the best out-of-domain generalization.

2605.27309 2026-05-27 cs.LG cs.OH

Greening AI Inference with Accuracy and Latency-aware User Incentives

通过准确性和延迟感知的用户激励实现绿色AI推理

Vasilios A. Siris, Adamantia Stamou, George D. Stamoulis, Konstantinos Varsos, Ramin Khalili

AI总结 提出一种基于用户对推理质量和延迟的估值以及环境意识的激励框架,通过双层级服务订阅平衡碳排放与QoE参数。

详情
Journal ref
IEEE Internet Computing, 2026
AI中文摘要

AI服务的广泛使用引发了对其环境可持续性的担忧,最近的研究表明AI推理的碳排放是主要贡献者。本文介绍了一个框架,基于用户对推理质量和延迟的估值以及他们的环境意识,同时考虑碳排放与这两个QoE参数之间的权衡,来设计AI推理激励。我们的方法可以适应不同的权衡,这取决于AI模型的大小和复杂性以及用于服务推理请求的资源分配。这些激励可以通过一个实用的双层级服务订阅来提供,该订阅为用户提供折扣以换取减少的碳排放。折扣服务选项使AI提供商能够在高碳强度期间以较低的质量和较高的延迟服务一定比例的推理请求。

英文摘要

The widespread use of AI services has raised concerns for its environmental sustainability, towards which recent studies have identified carbon emissions of AI inference as the major contributor. This paper introduces a framework for designing AI inference incentives based on the users' valuation for inference quality and latency, together with their environmental consciousness, while accounting for the tradeoff between carbon emissions and the two QoE parameters. Our approach can accommodate different tradeoffs, that depend on the size and complexity of the AI models and the allocation of resources to serve inference requests. The incentives can be offered through a practical two-tier service subscription that offers users a discount in exchange for reduced carbon emissions. The discounted service option gives the AI provider the flexibility to serve some percentage of inference requests at a lower quality and higher latency during periods of high carbon intensity.