arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.26884 2026-05-27 cs.CV

Small Object Detection in Industrial Recycling: A New Dataset and YOLO Performance Evaluation

工业回收中的小目标检测：新数据集与YOLO性能评估

Oussama Messai, Abbass Zein-Eddine, Abdelouahid Bentamou, Mickael Picq, Nicolas Duquesne, Stéphane Puydarrieux, Yann Gavet

AI总结针对工业回收中小、密集、重叠目标的检测难题，本文提出新数据集并对比基于深度学习的监督方法，评估YOLO等系统的性能、精度与计算效率，同时探索数据增强与合成图像的优势。

详情

DOI: 10.1117/1.jei.35.3.031203
Journal ref: Journal of Electronic Imaging 2026

AI中文摘要

本文解决了检测小、密集和重叠目标的问题，这是计算机视觉中的一个主要挑战。我们重点回顾了基于深度学习监督方法提出的系统，并在一个包含超过1万张图像和12万个实例的新数据集上对这些系统进行了详细比较，突出了它们在工业回收流程用例中的性能、准确性和计算效率。通过这种比较分析，我们确定了当前最可靠的系统及其设计要解决的具体挑战。此外，我们探讨了数据增强和合成图像的好处。基于我们的分析，我们还提出了潜在的未来方向和创新解决方案，这些方案可以增强小、密集和重叠目标检测系统的有效性。我们的研究范围涵盖回收流程中的目标检测、长度测量和异常检测。异常检测策略对图像分辨率和缩放级别的变化具有鲁棒性，确保在工业应用中的可靠性能。所提出的数据集、方法和评估代码的仓库可在以下网址找到：https://github.com/o-messai/SDOOD

英文摘要

In this paper, we address the problem of detecting small, dense, and overlapping objects, a major challenge in computer vision. Our focus is on reviewing proposed methods based on deep learning supervised approaches. We provide a detailed comparison of these systems on a new dataset of more than 10k images and 120k instances, highlighting their performance, accuracy, and computational efficiency in the industrial recycling process use case. Through this comparative analysis, we identify the most reliable systems currently available and the specific challenges they are designed to tackle. Furthermore, we explore the benefits of data augmentation and synthetic images. Based on our analysis, we also propose potential future directions and innovative solutions that could enhance the effectiveness of small, dense and overlapped object detection systems. The scope of our investigations encompasses object detection, length measurement, and anomaly detection within the context of the recycling process. The anomaly detection strategy is robust against variations in image resolution and zoom levels, ensuring reliable performance in industrial applications. The repository of the proposed dataset, methods and evaluation codes can be found at: https://github.com/o-messai/SDOOD

URL PDF HTML ☆

赞 0 踩 0

2605.26601 2026-05-27 cs.CV

FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling

FTibSuite：面向藏语视觉语言建模的综合资源套件

Guixian Xu, Yide Liang, Zeli Su, Xuexian Song, Ziyin Zhang, Yushuang Dong, Ting Zhang, Xu Han

AI总结针对藏语视觉语言建模缺乏可复现训练和评估基础设施的问题，提出FTibSuite资源套件，包含数据集FTibData、基准FTibBench和基线模型FTibVLM，在多项任务上取得显著性能提升。

详情

AI中文摘要

视觉语言模型取得了快速进展，但藏语由于缺乏可复现的训练和评估基础设施，仍然是一种严重服务不足的低资源语言。为填补这一空白，我们引入了FTibSuite，一个面向藏语视觉语言研究的综合资源套件，包括FTibData（人工验证的多模态训练语料库，涵盖持续预训练、图像-文本对齐和指令调优数据）、FTibBench（五个主流多模态基准的藏语改编版本，采用分层质量控制流程以减少翻译噪声）以及FTibVLM（基于Qwen3-VL-8B-Instruct通过三阶段适应流程构建的可复现基线）。在FTibBench上的实验表明，FTibVLM在所有任务上均取得一致的性能提升，例如将MMBench准确率从42.97提高到67.78，POPE-random准确率从47.53提高到80.56，同时保持了骨干模型原有的中文能力且退化最小，为藏语多模态研究提供了首个标准化基础。

英文摘要

Vision-language models have progressed rapidly, but Tibetan remains a severely underserved low-resource language due to the lack of reproducible training and evaluation infrastructure. To fill this gap, we introduce FTibSuite, a comprehensive resource suite for Tibetan vision-language research, consisting of FTibData (human-verified multimodal training corpora spanning continual pretraining, image-text alignment, and instruction tuning data), FTibBench (Tibetan adaptations of five mainstream multimodal benchmarks with a hierarchical quality-control workflow to reduce translation noise), and FTibVLM, a reproducible baseline built on Qwen3-VL-8B-Instruct via a three-stage adaptation pipeline. Experiments on FTibBench show FTibVLM delivers consistent performance gains across all tasks, such as improving MMBench accuracy from 42.97 to 67.78 and POPE-random accuracy from 47.53 to 80.56, while retaining the backbone's original Chinese capabilities with minimal degradation, providing the first standardized foundation for Tibetan multimodal research.

URL PDF HTML ☆

赞 0 踩 0

2605.25046 2026-05-27 cs.CV cs.AI

TinyFormer: Preserving Tiny Objects in YOLO-DETR Hybrid Real-time Detectors

TinyFormer: 在YOLO-DETR混合实时检测器中保留小目标

Jun-Wei Hsieh, Meng-Yu Kao, Ghufron Wahyu Kurniawan, Kuan-Chuan Peng

AI总结提出TinyFormer混合检测器，通过并行双融合模块（PBM）保留浅层高分辨率特征，并设计空间语义适配器（SSA）补偿粗粒度标记化导致的空间损失，在MS COCO上实现小目标检测精度提升。

详情

AI中文摘要

YOLO系列和基于DETR的检测器在小目标检测方面存在困难。YOLO风格的模型受益于高效的密集预测，但其大步长骨干网络可能会抑制深层特征图中的小目标实例，并使网格分配变得模糊。基于DETR的模型通过集合预测去除了手工设计的后处理，但它们在粗粒度标记网格上进行推理，其中小目标仅占据少数弱标记，在匹配过程中容易被忽略。为了解决这些局限性，我们提出了TinyFormer，一种统一的YOLO-DETR混合实时检测器，它结合了ViT表示、无NMS的集合预测和YOLO风格的金字塔颈部，以实现准确的小目标检测。TinyFormer引入了并行双融合模块（PBM），该模块从浅层阶段构建高分辨率捷径到特征金字塔，在多尺度融合过程中保留精细的空间细节。我们进一步设计了空间语义适配器（SSA）来补偿粗粒度标记化导致的空间损失。SSA从早期阶段提取高分辨率线索并将其注入Transformer标记嵌入中，从而在不牺牲DETR全局建模能力的情况下改进小目标定位。在MS COCO上的实验表明，TinyFormer持续优于最近的YOLO系列检测器和强大的DEIMv2基线。即使没有PBM，TinyFormer-X也达到了58.4%的AP，而添加PBM将整体AP提高到58.5%，并在小目标上带来了1.6%的AP增益。使用Objects365预训练，TinyFormer-X-PBM达到了60.2%的AP，以更少的参数和更低的计算量超越了RF-DETR和其他Objects365预训练的检测器。这些结果表明，TinyFormer弥合了密集的YOLO风格特征融合和DETR风格集合预测之间的差距，为实时小目标检测提供了强大的精度-效率权衡。代码可在https://github.com/mmpmmpmmpjosh/TinyFormer获取。

英文摘要

YOLO-series and DETR-based detectors struggle with tiny-object detection. YOLO-style models benefit from efficient dense prediction, but their large-stride backbones may suppress tiny instances in deep feature maps and make grid assignment ambiguous. DETR-based models remove hand-crafted post-processing through set prediction, yet they reason over coarse token grids, where tiny objects occupy only a few weak tokens and are easily overlooked during matching. To address these limitations, we propose TinyFormer, a unified YOLO--DETR hybrid real-time detector that combines ViT representations, NMS-free set prediction, and a YOLO-style pyramid neck for accurate small-object detection. TinyFormer introduces a Parallel Bi-fusion Module (PBM), which builds high-resolution shortcuts from shallow stages to the feature pyramid, preserving fine spatial details during multi-scale fusion. We further design a Spatial Semantic Adapter (SSA) to compensate for the spatial loss caused by coarse tokenization. SSA extracts high-resolution cues from early stages and injects them into transformer token embeddings, improving tiny-object localization without sacrificing the global modeling ability of DETR. Experiments on MS COCO show that TinyFormer consistently outperforms recent YOLO-series detectors and the strong DEIMv2 baseline. TinyFormer-X achieves 58.4% AP even without PBM, while adding PBM improves the overall AP to 58.5% and brings a 1.6% AP gain on small objects. With Objects365 pre-training, TinyFormer-X-PBM reaches 60.2% AP, surpassing RF-DETR and other Objects365-pretrained detectors with fewer parameters and lower computation. These results demonstrate that TinyFormer bridges dense YOLO-style feature fusion and DETR-style set prediction, providing a strong accuracy-efficiency trade-off for real-time tiny-object detection. Code is available at https://github.com/mmpmmpmmpjosh/TinyFormer.

URL PDF HTML ☆

赞 0 踩 0

2605.27372 2026-05-27 cs.CV

G3T Up! Gravity Aligned Coordinate Frames Simplify Pointmap Processing

G3T 崛起！重力对齐的坐标框架简化点图处理

Bharath Raj Nagoor Kani, Noah Snavely

AI总结提出G3T模型，通过预测重力对齐的点图而非相机中心点图，利用场景结构先验减少旋转自由度，提升3D重建精度。

详情

Comments: Project Page: https://g3t-paper.github.io/

AI中文摘要

现代前馈3D重建方法（如VGGT）在相机中心坐标框架中预测像素对齐的点图。然而，这种坐标框架的选择并非总是最优。我们提出改为在直立、重力对齐的框架中预测点图，该框架利用许多真实场景中存在的强结构线索。与相机中心框架不同，重力对齐框架在视点之间共享共同的垂直轴，减少了关联点图所需的旋转自由度。为此，我们引入了重力接地几何变换器（G3T），该模型从现有模型在重力对齐的3D数据上进行微调。G3T生成高度准确的重力感知预测，包括直立点图和相机到重力姿态。我们进一步介绍了G3T-Long，一种基于子图的增量式3D重建流程，该流程利用直立框架提供的减少的旋转自由度，实现了显著提高的重建精度。

英文摘要

Modern feed-forward 3D reconstruction methods like VGGT predict pixel-aligned pointmaps in camera-centric coordinate frames. However, this choice of coordinate frame is not always optimal. We propose instead to predict pointmaps in upright, gravity-aligned frames that exploit strong structural cues present in many real-world scenes. Unlike camera-centric frames, gravity-aligned frames share a common vertical axis across viewpoints, reducing the rotational degrees of freedom needed to relate pointmaps to one another. To this end, we introduce the Gravity Grounded Geometry Transformer (G3T), fine-tuned from existing models on gravity-aligned 3D data. G3T produces highly accurate gravity-aware predictions, including upright pointmaps and camera-to-gravity poses. We further introduce G3T-Long, a submap-based incremental 3D reconstruction pipeline that leverages the reduced rotational degrees of freedom afforded by upright frames to achieve significantly improved reconstruction accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.27371 2026-05-27 cs.CY cs.AI

Algorithmic Monocultures in Hiring

招聘中的算法同质化

Rishi Bommasani, Sarah H. Bana, Kathleen A. Creel, Dan Jurafsky, Percy Liang

AI总结研究招聘算法同质化导致相同个体和种族群体被拒绝的问题，通过分析300万求职者的400万份申请数据，发现明显的种族差异和结果同质性。

详情

Comments: Published at FAccT 2026. Website: https://algorithmichiring.github.io/

AI中文摘要

许多雇主使用由少数几家算法供应商构建的算法筛选求职者。我们假设算法同质化导致相同的个体和相同种族群体的成员面临拒绝。我们获取并分析了一个包含300万求职者提交400万份申请的新数据集，所有申请均由同一供应商构建的算法筛选。我们发现求职者结果存在明显的种族差异。根据美国就业歧视标准，亚裔和黑人求职者提交的所有申请中，分别有14.74%和25.87%的申请提交给了对亚裔和黑人求职者产生不利影响的职位。个体也收到同质化的结果：在所有申请10个职位的求职者中，有4%被所有职位推荐拒绝，这一比例高于随机预期。为了更好地理解这种同质性，我们利用招聘算法的确定性可复制性，生成如果求职者申请所有职位本应获得的结果。我们表明，求职者需要广泛申请才能确保他们的申请被人审阅。

英文摘要

Many employers screen job applicants with algorithms built by the same few algorithm vendors. We hypothesize that algorithmic monoculture leads to the same individuals and members of the same racial groups facing rejection. We acquire and analyze a novel dataset of 3 million applicants submitting 4 million applications where all the applications are screened by algorithms built by the same vendor. We find clear racial disparities in applicant outcomes. Of all applications submitted by Asian and Black applicants, 14.74% and 25.87% are submitted to positions that adversely impact Asian and Black applicants, respectively, according to U.S. employment discrimination standards. Individuals also receive homogeneous outcomes: 4% of all applicants who apply to 10 positions are recommended for rejection from all positions, a rate higher than expected by chance. To better understand this homogeneity, we leverage the deterministic replicability of hiring algorithms to generate the outcomes applicants would have received if they applied to all positions. We show that applicants would need to apply widely in order to ensure their applications are considered by a human

URL PDF HTML ☆

赞 0 踩 0

2605.27366 2026-05-27 cs.AI cs.CL cs.LG cs.MA

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

MUSE-Autoskill: 通过技能创建、记忆、管理和评估实现自我进化智能体

Huawei Lin, Peng Li, Jie Song, Fuxin Jiang, Tieying Zhang

AI总结提出MUSE-Autoskill框架，通过统一的技能生命周期（创建、记忆、管理、评估和优化）使LLM智能体持续提升任务解决能力，实验表明生命周期管理的技能可提高任务成功率、效率、复用性和跨智能体迁移。

详情

Comments: 30 pages, 8 figures, 13 tables, working in progress

AI中文摘要

大型语言模型（LLM）智能体依赖可复用技能来解决复杂任务。然而，现有的技能创建方法将技能视为孤立和静态的工件，限制了其可复用性、可靠性和长期改进。我们提出了MUSE-Autoskill智能体（记忆利用技能进化），一个以技能为中心的智能体框架，让智能体通过统一的技能生命周期（创建、记忆、管理、评估和优化）持续提升任务解决能力。我们的框架使智能体能够按需创建技能，跨任务存储和复用技能，高效组织和选择技能，并通过单元测试和运行时反馈评估技能以进行持续优化。我们进一步引入了技能级记忆，为每个技能跨任务积累经验，从而实现更有效的复用和随时间适应。在SkillsBench上的实验提供了初步证据，表明生命周期管理的技能可以提高任务成功率、效率、复用性和跨智能体迁移，突出了将技能视为长期存在、具有经验意识和可测试资产的重要性。

英文摘要

Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent (Memory-Utilizing Skill Evolution), a skill-centric agent framework that lets agents continuously improve their task-solving capability by creating, reusing, and refining skills under a unified lifecycle (creation, memory, management, evaluation, and refinement). Our framework enables agents to create skills on demand, store and reuse them across tasks, organize and select them efficiently, and evaluate them through unit tests and runtime feedback for continuous refinement. We further introduce skill-level memory that accumulates experience for each skill across tasks, enabling more effective reuse and adaptation over time. Experiments on SkillsBench provide initial evidence that lifecycle-managed skills can improve task success, efficiency, reuse, and cross-agent transfer, highlighting the importance of treating skills as long-lived, experience-aware, and testable assets.

URL PDF HTML ☆

赞 0 踩 0

2605.27361 2026-05-27 cs.AI cs.SY eess.SY

Natural Language Query to Configuration for Retrieval Agents

面向检索代理的自然语言查询到配置

Melissa Z. Pan, Negar Arabzadeh, Mathew Jacob, Fiodar Kazhamiaka, Esha Choukse, Matei Zaharia

AI总结提出BRANE方法，利用LLM将查询转换为工作负载特征，并训练轻量级预测器选择最优配置，在多个基准上实现成本-质量帕累托前沿的优化。

详情

AI中文摘要

现代检索代理暴露了许多配置选择——LLM、检索器、文档数量、跳数和合成策略——每个都影响答案质量和服务成本。目前，这些流水线通常针对每个工作负载手动调整一次，留下了大量每查询优化的空间。我们形式化了这个问题：给定一个自然语言查询以及一个准确性或预算目标，从预定义的流水线目录中选择在推理时最小化成本或最大化准确性的配置。我们提出了**BRANE**，它使用LLM将每个查询转换为工作负载特定的特征，然后训练一个轻量级的每配置预测器，估计流水线是否能正确回答查询。在推理时，**BRANE**选择最大化预测正确性（经成本惩罚）的配置，无需重新训练即可暴露可调的成本-质量权衡。在MuSiQue、BrowseComp-Plus和FinanceBench上，**BRANE**持续推动成本-质量帕累托前沿，以高达89%的成本降低匹配最佳固定配置的准确性，并优于LLM路由、基于规则和微调的Qwen3-4B基线。这些结果表明，对整个检索流水线进行每查询配置是静态工作负载级调优的实用替代方案。

英文摘要

Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time. We propose **BRANE**, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, **BRANE** selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, **BRANE** consistently pushes the cost-quality Pareto frontier, matches the best fixed configuration's accuracy at up to 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning.

URL PDF HTML ☆

赞 0 踩 0

2605.27360 2026-05-27 cs.NI cs.AI

GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis, Research, and Testing

GENESIS: 利用AI智能体实现自主6G RAN合成、研究与测试

Tamerlan Aghayev, Maxime Elkael, Michele Polese, Minh Dat Nguyen, Gabriele Gemmi, Andrea Lacava, Ali Saeizadeh, Reshma Prasad, Paolo Testolina, Angelo Feraudo, Soumendra Nanda, Pedram Johari, Salvatore D'Oro, Tommaso Melodia

AI总结提出GENESIS框架，通过智能体、技能和钩子三种可组合原语及知识层SYNAPSE，将意图转化为经空口实验验证的解决方案，以加速6G无线接入网研发。

详情

Comments: 18 pages, 16 figures

AI中文摘要

蜂窝研究与开发受制于六个结构性流程，每个流程每次迭代需要数月的体力工程工作：(i) 将标准或研究论文中的新特性综合为生产代码；(ii) 一致性测试和互操作性测试；(iii) 针对现场异常和多样化部署环境进行加固；(iv) 网络功能的数据驱动优化；(v) 发现并原型化未来标准的新波形、功能及能力；(vi) 保护协议栈免受漏洞攻击。尽管大型语言模型已将通用软件工程中类似的研发工作从数天压缩至数分钟，但其已知缺陷在无线接入网用例中更为严重：它们会幻觉应用程序编程接口并误读规范，导致RAN组件在第一次错误时即失去互操作性，并且它们严重依赖仿真来设计算法，而仿真在迁移到真实硬件时往往失效。为应对这些挑战，我们提出GENESIS，一个智能体人工智能框架，将意图（如规范条款、遥测异常或研究假设）转化为经空口实验验证的解决方案，并反馈到持久知识库中。GENESIS建立在三种可组合原语（智能体、技能、钩子）和一个知识层（SYNAPSE）之上，该知识层既作为事实来源，也作为框架产生的所有工件的接收者，使能力在多次运行中累积。

英文摘要

Cellular research and development (R&D) is throttled by six structural processes that each consume months of manual engineering work per iteration: (i) synthesizing new features from standards or research papers into production code; (ii) conformance and interoperability testing; (iii) hardening against field anomalies and diverse deployment environments; (iv) data-driven optimization of network functionalities; (v) discovering and prototyping novel waveforms, functionalities, and capabilities for future standards; and (vi) securing the stack against vulnerabilities. Although Large Language Models (LLMs) have compressed comparable R&D work in general software engineering from days to minutes, their known pitfalls worsen on Radio Access Network (RAN) use cases: they hallucinate Application Programming Interfaces (APIs) and mis-read specifications, which kills interoperability of RAN components at the first mistake, and they heavily rely on simulations for designing algorithms, which is notorious for breaking when transferred to real hardware. To address these challenges, we present GENESIS, an agentic Artificial Intelligence (AI) framework that converts intents (e.g., a specification clause, a telemetry anomaly, or a research hypothesis) into solutions validated with over-the-air experiments, fed back into a persistent knowledge base. GENESIS is built on three composable primitives (agents, skills, hooks) and a knowledge layer (SYNAPSE) that doubles as the source of ground truth and the recipient of every artifact the framework produces, making capabilities compound across runs.

URL PDF HTML ☆

赞 0 踩 0

2605.27358 2026-05-27 cs.LG cs.AI cs.CL

MobileMoE: Scaling On-Device Mixture of Experts

MobileMoE: 扩展设备端混合专家模型

Yanbei Chen, Hanxian Huang, Ernie Chang, Jacob Szwejbka, Digant Desai, Zechun Liu, Vikas Chandra, Raghuraman Krishnamoorthi

AI总结针对设备端部署，提出MobileMoE系列子十亿参数MoE语言模型，通过联合优化架构和四阶段训练，在14个基准上匹配或超越领先的密集模型和MoE模型，并在智能手机上实现高效推理。

详情

AI中文摘要

混合专家（MoE）已成为千亿参数语言模型的事实标准架构，但其在十亿以下规模用于设备端部署的优势尚未得到充分探索。为弥补这一差距，我们提出MobileMoE，一系列设备端MoE语言模型，具有子十亿激活参数（0.3-0.9B激活，1.3-5.3B总参数），为设备端LLM建立了新的帕累托前沿。我们首先制定了一个设备端MoE缩放定律，在移动内存和计算约束下联合优化MoE架构，识别出一个设备端最佳点——具有细粒度和共享专家的适度稀疏性——同时实现内存和计算最优。基于推导出的架构，我们采用四阶段方案训练MobileMoE，包括预训练、中期训练、指令微调和量化感知训练，全部使用开源数据集。在14个基准上，MobileMoE匹配或超越领先的设备端密集LLM，推理FLOPs减少2-4倍，并以最多60%的参数匹配或超越最先进的MoE模型OLMoE-1B-7B。为弥合移动部署的最后一步，我们提供了首个在商用智能手机上的高效MoE推理，并进行了全面的设备端性能分析。在相当的INT4权重内存下，MobileMoE-S的预填充速度比密集基线MobileLLM-Pro快1.8-3.8倍，解码速度快2.2-3.4倍。

英文摘要

Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4$\times$ fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers $1.8$-$3.8\times$ faster prefill and $2.2$-$3.4\times$ faster decode than the dense baseline MobileLLM-Pro.

URL PDF HTML ☆

赞 0 踩 0

2605.27354 2026-05-27 cs.LG cs.AI cs.CL

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

利用稀疏自编码器的模型内部状态指导LLM后训练数据工程

Yi Jing, Zao Dai, Jinwu Hu, Zijun Yao, Lei Hou, Juanzi Li, Xiaozhi Wang

AI总结提出SAERL框架，通过稀疏自编码器提取模型内部状态，建模数据多样性、难度和质量，用于强化学习数据工程，提升准确率并减少训练步数。

详情

AI中文摘要

模型内部状态编码了大型语言模型（LLM）处理其训练数据时的丰富信息；然而，后训练数据工程主要依赖外部信号，忽略了模型内部状态中丰富的内在信号。我们提出了SAERL，一个用于LLM强化学习（RL）的数据工程框架。它使用稀疏自编码器（SAE）这一先进的机制可解释性工具提取的模型内部状态，建模三种内在数据属性：多样性、难度和质量。每个属性支撑一个具体的数据工程操作：用于批次多样性控制的SAE空间聚类与适度批次混合、用于从易到难课程排序的难度代理，以及用于数据过滤的质量探针。SAERL在Qwen2.5-Math-1.5B上相比原始GRPO平均准确率提升3.00%，并以减少20%的训练步数达到目标准确率，在模型规模和RL算法上均有一致收益。实验表明，SAE在不同模型家族和规模间有效迁移，作为一种轻量级且可重用的数据工程工具。这些结果证明，模型内部状态是后训练数据工程中强大且实用的信号来源。

英文摘要

Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B, with consistent gains across model scales and RL algorithms. Experiments show that SAE transfers effectively across model families and scales, serving as a lightweight and reusable data engineering tool. These results demonstrate that model internals are a powerful and practical source of signals for post-training data engineering.

URL PDF HTML ☆

赞 0 踩 0

2605.27352 2026-05-27 cs.LG stat.ML

From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models

从分数到吉布斯校正器：加速均匀速率离散扩散模型

Yuchen Liang, Ness Shroff, Yingbin Liang

AI总结提出吉布斯加速离散扩散（GADD）方法，利用具体分数函数构建吉布斯后验似然，无需额外训练即可实现均匀速率离散扩散模型的加速采样，达到$\mathcal{O}(\mathrm{polylog} (\varepsilon^{-1}))$的采样复杂度。

详情

AI中文摘要

离散扩散模型在文本和其他符号领域取得了强大的实证表现，但特别是对于均匀速率模型，它们通常需要许多步骤才能生成单个样本。现有的加速方法要么依赖训练额外的量，要么遭受慢混合问题。在这项工作中，我们提出了一种新颖的基于吉布斯的离散扩散模型校正器，称为吉布斯加速离散扩散（GADD）。GADD利用具体分数函数的结构直接构建吉布斯后验似然，除了标准分数估计外不需要任何额外训练。我们证明GADD实现了$\mathcal{O}(\mathrm{polylog} (\varepsilon^{-1}))$的整体采样复杂度，为均匀速率离散扩散模型的基于扩散的采样器提供了第一个这样的速率。我们还进行了数值实验，展示了GADD在合成数据、零样本文本采样和零样本条件音乐生成中的实际优势。这些结果证实了理论，并表明GADD在样本质量和墙钟效率上始终优于标准基线，包括原始欧拉方法和CTMC校正器。除此之外，我们的理论分析引入了一个新颖的框架，用于分析离散扩散模型中的预测器-校正器方法，这可能具有独立的意义。与依赖Girsanov测度变换技术的现有方法不同，我们的方法基于一个归纳论证，该论证在考虑校正器更新不准确性的同时，跟踪预测器迭代中的误差传播。

英文摘要

Discrete diffusion models have achieved strong empirical performance in text and other symbolic domains, but, especially for uniform-rate models, they often require many steps to generate a single sample. Existing acceleration methods either rely on training additional quantities or suffer from slow mixing. In this work, we propose a novel Gibbs-based corrector for discrete diffusion models, termed Gibbs-Accelerated Discrete Diffusion (GADD). GADD leverages the structure of the concrete score function to construct Gibbs posterior likelihoods directly, without requiring any additional training beyond standard score estimation. We show that GADD achieves an overall sampling complexity of $\mathcal{O}(\mathrm{polylog} (\varepsilon^{-1}))$, yielding the first such rate for diffusion-based samplers for uniform-rate discrete diffusion models. We also conduct numerical experiments demonstrating the practical advantages of GADD across synthetic data, zero-shot text sampling, and zero-shot conditional music generation. These results corroborate the theory and show that GADD consistently improves sample quality and wall-clock efficiency over standard baselines, including vanilla Euler methods and CTMC correctors. Beyond this, our theoretical analysis introduces a novel framework for analyzing predictor-corrector methods in discrete diffusion models, which may be of independent interest. Unlike existing approaches that rely on the Girsanov change-of-measure technique, our method is based on an induction argument that tracks error propagation across predictor iterations while accounting for inaccuracies in the corrector updates.

URL PDF HTML ☆

赞 0 踩 0

2605.27346 2026-05-27 cs.SD

MERIT: Learning Disentangled Music Representations for Audio Similarity

MERIT: 学习用于音频相似性的解耦音乐表示

Abhinaba Roy, Junyi Liang, Dorien Herremans

AI总结针对现有音乐相似性模型将旋律、节奏和音色等维度纠缠在一起的问题，提出MERIT框架，通过条件音频生成和源分离茎的训练策略学习解耦的因子特定表示，实现各维度独立响应。

2605.27345 2026-05-27 cs.CL

MATCHA: Matching Text via Contrastive Semantic Alignment

MATCHA: 通过对比语义对齐进行文本匹配

Siran Li, Ece Sena Etoglu, Carsten Eickhoff, Seyed Ali Bahrainian

AI总结针对现有评估指标无法区分语义矛盾的问题，提出MATCHA指标，通过双视角对比学习同时奖励语义一致性和惩罚矛盾，在多个基准上优于ROUGE和BERTScore。

详情

AI中文摘要

可靠的评估对于理解大型语言模型（LLM）的性能至关重要，但当今常用的指标，即词元重叠分数（如ROUGE）和基于嵌入的度量（如BERTScore），常常误判文档的语义相似性。我们的研究表明，词元重叠指标和基于嵌入的指标通常会将几乎相同的分数分配给直接相互矛盾的文本，从而可能掩盖根本性错误。我们引入了MATCHA，一种自动度量指标，它同时奖励与参考的语义一致性并惩罚矛盾。MATCHA采用双视角方法，衡量（i）与黄金文本的接近程度和（ii）与对抗性生成的反事实矛盾的距离。在八个公开基准上，MATCHA在问答、图像字幕生成、自然语言推理、摘要和语义文本相似性任务中，与人工标注相比，优于流行指标。在TruthfulQA数据集（即没有训练集的数据集，其中没有基于嵌入的指标可以局部训练）上，这种在根据参考匹配文本方面的改进相对于ROUGE-L达到18.38%，相对于BERTScore达到20.82%。定量比较和定性人工评估都证实了MATCHA的有效性和正确性，并揭示了现有指标的根本弱点。与23个嵌入模型（包括最先进的模型）作为类似BERTScore的度量相比，MATCHA在仅基于参考区分正确和错误陈述方面仍然是最准确的。我们的代码和指标公开可用（https://github.com/Siran-Li/MATCHA）。

英文摘要

Reliable evaluation is essential for understanding large language model (LLM) performance, yet today's go-to metrics, namely token-overlap scores (e.g., ROUGE) and embedding-based measures (e.g., BERTScore), often misjudge semantic similarity of documents. Our study shows that both token-overlap metrics and embedding-based metrics routinely assign nearly identical scores to texts that directly contradict each other, thereby potentially masking fundamental errors. We introduce MATCHA, an automatic metric that jointly rewards semantic agreement with a reference and penalizes contradictions. MATCHA employs a dual-view perspective that measures (i) proximity to the gold text and (ii) distance from an adversarially generated counterfactual contradiction. In eight public benchmarks, MATCHA outperforms popular metrics, compared with human annotations on question-answering, image caption generation, natural language inference, summarization, and semantic textual similarity tasks. On the TruthfulQA dataset (i.e., a dataset without a training set, where no embedding-based metrics could locally train on), this improvement in terms of matching texts with a reference reaches 18.38% over ROUGE-L and 20.82% over BERTScore. Both quantitative comparison and qualitative human assessments confirm the efficacy and validity of MATCHA and uncover fundamental weaknesses in pre-existing metrics. Compared with 23 embedding models, including top state-of-the-art ones, used as a metric similar to BERTScore, MATCHA remains the most accurate in distinguishing correct from incorrect statements solely based on a reference. Our code and metric are publicly available (https://github.com/Siran-Li/MATCHA).

URL PDF HTML ☆

赞 0 踩 0

2605.27343 2026-05-27 cs.CV cs.LG

Towards Controllable Image Generation through Representation-Conditioned Diffusion Models

通过表示条件扩散模型实现可控图像生成

Nithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen

AI总结本文提出利用预训练自监督模型的表示作为条件，通过扩散模型实现无需大量标注的可控图像生成，并探索了表示空间中的平滑和分离特性。

2605.27338 2026-05-27 cs.AI cs.CC cs.CL cs.LO

2-ASP(Q) programs with weak constraints: Complexity and efficient implementation

带有弱约束的2-ASP(Q)程序：复杂性与高效实现

Andrea Cuteri, Giuseppe Mazzotta, Francesco Ricca

AI总结本文研究了带有两个量词和弱约束的ASP(Q)程序（2-ASP(Q)^w）的复杂性，并提出基于CEGAR技术的Casper系统实现策略，实验证明其有效性。

2605.27336 2026-05-27 cs.CV

PARE: Pruning and Adaptive Routing for Efficient Video Generation

PARE：面向高效视频生成的剪枝与自适应路由

Yutong Wang, Yunke Wang, Tianfan Xue, Yu Qiao, Yaohui Wang, Xinyuan Chen, Chang Xu

AI总结提出PARE方法，通过结构感知剪枝压缩宽度和输入自适应路由压缩深度，联合减少视频扩散Transformer的计算量，在Wan2.1-14B上实现每步计算大幅降低且质量保持。

详情

AI中文摘要

视频扩散Transformer（DiTs）能生成高质量视频，但由于宽块、深架构和迭代采样，需要大量计算。近期方法通过压缩宽度、深度或采样步数来降低成本，但通常采用固定架构，无法适应单个输入或去噪阶段。我们提出PARE（面向高效视频生成的剪枝与自适应路由），通过结构感知剪枝和输入自适应路由联合压缩宽度和深度。对于宽度，我们观察到注意力头分化为空间和时间角色，并设计考虑这种区分的重分评分，以防止运动关键的时间头被过早剪枝。对于深度，我们训练一个轻量级路由器，以去噪时间步和视觉内容为条件，动态选择每个步骤执行哪些块，实现每个输入的计算自适应，而非静态移除块。一个渐进式流程首先通过蒸馏恢复宽度剪枝的质量，然后联合优化学生和路由器以解耦两个学习目标。在Wan2.1-14B上的图像到视频和文本到视频生成实验表明，PARE在VBench各维度上显著减少每步计算同时保持质量，并与步蒸馏结合实现进一步加速。

英文摘要

Video Diffusion Transformers (DiTs) generate high-quality videos but demand substantial compute due to wide blocks, deep architectures, and iterative sampling. Recent methods reduce cost by compressing width, depth, or sampling steps, but typically commit to a fixed architecture that cannot adapt to individual inputs or denoising stages. We propose PARE (Pruning and Adaptive Routing for Efficient video generation), which jointly compresses width and depth with structure-aware pruning and input-adaptive routing. For width, we observe that attention heads specialize into spatial and temporal roles, and design importance scoring that accounts for this distinction to prevent motion-critical temporal heads from being pruned prematurely. For depth, we train a lightweight router conditioned on denoising timestep and visual content to dynamically select which blocks to execute at each step, enabling per-input compute adaptation rather than static block removal. A progressive pipeline first recovers width-pruned quality via distillation, then jointly optimizes the student and router to decouple the two learning objectives. Experiments on Wan2.1-14B for both image-to-video and text-to-video generation show that PARE substantially reduces per-step computation while preserving quality across VBench dimensions, and composes with step distillation for further acceleration.

URL PDF HTML ☆

赞 0 踩 0

2605.27333 2026-05-27 cs.CL

FinHarness: An Inline Lifecycle Safety Harness for Finance LLM Agents

FinHarness：面向金融LLM代理的内联生命周期安全约束框架

Haoxuan Jia, Yang Liu, Bin Chong, Yingguang Yang, Yancheng Chen, Jiayu Liang, Qian Li, Hanning Lu, Kefu Xu, Hao Zheng, Chongyang Zhang, Hao Peng, Philip S. Yu

AI总结针对金融LLM代理在阻止提示诱导的未授权操作与批准合法多步骤业务流程之间的冲突，提出FinHarness内联安全约束框架，通过查询监控、工具监控和级联模块实现逐步骤风险评估与自适应验证，显著降低攻击成功率并保持良性批准率。

详情

AI中文摘要

金融LLM代理必须同时阻止提示诱导的未授权操作并批准合法的多步骤业务流程。然而，边界过滤器常常遗漏不可逆的中间轨迹工具调用，而事后LLM判断仅在终止后执行审计——对于干预来说为时已晚，且计算成本随轨迹长度线性增长。我们提出FinHarness，一个内联安全约束框架，通过三个组件端到端地封装金融代理：查询监控器融合单轮意图与跨轮漂移，工具监控器评估每个潜在工具调用，以及级联模块整合每步风险并在轻量级和高级LLM判断之间自适应路由验证。触发的风险因素作为事前证据重新注入代理输入，使代理能够自行拒绝、重新规划或批准。在FinVault上，路由的FinHarness将攻击成功率从38.3%降至15.0%，同时基本保持良性批准率（41.1%→39.3%），并且高级判断调用次数比始终使用高级判断的消融实验减少4.7倍。

英文摘要

Finance LLM agents must simultaneously block prompt-induced unauthorized actions and approve legitimate multi-step business workflows. However, boundary filters often miss irreversible mid-trajectory tool calls, while post-hoc LLM judges perform auditing only after termination -- too late for intervention and at a computational cost that scales linearly with trace length. We present FinHarness, an inline safety harness that wraps a finance agent end-to-end with three components: a Query Monitor that fuses single-turn intent with cross-turn drift, a Tool Monitor that evaluates each prospective tool call, and a Cascade module that integrates per-step risk and adaptively routes verification between a lightweight and an advanced-tier LLM judge. Fired risk factors are re-injected into the agent input as ex-ante evidence, enabling the agent to refuse, re-plan, or approve on its own. On FinVault, routed FinHarness cuts ASR from 38.3% to 15.0% while largely preserving benign approval ($41.1\% \to 39.3\%$), and uses $4.7\times$ fewer advanced-judge calls than an always-advanced ablation.

URL PDF HTML ☆

赞 0 踩 0

2605.27332 2026-05-27 cs.SE cs.AI cs.CV

EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering

EdgeFlow: 基于边缘图增强的VLM流程图处理用于工业需求工程

Zhifei Dou, Shabnam Hassani, Ou Wei

AI总结提出EdgeFlow方法，通过向视觉语言模型(VLM)输入添加Canny边缘图作为结构先验，无需训练数据或微调即可提升流程图到Mermaid代码的转换精度，在工业数据集上节点F1提升17.39%，边F1提升16.94%。

详情

Comments: 10 pages

AI中文摘要

流程图广泛应用于工业需求中，但通常以静态图像形式嵌入。视觉语言模型(VLM)在将这些流程图转换为机器可读模型以支持需求工程活动方面显示出潜力，然而，当直接应用于流程图转换时，它们常常在拓扑关键视觉细节上失败。为了解决这个问题，我们提出了EdgeFlow，它通过向VLM的原始输入添加确定性提取的Canny边缘图——作为结构先验——来改进流程图到Mermaid的转换，无需标注训练数据或领域特定的模型微调。我们在IndusReqFlow（一个来自真实世界需求的数据集）上评估了EdgeFlow。与现成的VLM相比，EdgeFlow将节点级F1提高了17.39个百分点，边级F1提高了16.94个百分点。在路径级别，EdgeFlow将路径F1提高了11.06个百分点，从而更好地支持基于模型的测试。这些结果表明，EdgeFlow提供了一种实用的、无需训练的方法，用于改进工业需求工程中保持拓扑结构的流程图到Mermaid转换。在公共合成基准上的跨数据集评估结果显示没有显著改进；这凸显了需要包含工业数据的多样化基准，以全面评估未来基于VLM的需求工程工具。

英文摘要

Flowcharts are widely used in industrial requirements, but usually remain embedded as static images. Vision Language Models (VLMs) show promise in the conversion of these flowcharts into machine-readable models for RE activities, yet, when directly applied to flowchart conversion, they often fail on topology-critical visual details. To address this, we propose EdgeFlow that augments a VLM's original input with a deterministically extracted Canny edge map-acting as a structural prior-to improve flowchart-to-Mermaid conversion, without requiring annotated training data or domain-specific model fine-tuning. We evaluate EdgeFlow on IndusReqFlow, a dataset sourced from real-world requirements. Compared with off-the-shelf VLMs, EdgeFlow improves node-level F1 by 17.39 percentage points and edge-level F1 by 16.94 percentage points. At the path level, EdgeFlow improves path F1 by 11.06 percentage points, enabling better support for model-based testing. These results demonstrate that EdgeFlow provides a practical, training-free means to improve topology-preserving flowchart-to-Mermaid conversion for industrial RE. Cross-dataset evaluation results on a public synthetic benchmark show no significant improvement; this highlights the need for diverse benchmarks incorporating industrial data for the comprehensive evaluation of future VLM-based RE tools.

URL PDF HTML ☆

赞 0 踩 0

2605.27331 2026-05-27 cs.AI

Maat: The Agentic Legal Research Assistant for Competition Protection

Maat: 面向竞争保护的法律研究智能助手

Basant Mounir, Farida Madkour, Amira Abdelaziz, Asmaa Sami

AI总结提出Maat，一种基于ReAct框架的智能法律研究助手，通过RAG、网络搜索和用户澄清机制，在竞争法案例检索中显著优于现有通用和专用法律助手。

详情

Comments: 5 pages, 1 figure

AI中文摘要

进行法律研究的竞争法专家必须查阅大量案例、决定和司法报告，以识别先例并评估竞争和合并案件中的关键要素。尽管通用研究助手（如Claude和ChatGPT）和法律助手（如SaulLM-7B和LegalGPT）越来越多地被用于辅助法律研究，但它们在竞争法分析方面仍然不足：缺乏专门的领域知识，提供不充分的官方引用，或虚构竞争法案例。我们提出Maat，一个ReAct智能体，它协调与研究过程不同任务对应的工具。Maat与竞争法专家迭代设计，使用RAG将案例和发现基于官方来源以确保可靠性，提供丰富的行内引用，在数据库覆盖不足时回退到网络搜索，并在查询模糊时提示用户澄清。Maat在案例特定任务上显著优于所有基线助手，在理论问题任务上表现与最佳基线相当。所使用的数据集可在GitHub上获取。

英文摘要

Competition law experts conducting legal research must review extensive volumes of cases, decisions, and judicial reports to identify precedents and assess key elements in competition and merger cases. Although general research assistants such as Claude and ChatGPT and legal assistants such as SaulLM-7B and LegalGPT are increasingly used to assist legal research, they remain inadequate for competition law analysis: they lack specialized domain expertise, provide insufficient official citations, or hallucinate competition law cases. We propose Maat, a ReAct agent that orchestrates tools corresponding to different tasks of the research process. Designed iteratively with competition law experts, Maat grounds cases and findings in official sources using RAG for reliability, provides rich in-line citations, falls back to web search when database coverage is insufficient, and prompts the user for clarification when queries are ambiguous. Maat significantly outperforms all baseline assistants on case-specific tasks and performs within range of the top baseline on theoretical question tasks. The dataset used is available on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2605.27328 2026-05-27 cs.SE cs.AI cs.MA

Governed Evolution of Agent Runtimes through Executable Operational Cognition

通过可执行操作认知实现代理运行时的受控演化

Mariano Garralda-Barrio

AI总结本文提出一个框架，通过可执行操作认知实现多智能体系统中代理生成工件的受控运行时演化，引入HarnessMutation机制在验证、可追溯、评估和回滚约束下进行生命周期感知的运行时适应。

详情

Comments: 14 pages, 4 figures, 1 table. Reference implementation and associated source code available at: https://github.com/mgarralda/governed-runtime

AI中文摘要

近期智能体系统的进展越来越将代码视为可执行的操作基底，而非可丢弃的输出工件。先前的工作如\emph{Code as Agent Harness}将经过验证的智能体生成工件视为运行时实体，可以在长时间运行的认知循环中创建、执行、修订、持久化和重用。然而，这些工件的治理、生命周期管理和操作演化仍未被充分定义。本文提出了一个通过可执行操作认知实现多智能体系统中受控运行时演化的框架。我们将智能体生成工件形式化为持久的运行时能力，这些能力逐渐成为操作基底的一部分，而非瞬时的中间输出。基于这一视角，我们引入了\emph{HarnessMutation}作为一种受控机制，用于在明确的验证、可追溯性、评估和回滚约束下进行生命周期感知的运行时适应。该框架不将运行时适应视为无限制的自我修改，而是将演化建模为在持久操作记忆上的有界且可观察的过程。它进一步展示了这些思想如何在现代智能体运行时和面向治理的编排系统上实现，为适应性基础设施提供了概念基础，使其演化保持明确、可审计且受约束。

英文摘要

Recent advances in agentic systems increasingly treat code as an executable operational substrate rather than as a disposable output artifact. Prior work such as \emph{Code as Agent Harness} frames validated agent-generated artifacts as runtime entities that can be created, executed, revised, persisted, and reused within long-running cognitive loops. However, the governance, lifecycle management, and operational evolution of such artifacts remain under-specified. This paper proposes a framework for governed runtime evolution in multi-agent systems through executable operational cognition. We formalize agent-generated artifacts as persistent runtime capabilities that progressively become part of the operational substrate rather than transient intermediate outputs. Building on this perspective, we introduce \emph{HarnessMutation} as a governed mechanism for lifecycle-aware runtime adaptation operating under explicit validation, traceability, evaluation, and rollback constraints. Rather than treating runtime adaptation as unrestricted self-modification, the proposed framework models evolution as a bounded and observable process over persistent operational memory. It further shows how these ideas can be operationalized over modern agent runtimes and governance-oriented orchestration systems, providing a conceptual foundation for adaptive infrastructures whose evolution remains explicit, auditable, and constrained.

URL PDF HTML ☆

赞 0 踩 0

2605.27322 2026-05-27 cs.CL

Semantic Gradients Interactions in SSD: A Case Study in Racial Identity and Hate Speech

SSD中的语义梯度交互：种族身份与仇恨言论的案例研究

Felix Ostrowicki, Hubert Plisiecki

AI总结本文提出交互式SSD方法，通过语义梯度交互模型研究调节变量对语义含义的影响，并在UC Berkeley仇恨言论语料库上验证了种族身份对仇恨言论判断的调节作用。

2605.27320 2026-05-27 cs.AI cs.CY econ.GN q-fin.EC

Modeling Agentic Technical Debt and Stochastic Tax: A Standalone Framework for Measurement, Simulation, and Dashboarding

建模代理技术债务与随机税：一个用于测量、模拟和仪表盘展示的独立框架

Muhammad Zia Hydari, Raja Iqbal, Narayan Ramasubbu

AI总结本文提出一个形式化且可管理的框架，区分代理技术债务（累积的设计与治理负债存量）与随机税（使用随机代理时产生的运营负担流），并通过应付账款模拟和电子表格说明其应用。

2605.27318 2026-05-27 cs.CV

Q-GeoMem: Question-Guided Geometric Memory for Video Spatial Reasoning

Q-GeoMem：面向视频空间推理的问题引导几何记忆

Xianqiang Gao, Qizhi Chen, Delin Qu, Haoming Song, Zhigang Wang, Bin Zhao, Dong Wang, Xuelong Li

AI总结提出Q-GeoMem框架，通过问题引导的几何记忆机制，结合细粒度上下文库和语义几何证据库，在视频空间推理任务中实现最先进性能。

详情

AI中文摘要

如何以及想象什么？统一多模态模型中的视觉思维用于跨视角空间推理

Qian Yang, Ankur Sikarwar, Huy Le, Le Zhang, Zhuan Shi, Perouz Taslakian, Aishwarya Agrawal

AI总结提出View Dropout训练策略使模型利用中间思维图像进行跨视角空间推理，并发现全景视觉思维在信息性和可学习性上最优。

详情

Comments: Preprint

AI中文摘要

跨视角空间推理仍然是视觉语言模型（VLM）的薄弱环节：它们通常用语言推理，丢失了任务所需的细粒度几何信息。用图像思考旨在通过生成中间思维图像来解决这一问题，但近期工作表明模型常常忽略这些轨迹中的视觉证据。因此，我们提出如何让视觉思维起作用，以及哪种视觉思维效果最好。我们在统一多模态模型（UMMs）中研究这些问题，这类模型原生支持交错的图像-文本生成。对于第一个问题，我们提出视图丢弃（VDrop），一种训练时干预方法，将输入视图的部分内容从答案跨度中隐藏，同时使其对思维图像令牌可见。这鼓励模型在回答时使用思维图像，而不是仅依赖输入视图。一旦思维图像用于答案预测，我们研究哪种类型的视觉思维最有效。我们将其表述为可学习性-信息性权衡，并比较三种思维图像变体：俯视图、全景图和点匹配渲染图。在合成场景上训练，并在五个真实世界域外基准上评估，采用VDrop的全景视觉思维是唯一既信息丰富又可学习的配置，并实现了最佳的域外泛化。

英文摘要

Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to address this by generating an intermediate thinking image, but recent work shows that models often ignore the visual evidence in these traces. We therefore ask how to make visual thinking matter, and what kind of visual thinking works best. We study these questions in unified multimodal models (UMMs), which natively support interleaved image-text generation. For the first question, we propose View Dropout (VDrop), a training-time intervention that hides parts of one input view from the answer span while keeping them visible to the thinking-image tokens. This encourages the model to use the thinking image when answering, instead of relying only on the input views. Once the thinking image is used for answer prediction, we study which type of visual thinking is most effective. We frame this as a learnability-informativeness tradeoff and compare three thinking-image variants: top-down, panoramic, and point-matching renderings. Trained on synthetic scenes and evaluated on five real-world out-of-domain benchmarks, panoramic visual thinking with VDrop is the only configuration that is both informative and learnable, and it achieves the best out-of-domain generalization.

URL PDF HTML ☆

赞 0 踩 0

2605.27309 2026-05-27 cs.LG cs.OH

Greening AI Inference with Accuracy and Latency-aware User Incentives

通过准确性和延迟感知的用户激励实现绿色AI推理

Vasilios A. Siris, Adamantia Stamou, George D. Stamoulis, Konstantinos Varsos, Ramin Khalili

AI总结提出一种基于用户对推理质量和延迟的估值以及环境意识的激励框架，通过双层级服务订阅平衡碳排放与QoE参数。

详情

DOI: 10.1109/MIC.2026.3695352
Journal ref: IEEE Internet Computing, 2026

AI中文摘要

AI服务的广泛使用引发了对其环境可持续性的担忧，最近的研究表明AI推理的碳排放是主要贡献者。本文介绍了一个框架，基于用户对推理质量和延迟的估值以及他们的环境意识，同时考虑碳排放与这两个QoE参数之间的权衡，来设计AI推理激励。我们的方法可以适应不同的权衡，这取决于AI模型的大小和复杂性以及用于服务推理请求的资源分配。这些激励可以通过一个实用的双层级服务订阅来提供，该订阅为用户提供折扣以换取减少的碳排放。折扣服务选项使AI提供商能够在高碳强度期间以较低的质量和较高的延迟服务一定比例的推理请求。

英文摘要

The widespread use of AI services has raised concerns for its environmental sustainability, towards which recent studies have identified carbon emissions of AI inference as the major contributor. This paper introduces a framework for designing AI inference incentives based on the users' valuation for inference quality and latency, together with their environmental consciousness, while accounting for the tradeoff between carbon emissions and the two QoE parameters. Our approach can accommodate different tradeoffs, that depend on the size and complexity of the AI models and the allocation of resources to serve inference requests. The incentives can be offered through a practical two-tier service subscription that offers users a discount in exchange for reduced carbon emissions. The discounted service option gives the AI provider the flexibility to serve some percentage of inference requests at a lower quality and higher latency during periods of high carbon intensity.

URL PDF HTML ☆

赞 0 踩 0