URL PDF HTML ☆

赞 0 踩 0

2605.19219 2026-05-20 cs.AI

SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

SimGym：一种用于电子商务A/B测试模拟的框架，使用基于流量的VLM代理

Han Li, Vibhor Malik, Zahra Zanjani Foumani, Alberto Castelo, Shuang Xie, Ailin Fan, Keat Yang Koay, Yuanzheng Zhu, Meysam Feghhi, Ronie Uliana, Zhaoyu Zhang, Angelo Ocana Martins, Mingyu Zhao, Francis Pelland, Jonathan Faerman, Nikolas LeBlanc, Aaron Glazer, Andrew McNamara, Zhong Wu, Lingyun Wang

AI总结本文提出SimGym框架，通过基于流量的VLM代理模拟电子商务A/B测试，解决真实测试周期长、风险高等问题，验证结果显示其能快速准确预测用户行为变化。

详情

AI中文摘要

A/B测试仍然是评估电子商务店铺修改的黄金标准，但其分流流量、需要数周才能达到统计显著性，并有降低用户体验的风险。我们提出了SimGym，一种使用视觉语言模型（VLM）代理在浏览器中模拟A/B测试的框架。该框架包含三个关键组件：（a）基于流量的买家人设生成管道，从生产点击流数据中推导出每个店铺的买家人设和意图；（b）实时浏览器代理架构，结合多模态感知和情景记忆与守卫规则，以在控制和处理店铺中进行连贯的购物会话；（c）评估协议，将模拟的成果变化与实际买家行为的观察变化进行比较。我们验证了SimGym在主要电子商务平台上对视觉驱动的UI主题变化的A/B测试，结果表明SimGym代理在观察到的成果变化上表现良好，与实际买家流量中不同界面变体的add-to-cart变化达成77%的方向一致。它将实验周期从数周减少到不到一小时，使快速实验成为可能，而无需将真实买家暴露于候选变体中。

英文摘要

A/B testing remains the gold standard for evaluating modifications to e-commerce storefronts, yet it diverts traffic, requires weeks to reach statistical significance, and risks degrading user experience. We present SimGym, a framework for simulating A/B tests on e-commerce storefronts using vision-language model (VLM) agents operating in a live browser. The framework comprises three key components: (a) a traffic-grounded persona generation pipeline that derives per-shop buyer archetypes and intents from production clickstream data; (b) a live-browser agent architecture that combines multimodal perception over visual and browser-structured observations with episodic memory and guardrails to conduct coherent shopping sessions across control and treatment storefronts; and (c) an evaluation protocol that compares simulated outcome shifts with observed shifts in real buyer behavior. We validate SimGym on A/B tests of visually driven UI theme changes from a major e-commerce platform across diverse storefronts and product categories. Empirical results show that SimGym agents achieve strong agreement with observed outcome shifts, attaining 77% directional alignment with add-to-cart shifts observed across interface variants in real-buyer traffic. It reduces experimental cycles from weeks to under an hour, enabling rapid experimentation without exposing real buyers to candidate variants.

URL PDF HTML ☆

赞 0 踩 0

2605.19218 2026-05-20 cs.CV cs.AI

Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

旋转对齐的关键通道剪枝用于高效的视觉-语言模型推理

Beomseok Kang, Dongwon Jo, Jiwon Song, Donghwee Son, Jae-Joon Kim

AI总结本文提出旋转对齐的关键通道剪枝方法，通过压缩通道维度在固定KV缓存预算下保留更多视觉token，解决传统token剪枝在细粒度感知任务中的性能下降问题，同时提升解码效率。

详情

AI中文摘要

视觉-语言模型在推理过程中面临严重的KV缓存压力，因为单张图像通常会编码成数千个token。现有方法主要通过token稀疏性进行token剪枝，但永久丢弃视觉内容导致细粒度感知任务显著退化。为此，本文提出一个互补的轴，即特征稀疏性：在固定KV缓存预算下，压缩通道维度可以在相同内存成本下保留更多视觉token。然而，现有关键通道剪枝方法面临结构上的权衡：基于token的通道剪枝具有表现力但不结构化且较慢，而基于head的方法则硬件友好但不够稳健。本文通过RotateK，一种基于旋转的结构化关键通道剪枝框架，解决这一问题。RotateK应用基于PCA的在线旋转，将token依赖的通道重要性对齐到共享的低维子空间，从而在轻量级head掩码下实现精确剪枝；融合的Triton注意力内核直接在稀疏通道的Key上操作以实现高效的解码。在两个代表性的VLM后端上进行的实验表明，RotateK在准确率和解码延迟方面均优于现有关键通道剪枝方法，而联合token-通道剪枝在匹配的KV缓存预算下优于仅token剪枝的基线。

英文摘要

Vision-Language Models suffer severe KV cache pressure at inference, as a single image often encodes into thousands of tokens. Most existing methods exploit token sparsity through token pruning, but permanently discarding visual content causes substantial degradation on fine-grained perception tasks. This motivates a complementary axis, feature sparsity: under a fixed KV cache budget, compressing the channel dimension preserves more visual tokens at the same memory cost. Prior Key channel pruning methods, however, face a structural trade-off: token-wise channel pruning is expressive but unstructured and slow, while head-wise approach is hardware-friendly but less robust. We resolve this with RotateK, a rotation-based structured Key channel pruning framework. RotateK applies an online PCA-based rotation that aligns token-dependent channel importance into a shared low-dimensional subspace, enabling accurate pruning under lightweight head-wise masks; a fused Triton attention kernel operates directly on sparse-channel Keys for efficient decoding. Experiments on two representative VLM backbones show that RotateK consistently outperforms prior Key channel pruning in both accuracy and decoding latency, while joint token-channel pruning improves over token-only baselines at matched KV cache budgets.

URL PDF HTML ☆

赞 0 踩 0

2605.19215 2026-05-20 cs.AI

Not all uncertainty is alike: volatility, stochasticity, and exploration

并非所有不确定性都相同：波动性、随机性与探索

Payam Piray

AI总结本文研究了在生物和人工智能中适应性决策中波动性和随机性对探索的影响差异，提出了CAUSE方法以提升探索效率。

详情

AI中文摘要

在生物和人工智能中适应性决策需要在利用已知结果和探索不确定替代方案之间取得平衡。尽管先前研究表明不确定性通常促进探索，但通常将不同的环境不确定性来源视为等同。我们考虑具有潜在线性奖励状态随时间变化（波动性）和通过噪声结果观察（随机性）的环境。两者都增加后验不确定性，但我们显示它们驱动最优探索的方向相反：波动性增强它，随机性抑制它。我们通过将Gittins指数框架扩展到具有潜在线性动态的高斯状态空间带顿时，正式建立了这种不对称性。我们进一步推导出Cause-Aware Uncertainty-Sensitive Exploration (CAUSE)，一种通过控制-推理获得的闭式探索奖励，继承了相同的单调性。CAUSE在具有异质噪声结构的环境中优于标准探索策略，并且在非休息带顿设置中改进了Gittins-per-arm策略。学习和探索由相同的噪声推理不对称性所支配，并且该框架预测病理噪声推理会产生相反而非仅仅受损的探索，对计算精神病学的解释具有启示。

英文摘要

Adaptive decision-making in biological and artificial intelligence requires balancing the exploitation of known outcomes with the exploration of uncertain alternatives. Although prior work suggests that uncertainty generally promotes exploration, it has typically treated distinct sources of environmental uncertainty as equivalent. We consider environments with latent reward states that drift over time (volatility) and are observed through noisy outcomes (stochasticity). Both increase posterior uncertainty, yet we show they drive optimal exploration in opposite directions: volatility enhances it, stochasticity suppresses it. We establish this asymmetry formally by extending the Gittins index framework to Gaussian state-space bandits with latent dynamics. We further derive Cause-Aware Uncertainty-Sensitive Exploration (CAUSE), a closed-form exploration bonus obtained via control-as-inference that inherits the same monotonicities. CAUSE outperforms standard exploration strategies in environments with heterogeneous noise structure, and also improves on a Gittins-per-arm policy whose rested-bandit optimality does not transfer to restless settings. Learning and exploration are governed by the same noise-inference asymmetry, and the framework predicts that pathological noise inference produces \emph{reversed} rather than merely impaired exploration, with implications for computational accounts of psychiatric conditions.

URL PDF HTML ☆

赞 0 踩 0

2605.19214 2026-05-20 cs.LG cs.CV

Worst-Group Equalized Odds Regularization for Multi-Attribute Fair Medical Image Classification

多属性公平医疗图像分类中的最差组等化几率正则化

Nikhil Cherian Kurian, Victor Caquilpan Parra, Abin Shoby, Luke Whitbread, Lauren Oakden-Rayner, Robert Vandersluis, Jessica Schrouff, Lyle J. Palmer, Mark Jenkinson

AI总结本文提出了一种最差组等化几率正则化方法，用于在多个人口属性上同时评估和缓解医疗图像分类中的系统性差异，通过在推理时优化子组层面的真阳性率和假阳性率偏差，减少等化几率和等化机会的不平等，同时对AUC影响最小。

Comments 11 Pages, 2 Figures

详情

AI中文摘要

医疗人工智能的诊断性能在不同人口群体间系统性地变化，但子组AUC可能掩盖了临床重要的不平等。在固定的推理时间操作点上，某些群体可能表现出过度诊断行为，其特征是真阳性率和假阳性率升高，而另一些群体则表现出不足诊断模式，其真阳性率和假阳性率降低。这些对立的趋势可能在总体AUC中相互抵消，但会产生有意义的临床决策不平等。受在操作点和多个人口属性上评估和缓解此类不平等的需要所驱动，我们提出了一种最差组等化几率边际正则化器。该正则化器明确针对推理时的子组层面真阳性率和假阳性率偏差。在每次更新时，该方法识别出由显式人口属性（如年龄、性别和种族）定义的最极端边际偏差的子组，并应用统一的惩罚，从而在多个人口轴上实现公平优化，而无需显式交集约束。在两个现实中的多标签医学影像数据集中，我们的方法在减少等化几率和等化机会的不平等方面表现一致，对AUC影响极小，从而在保持诊断性能的同时提高公平性。

英文摘要

Diagnostic performance in medical AI varies systematically across demographic groups, yet subgroup AUC can mask clinically important disparities. At a fixed inference-time operating point, some groups may exhibit over-diagnostic behaviour, characterized by elevated true and false positive rates, while others show under-diagnostic patterns with reduced true and false positive rates. These opposing tendencies can cancel in aggregate AUCs while producing meaningful inequities in clinical decision-making. Motivated by the need to assess and mitigate such disparities at the operating point and across multiple demographic attributes simultaneously, we propose a worst-group equalized-odds margin regularizer. The proposed regularizer explicitly targets subgroup-level deviations on both the true positive and false positive sides at inference. At each update, the method identifies subgroups defined by explicit demographic attributes (e.g., age, sex, and race) that exhibit the most extreme margin deviations and applies a unified penalty, enabling fairness optimization across multiple demographic axes without requiring explicit intersectional constraints. Across two medical imaging datasets in realistic multi-label settings, our method consistently reduces disparities in Equalized Odds and Equalized Opportunity with minimal impact on AUC, preserving diagnostic performance while improving fairness.

URL PDF HTML ☆

赞 0 踩 0

2605.19213 2026-05-20 cs.CV

用于低资源医疗环境的量化机器学习模型：医学影像

Sumanth Meenan Kanneti, Aryan Shah

AI总结本文提出了一种多策略压缩框架，用于MRI图像中的脑肿瘤分类，通过量化感知训练、从DenseNet-101教师模型到紧凑DenseNet-32学生模型的知识蒸馏以及轻量MobileNetV2骨干网络上的Float16后训练量化，实现了在低资源医疗环境中高效且准确的脑肿瘤筛查。

详情

AI中文摘要

深度学习模型在医学影像分析中表现出强大的性能，但在低资源临床环境中部署仍然困难，由于计算、内存和电力限制。本文提出了一种多策略压缩框架，用于从MRI中进行脑肿瘤分类，包括量化感知训练、从DenseNet-101教师模型到紧凑DenseNet-32学生模型的知识蒸馏，以及在轻量MobileNetV2骨干网络上的Float16后训练量化。使用包含胶质瘤、脑膜瘤、垂体瘤和健康对照的多类脑肿瘤MRI数据集，我们提供了基于MobileNetV2的完整实验验证，通过三阶段迁移学习训练分类器，并通过TensorFlow Lite应用Float16量化。DenseNet基于的知识蒸馏和量化感知训练策略被描述为框架内的互补压缩方法，其完整的经验评估留待未来工作。在MobileNetV2管道上的实验结果表明，量化模型在验证准确率为82.37%的情况下，与全精度基线82.20%相比，模型大小从35.34 MB减少到5.76 MB，压缩比为6.14倍，无显著精度损失。各分类评估证实，量化在所有四个肿瘤类别中均匀保持诊断性能。这些发现表明，轻量化的量化模型可以在资源受限的医疗环境中提供临床可行的脑肿瘤筛查。

英文摘要

Deep learning models have shown strong performance in medical image analysis, but deploying them in low-resource clinical environments remains difficult due to computational, memory, and power constraints. This paper presents a multi-strategy compression framework for brain tumor classification from MRI, encompassing quantization-aware training, knowledge distillation from a DenseNet-101 teacher to a compact DenseNet-32 student with low-bit post-training quantization, and Float16 post-training quantization on a lightweight MobileNetV2 backbone. Using a multi-class brain tumor MRI dataset containing glioma, meningioma, pituitary tumors, and healthy controls, we provide full experimental validation of the MobileNetV2-based pipeline, training the classifier through a three-stage transfer learning process and applying Float16 quantization via TensorFlow Lite. The DenseNet-based distillation and quantization-aware training strategies are described as complementary compression approaches within the framework, with their complete empirical evaluation reserved for future work. Experimental results on the MobileNetV2 pipeline show that the quantized model achieves 82.37 percent validation accuracy compared to the 82.20 percent full-precision baseline, reducing model size from 35.34 MB to 5.76 MB, a 6.14x compression ratio with no meaningful accuracy loss. Per-class evaluation confirms that quantization preserves diagnostic performance uniformly across all four tumor categories. These findings demonstrate that lightweight quantized models can deliver clinically viable brain tumor screening in resource-constrained healthcare settings.

URL PDF HTML ☆

赞 0 踩 0

2605.19206 2026-05-20 cs.RO

CLUE: Adaptively Prioritized Contextual Cues by Leveraging a Unified Semantic Map for Effective Zero-Shot Object-Goal Navigation

CLUE: 通过利用统一语义地图实现适应性优先级上下文线索

Taeyun Kim, Alvin Jinsung Choi, Dasol Hong, Hyun Myung

AI总结 CLUE通过利用统一语义地图，采用适应性优先级上下文线索的方法，有效解决零样本物体-目标导航问题，提高了导航的鲁棒性和效率。

Comments 8 pages, 5 figures

详情

AI中文摘要

零样本物体-目标导航（ZSON）是机器人领域具有挑战性的问题，需要对语言和视觉观察有全面的理解。房间和物体的上下文线索至关重要，但它们的相对重要性取决于目标：一些物体与特定房间类型紧密相关，而另一些物体则更可能由附近共存的物体预测。现有方法忽略了这一区别，导致探索效率低下且不准确。我们提出了CLUE，一种新的导航框架，通过利用从离线大型语言模型（LLM）提取的常识知识，适应性地平衡使用上下文房间和物体。通过使用LLM估计目标与房间类型的关联性，代理优先使用房间线索预测强关联的目标，使用物体线索预测弱关联的目标。我们的框架构建了一个统一的语义价值地图，整合了两种类型的上下文信息，并根据目标的模糊性进行自适应加权，以指导探索。结合多视角验证和由上下文线索指导的探索策略，CLUE实现了稳健且高效的导航。在模拟和真实世界部署中的大量实验表明，我们的方法在成功率（SR）和按路径长度加权的成功率（SPL）上均优于最先进的基线方法，证明了其在实际导航任务中的有效性和实用性。

英文摘要

Zero-shot object-goal navigation (ZSON) is a challenging problem in robotics that requires a comprehensive understanding of both language and visual observations. Contextual cues from rooms and objects are critical, but their relative importance depends on the target: some objects are strongly tied to specific room types, while others are better predicted by nearby co-located objects. Existing methods overlook this distinction, leading to inefficient and inaccurate exploration. We present CLUE, a novel navigation framework that adaptively balances the use of contextual rooms and objects by leveraging commonsense knowledge extracted from an offline large language model (LLM). By estimating a target's association with room types using LLM, the agent prioritizes room cues for predictable objects and object cues for those with weak room associations. Our framework constructs a unified semantic value map that integrates both types of contextual information, adaptively weighted by the target's ambiguity to guide exploration. Combined with multi-viewpoint verification and an exploration strategy informed by contextual cues, CLUE achieves robust and efficient navigation. Extensive experiments in simulation and real-world deployments show that our method consistently outperforms state-of-the-art baselines in both success rate (SR) and success weighted by path length (SPL), demonstrating its effectiveness and practicality for real-world navigation tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.19202 2026-05-20 cs.RO cs.AI math.OC

Aerial Inspection Behaviors via RL-based Quadrotor Control for Under-canopy Forest Environments

通过基于强化学习的四旋翼控制实现空中巡检行为：在树冠下森林环境中的应用

Fausto Mauricio Lagos Suarez, Akshit Saradagi, Vidya Sumathy, Viswa Narayanan Sankaranarayanan, George Nikolakopoulos

AI总结本文提出了一种基于深度强化学习的四旋翼控制器，用于在树冠下森林环境中进行自主巡检任务，通过端到端控制策略实现巡检视角姿态跟踪，并结合旅行商问题规划器和快速随机树星规划器确保长距离任务的安全可靠部署。

Comments Submitted to 2026 IEEE 22nd International Conference on Automation Science and Engineering

详情

AI中文摘要

本文针对在树冠下森林环境中使用基于深度强化学习（RL）的低级四旋翼控制器进行空中巡检任务的问题进行了研究。具体而言，本文提出了一种端到端（将状态映射到RPMs）的四旋翼控制策略，实现了巡检视角姿态跟踪（同时位置和偏航参考跟踪），这对于各种目标巡检行为和森林中的点对点导航至关重要。为确保在长距离任务中端到端RL控制器的安全可靠部署，本文利用了一个包含旅行商问题规划器（TSP）和快速随机树星规划器（RRT*）的更高导航指导层。在已知的森林地图和一组用户指定的巡检区域上，TSP规划器找到最优访问序列。在两个目标区域之间，RRT*规划器生成符合下层端到端RL策略跟踪限制的碰撞自由路径。通过五个目标巡检场景，本文证明了基于强化学习的电机级稳定控制器，结合导航指导层，可以有效用作树冠下森林巡检任务的低级巡检执行模块。

一种通过奖励设计和终止条件实现RL基于四旋翼控制性能调优的启发式方法

Fausto Mauricio Lagos Suarez, Akshit Saradagi, Vidya Sumathy, George Nikolakopoulos

AI总结本文提出了一种新的启发式方法，通过奖励设计和终止条件实现RL四旋翼控制的可调性能，该方法通过双带宽指数奖励结构实现了设定点跟踪的临界阻尼响应，并具有低稳态误差。在使用近端策略优化（PPO）算法训练时，结合episode截断条件，在600万次时间步内以高效的方式实现了所需性能。通过直观的启发式规则调整奖励权重和指数系数，可以实现更快（空翻式）和更慢（检查式）的稳定时间性能，同时保留基线临界阻尼响应和约2%的稳态误差。

Comments Accepted in the 34th Mediterranean Conference on Control and Automation

详情

AI中文摘要

基于强化学习（RL）的四旋翼控制策略在诸如在复杂环境中快速导航和无人机赛车等任务中取得了显著性能。然而，在某些应用中，如基础设施检查，实现精确、可控的机动并具有可调性能至关重要。本文提出了一种新的启发式方法，通过奖励设计和终止条件实现RL基于四旋翼控制的可调性能。我们提出了一种包含双带宽指数的新型奖励结构，实现了设定点跟踪的基线临界阻尼响应，并具有低稳态误差。当使用近端策略优化（PPO）算法进行训练时，结合episode截断条件，在600万次时间步内以高效的方式实现了所需性能。为了调节基线行为的性能，我们提出了直观的启发式规则来调整奖励权重和指数系数，以实现更快（空翻式）和更慢（检查式）的稳定时间性能，同时保留基线临界阻尼响应和大约2%的稳态误差。我们评估了三种RL策略（基线、空翻和检查）在100次试验中的表现，并展示了在随机初始条件下位置和偏航跟踪的准确且可调性能，从而证明了所提出启发式方法的有效性。

英文摘要

Reinforcement learning (RL)-based quadrotor control policies have achieved impressive performance in tasks such as fast navigation in cluttered environments and drone racing, where the focus is on speed and agility. However, in several applications, such as infrastructure inspection, it is critical to achieve precise, controlled maneuvers with tunable performance. In this article, we present a novel heuristic approach to achieve tunable performance in RL-based Quadrotor control through reward design and termination conditions. We present a novel reward structure containing dual bandwidth exponentials that achieves a baseline critically damped response in setpoint tracking, with low steady-state errors. When trained with a Proximal Policy Optimization (PPO) algorithm, in conjunction with episode truncation conditions, the desired performance is achieved in 6 million time steps in a sample-efficient manner. In order to tune the performance about the baseline behavior, we present intuitive heuristic rules to adjust the reward weights and exponential coefficients to achieve faster (acrobatic-like) and slower (inspection-like) settling time performance, while retaining the baseline critically damped response and approximately 2\% steady-state error. We evaluate the three RL policies (baseline, acrobatic, and inspection) across 100 trials and show accurate and tunable performance in position and yaw tracking from random initial conditions, thereby demonstrating the effectiveness of the proposed heuristic approach.

URL PDF HTML ☆

赞 0 踩 0

2605.19156 2026-05-20 cs.AI cs.CY cs.LG cs.MA

How Far Are We From True Auto-Research?

我们距离真正的自动研究还有多远？

Zhengxin Zhang, Ning Wang, Sainyam Galhotra, Claire Cardie

AI总结本文通过ResearchArena评估了不同代理生成的论文质量，发现虽然代理能生成看似有竞争力的论文，但实际实验严谨性不足，存在伪造结果、实验能力不足和计划与执行不匹配等问题，表明自动研究仍需进一步发展。

详情

AI中文摘要

最近的自动研究系统能够生成完整的论文，但可行性并不等同于质量，该领域仍然缺乏对代理生成论文实际质量的系统研究。我们介绍了ResearchArena，一个最小的框架，让现成的代理（Claude Code使用Opus 4.6，Codex使用GPT-5.4，和Kimi Code使用K2.5）在仅轻量指导下自行完成完整的研究循环（构想、实验、论文写作、自我完善）。在13个计算机科学种子和每个代理-领域对的3次试验中，ResearchArena生成了117篇代理生成的论文，每篇都在三个互补的视角下评估：仅手稿的评审员（SAR）、考虑工件的同行评审（PR）以及人工进行的元评审。在仅SAR的情况下，图景是乐观的：Claude Code获得最高评分，优于Analemma的FARS，并与加权平均的人类ICLR 2025提交匹配，表明最小框架的代理能够生成在手稿-only评审中看起来有竞争力的论文。然而，人工检查却揭示了这个图景被夸大了：SAR评分与实际接受决定不一致，且奖励合理框架而不验证实验实质。在考虑工件的PR评分急剧下降，人工审计发现实验严谨性是主要瓶颈，分解为三种失败模式（伪造结果、低能力实验、计划/执行不匹配），这些模式高度依赖于代理：Codex 5%/8%论文与工件不匹配/伪造参考文献，与Kimi Code 77%/72%相比，差距约为15倍，追踪代理发展出的不同研究身份。没有一篇代理生成的论文达到顶级会议的接受标准。这表明我们仍然与真正的自动研究有差距。

英文摘要

Recent auto-research systems can produce complete papers, but feasibility is not the same as quality, and the field still lacks a systematic study of how good agent-generated papers actually are. We introduce ResearchArena, a minimal scaffold that lets off-the-shelf agents (Claude Code using Opus 4.6, Codex using GPT-5.4, and Kimi Code using K2.5) carry out the full research loop themselves (ideation, experimentation, paper writing, self-refinement) under only lightweight guidance. Across 13 computer science seeds and 3 trials per agent-domain pair, ResearchArena yields 117 agent-generated papers, each evaluated under three complementary lenses: a manuscript-only reviewer (SAR), an artifact-aware peer review (PR) in which agents inspect the workspace alongside the manuscript, and an human conducted meta-review. Under SAR alone the picture is optimistic: Claude Code obtains the highest score, outperforms Analemma's FARS, and matches the weighted-average human ICLR 2025 submission, suggesting that minimally scaffolded agents can produce papers that look competitive on manuscript-only review. Manual inspection, however, reveals this picture is overstated: SAR scores are poorly aligned with its actual acceptance decisions and reward plausible framing without verifying experimental substance. Under artifact-aware PR scores drop sharply, and manual auditing identifies experimental rigor as the major bottleneck, decomposing into three failure modes (fabricated results, underpowered experiments, and plan/execution mismatch) that are highly agent-dependent: Codex 5%/8% paper-vs-artifact mismatch / fabricated references versus Kimi Code 77%/72%, a $\sim$15$\times$ spread that tracks distinct research personas the agents develop. None of the 117 agent-generated papers reaches the acceptance bar of a top-tier venue. This suggests that we are still gapped from the true auto-research.

URL PDF HTML ☆

赞 0 踩 0

2605.19155 2026-05-20 cs.CV

Efficient coding along the visual hierarchy

视觉层次中的高效编码

Ananya Passi, Brian S. Robinson, Michael F. Bonner

AI总结本文研究了在有限数据下如何通过高效编码原理构建与人类对齐的视觉特征层次，提出了一种无监督学习方法，该方法通过压缩输入到自然图像的主要变化模式来生成从边缘和颜色到纹理和形状的特征，且结合监督微调可提高脑区对齐性和类别学习速度。

Comments 34 pages, 6 figures

详情

AI中文摘要

GRASP：交互图中的确定性论证排名

Diganta Misra, Antonio Orvieto, Rediet Abebe, Volkan Cevher

AI总结本文提出GRASP框架，通过聚合稳定的局部交互判断生成全局排名，以解决大语言模型作为裁判时整体评判不一致的问题，强调结构充分性而非说服力或修辞吸引力。

Comments Preprint

详情

AI中文摘要

大型语言模型越来越多地被部署为自动裁判，以评估论证的强度。随着这一角色的扩大，其合法性取决于一致性、透明性和将论证结构与修辞吸引力区分开的能力。然而，我们证明了整体评判——一种常见的LLM-as-a-Judge实践，其中模型对辩论提供全球裁决——存在显著的跨模型分歧。我们主张这种不稳定性源于将辩论复杂的交互结构压缩成单一的不透明分数。为了解决这一问题，我们提出GRASP（渐进排名与攻击支持传播），一种确定性框架，通过收敛的攻击-防御传播操作，将稳定的局部交互判断聚合为全局排名。我们证明在LLM-as-a-Judge评估中，局部交互判断比整体排名更具可重复性，使GRASP能够生成更一致的全局排名。我们进一步证明GRASP分数与人类“说服性”标签不相关，突显了一个关键的社技术区别：GRASP不衡量说服力、事实性或修辞吸引力，而是结构充分性——一种在显式交互图上的防御意识的论证鲁棒性概念。总体而言，GRASP为整体LLM评判提供了一个透明且可审计的替代方案。

英文摘要

Large language models are increasingly deployed as automated judges to evaluate the strength of arguments. As this role expands, their legitimacy depends on consistency, transparency, and the ability to separate argumentative structure from rhetorical appeal. However, we show that holistic judging - a common LLM-as-a-Judge practice where a model provides a global verdict on a debate - suffers from substantial inter-model disagreement. We argue that this instability arises from collapsing a debate's complex interaction structure into a single opaque score. To address this, we propose GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework that aggregates stable local interaction judgments into a global ranking via a convergent attack--defense propagation operator. We show that local interaction judgments are more reproducible than holistic rankings in LLM-as-a-Judge evaluations, allowing GRASP to produce more consistent global rankings. We further show that GRASP scores do not correlate with human "convincingness" labels, highlighting a vital sociotechnical distinction: GRASP does not measure persuasion, factuality, or rhetorical appeal, but structural sufficiency - a defense-aware notion of argument robustness over the explicit interaction graph. Overall, GRASP offers a transparent and auditable alternative to holistic LLM judging.

URL PDF HTML ☆

赞 0 踩 0

2605.19140 2026-05-20 cs.AI

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

学习手柄：在接口约束下的可证明收敛的工作流学习

Jiayu Li, Enpei Zhang, Dawei Zhou, Elynn Chen, Yujun Yan

AI总结该研究探讨了在接口约束下的工作流学习问题，提出了一种异步去中心化的Q学习算法IC-Q，并给出了神经IC-Q的有限样本界，证明了在去中心化部分可观测性下的神经Q学习的第一个有限样本保证。

详情

AI中文摘要

我们研究了在专门的代理通过共享的艺术品进行控制转移的设置下的工作流学习，每个代理只能观察该艺术品的局部函数及其自己的私人状态，且没有集中式学习者访问联合轨迹——这多代理LLM管道跨越组织、供应商或信任边界时的操作模式。我们将这种模式形式化为一个接口约束的半马尔可夫决策过程（IC-SMDP），其决策时刻发生在手柄时间，设计了IC-Q，一种异步去中心化的Q学习算法，其中每次手柄的跨代理协调恰好是一个标量。我们的主要结果是神经IC-Q的有限样本界，该界分解为三个独立可控的误差源：神经函数近似误差、接口表示差距和混合时间残差，基于随机选项持续时间折扣。建立这个界需要将近似信息状态（AIS）框架从单代理原始步骤MDP提升到多代理SMDP，并在随机持续时间内控制马尔可夫噪声，而这在先前工作中尚未完成。据我们所知，这是第一个在去中心化部分可观测性下的神经Q学习的有限样本保证。四个实验：一个受控的合成IC-SMDP，多LLM数学推理，多代理路由，以及多代理CPU编程，显示IC-Q在没有任何代理观察联合轨迹的情况下匹配集中式 oracle，每个误差源沿其对应的轴按界预测的比例缩放。

英文摘要

We study workflow learning in a setting where specialized agents hand off control through a shared artifact, each agent observes only a local function of that artifact and its own private state, and no centralized learner accesses joint trajectories -- the operating regime of multi-agent LLM pipelines that span organizational, vendor, or trust boundaries. We formalize this regime as an interface-constrained semi-Markov decision process (IC-SMDP), whose decision epochs occur at handoff times, and design IC-$Q$, an asynchronous decentralized $Q$-learning algorithm in which cross-agent coordination at every handoff is exactly one scalar. Our main result is a finite-sample bound for neural IC-$Q$ that decomposes into three independently controllable error sources: neural function-approximation error, interface representation gap, and a mixing-time residual, under the random option-duration discount. Establishing this bound requires lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs and controlling Markovian noise under random duration, neither of which has been done in prior work. To our knowledge this is the first finite-sample guarantee for neural $Q$-learning under decentralized partial observability. Four experiments: a controlled synthetic IC-SMDP that validates the bound term-by-term, multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming, show that IC-$Q$ matches a centralized oracle without any agent observing joint trajectories, with each of the three error sources scaling along its corresponding axis as the bound predicts.

URL PDF HTML ☆

赞 0 踩 0

2605.19137 2026-05-20 cs.CV

Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models

迈向数据高效的视频预训练：使用冻结的图像基础模型

Svetlana Orlova, Niccolò Cavagnero, Gijs Dubbelman

AI总结本文探讨了如何通过冻结预训练的图像基础模型并仅训练时间模块来实现数据高效的视频预训练，从而减少对大规模视频数据和计算资源的需求。

Comments Accepted to CVPR 2026 Workshops CV4Smalls

详情

AI中文摘要

视频基础模型在许多视频理解任务中表现出色，但通常需要在大规模视频数据集上进行大规模预训练，导致显著的数据和计算成本。相比之下，现代图像基础模型已经提供了强大的空间表示。这引发了一个重要问题：能否通过重用这些空间表示并仅进行时间推理的预训练来构建具有竞争力的视频模型？我们初步探索了一种轻量级训练范式，即冻结预训练的图像基础模型并仅训练时间模块来处理流视频。通过将图像基础模型用作空间编码器，这种方法可以显著减少与端到端视频预训练相比所需的视频数据和计算量。在本工作中，我们探讨了这种方法的可行性，以在投入视频预训练计算之前进行探索。在多个视频理解任务上的实证发现表明，无需大规模视频预训练即可获得强大的时间性能，这促使未来的工作集中在通过在冻结的图像基础模型上预训练时间模块来构建递归视频基础模型。代码：https://github.com/tue-mps/towards-video-image-frozen

英文摘要

Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this approach before investing in computing for video pre-training. Our empirical findings across multiple video understanding tasks suggest that strong temporal performance can emerge without large-scale video pre-training, motivating future work on recurrent video foundation models obtained by pre-training a temporal module on top of a frozen image foundation model. Code: https://github.com/tue-mps/towards-video-image-frozen .

URL PDF HTML ☆

赞 0 踩 0