arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.14879 2026-06-16 cs.RO cs.CV cs.LG 新提交

VANDERER: Map-Free Exploration using Future-Aware and Visual-Curiosity-Guided Diffusion Policy

VANDERER: 基于未来感知与视觉好奇心引导扩散策略的无地图探索

Venkata Naren Devarakonda, Raktim Gautam Goswami, Prashanth Krishnamurthy, Farshad Khorrami

发表机构 * Control/Robotics Research Laboratory (CRRL), Department of Electrical and Computer Engineering, NYU Tandon School of Engineering（纽约大学坦登工程学院电气与计算机工程系控制/机器人研究实验室（CRRL））； New York University Abu Dhabi (NYUAD) Center for Artificial Intelligence and Robotics (CAIR)（纽约大学阿布扎比分校人工智能与机器人中心（CAIR））

AI总结提出VANDERER框架，利用视觉好奇心模块引导预训练扩散策略，仅依赖单目图像实现高效无地图探索，在多种模拟环境中平均探索面积比NoMaD多13.4%。

详情

AI中文摘要

移动智能体需要高效的探索策略来绘制未知环境并自主规划任务。传统方法依赖于生成占据地图并优化未探索区域的访问顺序。然而，在传感器受限的设置中，例如仅使用单目相机，生成准确的占据地图具有挑战性。为了解决这一问题，我们提出了VANDERER，一个探索框架，它利用视觉好奇心模块（VCM）仅使用单目图像数据来引导预训练的扩散策略。该好奇心模块通过导航世界模型预测所提议动作的结果，并通过好奇心成本对其进行评估。然后，该成本引导扩散过程生成最大化探索的动作。在多种模拟环境中进行评估，VANDERER始终优于现有基线，平均探索面积比NoMaD多13.4%。我们的结果揭示了室外环境中视觉好奇心与几何好奇心之间的直接相关性，表明VANDERER能够有效利用这种关系，在传感器受限的智能体上实现高效探索。

英文摘要

Mobile agents require efficient exploration strategies to map unseen environments and autonomously plan tasks. Traditional methods rely on generating occupancy maps and optimizing the sequence in which unexplored regions are visited. However, in sensor-constrained settings, such as those limited to monocular cameras, generating accurate occupancy maps is challenging. To address this, we propose VANDERER, an exploration framework that leverages a Visual Curiosity Module (VCM) to guide pre-trained diffusion policies using only monocular image data. This curiosity module predicts the outcomes of proposed actions via a navigation world model and evaluates them through a curiosity cost. The cost then guides the diffusion process toward generating actions that maximize exploration. Evaluated across diverse simulated environments, VANDERER consistently outperforms established baselines, exploring an average of 13.4% more area than NoMaD. Our results reveal a direct correlation between visual and geometric curiosity in outdoor environments, demonstrating that VANDERER can effectively leverage this relationship for efficient exploration using sensor-constrained agents.

URL PDF HTML ☆

赞 0 踩 0

2606.14875 2026-06-16 cs.CL 新提交

Context Compression Is Not One Thing: Readable Symbolic Re-expression vs. Coherent Summary at Matched Budget

上下文压缩并非单一事物：在匹配预算下可读的符号化重新表达与连贯摘要的比较

Sisong Bei, Mikhail L. Arbuzov, Ziwei Dong, Dmitri Kalaev, Alexey Shvets

发表机构 * Independent Researcher（独立研究员）； Palo Alto Networks

AI总结提出Telegraph English可读符号格式，将检索段落重写为结构化实体关系陈述，在匹配预算下比三种压缩基线及连贯摘要更有效，F1提升13-20个百分点。

详情

AI中文摘要

我们研究了使用小型语言模型进行多跳问答时的上下文压缩。我们提出Telegraph English，一种可读的符号化格式，将检索到的段落重写为结构化的实体关系陈述，以更低的token成本保留推理证据。在MuSiQue、TwoWiki和HotpotQA上的受控实验中，Telegraph English在每个数据集上都优于三种匹配预算的压缩基线（字符级删除、截断和随机子采样），F1得分提升13至20个百分点。它还在最难的数据集上优于由同一编码器生成的连贯散文摘要。一个预先注册的深度交互假设被证伪：在数据集内，优势并未随推理深度增加而增长。我们将这些结果解释为证据，表明在匹配的token预算下，可读的符号化重新表达比自然语言或连贯摘要更密集地保留了实体内容。

英文摘要

We study context compression for multi-hop question answering with small language models. We propose Telegraph English, a readable symbolic format that rewrites retrieved passages into structured entity-relation statements, preserving reasoning evidence at lower token cost. In controlled experiments on MuSiQue, TwoWiki, and HotpotQA, Telegraph English outperforms three matched-budget compression baselines (character-level deletion, truncation, and random sub-sampling) on every dataset, with gains of 13 to 20 F1 percentage point. It also outperforms a coherent prose summary produced by the same encoder on the hardest dataset. A pre-registered depth-interaction hypothesis is null: the advantage does not grow with reasoning depth within datasets. We interpret these results as evidence that readable symbolic re-expression preserves entity content more densely than either natural language or coherent summarization at matched token budget.

URL PDF HTML ☆

赞 0 踩 0

2606.14871 2026-06-16 cs.CV cs.AI 新提交

An Ensemble Deep Learning Approach for Reliable and Scalable Lemon Leaf Disease Classification

一种可靠且可扩展的柠檬叶病害分类集成深度学习方法

Shayan Abrar, Sudeepta Mandal, Abdul Awal Yasir, Sonjoy Bhattacharjee, Sadman Haque Bhuiyan, Samanta Ghosh, Rafi Ahamed

发表机构 * Dept. of CSE（计算机科学与工程系）； American International University-Bangladesh（美国国际大学-孟加拉国）； East West University（东-西大学）； North South University（北南大学）

AI总结提出集成InceptionV3和MobileNetV2的深度学习方法，结合对抗训练和Grad-CAM可视化，在9类柠檬叶病害数据集上达到99.27%准确率，实现可靠分类。

Comments 5 pages, 12 figures, 3 Tables, Presented at 18th IEEE International Conference on Computational Intelligence and Communication Networks (CICN) 2026

2606.14867 2026-06-16 cs.CL cs.AI cs.LG 新提交

Evaluating the Robustness of Proof Autoformalization in Lean 4

评估 Lean 4 中证明自动形式化的鲁棒性

Zhengtao Gui, Sheng Yang, Zhouxing Shi

发表机构 * University of California, Irvine（加州大学洛杉矶分校）； University of California, Riverside（加州大学河滨分校）

AI总结研究证明自动形式化模型在全局和局部扰动下的鲁棒性，发现现有模型对全局扰动敏感且多数无法忠实反映局部扰动。

Comments Preprint

详情

AI中文摘要

证明自动形式化旨在将用自然语言编写的数学非正式证明翻译成形式语言（如 Lean~4）中的形式证明。已有几项工作开发了基于 LLM 的证明自动形式化模型。然而，现有评估通常侧重于翻译来自精选数据集的规范非正式证明。我们认为，一个鲁棒的证明自动形式化器必须即使对于偏离这些理想化形式的非正式证明也能保持忠实，并提出了首个关于证明自动形式化模型鲁棒性的研究。我们制定了两类扰动并评估每种扰动下的鲁棒性：全局扰动以不同风格改写非正式证明，在此情况下形式化应保持一致；局部扰动改变一个值、符号或证明步骤，可能是反事实的方式，鲁棒的形式化应忠实地反映扰动，而不是自行恢复为原始形式或推断出不同的形式。我们在 miniF2F 和 MATH-500 上构建了包含两种扰动的基准，并自动衡量证明自动形式化在全局扰动下正确性的稳定程度，以及其输出在局部扰动下的忠实程度。我们评估了七个最新模型，所有模型都对全局扰动敏感，且大多数在局部扰动下无法保持忠实。代码和数据可通过 https://github.com/ucr-rai/robust-proof-autoformalization 获取。

英文摘要

Proof autoformalization aims to translate a mathematical informal proof written in natural language into a formal proof in a formal language such as Lean~4. Several works have developed LLM-based models for proof autoformalization. However, existing evaluations have typically focused on translating well-formed informal proofs from curated datasets. We argue that a robust proof autoformalizer must remain faithful even for informal proofs that diverge from these idealized ones, and we present the first study on the robustness of proof autoformalization models. We formulate two categories of perturbations and evaluate robustness under each: a global perturbation paraphrases the informal proof in a different style, under which the formalization should remain consistent; a local perturbation alters a value, symbol, or proof step, possibly in a counterfactual way, and a robust formalization should faithfully reflect the perturbation rather than reverting to the original one or inferring a different one on its own. We build a benchmark with both perturbations on miniF2F and MATH-500, and automatically measure how stable a proof autoformalization's correctness is under global perturbations and how faithfully its output reflects local perturbations. We evaluate seven recent models, all of which are sensitive to global perturbations and mostly fail to remain faithful under local perturbations. Code and data are available via https://github.com/ucr-rai/robust-proof-autoformalization.

URL PDF HTML ☆

赞 0 踩 0

2606.14865 2026-06-16 cs.LG cs.AI 新提交

GRAPE: Guided Parameter-Space Evolution for Compact Adversarial Robustness

GRAPE: 面向紧凑对抗鲁棒性的引导式参数空间演化

Zhiyuan Ye, Xiangyu Zhou, Ji Qi, Hao Zhang, Yi Zhou

发表机构 * University of Science and Technology of China（中国科学技术大学）； China Mobile (Suzhou) Software Technology Co., Ltd.（中移（苏州）软件技术有限公司）

AI总结提出GRAPE框架，通过逐步暴露参数空间并利用对抗谱利用分数引导容量分配，在固定计算预算下提升紧凑模型的对抗鲁棒性，在CIFAR-10上以1.009倍FLOPs将PGD-20鲁棒准确率从51.70%提升至56.94%，参数减少21.4%。

详情

AI中文摘要

对抗训练（AT）提高了神经网络的鲁棒性，但大多数方法从一开始就训练固定的参数空间。本文探讨了参数变得可优化的顺序是否会影响最终的鲁棒解，即使最终架构或计算预算被控制。我们提出了GRAPE（引导式参数空间演化），一种面向紧凑对抗鲁棒性的训练框架。GRAPE结合了参数空间稳定化与渐进式隐藏扩展：它在当前暴露空间中稳定鲁棒优化，逐步释放新的可优化维度，并使用对抗谱利用分数引导新释放的容量流向高压模块。与固定结构的AT相比，GRAPE将鲁棒模型学习视为一个渐进式参数空间暴露和演化的过程。在CIFAR-10上的标准$\ell_\infty$威胁模型下，以固定结构ResNet-18 AT作为对照参考，GRAPE在几乎匹配的计算预算下（FLOPs比率为1.009倍）将PGD-20鲁棒准确率从51.70%提升至56.94%，同时参数数量减少约21.4%。一个具有相同最终ResNet-18架构的序列增长变体达到了56.52%的PGD-20鲁棒准确率，表明增益不仅来自最终架构差异，还来自参数空间暴露路径。这些结果表明，引导式参数空间演化可以在匹配计算条件下产生紧凑且鲁棒的参数配置。

英文摘要

Adversarial Training (AT) improves neural network robustness, but most methods train a fixed parameter space from the start. This paper asks whether the order in which parameters become optimizable can affect the final robust solution, even when the final architecture or computation budget is controlled. We propose GRAPE, Guided Parameter-Space Evolution, a training framework for compact adversarial robustness. GRAPE combines parameter-space stabilization with progressive hidden expansion: it stabilizes robust optimization in the currently exposed space, gradually releases new optimizable dimensions, and uses an adversarial spectral utilization score to guide newly released capacity toward high-pressure modules. In contrast to fixed-structure AT, GRAPE treats robust model learning as a process of progressive parameter-space exposure and evolution. Under the standard $\ell_\infty$ threat model on CIFAR-10, with fixed-structure ResNet-18 AT as a controlled reference, GRAPE improves PGD-20 robust accuracy from 51.70% to 56.94% at a nearly matched computation budget with a FLOPs ratio of 1.009x, while reducing parameter count by about 21.4%. A sequential grow variant with the same final ResNet-18 architecture reaches 56.52% PGD-20 robust accuracy, indicating that the gain is not only due to final architecture differences but also to the parameter-space exposure path. These results suggest that guided parameter-space evolution can yield compact and robust parameter configurations under matched computation.

URL PDF HTML ☆

赞 0 踩 0

2606.14862 2026-06-16 cs.RO 新提交

TacStyle: Personalizing Tactile Robot Policies using Structured Behavior Representations

TacStyle: 使用结构化行为表示个性化触觉机器人策略

Kevin Robledo, Matías I. Torres Galaz, Kumar Dixhant Rai, Shelly Sara Ulman, Tasmia Tasrin, Heramb Nemlekar

发表机构 * Department of Computer Science, California State University, Northridge（加州州立大学北岭分校计算机科学系）； Department of Mechanical Engineering, California State University, Northridge（加州州立大学北岭分校机械工程系）

AI总结提出通过结构化潜在表示组织用户偏好，结合基础模型解释语言指令，实现机器人行为精细调整，减少偏好标签需求。

Comments 14 pages, 5 figures

详情

AI中文摘要

辅助人类的机器人系统应能根据个人用户偏好调整其行为。例如，用户可能希望机器人手臂在折叠衣物或清洁家具时调整施加的力。自然语言为人类传达此类偏好提供了直观方式。语言条件机器人策略的最新进展表明，机器人可以成功使用语言提示来确定要执行的任务。然而，将相同方法扩展到实现任务应如何执行，需要描述任务数据中轨迹偏好或风格的详细标签。收集此类注释具有挑战性，而且直接以这些标签为条件可能无法提供对连续行为范围的精细控制。例如，通过“施加比之前稍大一点的压力”这样的抽象指令来传达机器人必须施加的确切力是困难的。因此，在这项工作中，我们提出使用语言来推理偏好行为，而不是直接生成它们。我们首先学习一个结构化的潜在表示，根据相应轨迹的差异来组织用户偏好。然后，给定一个偏好提示，我们使用基础模型来解释这个潜在空间，并选择一个能产生所需行为的值。通过仿真和真实世界实验，我们表明从直观结构化的潜在空间中选择机器人行为能够更精确地适应用户偏好，同时所需的偏好标签显著少于语言条件策略。

英文摘要

Robotic systems that assist humans should be capable of adapting their behaviors to individual user preferences. For instance, users may want a robot arm to adjust the amount of force it applies while folding their laundry or cleaning furniture. Natural language provides an intuitive way for humans to communicate such preferences. Recent progress in language-conditioned robot policies has shown that robots can successfully use language prompts to determine what task to perform. However, extending the same approach to realize how the task should be performed requires detailed labels describing the preferences or styles of trajectories in the task data. Not only is collecting such annotations challenging, but conditioning directly on these labels may also fail to provide fine-grained control over a continuous range of behaviors. For example, it can be difficult to convey the exact force that a robot must apply through abstract instructions like "apply a bit more pressure than before". Therefore, in this work, we propose using language to reason over preferred behaviors instead of directly generating them. We first learn a structured latent representation that organizes user preferences according to differences in the corresponding trajectories. Then, given a preference prompt, we use a foundation model to interpret this latent space and choose a value that produces the desired behavior. Through both simulation and real-world experiments, we show that selecting robot behaviors from an intuitively structured latent space enables more precise adaptation to user preferences while requiring significantly fewer preference labels than language-conditioned policies.

URL PDF HTML ☆

赞 0 踩 0

2606.14841 2026-06-16 cs.CV 新提交

Multi-HMR 2: Multi-Person Camera-Centric Human Detection, Mesh Recovery and Tracking

Multi-HMR 2：多人相机中心人体检测、网格恢复与跟踪

Guénolé Fiche, Philippe Weinzaepfel, Romain Brégier, Fabien Baradel

发表机构 * NAVER LABS Europe（NAVER LABS欧洲）

AI总结提出基于DETR的框架Multi-HMR 2，联合预测场景一致相机和人体网格，实现度量3D定位与跟踪，无需真实内参或视频监督，在保持骨盆中心性能的同时显著提升检测与定位精度。

详情

AI中文摘要

人体网格恢复（HMR）的大多数进展集中在骨盆中心恢复，忽视了相机坐标系中的度量3D定位和检测精度——这两个因素对于人机交互和社交场景理解等实际应用至关重要。当前的评估协议通常忽略这些方面，强调每人的根中心恢复而非相机空间感知。因此，现有方法依赖于固定的相机假设或手工后处理，限制了其鲁棒性和实际部署。我们提出了Multi-HMR 2，一个简单而鲁棒的基于DETR的框架，用于多人相机中心的人体检测、网格恢复和跟踪。Multi-HMR 2预测一个场景一致的相机以及人体网格，无需真实内参即可实现度量3D定位。此外，通过从SAM2中提取基于图像的记忆特征，Multi-HMR 2扩展到跟踪，无需视频监督即可实现一致的同一性关联。尽管概念简单——无手工组件、无视频输入、无真实相机——Multi-HMR 2在保持最先进的骨盆中心性能的同时，显著提高了检测精度和度量3D定位。

英文摘要

Most advances in human mesh recovery (HMR) have focused on pelvis-centered recovery, overlooking metric 3D localization and detection accuracy in the camera coordinate system - two key factors for real-world applications such as human-robot interaction and social scene understanding. Current evaluation protocols often ignore these aspects, emphasizing per-person, root-centered recovery rather than camera-space perception. As a result, existing approaches rely on fixed camera assumptions or handcrafted post-processing, limiting their robustness and practical deployment. We introduce Multi-HMR 2, a simple yet robust DETR-based framework for Multi-person Camera-centric Human detection, mesh Recovery, and tracking. Multi-HMR 2 predicts a scene-consistent camera together with human meshes, enabling metric 3D localization without ground-truth intrinsics. Moreover, by distilling image-based memory features from SAM2, Multi-HMR 2 extends to tracking, achieving consistent identity association without video supervision. Despite its conceptual simplicity - no handcrafted components, no video input, and no ground-truth cameras - Multi-HMR 2 achieves state-of-the-art pelvis-centered performance while substantially improving detection accuracy and metric 3D localization.

URL PDF HTML ☆

赞 0 踩 0

2606.14838 2026-06-16 cs.AI 新提交

A Definition of Good Explanations and the Challenges Explaining LLM Outputs

好解释的定义及解释LLM输出的挑战

Louis Mahon, Elliot Ford, Callum Hackett

发表机构 * arXiv

AI总结本文提出一种基于反事实解释且考虑对话者先验信念的好解释定义，并探讨该定义对AI可解释性的影响，特别是为何LLM输出难以产生好解释。

2606.14832 2026-06-16 cs.CL 新提交

PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

PhoneHarness: 通过混合GUI、CLI和工具操作利用手机使用代理

Chenxin Li, Zhengyao Fang, Zhengyang Tang, Pengyuan Lyu, Xingran Zhou, Xin Lai, Fei Tang, Liang Wu, Yiduo Guo, Weinong Wang, Junyi Li, Yi Zhang, Yang Ding, Huawen Shen, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang, Chengquan Zhang, Han Hu

发表机构 * Tencent Hunyuan（腾讯混元）； The Chinese University of Hong Kong（香港中文大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Tsinghua University（清华大学）

AI总结提出PhoneHarness，一个混合动作基准和执行框架，用于评估手机代理在可验证移动工作流中的表现，通过GUI、CLI和工具动作的组合，达到75.0%通过率，比最强基线高12.9个百分点。

Comments Project Page: https://phoneharness.github.io/

详情

AI中文摘要

手机代理越来越被期望完成真实的移动工作流，而不仅仅是预测下一个屏幕动作。然而，当前许多移动代理文献仍然主要将代理评估为GUI控制器，它们观察屏幕、发出点击和滑动，并根据目标应用状态评分。真实的手机使用任务更为广泛：它们需要决定何时使用应用GUI、设备端命令或结构化工具，同时留下预期副作用实际发生的证据。我们引入了PhoneHarness，一个混合动作基准和执行框架，用于研究在可验证移动工作流上的手机使用代理。PhoneHarness在GUI、CLI和主机端工具动作上运行设备端代理循环，结合确定性动作路由与有界GUI委托和可审计执行轨迹。其基准PhoneHarness Bench评估代理是否完成具有可观察副作用的任务，而不仅仅是产生合理的最终答案。在注释评估集上，PhoneHarness达到75.0%的通过率，比最强的非PhoneHarness设置高出12.9个百分点。因此，PhoneHarness和PhoneHarness Bench扮演着不同但相互依赖的角色：框架使混合手机工作流可执行，而基准衡量代理是否能够可靠且安全地使用该框架。我们的发现表明，可靠的手机自动化依赖于动作表面路由和可验证执行，而不仅仅是视觉GUI控制。

英文摘要

Phone agents are increasingly expected to complete real mobile workflows rather than merely predict the next screen action. However, much of the current mobile-agent literature still evaluates agents primarily as GUI controllers that observe a screen, emit taps and swipes, and are scored by target app state. Real phone-use tasks are broader: they require deciding when to use app GUIs, device-side commands, or structured tools, while leaving evidence that the intended side effect actually occurred. We introduce PhoneHarness, a mixed-action benchmark and execution harness for studying phone-use agents on verifiable mobile workflows. PhoneHarness runs a device-side agent loop over GUI, CLI, and host-side tool actions, combining deterministic action routing with bounded GUI delegation and auditable execution traces. Its benchmark, PhoneHarness Bench, evaluates whether agents complete tasks with observable side effects, not only whether they produce plausible final answers. On the annotated evaluation split, PhoneHarness reaches a 75.0% pass rate, outperforming the strongest non-PhoneHarness settings by 12.9 percentage points. PhoneHarness and PhoneHarness Bench therefore play distinct but mutually dependent roles: the harness makes mixed phone workflows executable, while the benchmark measures whether agents can use that harness reliably and safely. Our findings suggest that reliable phone automation depends on action-surface routing and verifiable execution, not only visual GUI control.

URL PDF HTML ☆

赞 0 踩 0

2606.14820 2026-06-16 cs.SD cs.AI cs.CL eess.AS 新提交

Spectro-Temporal Interference Confounds Phase Encoding in Spatial Audio Foundation Models

频谱-时间干扰混淆空间音频基础模型中的相位编码

Yuxuan Chen, Haoyuan Yu, Peize He

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Jilin University（吉林大学）； Hunan University（湖南大学）； University of Electronic Science and Technology of China（电子科技大学）

AI总结提出基于双耳掩蔽级差的心理声学基准，评估空间自监督音频模型对微秒级耳间相位精细结构的编码能力，发现通用双耳SSL模型依赖频谱-时间干扰纹理而非真实相位计算。

Comments Accepted to INTERSPEECH 2026; 6 pages, 3 figures

详情

AI中文摘要

最近的空间自监督音频模型在定位任务上取得了高性能，引发了对它们编码微秒级耳间相位精细结构能力的疑问。我们提出了一个基于双耳掩蔽级差的心理声学基准来评估这一点。使用均衡抵消基线和GCC-PHAT阳性对照，我们评估了九个冻结的音频模型，涵盖双耳SSL、单耳SSL和神经音频编解码器。四个单耳阴性对照产生零BMLD，确认了双耳特异性。两个通用双耳SSL模型表现出最小的相位敏感性，而专用双耳空间SSL模型实现了与分析基线相当的BMLD。渐进式物理消融实验表明，通用双耳SSL模型依赖于频谱-时间干扰纹理而非跨通道相位计算。语音中的高检测率反映了对宽带包络而非真实相位编码的混淆依赖。

英文摘要

Recent spatial self supervised audio models achieve high performance on localization tasks, raising questions about their encoding of microsecond interaural phase fine structures. We propose a psychoacoustic benchmark based on the binaural masking level difference to evaluate this. Using an equalization cancellation baseline and a GCC PHAT positive control we evaluate nine frozen audio models spanning binaural SSL, monaural SSL, and neural audio codecs. Four monaural negative controls yield zero BMLD confirming binaural specificity. Two general purpose binaural SSL models exhibit minimal phase sensitivity while dedicated binaural spatial SSL models achieve BMLD comparable to the analytical baseline. Progressive physical ablations show that general purpose binaural SSL models rely on spectro temporal interference textures rather than cross channel phase computation. High detection rates in speech reflect a confounding reliance on broadband envelopes rather than genuine phase encoding.

URL PDF HTML ☆

赞 0 踩 0

2606.14811 2026-06-16 cs.CV 新提交

S23DR 2026: End-to-End 3D Wireframe Prediction via DETR-Style Set Prediction with Contrastive Denoising

S23DR 2026：基于对比去噪的DETR风格集合预测实现端到端3D线框预测

Nitiz Khanal

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出WireframeDETR方法，直接对3D点云进行DETR风格集合预测，无需中间顶点检测，通过对比去噪训练、多尺度编码器和渐进辅助损失权重实现端到端3D线框预测，在S23DR 2026挑战赛上取得0.575 HSS。

Comments Technical report; S23DR 2026 Challenge submission

2606.14803 2026-06-16 cs.CV 新提交

HSQ-VLM: A Novel Spatially-Constrained Quadrant Segmentation VLM Model for Explainability in Diabetic Retinopathy

HSQ-VLM: 一种用于糖尿病视网膜病变可解释性的新型空间约束象限分割VLM模型

Shivum Telang

发表机构 * Pittsburgh, Pennsylvania（宾夕法尼亚州匹兹堡）

AI总结提出HSQ-VLM，利用地标锚定笛卡尔交叉注意力机制和四象限拓扑潜在分割，实现眼底图像中病变的解剖精确量化与自然语言报告生成，在出血和微动脉瘤检测上达到99.6%和96.4%的灵敏度。

详情

AI中文摘要

糖尿病视网膜病变（DR）是一种侵袭性视网膜疾病，也是全球失明的主要原因，但其临床管理目前受到诊断AI黑箱性质的阻碍。虽然深度学习模型实现了高分类准确率，但严重缺乏能够详细描述导致DR临床决策的确切解剖标志和病变分布的可解释性方法。因此，我们提出了HSQ-VLM，一种新颖的眼底图像象限分割流水线，利用地标锚定笛卡尔交叉注意力机制将视觉特征提取与结构化临床推理统一起来。与依赖任意图像分割的传统方法不同，我们的流水线实现了四象限拓扑潜在分割（TLP），以动态地将视网膜特征与以中央凹为中心的坐标系对齐。这使得视觉语言模型能够生成以解剖精度量化病理的自然语言报告。在包含3,500张高分辨率眼底图像的数据集上，这种创新方法实现了出血检测灵敏度99.6%和微动脉瘤检测灵敏度96.4%，同时与标准分割基线相比，边界模糊误差显著减少。

英文摘要

Diabetic Retinopathy (DR) is an aggressive retinal disease and a leading cause of global blindness, yet its clinical management is currently hindered by the black-box nature of diagnostic AI. While deep learning models achieve high classification accuracy, there is a critical lack of explainability methods capable of detailing the exact anatomical landmarks and lesion distributions that lead to a clinical decision for DR. Therefore, we propose HSQ-VLM, a novel quadrant segmentation pipeline on fundus images that utilizes a Landmark-Anchored Cartesian Cross-Attention mechanism to unify visual feature extraction with structured clinical reasoning. Unlike traditional methods that rely on arbitrary image partitioning, our pipeline implements 4-quadrant Topological Latent Partitioning (TLP) to dynamically align retinal features with a fovea-centered coordinate system. This allows the Vision-Language Model to generate natural language reports that quantify pathology with anatomical precision. On a dataset of 3,500 high-resolution fundus images, this innovative methodology achieved a lesion detection sensitivity of 99.6% for hemorrhages and 96.4% for microaneurysms, while demonstrating a significant reduction in boundary-ambiguity errors compared to standard segmentation baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.14801 2026-06-16 cs.LG cs.AI cs.RO 新提交

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

QPILOTS：面向流策略的高效测试时Q引导

Yifan Ruan, Chenyang Cao, Andreas Burger, Ali Pesaranghader, Kaveh Kamali, Jaehong Kim, Nandita Vijaykumar, Alan Aspuru-Guzik, Igor Gilitschenski, Nicholas Rhinehart

发表机构 * University of Toronto（多伦多大学）； Vector Institute（向量研究所）； LG Electronics（LG电子）

AI总结提出QPILOTS方法，在推理时通过投影去噪中间状态到最终动作估计并计算评论家梯度来引导流匹配和扩散策略，无需修改原策略，在离线到在线RL基准上达到90%平均成功率。

Comments 10 pages, 7 figures

详情

AI中文摘要

流匹配和扩散策略是表达力强的动作生成器，但使用时序差分强化学习（RL）优化它们仍然困难。有效的策略提取需要利用评论家的动作梯度，但通过多步去噪过程直接反向传播该信号可能数值不稳定。现有方法要么丢弃梯度信息，将策略蒸馏为更简单的单步动作器，要么随着评论家改进而重复微调去噪策略。我们提出QPILOTS，一种保持原策略不变并在推理时引导去噪过程的方法。在每个去噪步骤中，我们不是评估评论家对噪声中间动作（其中评论家预测不可靠），而是首先将该中间状态投影到最终干净动作的估计，并在那里计算评论家梯度。我们引入两种变体：QPILOTS-U使用快速单点近似，而QPILOTS-M通过学习的辅助网络绘制可微后验样本。在标准的离线到在线RL基准测试中，QPILOTS实现了最佳整体性能，在50个任务中达到平均90%的成功率。我们还应用QPILOTS引导一个大型、冻结的预训练视觉-语言动作（VLA）基础模型，在模拟的六个操作任务中优于或匹配先前的推理时方法。

英文摘要

Flow-matching and diffusion policies are expressive action generators, but optimizing them with temporal-difference reinforcement learning (RL) remains difficult. Effective policy extraction requires exploiting the critic's action gradient, yet directly backpropagating this signal through a multi-step denoising process can be numerically unstable. Existing methods work around this either by discarding gradient information, distilling the policy into a simpler one-step actor, or repeatedly fine-tuning the denoising policy as the critic improves. We propose QPILOTS, a method that leaves the original policy unmodified and steers the denoising process at inference time. At each denoising step, instead of evaluating the critic on the noisy intermediate action where critic predictions are unreliable, we first project that intermediate state to an estimate of the final clean action and compute the critic gradient there. We introduce two variants: QPILOTS-U uses a fast single-point approximation, while QPILOTS-M draws differentiable posterior samples via a learned auxiliary network. On a standard offline-to-online RL benchmark, QPILOTS achieves the best aggregate performance, reaching an average success rate of 90% across 50 tasks. We also apply QPILOTS to steer a large, frozen, pretrained Vision-Language Action (VLA) foundation model, outperforming or matching prior inference-time approaches across six manipulation tasks in simulation.

URL PDF HTML ☆

赞 0 踩 0

2606.14794 2026-06-16 cs.RO 新提交

Computing Smooth Geodesics under Two-Sided Curvature Bounds with Applications to Robotics and Image Analysis

计算双侧曲率约束下的光滑测地线及其在机器人和图像分析中的应用

Da Chen, Zhenjiang Li, Jean-Marie Mirebeau, Xuecheng Tai, Jinglin Zhang, Wei Zhang, Laurent D. Cohen

发表机构 * CEREMADE, University Paris Dauphine, University-PSL, CNRS, UMR 7534（巴黎多芬纳大学CEREMADE实验室，巴黎文理研究大学，法国国家科学研究中心，UMR 7534）； Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University, Shandong Academy of Medical Sciences（山东省肿瘤医院放射肿瘤科，山东第一医科大学，山东省医学科学院）； Department of Mathematics, Centre Borelli, ENS Paris-Saclay, CNRS, University Paris-Saclay（巴黎萨克雷大学数学系，博雷利中心，巴黎萨克雷高等师范学校，法国国家科学研究中心）； Norce（挪威研究中心）； School of Control Science and Engineering, Shandong University（山东大学控制科学与工程学院）

AI总结提出一种基于Hamilton-Jacobi-Bellman偏微分方程框架的曲率有界测地线模型，通过约束曲率上下界实现路径的光滑性和几何控制，并给出离散化求解方案，应用于机器人路径规划和图像曲线结构跟踪。

详情

AI中文摘要

平面曲线的曲率由于与光滑性、刚性和弹性等理想几何特性密切相关，因此作为计算二阶最小路径的关键正则化项。本文解决计算物理和几何中一个更具挑战性的问题：跟踪曲率受任意上下界约束的最小路径。为此，我们提出了一种新的曲率有界测地线模型，该模型在Hamilton-Jacobi-Bellman (HJB) 偏微分方程 (PDE) 框架下开发。它通过强制曲率范围约束，对最小路径提供强大的几何控制，使得路径光滑且具有有界曲率限制。我们还提出了一种包含曲率约束的哈密顿量和HJB PDE的离散化方案，使得能够高效求解模型的数值解。最后，我们展示了所提出的曲率有界测地线模型在机器人路径规划和图像曲线结构跟踪中的应用能力。数值实验表明，所提出的曲率有界测地线模型是寻找满意路径的强大且鲁棒的工具。

英文摘要

Curvature of planar curves serves as a key regularization term for computing second-order minimal paths, due to its tight relevance to desirable geometric properties such as smoothness, rigidity, and elasticity. In this paper, we tackle a more challenging problem in computational physics and geometry problem: tracking minimal paths whose curvature is constrained by arbitrary upper and lower bounds. For that purpose, we propose a new curvature-bounded geodesic model, developed under the Hamilton-Jacobi-Bellman (HJB) partial differential equation (PDE) framework. It provides strong geometric control over minimal paths by enforcing curvature range constraints, whose paths are smooth and of bounded curvature limitation. We also present a discretization scheme for the Hamiltonian and the HJB PDE incorporating curvature bounds, allowing efficient solver for estimating numerical solutions to the model. Finally, we illustrate the capability of the proposed curvature-bounded geodesic model in applications of robot path planning and curvilinear structures tracking from images. Numerical experiments demonstrate that the proposed curvature-bounded geodesic model serves as a powerful and robust tool for finding satisfactory paths.

URL PDF HTML ☆

赞 0 踩 0

2606.14792 2026-06-16 cs.CV cs.AI 新提交

Efficient Reinforcement for Visual-Textual Thinking with Discrete Diffusion Model

基于离散扩散模型的视觉-文本思维高效强化学习

Yoonjeon Kim, Yuhta Takida, Chieh-Hsin Lai, Eunho Yang, Yuki Mitsufuji

发表机构 * KAIST（韩国科学技术院）； Sony AI（索尼AI）； AITRICS ； Sony Group Corporation（索尼集团公司）

AI总结提出用离散扩散模型替代自回归模型进行多模态强化学习，通过局部视觉编辑减少计算量，并设计分解奖励分配策略解决跨模态干扰问题。

详情

AI中文摘要

基于强化学习的后训练已被广泛采用，以在能够同时进行文本和图像生成的统一多模态模型中实现交错视觉和文本推理。然而，大多数现有方法建立在自回归统一模型上，在视觉推理过程中需要完整的图像再生。在这项工作中，我们证明多模态离散扩散模型是自回归模型在交错推理中进行强化学习的有效替代方案，因为它们能够通过局部视觉编辑而非完整的图像令牌再生来执行高效的视觉展开。与自回归基线相比，这使GRPO期间的展开计算减少了26.9%，且性能下降极小。尽管效率提高，我们发现联合奖励分配（在模态间使用共享奖励信号）在RL更新期间会在不相关的图像和文本令牌序列之间引入跨模态干扰。为解决此问题，我们提出分解奖励分配策略，该策略独立地为文本和视觉片段分配奖励。采用分解奖励分配后，我们的RL方法相比联合奖励分配提高了11.2%，相比基础模型提高了38.04%。

英文摘要

RL-based post-training has been widely adopted to enable interleaved visual and textual reasoning in unified multimodal models capable of both text and image generation. However, most existing approaches are built upon autoregressive (AR) unified models, which require full image regeneration during visual reasoning. In this work, we demonstrate that multimodal discrete diffusion models are effective alternatives to AR models for reinforcement learning in interleaved reasoning, owing to their ability to perform efficient visual rollouts via localized visual editing rather than full image-token regeneration. This reduces rollout computation during GRPO by 26.9\% compared to AR baselines, with minimal performance drop. Despite the improved efficiency, we find that joint reward assignment, which employs a shared reward signal across modalities, introduces cross-modal interference between unrelated image and text token sequences during RL updates. To address this issue, we propose factorized reward assignment, a strategy that assigns rewards independently to text and vision segments. With factorized reward assignment, our RL approach achieves an 11.2% improvement over joint reward assignment and a 38.04% improvement over the base model.

URL PDF HTML ☆

赞 0 踩 0

2606.14788 2026-06-16 cs.SD cs.AI cs.LG eess.AS 新提交

Unifying Acoustic Features and Text with Multimodal LLMs for Neurodegenerative Screening

统一声学特征与文本的多模态大语言模型用于神经退行性疾病筛查

Qingfeng Zhang, Yuanxiong Guo, Yanmin Gong

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出NeurMLLM框架，通过多模态大语言模型融合声谱图、MFCC和文本，实现阿尔茨海默病和帕金森病的精细分期，优于传统方法和现有LLM方法。

Comments IEEE International Conference on Healthcare Informatics, 2026

详情

AI中文摘要

基于语音的筛查为评估阿尔茨海默病（AD）和帕金森病（PD）等神经退行性疾病提供了一种可扩展且非侵入性的方式，但由于整合异质数据的困难，其分期仍然具有挑战性。本文提出了NeurMLLM，一种用于神经退行性疾病分期的高效多模态生成框架。NeurMLLM首先使用视觉变换器对音频数据的声谱图和梅尔频率倒谱系数进行编码，并将其表示投影到大语言模型（LLM）的嵌入空间中，在那里它们与转录文本和人口统计指令标记连接成一个统一的序列。然后，通过低秩适应使用任务提示对LLM进行指令微调，以自回归方式预测受限的标签标记，从而实现生成式分类。通过在Bridge2AI-Voice数据集上对AD和PD进行细粒度分期评估，我们观察到NeurMLLM取得了强劲的性能，持续优于经典机器学习方法和现有的基于LLM的方法。结果表明，多模态LLM在神经退行性疾病分期中具有巨大潜力，提高了分期准确性并支持可访问的部署。

英文摘要

Voice-based screening offers a scalable and non-invasive way to assess neurodegenerative diseases such as Alzheimer's disease (AD) and Parkinson's disease (PD), but their staging remains challenging due to the difficulty of integrating heterogeneous data. This paper presents NeurMLLM, an efficient multimodal generative framework for neurodegenerative disease staging. NeurMLLM first encodes the spectrograms and Mel-frequency cepstral coefficients of audio data with vision transformers and projects their representations into the embedding space of a large language model (LLM), where they are concatenated with transcript and demographic instruction tokens as a single unified sequence. The LLM is then instruction-tuned via Low-Rank Adaptation using task prompts to autoregressively predict a constrained label token, enabling a generative classification. By evaluating on the Bridge2AI-Voice dataset for fine-grained staging of AD and PD, we observe that NeurMLLM achieves strong performance, consistently outperforming classical machine learning methods and existing LLM-based approaches. The results show the high potential of multimodal LLMs in neurodegenerative disease staging, improving staging accuracy and supporting accessible deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.14787 2026-06-16 cs.CV cs.CR 新提交

Vision-Encoder Behavioral Fingerprints of Image-to-Image Generative Models: A Training-Paradigm-Driven Taxonomy of Six Commercial APIs

图像到图像生成模型的视觉编码器行为指纹：基于训练范式的六个商业API分类

Hunter Hill

发表机构 * H. Hill

AI总结通过内容自适应亚JND对抗扰动管道，对六个商业图像到图像AI系统进行测试，基于DINOv2 ViT-B/14令牌距离，发现编辑训练模型与采样时适配的T2I基模型在2D平面上形成两个不同的行为带。

2606.14783 2026-06-16 cs.CV cs.CR 新提交

The Vision Encoder as a Privacy Boundary: Visual-Token Side Channels in Encoder-Free Vision-Language Models

视觉编码器作为隐私边界：无编码器视觉-语言模型中的视觉令牌侧信道

Chenyu Zhou, Qiliang Jiang, Shuning Wu, Xu Zhou

发表机构 * School of Engineering, Institute of Science Tokyo（东京科学大学工学院）； College of Control Science and Engineering, Zhejiang University（浙江大学控制科学与工程学院）； Department of Electrical and Computer Engineering, National University of Singapore（新加坡国立大学电气与计算机工程系）

AI总结研究无编码器视觉-语言模型中视觉令牌侧信道导致的隐私泄露问题，通过解码器攻击从中间视觉令牌恢复图像和文本，发现空间采样保真度是关键因素，并指出KV缓存也存在泄露风险。

详情

AI中文摘要

视觉编码器将图像像素压缩为语义嵌入，通过保留语义内容同时衰减精确文本恢复所需的像素局部细节，隐式地充当隐私边界。无编码器视觉-语言模型（VLM）通过将图像块直接路由到语言模型令牌流中移除了这一边界，从而暴露了一个架构上的隐私攻击面：中间视觉令牌成为输出前的侧信道。在令牌访问攻击者下，解码器从两个无编码器VLM（Gemma4和Fuyu）中反转视觉令牌流，恢复可识别的图像结构和可读的保留访问码，而匹配的基于编码器的控制模型仅能定位目标区域但无法恢复精确字符串。模型内消融实验表明，操作因素是视觉令牌网格的空间采样保真度，尤其是字符方向采样密度，而非令牌或值的数量。泄露不仅限于导出的令牌：Gemma4第0层键值缓存张量可直接反转，将侧信道置于生产服务栈通常为解码效率而持久化的KV缓存中。该攻击在杂乱场景、真实文档退化以及零样本迁移到公共文档图像中依然有效，并抵抗加性噪声和量化等值级防御。因此，有效的缓解措施必须降低空间采样，使得移除视觉编码器成为VLM部署中的一级隐私决策。

英文摘要

A vision encoder compresses image pixels into semantic embeddings, implicitly acting as a privacy boundary by preserving semantic content while attenuating pixel-local detail required for exact text recovery. Encoder-free vision-language models (VLMs) remove this boundary by routing image patches directly into the language-model token stream, thereby exposing an architectural privacy attack surface: intermediate visual tokens become a pre-output side channel. Under a token-access adversary, decoders invert visual-token streams from two encoder-free VLMs, Gemma4 and Fuyu, recovering recognizable image structure and readable held-out access codes, whereas matched encoder-based controls localize target regions but recover no exact strings. Within-model ablations show that the operative factor is spatial sampling fidelity of the visual-token grid, especially character-direction sampling density, rather than token or value count. The leakage is not limited to exported tokens: Gemma4 layer-0 key-value cache tensors are directly invertible, placing the side channel within KV caches commonly persisted by production serving stacks for decoding efficiency. The attack survives clutter, realistic document degradation, and zero-shot transfer to public document images, and it resists value-level defenses such as additive noise and quantization. Effective mitigation must therefore reduce spatial sampling, making removal of the vision encoder a first-class privacy decision in VLM deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.14781 2026-06-16 cs.CV 新提交

Variational Deep Unfolding with Mamba-Based Nonlocal Modeling for Underwater Image Enhancement

基于Mamba非局部建模的变分深度展开水下图像增强

Daniel Torres, Julia Navarro, Catalina Sbert, Joan Duran

发表机构 * Institute of Applied Computing and Community Code (IAC3)（应用计算与社区代码研究所 (IAC3)）； Dept. of Mathematics and Computer Science, Universitat de les Illes Balears（巴利阿里群岛大学数学与计算机科学系）

AI总结提出一种融合变分建模与可学习架构的深度展开网络，利用Mamba层捕获自相似性，通过近端轨迹损失约束展开阶段，实现水下图像增强。

详情

AI中文摘要

水下成像在海洋工程中至关重要，但捕获的数据通常存在能见度低和颜色失真问题。针对这些挑战，我们提出了一种基于模型的深度展开网络用于水下图像增强，该网络将变分建模集成到可学习架构中。该框架基于去雾分解的变分公式，包含一个乘法残差分量以吸收剩余伪影，以及一个非局部梯度型约束以保留结构细节并增强边缘锐度。我们提供了理论分析，建立了相关最小化问题解的存在性。所提出的展开方法结合了Mamba层，以有效捕获场景中的自相似性。此外，我们引入了一种近端轨迹损失，强制展开阶段与理想恢复正则化器的迭代之间的一致性。实验结果表明，与最近的最先进方法相比，所提出的展开方法实现了更好的视觉质量和有竞争力的定量性能。源代码将在https://github.com/MIA-UIB/Variational-Unfolding-Mamba-Underwater-Enhancement 提供。

英文摘要

Underwater imaging plays a crucial role in ocean engineering, although captured data often suffer from poor visibility and color distortion. To address these challenges, we propose a model-based deep unfolding network for underwater image enhancement that integrates variational modeling into a learnable architecture. The framework is guided by a variational formulation based on a dehazing decomposition, incorporating a multiplicative residual component to absorb remaining artifacts and a nonlocal gradient-type constraint to preserve structural details and enhance edge sharpness. We provide a theoretical analysis establishing the existence of solution for the associated minimization problem. The proposed unfolding method incorporates Mamba layers to efficiently capture self-similarities in the scene. In addition, we introduce a proximal trajectory loss that enforces consistency between the unfolding stages and the iterations of an ideal restoration regularizer. Experimental results demonstrate that the proposed unfolding approach achieves improved visual quality and competitive quantitative performance compared with recent state-of-the-art methods. The source code will be available at https://github.com/MIA-UIB/Variational-Unfolding-Mamba-Underwater-Enhancement .

URL PDF HTML ☆

赞 0 踩 0

2606.14780 2026-06-16 cs.CV cs.LG 新提交

YTClickbait21K: Human-Annotated Multimodal Dataset for YouTube Clickbait Detection Across Diverse Channels and Content Categories

YTClickbait21K：面向YouTube点击诱饵检测的多模态人工标注数据集，覆盖多样频道与内容类别

Md. Minhazul Islam, Md. Tanbeer Jubaer, Amith Khandakar, Shovon Sarker, Sumaiya Rahman, Md. Masum Mia, Mohamed Arselene Ayari, Hamed Noori

发表机构 * Department of Computer Science and Engineering, Rajshahi University of Engineering & Technology（拉贾沙希工程与技术大学计算机科学与工程系）； Department of Electrical Engineering, Qatar University（卡塔尔大学电气工程系）； Department of Civil and Environmental Engineering, Qatar University（卡塔尔大学土木与环境工程系）； SenseNet Inc.（SenseNet公司）

AI总结为应对视频平台点击诱饵检测缺乏大规模高质量多模态数据的问题，构建了包含21,238个视频、来自29国40频道、覆盖新闻/娱乐/教育/游戏等类别的人工标注数据集YTClickbait21K，通过三人独立标注与多数投票确保质量，为多模态语义理解和自动内容审核提供基准。

详情

AI中文摘要

视频分享平台上的点击诱饵内容对信息可靠性构成重大挑战，然而自动检测的进展一直受限于缺乏大规模、高质量的多模态数据集。我们提出了YTClickbait21K，一个人工标注的YouTube点击诱饵数据集，包含来自29个国家40个频道的21,238个视频，覆盖新闻、娱乐、教育和游戏等多种内容类别。每个样本包括结构化元数据（标题、描述、互动统计）以及相关的缩略图图像，支持全面的多模态分析。为确保标注质量，每个视频由三名标注员使用标准化的决策框架独立标注，该框架融合了文本、视觉和跨模态一致性线索，最终标签通过多数投票确定。该数据集展现出显著的人工标注一致性（k=0.65），尽管点击诱饵检测具有固有的主观性，但仍确认了可靠的标注。通过结合规模、标注严谨性和多模态丰富性，该数据集为开发和评估机器学习模型提供了稳健的基准，促进了跨模态语义理解的研究，并推动了自动内容审核系统的发展。

英文摘要

Clickbait content on video-sharing platforms poses a significant challenge to information reliability, yet progress in automated detection has been constrained by the lack of large-scale, high-quality multimodal datasets. We present YTClickbait21K, a human-annotated YouTube clickbait dataset comprising 21,238 videos collected from 40 channels across 29 countries, covering diverse content categories such as news, entertainment, education, and gaming. Each sample includes structured metadata (title, description, engagement statistics) along with associated thumbnail images, enabling comprehensive multimodal analysis. To ensure annotation quality, every video was independently labeled by three annotators using a standardized decision framework that incorporates textual, visual, and cross-modal consistency cues, with final labels determined through majority voting. The dataset exhibits substantial inter-annotator agreement (k=0.65), confirming reliable labeling despite the inherent subjectivity of clickbait detection. By combining scale, annotation rigor, and multimodal richness, this dataset provides a robust benchmark for developing and evaluating machine learning models, facilitating research in cross-modal semantic understanding, and advancing automated content moderation systems.

URL PDF HTML ☆

赞 0 踩 0

2606.14778 2026-06-16 cs.CV cs.AI 新提交

FactCheck: Feasibility-aware Long-term Action Anticipation with Multi-agent Collaboration

FactCheck: 基于多智能体协作的可行性感知长期动作预测

Rui Cao, Jiannong Cao, Bo Yuan, Zhiyuan Wen, Mingjin Zhang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； China Mobile（中国移动）

AI总结提出FactCheck多智能体框架，通过闭环“观察-规划-验证”机制，结合历史动作图验证可行性，在EPIC-Kitchens-55和EGTEA Gaze+上超越现有方法。

详情

AI中文摘要

长期动作预测（LTA）旨在从部分观察的视频中预测未来动词-名词动作的有序序列。虽然该任务是具身智能的基础，但预测物理上可行的长期动作仍然是一个关键挑战。现有方法以开环方式运行，常常幻觉出不存在物体、违反物体可供性或不考虑物体状态，因为它们缺乏明确的机制来验证动作相对于物理环境的可行性。为解决此问题，我们提出FactCheck，一种新颖的多智能体协作框架，通过闭环“观察-规划-验证”机制提高可行性。FactCheck将复杂的LTA任务分解为专门角色：观察者从视频观察中识别历史动作并构建双形式结构化记忆，包括捕捉高层人类意图和环境状态的历史动作摘要，以及编码物体状态和时间依赖性的历史动作图；规划者基于低层历史动作和高层历史动作摘要生成未来动作草案；验证者严格根据历史动作图验证草案并修正不可行动作。在EPIC-Kitchens-55和EGTEA Gaze+基准上的大量实验表明，FactCheck始终优于最先进方法。我们的工作为可行性感知的长期动作预测建立了新范式，有效闭环了动作识别、动作预测和动作验证。

英文摘要

Long-term action anticipation (LTA) aims to predict an ordered sequence of future verb-noun actions from a partially observed video. While this task serves as the foundation for embodied intelligence, anticipating physically feasible long-term actions remains a critical challenge. Existing methods, which operate in an open-loop manner, often hallucinate non-existent objects, violate object affordances, or disregard object states, as they lack explicit mechanisms to verify action feasibility against the physical environment. To address this, we propose FactCheck, a novel multi-agent collaboration framework that improves feasibility through a closed-loop "Observe-Plan-Verify" mechanism. FactCheck decomposes the complex LTA task into specialized roles: an Observer that recognizes historical actions from video observations and constructs a dual-form structured memory, comprising a History Action Abstract that captures high-level human intentions and environmental status, and a History Action Graph that encodes object states and temporal dependencies; a Planner that generates draft future actions conditioned on both low-level historical actions and high-level History Action Abstract; and a Verifier that rigorously validates the draft against the History Action Graph and refines infeasible actions. Extensive experiments on the EPIC-Kitchens-55 and EGTEA Gaze+ benchmarks demonstrate that FactCheck consistently outperforms state-of-the-art methods. Our work establishes a new paradigm for feasibility-aware long-term action anticipation, effectively closing the loop of action recognition, action prediction and action verification.

URL PDF HTML ☆

赞 0 踩 0

2606.14777 2026-06-16 cs.CV cs.AI 新提交

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

JoyAI-VL-Interaction: 实时视觉-语言交互智能

Dingyu Yao, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Haowen Hou, Zheming Liang, Congcong Wang, Yuhang Cao, Shenglong Ye, Shuai Xie, Shuhuan Gu, Haoyang Huang, Qingyi Si, Nan Duan, Jiaqi Wang

发表机构 * JD.com（京东）

AI总结提出一种持续观察、自主决定是否回应的视觉-语言交互模型，并开源8B规模模型及完整部署系统，在六个真实场景中优于现有方案。

详情

AI中文摘要

现实世界中的许多时刻不会等待用户提问。安全监控上起火，视频通话中表情变化，或直播中观众想要的商品一闪而过。然而，当今的大模型大多仍以轮次式设计：它们只在被召唤时回答，即使是看似交互式的视频通话应用，其运作方式仍是问答系统，仅在轮询或提示时做出反应。我们主张一种不同的范式：一个像人一样存在于世界中的模型。它持续观察当前发生的事件，自行决定是说话还是保持沉默，实时交互，并在问题困难时委托给后台模型。为了推动交互模型及其在各领域的应用，我们做出两项完全开源贡献。首先，我们发布JoyAI-VL-Interaction，一个8B规模的视觉优先VL交互模型。该模型内部做出响应决策，每秒选择保持沉默、回应或委托给后台模型，并在视觉触发响应性和时间感知方面表现出色。我们为其配备了一个可迁移的训练方案，从中涌现出我们从未训练过的能力，例如引导购物者切换应用屏幕或根据幻灯片即兴授课。其次，我们发布了一个围绕该模型构建的完整可部署系统。该系统将任何正在进行的视频流式传输到模型中，使其真正存在于世界中。所有其他组件都是可插拔的，包括ASR/TTS模块、记忆、可视化UI以及可连接任何API或代理的后台大脑。在六个真实场景中，人类评估者以较大优势偏好JoyAI-VL-Interaction而非豆包和Gemini的应用内视频通话助手。据我们所知，这是第一个开源的、视觉驱动的交互模型，同时发布了其训练方案、数据和完整可部署系统。

英文摘要

Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It continuously watches what is happening now, decides on its own whether to speak or stay silent, interacts in real time, and delegates to a background model when the problem is hard. To advance interaction models and their adoption across domains, we make two fully open-sourced contributions. First, we release JoyAI-VL-Interaction, an 8B-scale, vision-first VL-interaction model. The model makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model, and it excels at vision-triggered responsiveness and time awareness. We pair it with a transferable training recipe, from which capabilities we never trained for emerge, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck. Second, we release a complete, deployable system built around that model. The system streams any ongoing video into the model, making it genuinely present in the world. All other components are pluggable, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent. Across six real-world scenarios, human raters prefer JoyAI-VL-Interaction over the in-app video-call assistants of Doubao and Gemini by a wide margin. To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system.

URL PDF HTML ☆

赞 0 踩 0

2606.14773 2026-06-16 cs.CV cs.AI 新提交

Double-Helix Vision (DH-V2): A Geometry-Based Visual Sampler for Bandwidth-Constrained Perception

双螺旋视觉 (DH-V2)：一种基于几何的带宽受限感知视觉采样器

Jinwen Wen

发表机构 * Independent Researcher（独立研究者）

AI总结提出双螺旋视觉(DH)，一种基于黄金比例螺旋轨迹的几何采样器，将2D图像压缩为1D信号，实现1433倍压缩比，在CPU上0.52ms完成感知，CIFAR-10上准确率提升6.03%。

Comments 5 pages, 3 figures, 5 tables. Code and benchmarks: https://github.com/JackJ-C/double-helix-vision-tool

详情

AI中文摘要

我们提出双螺旋视觉(DH)，一种基于几何的视觉采样器，利用成对的黄金比例启发螺旋轨迹将2D图像压缩为紧凑的1D信号。DH不是均匀处理每个像素，而是采用两个相位偏移的螺旋（Alpha和Beta，偏移180度）以生物启发的中央凹方式采样图像：中心高密度，外围稀疏覆盖。在4K分辨率下，DH实现了1433倍压缩比（减少99.93%），同时保留场景的几何结构。完整的感知流水线——包括空间映射、时间碰撞检测和帧内结构视差估计——在仅CPU硬件上以1080p分辨率运行仅需0.52毫秒，无需神经网络依赖。在CIFAR-10上，在极端采样预算下（每个螺旋K=128个点），DH比均匀随机采样获得了+6.03%的准确率提升。提供了一个可序列化为JSON的机器人API，以2.7 KB的数据包提供亚毫秒级空间感知报告。代码和基准测试在MIT许可下提供。

英文摘要

We present Double-Helix Vision (DH), a geometry-based visual sampler that compresses 2D images into compact 1D signals using paired golden-ratio-inspired spiral trajectories. Rather than processing every pixel uniformly, DH employs two phase-shifted helices (Alpha and Beta, offset by 180 degrees) to sample the image with biologically-inspired foveation: high density at the center, sparse coverage at the periphery. At 4K resolution, DH achieves a 1,433x compression ratio (99.93% reduction) while preserving the geometric structure of the scene. The full perception pipeline -- including spatial mapping, temporal collision detection, and intra-frame structural disparity estimation -- runs in 0.52 ms at 1080p on CPU-only hardware, with no neural network dependencies. On CIFAR-10 at extreme sampling budgets (K=128 points per helix), DH achieves a +6.03% accuracy gain over uniform random sampling. A JSON-serializable Robotics API is provided, delivering sub-millisecond spatial perception reports in 2.7 KB packets. Code and benchmarks are available under the MIT License.

URL PDF HTML ☆

赞 0 踩 0

2606.14772 2026-06-16 cs.CV cs.AI 新提交

ScoutVLA: UAV-Centric Active Perception via a Dual-Expert VLA Model for Open-World Embodied Question Answering

ScoutVLA：面向开放世界具身问答的无人机中心主动感知双专家VLA模型

Wenhao Lu, Zhengqiu Zhu, Xiaofeng Wang, Xiaoran Zhang, Yatai Ji, Yong Zhao, Yue Hu, Yingzhen Nie, Jinlong Zhu, Zheng Zhu

发表机构 * National Key Laboratory of Digital Intelligent Modeling and Simulation, National University of Defense Technology（国防科技大学数字智能建模与仿真国家重点实验室）； GigaAI

AI总结针对无人机在室外具身问答中细粒度视角调整不足的问题，提出ScoutVLA模型，采用解耦双专家架构（视觉语言专家推断语义意图，动作专家生成连续视角调整轨迹），并通过知识隔离机制平衡连续控制与语义推理，在仿真和真实实验中显著优于基线方法。

详情

AI中文摘要

空中具身问答（EQA）要求无人机（UAV）主动感知环境并回答自然语言问题。现有的室外EQA系统通常在目标进入无人机视野后停止，导致寻找证据所需的问题的细粒度视角调整问题仍未解决。为解决此问题，我们引入FG-EQA，一个细粒度主动感知EQA基准，包含超过4万条模拟轨迹和1千条真实轨迹。受侦察蜂“摇摆舞”的启发（它们迭代调整飞行路径以验证目标信息），我们提出ScoutVLA，一种用于室外EQA的证据驱动视觉-语言-动作模型。为模拟这种主动探索行为，ScoutVLA采用解耦双专家架构：视觉语言专家推断语义意图以识别缺失证据，而独立动作专家使用高自由度流匹配生成连续视角调整轨迹。为平衡连续控制和语义推理的竞争需求，我们设计了一种解耦训练策略，其中包含知识隔离机制，防止动作梯度抹除模型的多模态推理能力。大量仿真实验和定性真实世界实地研究均验证了ScoutVLA相对于最先进基线的优越性，平均严格成功率高10.48倍，平均QA正确率高7.72倍。

英文摘要

Aerial Embodied Question Answering (EQA) requires Unmanned Aerial Vehicles (UAVs) to actively perceive the environment and answer natural language questions. Existing outdoor EQA systems usually stop once the target enters the UAV's field of view, leaving the fine-grained viewpoint adjustment needed for evidence-seeking questions largely unresolved. To address this issue, we introduce FG-EQA, a fine-grained active perception EQA benchmark with more than 40K simulated trajectories and 1K real-world trajectories. Drawing inspiration from the ``waggle dance'' of scout bees, which iteratively adjust their flight paths to verify target information, we propose ScoutVLA, an evidence-driven Vision-Language-Action model for outdoor EQA. To emulate this active exploration behavior, ScoutVLA features a decoupled dual-expert architecture: a vision-language expert infers the semantic intent to identify missing evidence, while an independent action expert employs high-DoF flow matching to generate continuous viewpoint-refinement trajectories. To balance the competing demands of continuous control and semantic reasoning, we devise a decoupled training strategy with a knowledge insulation mechanism that prevents the action gradients from erasing the model's multimodal reasoning ability. Extensive simulated experiments and a qualitative real-world field study both verify the superiority of ScoutVLA over the state-of-the-art baselines, demonstrating a 10.48$\boldsymbol{\times}$ higher average strict success rate and a 7.72$\boldsymbol{\times}$ higher average QA correctness.

URL PDF HTML ☆

赞 0 踩 0

2606.14770 2026-06-16 cs.CV cs.AI cs.IR cs.LG 新提交

An Empirical Analysis of Optimization Dynamics and Sparsity Boundaries in Large-Scale Pedestrian Attribute Recognition

大规模行人属性识别中的优化动态与稀疏边界实证分析

Houssam El Mir

发表机构 * College of Computer Science and Technology, Zhejiang University of Technology（浙江工业大学计算机科学与技术学院）

AI总结针对行人属性识别中极端类别不平衡问题，提出多标签焦点损失校准配置（alpha=0.50, gamma=2.0），在零计算开销下匹配BCE基线并提升难例挖掘，同时识别出0.1%正样本率下的稀疏墙边界。

详情

AI中文摘要

行人属性识别（PAR）对于视频监控至关重要，支持法医搜索和重识别系统。当将PETA和PA-100K合并为一个包含109,000张图像的复合语料库时，极端类别不平衡仍然是一个基本障碍，其中少数属性的正样本比例低于1%。这导致标准BCE优化抑制稀有特征，我们称之为多数负类欺骗陷阱。我们在ResNet-18骨干网络上对多标签焦点损失超参数（alpha和gamma）进行了系统消融。校准配置（alpha=0.50, gamma=2.0）实现了62.32%的宏F1分数，与BCE基线相当，同时保留了优越的难例挖掘和收敛动态。我们的方法使用纯损失函数工程，边缘部署零计算开销。我们识别出稀疏墙，这是一个硬边界，当正样本比例低于0.1%时，全局损失重新加权失效，需要实例级干预。

英文摘要

Pedestrian Attribute Recognition (PAR) is critical for video surveillance, enabling forensic search and re-identification systems. Extreme class imbalance remains a fundamental obstacle when merging PETA and PA-100K into a 109,000-image composite corpus, where minority attributes have positive sample fractions below 1%. This causes standard BCE optimization to suppress rare traits, a phenomenon we term the majority negative class cheating trap. We present a systematic ablation of Multi-Label Focal Loss hyperparameters (alpha and gamma) on a ResNet-18 backbone. A calibrated configuration (alpha=0.50, gamma=2.0) achieves a Macro F1-score of 62.32%, matching BCE baseline while preserving superior hard-example mining and convergence dynamics. Our approach uses pure loss-function engineering with zero computational overhead for edge deployment. We identify the Sparsity Wall, a hard boundary where positive sample fractions below 0.1% make global loss reweighting ineffective, requiring instance-level intervention.

URL PDF HTML ☆

赞 0 踩 0

2606.14767 2026-06-16 cs.RO 新提交

Synthetic-to-Real Pipeline for Safe Landing Zone Detection

合成到真实的着陆区安全检测流水线

Shrikant Banerjee, Reza Faieghi

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种合成数据生成与感知流水线，通过域随机化生成逼真城市环境并微调Transformer架构，结合欧几里得距离变换实现无碰撞着陆区检测，消除手动标注需求。

Comments Proceedings of Conference on Robots and Vision (CRV) 2026, Vancouver, British Columbia , Canada

详情

AI中文摘要

随着无人飞行器（UAV）向更高自主性过渡，在非合作、非结构化环境中执行无辅助回收的能力变得至关重要。实现安全自主着陆需要高保真语义分辨率以区分可通行地形与危险障碍物，然而开发常因标注航拍数据集的稀缺而受阻。本文提出一种全面的感知与数据生成流水线，旨在弥合自主着陆任务的模拟到现实差距。我们引入一个程序化合成数据引擎，通过域随机化生成具有自动语义标注的逼真城市环境。基于Transformer的OneFormer架构仅在此合成数据上微调，利用多头自注意力机制进行全局上下文解析。为确保操作安全，一个确定性着陆模块利用欧几里得距离变换（EDT）和动态推理逻辑，在障碍物周围保持严格安全缓冲的同时，识别最大的内接安全着陆区。针对UAVid数据集的定量基准测试展示了鲁棒的语义分割性能，而在真实世界无人机视频上的定性验证证实了系统在未见环境中识别无碰撞着陆点的能力。我们的结果凸显了高保真程序化模拟在消除手动标注需求的同时，为自主UAV回收提供鲁棒、边缘可部署的态势感知的潜力。

英文摘要

As Uncrewed Aerial Vehicles (UAVs) transition toward higher levels of autonomy, the ability to perform unassisted recovery in non-cooperative, unstructured environments becomes critical. Achieving safe autonomous landing requires high-fidelity semantic resolution to distinguish navigable terrain from hazardous obstacles, yet development is often hindered by the scarcity of annotated aerial datasets. This work proposes a comprehensive perception and data generation pipeline designed to bridge the sim-to-real gap for autonomous landing tasks. We introduce a procedural synthetic data engine that generates photorealistic urban environments with automated semantic annotations through domain randomization. A Transformer-based OneFormer architecture is fine-tuned exclusively on this synthetic data, leveraging multi-head self-attention mechanisms for global context resolution. To ensure operational safety, a deterministic landing module utilizes a Euclidean Distance Transform (EDT) and dynamic inference logic to identify the largest inscribed safe landing zones while maintaining strict clearance buffers around obstacles. Quantitative benchmarking against the UAVid dataset demonstrates robust semantic segmentation performance, while qualitative validation on real-world UAV footage confirms the system's ability to identify collision-free landing sites in unseen environments. Our results highlight the potential of high-fidelity procedural simulation to eliminate the need for manual annotation while providing robust, edge-deployable situational awareness for autonomous UAV recovery.

URL PDF HTML ☆

赞 0 踩 0

2606.14766 2026-06-16 cs.CV cs.AI cs.MA 新提交

XMedFusion: A Knowledge-Guided Multimodal Perception and Reasoning Framework for Autonomous Medical Systems

XMedFusion：面向自主医疗系统的知识引导多模态感知与推理框架

Hamza Riaz, Arham Haroon, Maha Baig, Muhammad Dawood Rizwan, Muhammad Naseer Bajwa, Muhammad Moazam Fraz

发表机构 * National University of Sciences and Technology (NUST)（巴基斯坦国立科技大学）； University of Oxford（牛津大学）

AI总结提出XMedFusion模块化AI框架，通过视觉感知、知识图谱构建和检索引导生成等智能体协同，增强放射学报告生成的视觉基础与临床发现捕捉能力，在公共数据集上显著优于基线模型。

Comments Accepted at the 2026 International Conference on Robotics and Automation in Industry (ICRAI)

详情

AI中文摘要

自主医疗和机器人系统日益依赖智能感知与推理能力来解释视觉数据并支持临床决策。放射学报告生成是此类自动化诊断工作流的关键组成部分，然而现有的端到端多模态模型常因视觉基础薄弱而导致不可靠的解释和细微临床发现的遗漏。本文提出XMedFusion，一个模块化AI框架，设计为自主医疗系统的智能感知与推理模块。该框架将视觉信息分解为协调的功能组件，模拟专家驱动的分析，包括提取图像基础证据的视觉感知智能体、构建临床相关发现结构的知识图谱构建智能体，以及确保报告结构一致的检索引导起草过程。合成智能体通过推理驱动的验证迭代整合视觉和结构化证据，生成可靠且可解释的诊断输出。在公共胸部X光片数据集上的实验评估表明，与基线视觉-语言模型相比，在BLEU-1上提升0.0493至0.3359，ROUGE-L上提升0.0863至0.2440，METEOR上提升0.0829至0.1708，同时在语义评估指标如一致性（2.38至7.80）和准确性（2.34至6.93）上也有显著提升。结果突出了结构化多智能体感知与推理在增强智能医学成像系统的鲁棒性、透明度和自动化方面的有效性，使其能够集成到自主医疗和机器人诊断工作流中。

英文摘要

Autonomous medical and robotic systems increasingly rely on intelligent perception and reasoning capabilities to interpret visual data and support clinical decision making. Radiology report generation represents a critical component of such automated diagnostic workflows, yet existing end-to-end multimodal models often suffer from weak visual grounding, resulting in unreliable interpretations and omission of subtle clinical findings. This paper presents XMedFusion, a modular AI framework designed as an intelligent perception and reasoning module for autonomous medical systems. The proposed framework decomposes visual information into coordinated functional components that emulate expert-driven analysis, including a visual perception agent that extracts image-grounded evidence, a knowledge graph construction agent that structures clinically relevant findings, and a retrieval-guided drafting process that ensures a consistent reporting structure. A synthesis agent iteratively integrates visual and structured evidence through reasoning-driven verification to produce reliable and interpretable diagnostic outputs. Experimental evaluation on a public chest radiograph dataset demonstrates significant improvements over baseline vision-language models, achieving gains from 0.0493 to 0.3359 in BLEU-1, 0.0863 to 0.2440 in ROUGE-L, and 0.0829 to 0.1708 in METEOR, along with substantial improvements in semantic evaluation metrics such as Consistency (2.38 to 7.80) and Accuracy (2.34 to 6.93). The results highlight the effectiveness of structured multi-agent perception and reasoning for enhancing robustness, transparency, and automation in intelligent medical imaging systems, enabling integration into autonomous healthcare and robotic diagnostic workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.14765 2026-06-16 cs.CV cs.AI cs.LG cs.MM 新提交

Momentum-Guided Semantic Forecasting (MoFore) for Self-Supervised Video Representation Learning

动量引导的语义预测（MoFore）用于自监督视频表示学习

Qinwu Xu

发表机构 * Qinwu Xu, PhD（秦武 Xu 博士）

AI总结提出MoFore框架，通过预测未来潜在嵌入进行自监督视频表示学习，结合对比正则化防止表示崩溃，在UCF101上验证了时间一致性和语义结构。

Comments 13 pages, 5 Figures, and 2 Tables

详情

AI中文摘要

自监督视频表示学习最近通过对比学习、掩码重建和预测表示学习取得了进展。基于重建的方法如MAE和VideoMAE通过恢复掩码视觉内容来学习表示，而对比方法如CLIP通过表示对齐学习语义有意义的嵌入空间。在这项工作中，我们提出了一种动量引导的语义预测框架（MoFore）用于自监督视频表示学习。该方法不是优化像素级重建或任务特定的语义对齐，而是通过从时间上遥远的上下文片段预测未来的潜在嵌入来学习时间预测性视频表示。为了提高跨时间尺度的鲁棒性，我们进一步引入了训练期间的随机时间间隔预测。该框架将预测性潜在预测与对比正则化相结合，以鼓励时间一致性同时防止表示崩溃。在UCF101数据集上的实验表明，所提出的框架在训练期间不使用动作标签的情况下学习了时间一致且语义有意义的视频表示。定量分析显示学习到的嵌入空间具有强时间稳定性和涌现的类别级结构，而定性检索实验揭示了跨相关活动的运动感知组织。总体而言，结果表明长程潜在预测为自监督视频表示学习提供了一种有效且计算高效的方法，而不依赖于基于重建的目标。

英文摘要

Self-supervised video representation learning has recently advanced through contrastive learning, masked reconstruction, and predictive representation learning. Reconstruction-based approaches such as MAE and VideoMAE learn representations by recovering masked visual content \cite{he2022mae,tong2022videomae}, while contrastive methods such as CLIP learn semantically meaningful embedding spaces through representation alignment \cite{radford2021clip}. In this work, we introduce a Momentum-Guided Semantic Forecasting framework (MoFore) for self-supervised video representation learning. Instead of optimizing for pixel-level reconstruction or task-specific semantic alignment, the proposed method learns temporally predictive video representations by forecasting future latent embeddings from temporally distant context clips. To improve robustness across temporal scales, we further introduce randomized temporal-gap forecasting during training. The framework combines predictive latent forecasting with contrastive regularization to encourage temporal consistency while preventing representation collapse. Experiments on the UCF101 dataset demonstrate that the proposed framework learns temporally consistent and semantically meaningful video representations without using action labels during training. Quantitative analysis shows strong temporal stability and emergent category-level structure in the learned embedding space, while qualitative retrieval experiments reveal motion-aware organization across related activities. Overall, the results suggest that long-range latent forecasting provides an effective and computationally efficient approach for self-supervised video representation learning without relying on reconstruction-based objectives.

URL PDF HTML ☆

赞 0 踩 0

2606.14764 2026-06-16 cs.CV cs.DM 新提交

Avoiding Exponential Blow-Up in Distributive Lattice Submodular Minimization

避免分配格次模最小化中的指数爆炸

Ishant Shanu

发表机构 * Ishant Shanu

AI总结针对分配格上次模函数最小化中因布尔格变换导致的空间指数膨胀问题，提出仅在分配格内工作的通用框架，显著提升运行效率。

2606.14763 2026-06-16 cs.RO cs.LG math.OC 新提交

Bayesian Optimization for Learning Nonlinear MPC in Autonomous Agent Navigation

自主智能体导航中学习非线性模型预测控制的贝叶斯优化

Lorenzo Ortolani, Gabriel Voss, Gabriele Beltrami, Francesco Dorati, Tommaso Felice Banfi

发表机构 * Talos Robotics AI

AI总结提出一种无地图框架，结合滚动时域规划与非线性MPC，利用贝叶斯优化自动调参，在仿真和实物四足机器人上实现高效导航。

Comments Published at the IEEE ICRA 2026 Xplore Workshop (Oral), Cross-Disciplinary aspects of Exploration in Robotics, Reinforcement Learning, and Search

详情

AI中文摘要

在动态未知环境中的实时自主导航仍然是移动机器人领域的一个基本挑战。我们提出了一种无地图框架，该框架紧密集成了反应式滚动时域规划与非线性模型预测控制（MPC）。在每个控制周期，构建基于激光雷达的高斯占据表示，并通过A*搜索生成无碰撞轨迹，随后由采用平滑sigmoid障碍屏障的CasADi/IPOPT MPC公式进行跟踪。为了提高对参数敏感性的鲁棒性，我们采用基于树结构Parzen估计器（TPE）的离线贝叶斯优化方案，该方案针对复合导航目标识别出接近最优的控制器参数。此外，使用高斯过程代理分析参数敏感性，并深入了解优化景观。所提出的框架与机器人无关，在仿真中使用Gazebo在Unitree Go2四足机器人上进行评估，随后部署到实体机器人上。实验结果表明，在仿真中调优的参数能有效迁移到硬件上，无需额外调优即可保持相当的性能。完整系统在部署时实现了高达90.0%的导航成功率，并且在仿真环境中评估指标平均提升38.9%。

英文摘要

Real-time autonomous navigation in dynamic, unknown environments remains a fundamental challenge for mobile robotics. We propose a map-free framework that tightly integrates reactive rolling-horizon planning with nonlinear Model Predictive Control (MPC). At each control cycle, a LiDAR-based Gaussian occupancy representation is constructed and used to generate collision-free trajectories via A* search, which are then tracked by a CasADi/IPOPT MPC formulation incorporating a smooth sigmoid obstacle barrier. To improve robustness to parameter sensitivity, we adopt an offline Bayesian optimization scheme based on Tree-structured Parzen Estimators (TPE), which identifies near-optimal controller parameters with respect to a composite navigation objective. In addition, a Gaussian Process surrogate is used to analyze parameter sensitivity and provide insight into the optimization landscape. The proposed framework is robot-agnostic and is evaluated on the Unitree Go2 quadruped in simulation using Gazebo, followed by deployment on the physical robot. Experimental results show that parameters tuned in simulation transfer effectively to hardware, maintaining comparable performance without additional tuning. The full system achieves up to a 90.0\% navigation success rate when deployed, along with a 38.9\% average improvement in the evaluation metrics across simulated environments.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

VANDERER: Map-Free Exploration using Future-Aware and Visual-Curiosity-Guided Diffusion Policy

Context Compression Is Not One Thing: Readable Symbolic Re-expression vs. Coherent Summary at Matched Budget

An Ensemble Deep Learning Approach for Reliable and Scalable Lemon Leaf Disease Classification

Evaluating the Robustness of Proof Autoformalization in Lean 4

GRAPE: Guided Parameter-Space Evolution for Compact Adversarial Robustness

TacStyle: Personalizing Tactile Robot Policies using Structured Behavior Representations

Multi-HMR 2: Multi-Person Camera-Centric Human Detection, Mesh Recovery and Tracking

A Definition of Good Explanations and the Challenges Explaining LLM Outputs

PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

Spectro-Temporal Interference Confounds Phase Encoding in Spatial Audio Foundation Models

S23DR 2026: End-to-End 3D Wireframe Prediction via DETR-Style Set Prediction with Contrastive Denoising

HSQ-VLM: A Novel Spatially-Constrained Quadrant Segmentation VLM Model for Explainability in Diabetic Retinopathy

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

Computing Smooth Geodesics under Two-Sided Curvature Bounds with Applications to Robotics and Image Analysis

Efficient Reinforcement for Visual-Textual Thinking with Discrete Diffusion Model

Unifying Acoustic Features and Text with Multimodal LLMs for Neurodegenerative Screening

Vision-Encoder Behavioral Fingerprints of Image-to-Image Generative Models: A Training-Paradigm-Driven Taxonomy of Six Commercial APIs

The Vision Encoder as a Privacy Boundary: Visual-Token Side Channels in Encoder-Free Vision-Language Models

Variational Deep Unfolding with Mamba-Based Nonlocal Modeling for Underwater Image Enhancement

YTClickbait21K: Human-Annotated Multimodal Dataset for YouTube Clickbait Detection Across Diverse Channels and Content Categories

FactCheck: Feasibility-aware Long-term Action Anticipation with Multi-agent Collaboration

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

Double-Helix Vision (DH-V2): A Geometry-Based Visual Sampler for Bandwidth-Constrained Perception

ScoutVLA: UAV-Centric Active Perception via a Dual-Expert VLA Model for Open-World Embodied Question Answering

An Empirical Analysis of Optimization Dynamics and Sparsity Boundaries in Large-Scale Pedestrian Attribute Recognition

Synthetic-to-Real Pipeline for Safe Landing Zone Detection

XMedFusion: A Knowledge-Guided Multimodal Perception and Reasoning Framework for Autonomous Medical Systems

Momentum-Guided Semantic Forecasting (MoFore) for Self-Supervised Video Representation Learning

Avoiding Exponential Blow-Up in Distributive Lattice Submodular Minimization

Bayesian Optimization for Learning Nonlinear MPC in Autonomous Agent Navigation