arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.09313 2026-06-09 cs.LG stat.AP 新提交

Machine-Learning Emulation of Satellite Greenhouse Gas Retrievals: Stability over Time

卫星温室气体反演的机器学习仿真:时间稳定性

Nugzar Gognadze, Motonobu Kanagawa, Yu Someya, Hisashi Yashiro

发表机构 * EURECOM National Institute for Environmental Studies(国立环境研究所)

AI总结 研究机器学习仿真卫星温室气体反演算法的时间稳定性,发现预测精度随时间下降,加入时间特征可改善Lasso和神经网络模型的XCH4预测,简单Lasso模型表现优于复杂方法且更稳定。

详情
Comments
48 pages, 9 figures, 15 tables
AI中文摘要

反演算法通过求解高光谱分辨率卫星辐射测量值的逆问题,用于估算二氧化碳(CO2)和甲烷(CH4)等温室气体(GHGs)的大气浓度。然而,这些算法计算成本高,使得大规模实时估算变得困难。因此,机器学习模型被提出作为反演算法的快速仿真器。然而,现有大多数研究仅使用与训练数据同期的测试数据评估它们。我们利用温室气体观测卫星(GOSAT)的数据研究此类仿真器的时间稳定性。我们表明,当测试期远离训练期时,预测精度通常会下降。我们还表明,将时间作为输入特征显著改善了Lasso和神经网络模型的XCH4预测。在所考虑的方法中,简单的Lasso模型表现与神经网络等更复杂的方法相当或更好,并且随时间产生更稳定的预测。我们利用地面观测网络——总碳柱观测网络(TCCON)进一步验证了结果。在TCCON匹配数据集上,时间增强的Lasso模型对TCCON的误差与GOSAT和TCCON之间在XCO2和XCH4上的差异相当。

英文摘要

Retrieval algorithms are used to estimate atmospheric concentrations of greenhouse gases (GHGs), such as carbon dioxide (CO2) and methane (CH4), by solving inverse problems from high-spectral-resolution satellite radiance measurements. However, these algorithms are computationally expensive, which makes real-time estimation at scale difficult. Machine-learning models have therefore been proposed as fast emulators of retrieval algorithms. Most existing studies, however, evaluate them only on test data from the same period as the training data. We study the stability over time of such emulators using data from the Greenhouse Gases Observing SATellite (GOSAT). We show that prediction accuracy generally deteriorates when the test period moves away from the training period. We also show that including time as an input feature substantially improves XCH4 prediction for Lasso and neural-network models. Among the methods considered, a simple Lasso model performs as well as or better than more complex methods such as neural networks, and yields more stable predictions over time. We further validate the results using the Total Carbon Column Observing Network (TCCON), a ground-based observation network. On the TCCON-matched dataset, the time-augmented Lasso achieves errors against TCCON that are comparable to the disagreement between GOSAT and TCCON for both XCO2 and XCH4.

2606.09312 2026-06-09 cs.LG cs.PL 新提交

Toward Compiler World Models: Learning Latent Dynamics for Efficient Tensor Program Search

迈向编译器世界模型:学习潜在动态以实现高效张量程序搜索

Haolin Pan, Lianghong Huang, Xvlin Zhou, Mingjie Xing, Yanjun Wu

发表机构 * Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州高等研究院) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出一种受世界模型启发的评估器,通过轻量级过渡模型在连续潜在空间中展开调度动作,避免昂贵AST变异和重复编码,在TVM AutoScheduler中实现比Ansor更优的延迟和测量效率。

详情
AI中文摘要

张量程序优化对现代机器学习系统至关重要,但其搜索空间巨大。现有的自动调度器通过学习成本模型来降低测量成本,但它们通常将每个候选视为静态代码快照,忽略了产生它的调度轨迹。这使得它们对动作依赖不敏感,且易受表面代码变化影响。我们提出一种受世界模型启发的评估器,将调度评估建模为程序状态上的动作条件潜在动态。从初始程序开始,它使用轻量级过渡模型在连续潜在空间中展开调度动作,避免了昂贵的AST变异和重复代码编码。最终的动态表示与动作和硬件特征结合以对候选进行排序。在TVM AutoScheduler中实现后,我们的方法在相同64次试验预算下,GPU上代表性子图延迟比Ansor提升1.37倍,CPU上提升1.54倍。它还在使用10倍更少测量次数的情况下,在2.2%几何平均内匹配Ansor-10K,并将完整模型推理速度提升至PyTorch/PyTorch-opt(cuDNN)的4.61倍/3.67倍几何平均。

英文摘要

Tensor program optimization is essential for modern machine learning systems, but its search space is enormous. Existing auto-schedulers reduce measurement cost with learned cost models, yet they usually evaluate each candidate as a static code snapshot, ignoring the schedule trajectory that produced it. This makes them insensitive to action dependencies and vulnerable to superficial code variations. We propose a \emph{world-model-inspired} evaluator that models schedule evaluation as action-conditioned latent dynamics over program states. Starting from the initial program, it rolls out scheduling actions in a continuous latent space with a lightweight transition model, avoiding expensive AST mutation and repeated code encoding. The final dynamic representation is combined with action and hardware features to rank candidates. Implemented in TVM AutoScheduler, our method improves representative-subgraph latency over Ansor by 1.37$\times$ on GPU and 1.54$\times$ on CPU under the same 64-trial budget. It also matches Ansor-10K within 2.2% geometric mean using 10$\times$ fewer measurements, and accelerates full-model inference over PyTorch/PyTorch-opt(cuDNN) by 4.61$\times$/3.67$\times$ geometric mean.

2606.09311 2026-06-09 cs.AI 新提交

FF-JEPA: Long-Horizon Planning in World Models with Latent Planners

FF-JEPA:基于潜在规划器的世界模型中的长时域规划

Sergi Masip, Jonathan Swinnen, Yutong Hu, Renaud Detry, Tinne Tuytelaars

发表机构 * KU Leuven(鲁汶大学)

AI总结 提出FF-JEPA层次化方法,通过引入无动作潜在规划器预测子目标,将复杂轨迹分解为短期优化问题,解决长时域规划中计算昂贵和需要目标图像的问题。

详情
AI中文摘要

联合嵌入预测架构(JEPAs)展示了有前景的世界建模能力,能够通过使用交叉熵方法(CEM)等方法优化动作轨迹,在潜在空间中进行规划。然而,这些方法对于长时域规划而言计算成本过高且效果不佳。此外,这些方法通常需要目标状态的显式图像,这在现实任务中并不总是可行。在这项工作中,我们通过提出Forward-Forward-JEPA(FF-JEPA)来解决这些局限性,这是一种利用两个前向动力学模型的层次化方法。除了标准的动作条件前向模型外,我们还引入了一个无动作潜在规划器,该规划器根据当前状态预测下一个子目标。这种方法消除了对目标图像的需求,并通过将复杂轨迹分解为一系列可处理的短期优化问题来实现长时域规划。在PushT上的初步结果表明,FF-JEPA成功克服了扁平世界模型的长时域崩溃,凸显了该方法作为无目标规划的一个有前景的方向。

英文摘要

Joint Embedding Predictive Architectures (JEPAs) have shown promising world modeling capabilities, enabling planning in latent space by optimizing action trajectories using methods like the Cross-Entropy Method (CEM). These methods are, however, too computationally expensive and ineffective for long-horizon planning. Furthermore, these methods typically require an explicit image of the goal state, which is not always possible in real-world tasks. In this work, we tackle these limitations by proposing Forward-Forward-JEPA (FF-JEPA), a hierarchical approach leveraging two forward dynamics models. Alongside a standard action-conditioned forward model, we introduce an action-free latent planner that predicts the next subgoal given the current state. This approach removes the need for goal images and enables long-horizon planning by decomposing complex trajectories into a sequence of tractable, short-term optimization problems. Preliminary results on PushT demonstrate that FF-JEPA successfully overcomes flat world models' long-horizon collapse, highlighting this approach as a promising direction for goal-free planning.

2606.09304 2026-06-09 cs.CL cs.LG 新提交

SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling

SG-OPD: 通过符号一致性门控和分阶段教师采样的符号门控在线蒸馏

Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Xiaofeng Zhang, Xiaosong Yuan

发表机构 * Zhejiang University(浙江大学) Hunan University(湖南大学) Tianjin University(天津大学) Shanghai Jiao Tong University(上海交通大学) Jilin University(吉林大学)

AI总结 针对在线蒸馏中轨迹级对齐和教师偏好均匀可靠性假设的失效问题,提出SG-OPD方法,通过符号一致性门控和分阶段教师采样改进蒸馏效果,在竞赛级数学推理任务上平均提升1.98和7.50。

详情
AI中文摘要

在线蒸馏(OPD)在自身轨迹上训练学生模型,并利用更强教师的密集逐token监督,通常优于离线蒸馏和标准强化学习。然而,我们发现其有效性隐含地依赖于两个在实践中经常失效的假设:学生与教师之间的轨迹级对齐,以及教师偏好的均匀token级可靠性。因此,我们提出符号门控在线蒸馏(SG-OPD),该方法使用二元验证器作为教师信任信号,在两个互补粒度上发挥作用:分阶段教师采样在冷启动时混合验证器认可的教师轨迹,而符号一致性门控在教师与验证器校正方向一致的token上外推蒸馏更新,在分歧时内插。在竞赛级数学推理基准上的实验表明,SG-OPD持续优于标准OPD,在每样本和每问题水平上平均提升分别为1.98和7.50。

英文摘要

On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assumptions that frequently break in practice: trajectory-level alignment between the student and the teacher, and uniform token-level reliability of the teacher's preferences. We therefore propose Sign-Gated On-Policy Distillation (SG-OPD), which uses a binary verifier as a trust signal for the teacher at two complementary granularities: phased teacher sampling mixes in verifier-endorsed teacher rollouts at cold-start, and a sign-consistency gate extrapolates the distillation update on tokens where the teacher agrees with the verifier-correct direction and interpolates it where it disagrees. Experiments on competition-level mathematical reasoning benchmarks show that SG-OPD consistently outperforms standard OPD, with average gains of 1.98 and 7.50 at the per-sample and per-question levels, respectively.

2606.09303 2026-06-09 cs.CV 新提交

Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

再思考:通过候选发现与比较推理进行分割

Xinyan Gao, Haoran Hao, Xiangyu Yue

发表机构 * The Chinese University of Hong Kong(香港中文大学) Nanjing University(南京大学)

AI总结 提出两阶段框架Rea2Seg,先基于注意力图生成候选掩码,再用多模态大语言模型推理评分,将分割转化为候选发现与判别选择,并引入新基准ReasonSeg-SGDR全面评估感知、定位与推理能力。

详情
Comments
Project page: https://snowball521.github.io/Rea2Seg-Project/
AI中文摘要

预训练基础模型的快速发展使得更通用的图像分割成为可能。多模态大语言模型(MLLMs)已被广泛探索用于需要高级推理的复杂查询的图像分割。尽管取得了有希望的进展,现有方法通常受限于有限的训练数据以及MLLMs与掩码生成模块之间的差距。为了更好地将MLLMs的感知和推理能力迁移到复杂的基于推理的分割任务,我们提出了一个两阶段框架Rea2Seg用于掩码生成和选择。具体来说,该框架首先基于分割MLLM的注意力图识别潜在区域作为候选掩码。然后,它利用MLLM对问题和候选掩码进行推理,并为每个掩码分配分数。最终的分割结果通过对候选掩码重新排序并选择最高分的掩码获得,将图像分割重新表述为候选发现后跟判别性掩码选择。\n我们还注意到,现有基准中的大部分问题集中在常识推理上,这些问题通常不需要完全的联合视觉观察和推理。为了解决这个问题,我们引入了一个名为ReasonSeg-SGDR的新基准,该基准在多个维度上全面评估模型的感知、定位和推理能力,包括判别性识别、空间推理、几何推理和多步推理,并带有细粒度的掩码生成。\n此外,我们收集训练数据以增强MLLMs联合理解多模态查询和候选掩码的能力,并通过推理分配分数。在提出的基准和ReasonSeg上的实验结果表明了统一掩码生成和选择框架的有效性。

英文摘要

The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal large language models (MLLMs) have been widely explored for image segmentation with complex queries that require high-level reasoning. Despite promising progress, existing methods are often constrained by limited training data and the gap between MLLMs and mask generation modules. To better transfer MLLMs' perception and reasoning ability to complex reasoning-based segmentation tasks, we propose a two-stage framework Rea2Seg for mask generation and selection. Specifically, the framework first identifies potential regions as candidate masks based on the attention maps of a segmentation MLLM. It then employs an MLLM to reason over the question and candidate masks and assign scores to each mask. The final segmentation result is obtained by reranking the candidates and selecting the highest-scoring mask, reformulating image segmentation as candidate discovery followed by discriminative mask selection. We also notice that a large portion of questions in existing benchmarks focus on commonsense reasoning, and these questions usually do not fully require joint visual observation and reasoning. To address this issue, we introduce a new benchmark called ReasonSeg-SGDR that comprehensively evaluates a model's perception, grounding, and reasoning abilities across multiple dimensions, including discriminative recognition, spatial reasoning, geometric reasoning, and multi-step reasoning, with fine-grained mask generation. In addition, we collect training data to enhance MLLMs' ability to jointly understand multimodal queries and candidate masks, and to assign scores through reasoning. Experimental results on the proposed benchmark and ReasonSeg demonstrate the effectiveness of the unified mask generation and selection framework.

2606.09301 2026-06-09 cs.LG 新提交

PRISM: Topology-Aware Cross-Modal Imputation for Modality-Deficient Federated Graph Learning

PRISM: 面向模态缺失联邦图学习的拓扑感知跨模态插补

Zekai Chen, Miao Zhang, Jiayang Xing, Xunkai Li, Xun Wu, Rong-Hua Li, Guoren Wang

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 针对联邦图学习中客户端级模态缺失问题,提出拓扑感知跨模态插补框架PRISM,通过联邦检索缺失模态语义并利用拓扑控制注入局部图传播,在六个多模态图数据集上平均提升4.48%。

详情
AI中文摘要

多模态联邦图学习(MM-FGL)旨在从包含文本和图像的分散图中协作学习。然而,现实世界的客户端可能没有共同的模态基础:视觉搜索客户端可能包含图像-交互图但没有卖家描述,而目录客户端可能提供文本但没有产品图像。我们将这种实际设置称为客户端级模态缺失。与随机的实例级缺失不同,缺失模态的客户端缺乏重建缺失模态所需的局部语义基础。更重要的是,在图学习中,不完整的表示初始化消息传递,因此插补误差可以被接收拓扑过滤、混合和放大。为了解决这一问题,我们提出了\textbf{PRISM}(\textbf{P}roactive \textbf{R}etrieval and \textbf{I}mputation via \textbf{S}tructural \textbf{M}eta-prompting),一个拓扑感知的联邦跨模态插补框架。PRISM不是仅从局部观测重建缺失模态,而是从联邦中恢复缺失模态语义,并在拓扑感知控制下将其引入局部图传播。在六个多模态图数据集上的实验表明,PRISM持续改善模态缺失客户端,平均优于最先进的基线\textbf{4.48}\%。

英文摘要

Multimodal federated graph learning (MM-FGL) aims to collaboratively learn from decentralized graphs with text and images. However, real-world clients may not share a common modality basis: a visual-search client may contain image--interaction graphs but no seller descriptions, while a catalog client may provide text but no product images. We refer to this practical setting as client-level modality deficiency. Unlike random instance-wise missingness, a deficient client lacks the local semantic basis needed to reconstruct the absent modality. More importantly, in graph learning, incomplete representations initialize message passing, so imputation errors can be filtered, mixed, and amplified by the receiving topology. To address this gap, we propose \textbf{PRISM} (\textbf{P}roactive \textbf{R}etrieval and \textbf{I}mputation via \textbf{S}tructural \textbf{M}eta-prompting), a topology-aware federated cross-modal imputation framework. Rather than reconstructing the missing modality solely from local observations, PRISM recovers missing-modality semantics from the federation and introduces them into local graph propagation under topology-aware control. Experiments on six multimodal graph datasets across graph-centric and modality-centric tasks show that PRISM consistently improves modality-deficient clients, outperforming state-of-the-art baselines by \textbf{4.48}\% on average.

2606.09295 2026-06-09 cs.CL 新提交

NüshuVoice: Reviving the Voice of Endangered Nüshu with Pitch-Aware Text-to-Speech

NüshuVoice:利用音高感知文本到语音技术复兴濒危女书的声音

Hongkun Yang, Xinhui Yi, Xiyan Zhao, Yibo Meng, Lionel Z. Wang, Lixu Wang, Yaqi Zhang, Ruiqi Chen, Xuanyue Zhao, Lanxin Zhang, Yu Zeng, Weijia Chu, Yiming Ma, Chenyu Liu, Jianghao Lin, Xin Xu

发表机构 * Ocean University of China(中国海洋大学) The Hong Kong Polytechnic University(香港理工大学) Cornell University(康奈尔大学) Nanyang Technological University(南洋理工大学) Shanghai Jiao Tong University(上海交通大学) University of Michigan–Ann Arbor(密歇根大学安娜堡分校) University of Science and Technology of China(中国科学技术大学) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 针对女书语音数据稀缺问题,提出NüshuVoice基准和F0条件VITS框架Nüshu-PitchVITS,利用五级音高标注作为韵律先验,在频谱保真度、音高重建和可懂度上优于强基线。

详情
Comments
12 pages, 3 figures
AI中文摘要

女书是一种濒危的音节文字,历史上由中国湖南省南部江永县的女性使用。现有的女书计算研究主要关注文本数字化和视觉识别,其真实发音的声学重建仍基本未被探索。构建女书文本到语音(TTS)系统尤其具有挑战性,因为可用的录音极其有限,且大多为孤立的音节级发音而非自然的句子级话语。在这项工作中,我们介绍了NüshuVoice,这是首个女书TTS基准。我们构建了一个句子级女书文本到音频数据集,对齐了标准化的Unicode女书文本、音标、标准中文翻译和档案录音。为了在这种极端低资源设置下合成语音,我们提出了Nüshu-PitchVITS,一种F0条件VITS框架,利用女书的五级音高符号作为显式的韵律归纳偏置。实验结果表明,Nüshu-PitchVITS在频谱保真度、音高重建和人类评定的可懂度方面优于强TTS基线。我们公开发布了数据集和代码,网址为:https://anonymous.4open.science/r/Nvshu-TTS-2EB6。

英文摘要

Nüshu is an endangered phonetic script historically used by women in Jiangyong County, southern Hunan, China. While existing computational studies of Nüshu mainly focus on textual digitization and visual recognition, the acoustic reconstruction of its authentic pronunciation remains largely unexplored. Building a Nüshu text-to-speech (TTS) system is particularly challenging because available recordings are extremely limited and mostly consist of isolated syllable-level pronunciations rather than natural sentence-level utterances. In this work, we introduce NüshuVoice, the first TTS benchmark for Nüshu. We construct a sentence-level Nüshu text-to-audio dataset that aligns standardized Unicode Nüshu text, phonetic transcriptions, standard Chinese translations, and archival recordings. To synthesize speech under this extreme low-resource setting, we propose Nüshu-PitchVITS, an F0-conditioned VITS framework that leverages Nüshu's five-level pitch notation as an explicit prosodic inductive bias. Experimental results show that Nüshu-PitchVITS outperforms strong TTS baselines in spectral fidelity, pitch reconstruction, and human-rated intelligibility. We publicly release the dataset and code at: https://anonymous.4open.science/r/Nvshu-TTS-2EB6.

2606.09293 2026-06-09 cs.CL 新提交

One Model, Multiple Goals: Adaptive Multi-Objective Learning for E-commerce Dialogue Systems

一个模型,多个目标:面向电商对话系统的自适应多目标学习

Mingzhe Li, Jing Xiang, Enguo Zhou, Lang Gao, Tai Li, Qishen Zhang, Xiangliang Zhang, Xiuying Chen

发表机构 * ByteDance(字节跳动) MBZUAI(穆罕默德·本·扎耶德人工智能大学) University of Notre Dame(圣母大学)

AI总结 提出自适应多目标强化学习框架MORE,通过将推理功能作为约束指导策略优化,并引入自适应多奖励机制平衡语言目标,在电商对话系统中同时提升推理准确性和语言自然性,在线实验转化率提升30.09%。

详情
Comments
Accepted by KDD 2026
AI中文摘要

电商场景中的对话系统通常需要满足多个目标:准确推理用户画像(如资格、信用额度)以确保正确决策和用户状态理解,同时生成自然且忠实的回复。这些目标是互补但非完全一致的。在这项工作中,我们提出了MORE,一个自适应多目标强化学习框架,联合优化推理准确性和语言自然性。我们的初步实验表明,直接混合具有不同优化动态的奖励会导致振荡和不稳定的学习。因此,我们不优化单一的混合奖励,而是将推理函数视为指导策略优化的约束。在推理时,系统直接生成回复,无需显式推理步骤,同时仍受益于推理增强的支架,避免额外的推理开销。为了更好地平衡回复生成过程中的语言目标,我们引入了一种自适应多奖励机制,该机制聚合流畅性和自然性等信号,并通过梯度反馈动态重新加权。我们在字节跳动的两个真实对话系统和MultiWOZ 2.2基准上评估MORE,其持续优于强基线。在字节跳动生产流量的14天在线实验中,MORE将总体转化率和达成转化率分别提高了16.53%和30.09%,同时提高了用户满意度并降低了转接率。值得注意的是,在人机对比中,MORE恢复了人类客服所实现的增量转化提升的约60%。

英文摘要

Dialogue systems in e-commerce scenarios often need to satisfy multiple objectives: accurately reasoning over user profiles (e.g., eligibility, credit limit) to ensure correct decision-making and user state interpretation, while also generating natural and faithful responses. These goals are complementary but not identical. In this work, we propose MORE, an adaptive Multi-Objective REinforcement learning framework that jointly optimizes reasoning accuracy and linguistic naturalness. Our preliminary experiments show that directly mixing rewards with diverging optimization dynamics can cause oscillations and unstable learning. Thus, instead of optimizing a single mixed reward, we treat reasoning functions as constraints that guide policy optimization. At inference time, the system directly generates responses without explicit reasoning steps, while still benefiting from reasoning-enhanced scaffold and avoiding additional inference overhead. To better balance linguistic objectives during response generation, we introduce an adaptive multi-reward mechanism that aggregates signals such as fluency and naturalness and dynamically reweighs them via gradient feedback. We evaluate MORE on two real-world dialogue systems at ByteDance and the MultiWOZ 2.2 benchmark, where it consistently outperforms strong baselines. In 14-day online experiments on ByteDance production traffic, MORE improves overall and reached conversion by 16.53% and 30.09%, while increasing user satisfaction and reducing handoff rates. Notably, in a human-machine comparison, MORE recovers about 60% of the incremental conversion lift achieved by human agents.

2606.09292 2026-06-09 cs.RO cs.SY eess.SY 新提交

Dual Quaternion-Based Unscented Kalman Filter with Visual Inertial Odometry for Navigation in GPS-Denied Environments

基于对偶四元数的无迹卡尔曼滤波与视觉惯性里程计在GPS拒止环境中的导航

Mohamed Khalifa, Hashim A. Hashim

发表机构 * Carleton University(卡尔顿大学)

AI总结 提出一种基于对偶四元数的无迹卡尔曼滤波(DQUKF)结合视觉惯性里程计(VIO),在GPS拒止环境下实现高精度状态估计,在EuRoC数据集上位置RMSE达0.2584米。

详情
AI中文摘要

在GPS拒止环境中的可靠导航仍然是机器人、航空航天和自动驾驶车辆应用中的基本挑战。本文提出了一种基于对偶四元数的无迹卡尔曼滤波(DQUKF),配备视觉惯性里程计(VIO)算法,用于在GPS拒止位置实现精确状态估计以实现导航。所提出的框架以误差状态形式构建DQUKF,其中名义位姿由单位对偶四元数表示,局部位姿误差由6维扭量参数化表示,用于sigma点生成、协方差传播和测量校正。同时,VIO算法跨图像帧跟踪特征,同步IMU和相机之间的测量,并提供补充惯性传播的视觉约束。在EuRoC MAV数据集上的仿真结果表明,所提出的DQUKF在高初始化不确定性下收敛,并在困难飞行序列中实现了0.2584米的位置RMSE,优于基准滤波器。

英文摘要

Reliable navigation in GPS-denied environments remains a fundamental challenge in robotics, aerospace, and autonomous vehicle applications. This paper presents a Dual Quaternion-Based Unscented Kalman Filter (DQUKF) equipped with a Visual Inertial Odometry (VIO) algorithm for accurate state estimation enabling navigation in GPS denied locations. The proposed framework formulates the DQUKF in an error state manner, where the nominal pose is represented by a unit dual quaternion and the local pose error is represented by a 6-dimensional twistor parameterization used for sigma point generation, covariance propagation, and measurement correction. In parallel, the VIO algorithm tracks features across image frames, synchronizes measurements between the IMU and camera, and provides visual constraints that complement inertial propagation. Simulation results on the EuRoC MAV dataset show that the proposed DQUKF converges under high initialization uncertainty and achieves a position RMSE of 0.2584~m in the difficult flight sequence, outperforming the benchmark filters.

2606.09290 2026-06-09 cs.CV 新提交

Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

Visual Para-Thinker++:用于视觉推理的单策略多智能体框架

Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Zizhao Tong, Xiaofeng Zhang, Xiaosong Yuan

发表机构 * Zhejiang University(浙江大学) Hunan University(湖南大学) Tianjin University(天津大学) University of Chinese Academy of Sciences(中国科学院大学) Shanghai Jiao Tong University(上海交通大学) Jilin University(吉林大学)

AI总结 提出Visual Para-Thinker++框架,通过共享MLLM策略实例化为多个角色智能体并行推理,结合多智能体能力注入和角色解耦优化,有效缓解视觉推理中的早期感知承诺和幻觉问题。

详情
AI中文摘要

视觉推理需要整合分布在区域、属性和关系中的证据,这使得单链推理容易产生早期感知承诺和幻觉。我们提出Visual Para-Thinker++,一个单策略多智能体框架,其中共享的MLLM策略被实例化为角色条件的主智能体、工作智能体和总结智能体。主智能体使用固定分配模式分解任务;工作智能体在上下文隔离下并行推理;总结智能体整合所有工作智能体的推理轨迹,而不是对最终标签进行多数投票。共享策略通过多智能体能力注入和角色解耦多智能体优化进行训练,为相应的token片段分配角色特定的奖励和优势,以减少协作角色之间的梯度冲突。原生推理引擎通过共享视觉前缀和KV缓存重用实现高效的多智能体展开。在V*、CountBench、RefCOCO系列和HallusionBench上,Visual Para-Thinker++始终优于单轨迹和推理时并行基线,在幻觉敏感的视觉推理上尤其表现出色。

英文摘要

Visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early perceptual commitment and hallucination. We propose Visual Para-Thinker++, a single-policy multi-agent framework in which one shared MLLM policy is instantiated as role-conditioned Main, Worker, and Summary Agents. The Main Agent decomposes the task with fixed allocation patterns; Worker Agents reason in parallel under context isolation; and the Summary Agent reconciles full Worker reasoning traces rather than majority-voting on final labels. The shared policy is trained by Multi-Agent Capability Injection and Role-Decoupled Multi-Agent Optimization, which assign role-specific rewards and advantages to corresponding token segments to reduce gradient conflict among collaborative roles. A native inference engine enables efficient multi-agent rollout through shared visual prefix and KV cache reuse. Across V*, CountBench, the RefCOCO family, and HallusionBench, Visual Para-Thinker++ consistently outperforms single-trajectory and inference-time parallel baselines, with especially strong gains on hallucination-sensitive visual reasoning.

2606.09286 2026-06-09 cs.RO 新提交

VAIC: Vision-Guided Humanoid Agile Object Interaction Control via Decoupled Commands

VAIC: 基于解耦命令的视觉引导人形机器人敏捷物体交互控制

Dongting Li, Qianyang Wu, Xingyu Chen, Liang Li, Yuhang Lin, Sikai Wu, Guoyao Zhang, Mingliang Zhou, Diyun Xiang, Qiang Zhang, Renjing Xu, Jianzhu Ma

发表机构 * Tsinghua University(清华大学) HKUST(Guangzhou)(香港科技大学(广州)) Xiaomi Robotics Lab(小米机器人实验室)

AI总结 提出VAIC框架,通过解耦命令和两阶段蒸馏范式,仅依靠机载深度、历史本体感知实现人形机器人的敏捷物体交互,在箱体搬运、推车、滑板等动态任务中超越基线。

详情
Comments
Webpage: https://vaic-humanoid.github.io/
AI中文摘要

人形机器人在现实辅助中具有巨大潜力,但在非结构化环境中与物体的敏捷交互需要紧密耦合的全身协调。尽管近期取得了进展,当前控制器仍面临关键的部署差距:它们严重依赖密集的参考轨迹和完美的状态可观测性,这本质上限制了物理泛化。我们提出了视觉引导的敏捷交互控制(VAIC),这是一个统一框架,通过仅依靠机载深度、历史本体感知和解耦的用户命令接口来弥合这一差距。VAIC采用两阶段蒸馏范式。首先,一个特权教师策略利用精确的物体运动学和精确的环境状态掌握多样的交互技能。其次,一个可部署的学生策略通过将全身跟踪替换为多轴速度目标和每帧交互指示器来蒸馏这些能力。学生利用一个循环物体适应模块,从原始深度流和本体感知中隐式推断不可观测的物体动力学。在人形机器人上的评估和实际部署表明,单个VAIC策略能够成功执行高度多样的动态任务,包括箱体搬运、推车交互和滑板,持续优于基线,推动了自主人形机器人的部署。

英文摘要

Humanoid robots hold immense potential for real-world assistance, yet agile interaction with objects in unstructured environments demands tightly coupled whole-body coordination. Despite recent advancements, current controllers face a critical deployment gap. They rely heavily on dense reference trajectories and perfect state observability, which inherently limits physical generalization. We present Vision Guided Agile Interaction Control (VAIC), a unified framework that bridges this gap by operating exclusively on onboard depth, historical proprioception, and a decoupled user command interface. VAIC employs a two-stage distillation paradigm. First, a privileged teacher policy masters diverse interaction skills using precise object kinematics and exact environmental states. Second, a deployable student policy distills these capabilities by replacing full body tracking with velocity targets across multiple axes and an interaction indicator for each frame. The student utilizes a recurrent object adaptation module to implicitly infer unobservable object dynamics from raw depth streams and proprioception. Evaluations and real-world deployments on the humanoid robot demonstrate that a single VAIC policy successfully executes highly diverse dynamic tasks. These tasks include box carrying, cart interaction, and skateboarding, consistently outperforming baselines and advancing autonomous humanoid deployment.

2606.09278 2026-06-09 cs.LG cs.AI 新提交

Internalizing Geometric Law: Learning from Solver Residuals for Precision-Critical Generation

内化几何法则:从求解器残差中学习以实现精度关键生成

Rafael Cabral, Pang Zixi, Ziyi Shou, Shen Xin

发表机构 * Huawei Celia Team(华为Celia团队)

AI总结 针对大语言模型在精度关键领域(如技术图表和机械设计)中的幻觉问题,提出可编程几何DSL PyGeoX及分层基准PyGeoX-Bench,并设计饱和加性奖励(SAR)方法,将奖励分解为有界逐约束项,解决异常梯度掩盖问题,使8B模型在基准上达到与更大前沿系统竞争的水平。

详情
AI中文摘要

大语言模型在精度关键领域(如技术图表和机械设计)中经常出现幻觉,这些领域的输出必须满足严格的几何约束。我们研究从自然语言进行开放式几何合成:将自由形式的描述转化为精确的构造,其实体必须同时满足数十个相互作用的约束。为使这一问题易于处理,我们发布了PyGeoX,一个可编程的几何DSL,它将声明性约束编译为可微损失,以及PyGeoX-Bench,一个包含300个问题的分层套件,每个问题都有可验证的逐约束奖励。使用PyGeoX作为验证器,我们识别出一种称为异常梯度掩盖的失败模式:在全局范数奖励(任何通过单一范数聚合残差的方案,例如$\exp(-\mathrm{MSE})$)下,单个异常约束可以抵消所有其他约束的学习信号。为解决此问题,我们提出饱和加性奖励(SAR),它将奖励分解为有界的逐约束项,保留部分进展并确保即使在严重违反下也能保持一致的梯度。与基于MSE的奖励(几何求解器的自然基线)相比,SAR将困难层级求解率提高了2.3倍,由此得到的8B模型在该基准上与更大的前沿系统具有竞争力。我们在https://github.com/Huawei-AI4Math/PyGeoX发布引擎、基准和数据。

英文摘要

Large Language Models frequently hallucinate in precision-critical domains such as technical diagramming and mechanical design, where outputs must satisfy strict geometric constraints. We study open-ended geometric synthesis from natural language: translating free-form descriptions into precise constructions whose entities must simultaneously satisfy dozens of interacting constraints. To make this tractable, we release PyGeoX, a programmable geometric DSL that compiles declarative constraints into a differentiable loss, and PyGeoX-Bench, a stratified suite of 300 problems with per-constraint verifiable rewards. Using PyGeoX as a verifier, we identify a failure mode we call Outlier Gradient Masking: under global-norm rewards (any scheme that aggregates residuals through a single norm, for example, $\exp(-\mathrm{MSE})$), a single outlier constraint can nullify the learning signal across all others. To address this, we propose Saturating Additive Rewards (SAR), which decompose the reward into bounded per-constraint terms, preserving partial progress and ensuring consistent gradients even under severe violations. Against MSE-based rewards, the natural baseline for geometry solvers, SAR improves the hard-tier solving rate by $2.3\times$, and the resulting 8B model is competitive with much larger frontier systems on this benchmark. We release the engine, benchmark, and data at https://github.com/Huawei-AI4Math/PyGeoX.

2606.09276 2026-06-09 cs.LG 新提交

ERBench: A Benchmark and Testsuite for Equation Discovery Algorithms

ERBench:方程发现算法的基准与测试套件

Paul Kahlmeyer, Henrik Voigt, Michael Habeck, Joachim Giesen

发表机构 * University of Jena(耶拿大学)

AI总结 提出ERBench基准,通过方程恢复任务评估符号回归算法,强调在变化维度、采样大小、分布和域下的鲁棒性,填补现有基准的空白。

详情
AI中文摘要

方程发现旨在从数据中自动发现数学方程形式的科学模型。技术上,方程发现通过符号回归算法实现。符号回归用于方程发现的性能沿两个维度衡量:测试数据的预测精度,以及已知真实公式的恢复。对于标准回归,精度通常通过域内测试数据衡量,例如,将数据集随机分为训练和测试数据。虽然这对于域内插值(普通回归的常见目标)有意义,但它可能误导真正的模型发现和泛化。明显的替代方案是衡量域外精度。然而,获得具有挑战性的域外测试数据是一个非平凡问题。因此,我们专注于方程恢复来评估用于方程发现的符号回归算法。理由是,在恢复已知真实公式方面表现良好的符号回归算法是未知方程发现中表现良好的良好候选。现有的符号回归基准包括方程恢复任务,但只有少量公开已知的真实公式。此外,这些基准较少强调评估算法在变化维度、采样大小、采样分布和采样域下的鲁棒性。然而,这对于希望发现自然现象建模方程的从业者至关重要,因为数据几乎肯定有噪声,并且来自不同的域、分布和样本大小。为填补这一空白,我们引入了方程恢复基准(ERBench),这是一个新的评估框架,旨在严格评估明确针对方程发现任务的算法。

英文摘要

Equation discovery aims to automate the discovery of scientific models in the form of mathematical equations from data. Technically, equation discovery is implemented by symbolic regression algorithms. Performance of symbolic regression for equation discovery is measured along two dimensions: Prediction accuracy on test data, and recovery of known groundtruth formulas. For standard regression, accuracy is typically measured on in-domain test data, for instance, by splitting a data set randomly into training and test data. While this makes sense for in-domain interpolation, which is the common goal in ordinary regression, it can be a misleading proxy for true model discovery and generalization. The obvious alternative is to measure out-of-domain accuracy. However, obtaining challenging out-of-domain test data is a non-trivial problem. Therefore, we focus on equation recovery for evaluating symbolic regression algorithms for equation discovery. The rationale is that symbolic regression algorithms that perform well in recovering known groundtruth formulas are good candidates to perform well in unknown equation discovery. Existing benchmarks for symbolic regression include equation recovery tasks, however, with only a small number of groundtruth formulas that are publicly known. Moreover, these benchmarks place less emphasis on evaluating the robustness of algorithms in terms of their behavior under changing dimensionality, sampling size, sampling distribution and sampling domain. This, however, is of central importance to practitioners wanting to discover equations for modeling natural phenomena, since data is almost certainly noisy and comes from diverse domains, distributions, and sample sizes. To fill this gap, we introduce the Equation Recovery Benchmark (ERBench), a new evaluation framework designed to rigorously assess algorithms explicitly targeting the task of equation discovery.

2606.09273 2026-06-09 cs.CV 新提交

EditSSC: Toward Editable Semantic Occupancy Scenes with Unconditional Diffusion Models

EditSSC: 基于无条件扩散模型的可编辑语义占用场景

Fatima Balde, Raoul de Charette, Alexandre Boulch

发表机构 * Inria(法国国家信息与自动化研究所) Valeo.ai(法雷奥人工智能实验室)

AI总结 提出EditSSC方法,利用2D BEV表示和现成潜扩散网络实现3D语义场景生成与免训练编辑,在SemanticKITTI上优于现有3D专用基线。

详情
Comments
Accepted at CVPR 2026 Workshop
AI中文摘要

3D语义场景生成对于自动驾驶应用至关重要,但大多数方法依赖于复杂的3D专用架构,如三平面编码器和适配的扩散网络,限制了其简单性和编辑能力。我们提出EditSSC,一种使用2D鸟瞰图(BEV)表示和现成潜扩散网络的3D语义场景生成方法,支持编辑。我们的方法将3D语义占用网格重塑为多通道BEV图像,并利用Stable Diffusion的量化自编码器和UNet,仅做最小修改。我们在量化后的潜变量上进行扩散,从而实现了免训练的编辑能力。通过利用码本中的类到码对应关系,我们的方法支持草图引导生成、修补和外推,无需任何重新训练。在SemanticKITTI上,EditSSC在无条件生成方面优于现有的3D专用基线,表明成熟的2D架构可以有效地用于3D场景生成和编辑。

英文摘要

3D semantic scene generation is crucial for autonomous driving applications, yet most methods rely on complex 3D-specific architectures such as triplane encoders and adapted diffusion networks, limiting both their simplicity and their editing capabilities. We propose EditSSC, an editing-ready method for 3D semantic scene generation using 2D Bird's Eye View (BEV) representations and off-the-shelf latent diffusion network. Our approach reshapes 3D semantic occupancy grids into multi-channel BEV images and leverages the quantized autoencoder and UNet from Stable Diffusion with minimal modifications. We perform diffusion on the latents after quantization, which enables training-free editing capabilities. By exploiting class-to-code correspondences in the codebook, our method supports sketch-guided generation, inpainting, and outpainting without any retraining. On SemanticKITTI, EditSSC outperforms existing 3D-specific baselines on unconditional generation, demonstrating that well-established 2D architectures can be effectively repurposed for 3D scene generation and editing.

2606.09271 2026-06-09 cs.SD cs.LG 新提交

Multi-View Speech Representation Learning for Parkinson's Disease Detection Using Context-guided Cross-modal Attention

基于上下文引导跨模态注意力的多视角语音表示学习用于帕金森病检测

George Theodosiou, Loukas Ilias, Dimitris Askounis

发表机构 * National Technical University of Athens(雅典国家技术大学)

AI总结 提出多分支深度学习框架,融合Log-Mel谱图、MFCC和HuBERT嵌入三种互补语音模态,通过上下文引导跨模态注意力机制动态加权,在PC-GITA语料库上实现91.51%准确率和95.97% AUC,验证了异质语音建模对帕金森病检测的有效性。

详情
AI中文摘要

帕金森病(PD)是一种进行性神经退行性疾病,常导致与运动功能减退性构音障碍相关的言语障碍。由于言语产生依赖于复杂神经肌肉机制的精确协调,语音分析已成为早期PD检测中一种有前景的非侵入性、成本效益高的生物标志物。最近的深度学习方法显示出令人鼓舞的结果;然而,大多数现有方法依赖单一语音表示,可能忽略跨不同特征空间编码的互补病理信息。在这项工作中,我们提出了一种多分支深度学习框架,用于从语音中自动检测PD。每个录音被分割成5秒的片段,并使用三种互补模态表示:Log-Mel谱图、MFCC和从原始波形中提取的HuBERT嵌入。谱图使用预训练的ResNet-18编码器处理,MFCC序列通过BiLSTM网络建模,原始语音使用预训练的HuBERT模型编码。为了有效整合这些异质表示,我们引入了一种上下文引导的跨模态注意力机制,该机制根据来自谱图和MFCC分支的全局声学上下文动态加权时间HuBERT嵌入。在公开的西班牙语PC-GITA语料库上,在严格的说话人独立5折交叉验证下进行的实验证明了所提出方法的有效性。所提出的架构实现了91.51%的准确率、91.24%的F1分数和95.97%的AUC。此外,消融研究证实了所提出的上下文引导跨模态注意力机制以及互补语音表示整合的贡献。这些发现突显了异质语音建模在稳健且临床可靠的PD检测中的潜力。

英文摘要

Parkinson's disease (PD) is a progressive neurodegenerative disorder that frequently causes speech impairments associated with hypokinetic dysarthria. As speech production relies on the precise coordination of complex neuromuscular mechanisms, speech analysis has emerged as a promising non-invasive and cost-effective biomarker for early PD detection. Recent deep learning approaches have shown encouraging results; however, most existing methods rely on a single speech representation, potentially overlooking complementary pathological information encoded across different feature spaces. In this work, we propose a multi-branch deep learning framework for automatic PD detection from speech. Each recording is segmented into 5-second chunks and represented using three complementary modalities: Log-Mel spectrograms, MFCCs, and HuBERT embeddings extracted from raw waveforms. The spectrograms are processed using a pre-trained ResNet-18 encoder, MFCC sequences are modeled through a BiLSTM network, and raw speech is encoded using a pre-trained HuBERT model. To effectively integrate these heterogeneous representations, we introduce a context-guided cross-modal attention mechanism that dynamically weights temporal HuBERT embeddings according to the global acoustic context derived from the spectrogram and MFCC branches. Experiments conducted on the publicly available Spanish PC-GITA corpus under strict speaker-independent 5-fold cross-validation demonstrate the effectiveness of the proposed approach. The proposed architecture achieves an accuracy of 91.51%, an F1-score of 91.24%, and an AUC of 95.97%. Furthermore, ablation studies confirm the contribution of both the proposed context-guided cross-modal attention mechanism and the integration of complementary speech representations. These findings highlight the potential of heterogeneous speech modeling for robust and clinically reliable PD detection.

2606.09268 2026-06-09 cs.RO 新提交

VGP-Nav: Metric-Aware Visual Geometric Perception for Robot Navigation

VGP-Nav:用于机器人导航的度量感知视觉几何感知

Hewei Pan, Weiye Zhu, Zekai Zhang, Zitong Huang, Rongtao Xu, Jinbao Wang, Feng Zheng

发表机构 * Southern University of Science and Technology(南方科技大学) MBZUAI(穆罕默德·本·扎耶德人工智能大学) Shenzhen University(深圳大学) SpatialTemporal AI(时空人工智能)

AI总结 提出VGP-Nav,一种仅依赖单目RGB输入的框架,通过地面平面几何约束解决尺度模糊,实现度量定位与障碍物感知的统一。

详情
AI中文摘要

可靠的机器人导航需要精确的全局定位和稠密、度量一致的障碍物感知的无缝集成。实现这些能力的常见策略涉及集成多种传感模态:相机提供丰富的视觉特征用于定位,而主动传感器如LiDAR提供直接的度量测量。然而,这种多传感器配置需要复杂的时空校准并增加部署开销。尽管纯视觉方法提供了低成本且可扩展的替代方案,现有的单目视觉系统通常难以同时实现高效、全局一致的定位和稠密、度量一致的几何感知。为弥合这一差距,我们提出\textbf{VGP-Nav},一个统一的\textit{度量感知视觉几何感知}框架,仅依赖单目RGB输入,联合支持度量定位和障碍物感知。我们的关键洞察是将基于定位的视觉几何锚定到从地面平面几何导出的物理上有意义的尺度约束,从而为单目感知提供可靠的度量参考。VGP-Nav在线解决单目尺度模糊,并生成可直接用于下游规划的、基于定位的度量障碍物表示。大量实验证明了其在多种环境中的强泛化能力以及在真实移动机器人上的成功部署,突显了该方法在可扩展、低成本且安全的自主导航中的实用性。

英文摘要

Reliable robotic navigation necessitates the seamless integration of accurate global localization and dense, metric-consistent obstacle perception. A common strategy to achieve these capabilities involves integrating diverse sensing modalities: cameras offer rich visual features for localization, while active sensors like LiDAR provide direct metric measurements. However, such multi-sensor configurations necessitate complex spatial-temporal calibration and increase deployment overhead. Although vision-only approaches offer a low-cost and scalable alternative, existing monocular visual systems typically struggle to simultaneously achieve efficient, globally consistent localization and dense, metric-consistent geometric perception. To bridge this gap, we propose \textbf{VGP-Nav}, a unified framework for \textit{Metric-Aware Visual Geometric Perception} that relies solely on monocular RGB input to jointly support metric localization and obstacle perception. Our key insight is to anchor localization-grounded visual geometry to physically meaningful scale constraints derived from ground-plane geometry, thereby providing a reliable metric reference for monocular perception. VGP-Nav resolves monocular scale ambiguity online and produces localization-grounded, metric obstacle representations that are directly applicable to downstream planning. Extensive experiments demonstrate strong generalization across diverse environments and successful deployment on real mobile robots, highlighting the practicality of our approach for scalable, low-cost, and safe autonomous navigation.

2606.09266 2026-06-09 cs.SD cs.AI 新提交

Physics-Guided Sequence-Based Generative Framework for Acoustic Metamaterial Inverse Design

物理引导的序列生成框架用于声学超材料逆向设计

Yijie Li, Jiahao Xu, Ching-Chih Tsao, Lili Qiu, Jingxian Wang

发表机构 * National University of Singapore(新加坡国立大学) UT Austin(德克萨斯大学奥斯汀分校)

AI总结 提出MetaSeq框架,将声学超材料表示为结构化序列,通过序列到序列模型结合物理求解器和强化学习,实现宽带逆向设计,误差降低45%。

详情
AI中文摘要

声学超材料(AMM)逆向设计对于宽带目标响应尤其具有挑战性,原因是声学色散:在一个频率上匹配期望响应的结构可能在其它频率上偏离,而修改几何以改善一个子带通常会扰动相邻子带。然而,现有的宽带逆向设计方法要么受限于预定义模板,要么依赖于无法保持声学结构所需的几何精度和结构连通性的图像表示。我们提出了MetaSeq,一个物理引导的、基于序列的生成框架,用于声学超材料逆向设计。其核心是,MetaSeq引入了一种语言,将每个AMM表示为结构化序列,而不是像素网格或固定模板。这种表示保留了精确的几何形状,显式编码了连通性,并将逆向设计转化为从目标响应到结构序列的序列到序列任务。MetaSeq进一步构建了一个平衡、高保真的数据集,具有高效的校准和基于复杂度的采样。为了解决逆向设计的一对多性质,MetaSeq结合了监督预训练和基于物理求解器及有效性检查器引导的强化学习微调。针对COMSOL和五个基线的广泛评估表明,MetaSeq在最佳基线基础上将响应误差降低了45%。

英文摘要

Acoustic metamaterial (AMM) inverse design is particularly challenging for broadband target responses due to acoustic dispersion: a structure that matches the desired response at one frequency may deviate at others, and modifying geometry to improve one sub-band often perturbs neighboring sub-bands. Yet existing broadband inverse-design approaches are either constrained by predefined templates, or rely on image representations that fail to preserve the geometric precision and structural connectivity required by acoustic structures. We present MetaSeq, a physics-guided, sequence-based generative framework for acoustic metamaterial inverse design. At its core, MetaSeq introduces a language that represents each AMM as a structured sequence, rather than as a pixel grid or fixed template. This representation preserves precise geometry, explicitly encodes connectivity, and casts inverse design as a sequence-to-sequence task from target response to structure sequence. MetaSeq further constructs a balanced, high-fidelity dataset with efficient calibration and complexity-based sampling. To address the one-to-many nature of inverse design, MetaSeq combines supervised pretraining with reinforcement learning fine-tuning guided by a physics-based solver and validity checker. Extensive evaluations against COMSOL and five baselines show that MetaSeq reduces response error by 45% over the best baseline.

2606.09262 2026-06-09 cs.CV 新提交

See More, Match Better: Multi-Source Feature Fusion for Two-View Correspondence Learning

看得更多,匹配更好:用于双视图对应学习的多源特征融合

Xiaojie Li, Xin Jiang, Luanyuan Dai, Jinnan Yang, Yongdong Zhang, Zechao Li

发表机构 * Nanjing University of Science and Technology(南京理工大学) People’s Daily Online(人民网) University of Science and Technology of China(中国科学技术大学)

AI总结 提出TriMatch框架,融合几何、纹理语义和结构语义特征,通过语义引导调制和层次细化,提升重复结构等场景下的对应点鉴别能力。

详情
Comments
Correspondence Learning, Multi-Source Feature Fusion, Outlier Removal, Camera Pose Estimation
AI中文摘要

双视图对应学习旨在通过利用图像对中真假对应点的内在差异来区分内点和外点。现有方法主要依赖于基于坐标的几何一致性。然而,在包含重复结构、无纹理区域或局部相似几何模式的场景中,它们常常难以处理伪一致的外点。为了解决这一限制,我们提出了TriMatch,一个用于双视图对应学习的多源特征融合框架,由特征提取和特征细化两部分组成。在特征提取中,TriMatch联合提取几何、纹理语义和结构语义特征,为对应点判别提供互补证据。为了弥合语义特征与几何特征之间的差距,纹理和结构语义特征分别通过专用的纹理-几何对齐和结构-几何对齐模块与几何特征对齐。我们进一步引入了语义引导的对应点调制模块,该模块利用语义信息调制几何特征,以抑制几何上合理但语义上不一致的对应点。在特征细化中,层次化语义增强的对应点细化策略逐步建模对应点依赖关系并重新校准多上下文特征响应,从而实现更可靠的内点-外点判别。大量实验证明了TriMatch的有效性、鲁棒性和泛化能力。

英文摘要

Two-view correspondence learning aims to distinguish true correspondences (inliers) from false ones (outliers) in image pairs by leveraging their underlying differences. Existing methods mainly rely on coordinate-based geometric consistency. However, they often struggle with pseudo-consistent outliers in scenes containing repetitive structures, textureless regions, or locally similar geometric patterns. To address this limitation, we propose TriMatch, a multi-source feature fusion framework for two-view correspondence learning, which consists of two parts: feature extraction and feature refinement. In feature extraction, TriMatch jointly extracts geometric, texture semantic, and structural semantic features to provide complementary evidence for correspondence discrimination. To bridge the gap between semantic and geometric features, texture and structural semantic features are aligned with geometric features through dedicated Texture-Geometric Alignment and Structural-Geometric Alignment modules, respectively. We further introduce a Semantic-Guided Correspondence Modulation module, which modulates geometric features using semantic information to suppress geometrically plausible but semantically inconsistent correspondences. In feature refinement, a Hierarchical Semantic-Enhanced Correspondence Refinement strategy progressively models correspondence dependencies and recalibrates multi-context feature responses, enabling more reliable inlier-outlier discrimination. Extensive experiments demonstrate the effectiveness, robustness, and generalization capability of TriMatch.

2606.09261 2026-06-09 cs.CV 新提交

Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition

自监督学习至关重要:一种用于微手势识别的简单集成方案

Tingyi Liu, Kun Li, Fei Wang, Junjie Chen, Zhiliang Wu, Jihao Gu, Haixu Liu, Dan Guo

发表机构 * Hefei University of Technology(合肥工业大学) United Arab Emirates University(阿拉伯联合酋长国大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合性国家科学中心人工智能研究院) Anhui Evolution Technology Co., Ltd.(安徽进化科技有限公司) Nanyang Technological University(南洋理工大学) University College London(伦敦大学学院) The University of Sydney(悉尼大学) Beijing QBoson Quantum Technology Co., Ltd.(北京量子芯科技有限公司)

AI总结 提出一种集成自监督RGB模型与监督多流模型的框架,在MiGA挑战赛微手势分类赛道取得第一名,通过自监督预训练提升性能,在iMiGUE测试集上达到74.419%的top-1准确率。

详情
AI中文摘要

在本文中,我们介绍了XInsight Lab在IJCAI 2026第四届MiGA挑战赛微手势分类赛道中的解决方案,该方案排名第一并取得了新的最先进结果。我们提出了一种多模态集成框架,将基于自监督的RGB模型与先前解决方案中的监督多流模型相结合。自监督RGB模型通过掩码视频建模在12万个未标注片段上进行预训练,然后在iMiGUE上微调。这一简单而有效的RGB基线在iMiGUE测试集上达到了69.224%的top-1准确率,展示了从域内未标注视频中学习可迁移表示的好处。通过将该模型作为互补分支加入,最终集成模型达到了74.419%的top-1准确率,比之前的最先进结果高出1.206个百分点。在iMiGUE上的实验结果,包括对集成策略的消融研究,验证了自监督RGB表示学习在微手势识别中的有效性。

英文摘要

In this paper, we present XInsight Lab's solution to the micro-gesture classification track of the 4th MiGA Challenge at IJCAI 2026, in which our solution ranked first and achieved a new state-of-the-art result. We propose a multimodal ensemble framework that integrates a self-supervised RGB-based model with supervised multi-stream models from previous solutions. The self-supervised RGB model is pretrained on 120K unlabeled clips via masked video modeling and then fine-tuned on iMiGUE. This simple yet effective RGB baseline achieves 69.224% top-1 accuracy on the iMiGUE test set, demonstrating the benefit of learning transferable representations from unlabeled in-domain videos. By incorporating this model as a complementary branch, the final ensemble reaches 74.419% top-1 accuracy, surpassing the previous state of the art by 1.206 percentage points. Experimental results on iMiGUE, including ablation studies on the ensemble strategy, validate the effectiveness of self-supervised RGB representation learning for micro-gesture recognition.

2606.09258 2026-06-09 cs.RO 新提交

Back to the Familiar Future: Failure Recovery for VLA Policies via Pre-Imagined Milestone Selection

回到熟悉的未来:通过预想里程碑选择实现VLA策略的故障恢复

Suyeon Shin, Juwon Kim, Hyeonbin Park, Hyunseo Kim, Hyundo Lee, Hyung-Sin Kim, Byoung-Tak Zhang

发表机构 * Seoul National University(首尔大学) Yonsei University(延世大学) Soongsil University(崇实大学)

AI总结 提出B2FF框架,通过预生成熟悉未来状态里程碑并选择恢复目标,使VLA策略在偏离轨迹时无需微调即可稳健恢复,成功率从56.3%提升至74.0%。

详情
AI中文摘要

视觉-语言-动作(VLA)策略在操作过程中可能偏离标称轨迹,即使任务在物理上仍然可行。从这些偏离中恢复具有挑战性,因为它们将策略推入陌生的状态空间,直接重新规划常常会破坏动作序列的稳定性。我们提出“回到熟悉的未来”(B2FF),一种面向预见性VLA的恢复框架,利用未来视觉条件作为恢复接口。在执行前,VLA基于干净的初始观察生成一个由熟悉未来状态组成的里程碑库。在恢复时,一个可恢复性感知的选择器从该库中选择一个恢复里程碑,并将其强制作为固定的视觉目标。这使得VLA能够将偏离轨迹的观察稳健地映射回熟悉的未来。在注入故障的LIBERO数据集上,在受控的恢复时间与注入故障对齐的情况下,B2FF将基线VLA的平均成功率从56.3%提升至74.0%,证明预想里程碑可以在不微调底层动作生成器的情况下指导恢复。

英文摘要

Vision-language-action (VLA) policies can deviate from nominal trajectories during manipulation, even when tasks remain physically feasible. Recovering from these deviations is challenging, as they push the policy into unfamiliar state spaces where direct re-planning frequently destabilizes action sequences. We propose Back to the Familiar Future (B2FF), a recovery framework for foresight-driven VLAs that leverages future visual conditioning as a recovery interface. Before execution, the VLA generates a milestone bank of familiar future states conditioned on the clean initial observation. At recovery time, a recoverability-aware selector selects a recovery milestone from this bank and enforces it as a fixed visual goal. This enables the VLA to robustly map off-trajectory observations back to a familiar future. On failure-injected LIBERO, under controlled recovery timing aligned with the injected failure, B2FF increases the average success rate of a baseline VLA from 56.3% to 74.0%, demonstrating that pre-imagined milestones can guide recovery without fine-tuning the low-level action generator.

2606.09257 2026-06-09 cs.LG cs.AI stat.ML 新提交

BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation

BSTabDiff: 用于高维表格数据生成的块-子单元扩散先验

Al Zadid Sultan Bin Habib, Md Younus Ahamed, Prashnna Gyawali, Gianfranco Doretto, Donald A. Adjeroh

发表机构 * West Virginia University(西弗吉尼亚大学) The University of Utah(犹他大学)

AI总结 针对高维低样本量表格数据,提出BSTabDiff框架,通过将特征划分为潜在块并使用共享低维子单元变量生成每个块,结合扩散先验和copula依赖,实现稳定合成与可控基准生成。

详情
Comments
Published as a paper at the 2nd DeLTa Workshop, ICLR 2026
AI中文摘要

高维低样本量(HDLSS)表格领域(例如组学)的特点是 $n \ll m$,其中 $n$ = 样本数,$m$ = 特征数。此类领域通常表现出强局部相关组、稀疏跨组依赖、重尾非高斯边缘分布、异方差噪声和结构化缺失,使得在 $\mathbb{R}^m$ 中直接进行密度学习因 $n \ll m$ 而病态。我们提出 BSTabDiff,一种块-子单元生成框架,将 $m$ 个观测特征划分为 $M$ 个潜在块($M \ll m$),并通过共享的低维子单元变量生成每个块,将全局依赖学习集中在紧凑的块潜在空间 $\mathbb{R}^M$ 中,同时通过 copula 驱动的依赖、灵活的逐特征边缘分布和显式缺失机制解码到完整特征空间。BSTabDiff 支持块潜在上的现代深度先验,包括扩散和归一化流,从而在 HDLSS 场景中实现稳定合成和可控基准生成。实验表明,与 HDLSS 数据上的非结构化表格生成器相比,BSTabDiff 能产生更真实和稳定的高维合成数据。

英文摘要

High-Dimensional Low-Sample Size (HDLSS) tabular domains (e.g., omics) are characterized by $n \ll m$, where $n$ = number of samples, and $m$ = number of features. Such domains often exhibit strong local correlation groups, sparse cross-group dependencies, heavy-tailed non-Gaussian marginals, heteroscedastic noise, and structured missingness, making direct density learning in $\mathbb{R}^m$ ill-conditioned since $n \ll m$. We propose BSTabDiff, a block-subunit generative framework that partitions the $m$ observed features into $M$ latent blocks ($M \ll m$) and generates each block via a shared low-dimensional subunit variable, concentrating global dependence learning in the compact block-latent space $\mathbb{R}^M$ while decoding to the full feature space with copula-driven dependence, flexible per-feature marginals, and explicit missingness mechanisms. BSTabDiff supports modern deep priors on block latents, including diffusion and normalizing flows, enabling stable synthesis and controllable benchmark generation in the HDLSS regime. Empirically, BSTabDiff produces more realistic and stable high-dimensional synthetic data when compared with unstructured tabular generators on HDLSS data.

2606.09255 2026-06-09 cs.RO 新提交

RPO-PDT: Demonstrating Role-Play-Based Knowledge Adaptation for Student Support Dialogue (Demonstration System)

RPO-PDT:展示基于角色扮演的知识适应用于学生支持对话(演示系统)

Filip Janik, Ewa Olton, Robert Smales, Harris Spratt, Shea Tait, Md Zia Ullah, Yanchao Yu

发表机构 * Edinburgh Napier University(爱丁堡龙比亚大学)

AI总结 提出RPO-PDT系统,通过检索增强和角色扮演循环,实现高等教育中基于结构化知识源的个性化学生支持对话,并确保安全与适应性。

详情
Comments
5 pages, 2 figures
AI中文摘要

我们提出RPO-PDT:一个基于检索、角色扮演的对话系统,用于高等教育中的自适应学生支持。RPO-PDT能够:(1)利用结构化知识源提供机构特定的个人发展导师(PDT)指导;(2)受明确的角色、边界、保密性和安全策略约束;(3)围绕反向角色扮演循环设计,其中未解决的交互从学生视角重放,从而生成替代的导师策略并存储为可重用的策略记忆。RPO-PDT支持基于文本和基于Furhat的具身交互,用于演示基于、安全且自适应的学生支持对话。

英文摘要

We present RPO-PDT: a retrieval-grounded, role-play-based dialogue system for adaptive student support in higher education. RPO-PDT is: (1) able to provide institution-specific Personal Development Tutor (PDT) guidance using structured knowledge sources; (2) constrained by explicit persona, boundary, confidentiality, and safety policies; and (3) designed around a reverse-roleplay loop where unresolved interactions are replayed from the student perspective, enabling alternative tutor strategies to be generated and stored as reusable strategy memory. RPO-PDT supports both text-based and Furhat-based embodied interaction for demonstrating grounded, safe, and adaptive student-support dialogue.

2606.09251 2026-06-09 cs.CL 新提交

TruthSplit: Operationalizing Conditional Validity in Arguments Through Multi-Perspective Reasoning

TruthSplit:通过多视角推理实现论证中的条件有效性操作化

Benjamin Stieger, Maximilian Terberger, Thomas Huber, Christina Niklaus

发表机构 * University of St. Gallen(圣加仑大学)

AI总结 提出TruthSplit系统,通过三层自然语言推理和结构化世界观档案,实现基于不同视角的论证条件有效性分析,识别价值冲突与假设差异。

详情
Comments
Demo paper. To appear at ACL 2026
AI中文摘要

我们提出TruthSplit,一个用于多视角论证分析的交互式系统。现有的论证工具通常分析论证本身的属性,如结构、质量、立场或说服力,而将特定视角的背景知识隐含起来。TruthSplit通过支持探索性分析来填补这一空白,即当通过世界观特定的价值观、假设和概念定义来解释时,同一主张如何导致不同的结论。我们将这种依赖于视角的分析称为条件有效性。给定输入的论证文本,TruthSplit提取主张和前提,应用三层自然语言推理(NLI)方法来评估逻辑和世界观特定的规范性一致性,并将大语言模型(LLM)推理条件化为编码核心价值观和决策原则的结构化世界观档案。然后,系统生成特定视角的解释,识别价值冲突和假设差距,并通过交互式分析界面可视化分歧。

英文摘要

We present TruthSplit, an interactive system for multi-perspective argument analysis. Existing argumentation tools typically analyze properties of the argument itself, such as structure, quality, stance, or persuasiveness, while leaving perspective-specific background knowledge implicit. TruthSplit addresses this gap by supporting an exploratory analysis of how the same claim can lead to different conclusions when interpreted through worldview-specific values, assumptions, and conceptual definitions. We refer to this perspective-dependent analysis as conditional validity. Given an input argumentative text, TruthSplit extracts claims and premises, applies a three-layer natural language inference (NLI) approach to assess both logical and worldview-specific normative consistency, and conditions large language model (LLM) reasoning on structured worldview profiles that encode core values and decision principles. The system then generates perspective-specific interpretations, identifies value conflicts and assumption gaps, and visualizes divergence through interactive analytical interfaces.

2606.09249 2026-06-09 cs.CV 新提交

MAGIS: Evidence-Based Multi-Agent Reasoning for Interpretable Strabismus Clinical Decision-Making

MAGIS:基于证据的多智能体推理用于可解释的斜视临床决策

Xikai Tang, Yifan Wang, Jiafan Zhuang, Li Luo, Jinming Guo, Xiaoling Xie, Jiacheng Liu, Peiwei Wei, Lihao Zhong, Xiaoli Kang, Jie Cen, Guangqiang Yin, Kunliang Qiu, Ce Zheng, Zhun Fan

发表机构 * School of Information and Software Engineering, University of Electronic Science and Technology of China(电子科技大学信息与软件工程学院) Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China(电子科技大学深圳高等研究院) Joint Shantou International Eye Center of Shantou University and The Chinese University of Hong Kong(汕头大学·香港中文大学联合汕头国际眼科中心) School of Artificial Intelligence, Guangzhou City Polytechnic(广州城市职业学院人工智能学院) Medical College, Shantou University(汕头大学医学院) College of Engineering, Shantou University(汕头大学工学院) Department of Ophthalmology, Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine(上海交通大学医学院附属新华医院眼科) Shenzhen Loop Area Institute(深圳环路区域研究所)

AI总结 提出MAGIS框架,通过多智能体协作、双重证据约束上下文和基于证据的纠正验证机制,将斜视诊断从黑箱生成转变为结构化推理,在细粒度斜视基准上将加权F1分数从72.0%提升至91.3%,并显著提高诊断报告的临床可靠性。

详情
AI中文摘要

斜视是一种常见的眼部疾病,需要细粒度亚型诊断以制定个性化治疗方案。然而,现有的深度学习方法主要提供诊断预测,缺乏透明推理;而近期的大视觉语言模型(LVLMs)虽然在联合图像理解和报告生成方面有前景,但在这种对证据敏感且规则驱动的医学任务中极易产生幻觉。为解决这些问题,我们提出了MAGIS,一个基于证据的多智能体可解释斜视诊断推理框架。MAGIS将黑箱端到端生成转变为结构化的诊断过程,包括候选假设生成、双重证据约束上下文、基于证据的纠正验证和报告生成。具体而言,我们引入了双重证据约束上下文(DECC)机制,将来自九个注视方位照片的视觉证据和基于证据的临床诊断规则联合组织成约束上下文,以实现可靠的诊断推理。我们进一步开发了基于证据的纠正验证(EBCV)机制,验证当前诊断假设是否得到视觉证据、基于热图的视觉线索和基于证据的临床诊断规则的支持。当检测到不一致时,触发假设修正。在细粒度斜视基准上的实验表明,MAGIS不仅显著优于其他最先进的诊断系统,将加权F1分数从72.0%提高到91.3%,而且大幅提升了生成诊断报告的临床可靠性(一致性、对齐性和完整性)。这些结果表明,MAGIS为构建准确、基于证据且临床可解释的斜视诊断系统提供了有效解决方案。

英文摘要

Strabismus is a common ocular disorder that requires fine-grained subtype diagnosis for individualized treatment planning. However, existing deep learning methods mainly provide diagnostic predictions without transparent reasoning, while recent large vision-language models (LVLMs), although promising for joint image understanding and report generation, remain highly prone to hallucination in this evidence-sensitive and rule-driven medical task. To address these challenges, we propose MAGIS, an evidence-based Multi-AGent reasoning for Interpretable Strabismus diagnosis framework. MAGIS transforms black-box end-to-end generation into a structured diagnostic process consisting of candidate hypothesis generation, dual-evidence constrained context, evidence-based corrective verification, and report generation. Specifically, we introduce a Dual-Evidence Constrained Context (DECC) mechanism that jointly organizes visual evidence from the photograph of the nine cardinal positions of gaze and evidence-based clinical diagnostic rules into a constrained context for reliable diagnostic reasoning. We further develop an Evidence-Based Corrective Verification (EBCV) mechanism that verifies whether the current diagnostic hypothesis is supported by visual evidence, heatmap-based visual cues, and evidence-based clinical diagnostic rules. Hypothesis refinement is triggered when inconsistency is detected. Experiments on a fine-grained strabismus benchmark demonstrate that MAGIS not only significantly outperforms other state-of-the-art diagnostic systems, improving the weighted F1 score from 72.0% to 91.3%, but also substantially improves the clinical reliability (consistency, alignment, and completeness) of generated diagnostic reports. These results demonstrate that MAGIS provides an effective solution for building accurate, evidence-based, and clinically interpretable strabismus diagnosis systems.

2606.09246 2026-06-09 cs.CV 新提交

SOMA: From Surface Observations to Muscle Anatomy

SOMA:从表面观察到肌肉解剖

Eduardo Alvarado, Emily Kim, Gerrit Nolte, Friedemann Runte, Mario Botsch, Marc Habermann, Christian Theobalt

发表机构 * Max Planck Institute for Informatics, Saarland Informatics Campus(马克斯·普朗克信息学研究所,萨尔兰信息学园区) TU Dortmund University(多特蒙德工业大学)

AI总结 提出SOMA模型,从多视角RGB相机获取的表面信号推断时空肌肉行为,并构建SKIM数据集,首次实现从多视角RGB数据恢复肌肉变形,提供可扩展的低成本解剖动画方案。

详情
AI中文摘要

随着对逼真虚拟人类的需求日益增长,参数化人体模型已成为现代医学、体育和娱乐应用的基石。然而,大多数这些模型固有地存在局限性:它们仅捕捉皮肤的3D表面,无法洞察产生运动的复杂生物力学结构。随着更多应用向生物力学扩展,对超越皮肤的虚拟人类模型的需求日益明显。传统的软组织模拟(如FEM)准确但不可扩展,且对于大多数常见应用而言计算成本过高。或者,现有的生物力学工具可以模拟肌肉力和激活,但不模拟外部形状的变化,限制了激活与实际可观察解剖结构之间的相关性。这激发了一个新的逆向研究问题:直接从可见的表面观测(即从皮肤,从而从姿态)恢复肌肉变形。在这项工作中,我们提出了SOMA(从表面观察到肌肉解剖),一个从使用RGB相机获得的表面信号推断时空肌肉行为的个体特定模型,以及SKIM,一个个体特定的软组织变形数据集。据我们所知,这是首次尝试从多视角RGB数据恢复肌肉变形的方法。我们展示了我们的方法如何提供解剖学基础的动画,而无需传统模拟的复杂性,从而提供可扩展且成本效益高的解决方案。数据和代码已公开。

英文摘要

With the growing demand for realistic virtual humans, parametric body models have become a cornerstone of modern medicine, sports, and entertainment applications. However, most of these models are inherently limited: they only capture the 3D surface of the skin, offering no insight into the complex bio-mechanical structures that generate motion. As more applications expand towards biomechanics, the need for virtual human models that go beyond the skin has become increasingly evident. Traditional soft-tissue simulations, such as FEM, are accurate but non-scalable and too computationally expensive for most common applications. Alternatively, existing biomechanical tools can simulate muscular forces and activations, but do not model changes in external shape, restricting how activations correlate with actual observable anatomy. This motivates a novel inverse research problem: recovering muscle deformations directly from visible surface observations - i.e., from the skin, and thus the pose. In this work, we present SOMA (from Surface Observations to Muscle Anatomy), a person-specific model that infers spatio-temporal muscle behavior from surface signals obtained using RGB cameras, and SKIM, a subject-specific soft-tissue deformation dataset. To the best of our knowledge, this is the first method that attempts to recover muscle deformations from multi-view RGB data. We show how our method provides anatomically grounded animations without the complexity of traditional simulations, leading to a scalable and cost-effective solution. Data and code are available.

2606.09245 2026-06-09 cs.CV cs.AI 新提交

Proposal Refinement for Few-Shot Object Detection

用于少样本目标检测的提议细化

Yuan Zeng, Bin Song, Jie Guo, Yuwen Chen

发表机构 * State Key Laboratory of Integrated Services Networks, Xidian University(西安电子科技大学综合业务网理论及关键技术国家重点实验室)

AI总结 针对少样本检测中区域提议在基类和新类间分布不均的问题,提出分阶段提议细化方法,通过基类训练阶段的细化损失和微调阶段的细化分支重新平衡提议分布,在基准上提升1%~6%且不增加推理时间。

详情
AI中文摘要

近年来,少样本目标检测引起了广泛关注。一些优秀的算法已被提出以处理这一任务。然而,这些算法大多依赖于少样本分类的性能。与以往尝试不同,我们的工作聚焦于新类和基类之间区域提议分布不均的问题。为了缓解这种不平衡分布,我们针对不同训练阶段提出了提议细化方法。具体而言,在基类训练阶段设计了细化损失以增强模型对新类的敏感性,在微调阶段引入了细化分支作为RPN(区域提议网络)的辅助分支以生成更多新类提议。通过重新平衡提议分布,所提方法在现有基准上比基线方法提高了约1%~6%,且不增加任何推理时间。通过大量实验,我们证明了为少样本目标检测任务建立了一种新的最先进方法。

英文摘要

Few-shot object detection has gained widely attention in recent years. Some excellent algorithms have been proposed to handle this task. However, most of these algorithms rely on the performance of few-shot classification. Unlike previous attempts, our work focuses on the problem of unbalanced distribution of region proposals between the novel classes and the base classes. In order to alleviate this unbalanced distribution, we propose the proposal refinement approach for different training phases. Specifically, refinement loss is designed for the base training phase to enhance sensitivity of the model to novel classes, and refinement branch is introduced as an auxiliary branch for RPN (Region Proposal Networks) to generate more novel proposals in the fine-tuning phase. By rebalancing the proposal distribution, the proposed approach outperforms the baselines methods by roughly 1\%$\sim$6\% on current benchmarks without increasing any inference time. Through extensive experiments, we prove that we establish a new state-of-the-art method for the few-shot object detection task.

2606.09237 2026-06-09 cs.RO cs.SY eess.SY 新提交

Can we stabilize an inverted pendulum with feedback from a time-of-flight camera?

我们能否利用飞行时间相机的反馈来稳定倒立摆?

Anthony Czubarow, Antonio Terpin, Raffaello D'Andrea

发表机构 * Institute for Dynamic Systems and Control, ETH Zürich(苏黎世联邦理工学院动态系统与控制研究所)

AI总结 本文证明低成本、低分辨率的飞行时间相机能够提供足够反馈,可靠且精确地平衡推车上的倒立摆,挑战了其无法用于精确反馈控制的普遍观点。

详情
AI中文摘要

飞行时间相机在机器人领域广受欢迎,因为它们能直接提供深度信息,同时结构紧凑、成本低廉且对光照条件鲁棒,但其低空间分辨率和深度噪声被广泛认为无法用于精确反馈控制。在本文中,我们展示了一款低成本、低分辨率的飞行时间相机能够提供足够的反馈,以可靠且精确地平衡推车上的倒立摆——这是快速、不稳定动力学的典型基准。

英文摘要

Time-of-flight cameras are popular in robotics for providing direct depth information while being compact, inexpensive, and robust to lighting conditions, but their low spatial resolution and depth noise are widely believed to preclude precise feedback control. In this paper, we show that an inexpensive, low-resolution time-of-flight camera provides sufficient feedback to reliably and precisely balance an inverted pendulum on a cart--a canonical benchmark for fast, unstable dynamics.

2606.09219 2026-06-09 cs.CV astro-ph.IM 新提交

Semi-supervised Source Detection in Astronomical Images: New Benchmark and Strong Baseline

天文图像中的半监督源检测:新基准与强基线

Longhan Feng, Zihuang Cao, Ali Luo, Yuanhao Guo, Shuilian Yao, Yixin Guo, Qi Jia, Yu Liu

发表机构 * School of Software Dalian University of Technology(大连理工大学软件学院) National Astronomical Observatories Chinese Academy of Sciences(中国科学院国家天文台) Research Institute of Highway Ministry of Transport(交通运输部公路科学研究院)

AI总结 针对天文图像中源检测的挑战,提出LAMOST-DET基准数据集和半监督学习框架Nova Teacher,通过光源增强、置信度引导伪监督和跨视图互补挖掘,在稀疏标注下有效检测密集源,mAP提升4.04%和5.22%。

详情
AI中文摘要

在现代观测天文学中,源检测是准确定位和识别恒星源的基石,对于恒星种群合成和宇宙学参数估计等研究至关重要。然而,天文图像的特征,包括高密度、点扩散函数效应和低信噪比,对最新的先进目标检测器提出了重大挑战。此外,由于在天文图像中标注密集、微小和暗弱的源存在显著困难,全监督检测方法几乎不实用。为了解决天文数据集的稀缺性,我们引入了一个新的综合基准(LAMOST-DET),包含18,400张天文图像和728,898个源实例。在该数据集上,我们进一步设计了一个新颖的半监督学习框架,称为Nova Teacher,能够在稀疏标注下有效检测密集源。它集成了光源增强模块、置信度引导的伪监督和跨视图互补挖掘,采用双教师范式。在LAMOST-DET上的大量实验表明,Nova Teacher在两种半监督设置下分别比之前的竞争者持续提高4.04%和5.22%的mAP。此外,我们的方法在自然图像数据集上与其他检测器竞争,验证了其在不同场景下的泛化能力。源代码可在https://github.com/AcWiz/NovaTeacher获取。

英文摘要

Source detection in modern observational astronomy is a cornerstone for localizing and identifying stellar sources accurately. It is crucial for studies such as stellar population synthesis and cosmological parameter estimation. However, the characteristics of astronomical images, including high density, the effect of point spread functions and low signal-to-noise ratios, significantly challenge the latest advanced object detectors. Besides, fully-supervised detection methods are hardly practical, due to the significant difficulty in annotating dense, small, and faint sources in astronomical images. To tackle the scarcity of astronomical datasets, we introduce a new comprehensive benchmark (LAMOST-DET), comprising 18,400 astronomical images and 728,898 source instances. Upon the dataset, we further devise a novel semi-supervised learning framework coined Nova Teacher, capable of detecting dense sources effectively given sparse annotations. It integrates source light enhancement module, confidence-guided pseudo-supervision, and cross-view complementary mining in a dual-teacher paradigm. Extensive experiments on LAMOST-DET show that, Nova Teacher consistently improves previous competitors by 4.04% and 5.22% mAP under two semi-supervised settings. Additionally, our method competes against other detectors on a natural image dataset, validating its generalization ability to various scenarios. The source code is available at https://github.com/AcWiz/NovaTeacher.

2606.09218 2026-06-09 cs.CV 新提交

Minimal Solvers for Full-DoF Motion Estimation from Asynchronous Differential SfM

全自由度运动估计的最小求解器:基于异步差分SfM

Shuo Pan, Banglei Guan, Bin Li, Zhenbao Yu, Zibin Liu, Zi Wang, Yang Shang, Qifeng Yu

发表机构 * College of Aerospace Science and Engineering, National University of Defense Technology(国防科技大学空天科学学院) The Hunan Provincial Key Laboratory of Image Measurement and Vision Navigation(湖南省图像测量与视觉导航重点实验室)

AI总结 提出从异步光流直接估计全自由度自运动的方法,解耦差分对极约束,基于至少五个点实现角速度和线速度的联合恢复,并设计了首个代数最小5点求解器及加速版本。

详情
AI中文摘要

作为一种仿生智能传感器,事件相机以其高时间分辨率、低延迟和低功耗为特点,为时空信息的智能感知和视觉运动估计引入了新范式。然而,其异步数据流对传统的同步帧算法提出了重大挑战。为了解决这些挑战,本文提出了一种新颖的框架,直接从异步光流进行全自由度(DoF)自运动估计,特别针对角速度和线速度的联合恢复。我们将差分对极约束解耦为不同的角速度和线速度分量,并推导出其异步数据的公式。基于该公式,开发了一种优化算法,利用至少五个点实现全自由度自运动估计。此外,通过对旋转动力学应用一阶近似,我们将约束方程转化为多项式形式,从而得到了该公式的第一个代数最小5点求解器。为了确保高速场景下的实时性能,我们还提出了一种通过截断高阶角速度项实现的加速求解器。在合成和真实数据集上的广泛评估表明,异步方法优于传统的同步方法,特别是在对时空噪声的准确性和鲁棒性方面。我们相信,这项工作为高速机器人应用中高效且准确的连续时间运动估计奠定了关键基础。

英文摘要

As a bio-inspired intelligent sensor, event cameras have introduced a new paradigm in the intelligent perception of spatiotemporal information and visual motion estimation, characterized by their high temporal resolution, low latency, and minimal power consumption. However, their asynchronous data streams present significant challenges to traditional synchronous, frame-based algorithms. To address these challenges, this paper presents a novel framework for full degree of freedom (DoF) egomotion estimation directly from asynchronous optical flow, specifically targeting the joint recovery of angular and linear velocities. We decouple the differential epipolar constraint into distinct angular and linear velocity components, and derive its formulation for asynchronous data. Based on this formulation, an optimization algorithm is developed that enables full-DoF egomotion estimation leveraging at least five points. Furthermore, by applying a first-order approximation to rotational dynamics, we transform the constraint equations into a polynomial form, resulting in the first algebraic minimal 5-point solver for this formulation. To ensure real-time performance in high-speed scenarios, we additionally propose an accelerated solver achieved by truncating high-order angular velocity terms. Extensive evaluations on both synthetic and real-world datasets demonstrate that the asynchronous approach outperforms traditional synchronous methods, particularly in its accuracy and robustness to spatiotemporal noise. We believe that this work establishes a critical foundation for efficient and accurate continuous-time motion estimation in high-speed robotics applications.

2606.09215 2026-06-09 cs.RO 新提交

MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation

MotionWAM:迈向实时人形机器人全身操作的基础世界动作模型

Jia Zheng, Teli Ma, Yudong Fan, Zifan Wang, Shuo Yang, Junwei Liang

发表机构 * Mondo Robotics HKUST (GZ)(香港科技大学(广州)) HKUST(香港科技大学)

AI总结 提出MotionWAM,一种实时世界动作模型,通过统一运动潜变量和全身动作令牌,实现单目相机驱动的自主人形机器人全身操作,在真实任务上成功率比VLA基线高30%以上。

详情
AI中文摘要

世界动作模型(WAM)将视频动态先验与策略耦合,在桌面操作中表现出令人鼓舞的结果,但高维视频-动作潜变量的迭代去噪使其对于实时人形机器人全身操作来说过于缓慢。主导的分层范式加剧了这一问题,其中高层操作策略仅控制上半身,而低层控制器跟踪粗略的基础命令——将上半身和下半身置于不一致的动作空间中,并将腿部降级为保持平衡的 locomotion。我们提出MotionWAM,一种实时WAM,通过将策略条件设置为视频世界模型的中间去噪特征,从单个自我中心摄像头驱动自主人形机器人全身操作。MotionWAM用统一的运动潜变量取代了上下半身的分割,并预测全身动作令牌,在单个动作空间中联合覆盖 locomotion、躯干运动、高度调节、足部交互和手部操作。一个三阶段学习框架逐步将视频世界模型适应于自我中心视觉动态和目标人形机器人具身。在九个真实世界的Unitree G1任务上,MotionWAM实时运行,在总体成功率上比在同一演示上微调的视觉-语言-动作(VLA)基线高出30%以上,并执行解耦的上下半身策略无法达到的任务驱动足部交互。我们的结果表明,视频预训练的WAM可以从桌面操作提升到协调的、类人的人形机器人全身控制。

英文摘要

World Action Models (WAMs) couple a video dynamics prior to the policy and have shown encouraging results on tabletop manipulation, but iterative denoising over high-dimensional video-action latents leaves them too slow for real-time humanoid loco-manipulation. The problem is compounded by the dominant hierarchical paradigm, in which a high-level manipulation policy controls only the upper body while a low-level controller tracks coarse base commands -- placing upper and lower body in inconsistent action spaces and reducing the legs to balance-preserving locomotion. We present MotionWAM, a real-time WAM that drives autonomous humanoid loco-manipulation from a single egocentric camera by conditioning the policy on the intermediate denoising features of a video world model. MotionWAM replaces the upper-lower split with a unified motion latent and predicts whole-body motion tokens that jointly cover locomotion, torso motion, height regulation, foot interaction, and hand manipulation in a single action space. A three-stage learning framework progressively adapts the video world model to egocentric visual dynamics and to the target humanoid embodiment. On nine real-world Unitree G1 tasks, MotionWAM runs in real time, substantially outperforms Vision-Language-Action (VLA) baselines fine-tuned on the same demonstrations by over 30% in overall success rate, and executes task-driven foot interaction that decoupled upper-lower policies cannot reach. Our results suggest that video-pretrained WAMs can be lifted from tabletop manipulation to coordinated, human-like whole-body humanoid control.