arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部类别2183
2606.11841 2026-06-11 cs.CV 新提交

Scene-Adaptive Nonlinear Tone Curves for Pseudo Ground-Truth Generation in Low-Light 3D Gaussian Splatting

面向低光照3D高斯泼溅的场景自适应非线性色调曲线伪地面真值生成

Mingzhe Lyu, Jinqiang Cui, Hong Zhang

发表机构 * Southern University of Science and Technology(南方科技大学) Pengcheng Laboratory(鹏城实验室)

AI总结 针对低光照3D重建中伪地面真值生成问题,提出场景自适应非线性色调曲线框架,通过两种曲线(ASE和AP3)替代线性增益,在三个基准上PSNR提升最高4.34dB。

详情
AI中文摘要

低光照新视角合成具有挑战性,因为暗光多视图图像包含噪声、弱结构细节和压缩的动态范围。最近的3D高斯泼溅(3DGS)方法通过生成伪地面真值(pseudo-GT)图像作为监督目标来解决这些挑战,当没有配对正常光照参考时。现有的伪GT方法对所有像素应用均匀线性增益,这会裁剪亮区,同时暗区增强不足,限制了重建质量。我们观察到,在2D低光照增强中早已建立的非线性色调映射,尚未在3D重建的伪GT生成中得到探索。因此,我们提出了一种场景自适应非线性色调曲线框架,用非线性替代方案替换线性伪GT。该框架引入了基于百分位数的归一化以实现场景无关的曲线应用、场景自适应偏移用于自动黑电平调整,以及两条互补曲线:自适应SoftExp(ASE),一种有界指数曲线,和自适应Poly3(AP3),一种数据驱动的三次多项式。该模块仅改变伪GT计算,而保持3DGS骨干不变。在覆盖21个场景的三个基准上的实验表明,两条曲线均一致优于线性基线,在LOM上PSNR提升高达+4.34 dB,在RealX3D上提升+3.25 dB。尽管数学形式不同,两条曲线实现了相似的性能,表明改进是曲线无关的。代码见 https://this https URL。

英文摘要

Low-light novel view synthesis is challenging because dark multi-view images contain noise, weak structural detail, and compressed dynamic range. Recent 3D Gaussian Splatting (3DGS) methods address these challenges by generating pseudo ground-truth (pseudo-GT) images as supervision targets when paired normal-light references are unavailable. Existing pseudo-GT methods apply a uniform linear gain to all pixels, which clips bright regions while providing insufficient enhancement in dark regions, limiting reconstruction quality. We observe that nonlinear tone mappings, long established in 2D low-light enhancement, have not been explored for pseudo-GT generation in 3D reconstruction. Accordingly, we propose a scene-adaptive nonlinear tone-curve framework that replaces linear pseudo-GT with nonlinear alternatives. The framework introduces percentile-based normalisation for scene-agnostic curve application, a scene-adaptive offset for automatic black-level adjustment, and two complementary curves: Adaptive SoftExp (ASE), a bounded exponential curve, and Adaptive Poly3 (AP3), a data-driven cubic polynomial. The module changes only the pseudo-GT computation and leaves the 3DGS backbone unchanged. Experiments on three benchmarks covering 21 scenes show that both curves consistently outperform the linear baseline with PSNR improvements up to +4.34 dB on LOM and +3.25 dB on RealX3D. Both curves achieve similar performance despite their different mathematical forms, suggesting the improvement is curve-agnostic. Code is available at this https URL

2606.11838 2026-06-11 cs.CV 新提交

Plan-and-Verify Video Reward Reasoning with Spatio-Temporal Scene Graph Grounding

基于时空场景图基础的计划与验证视频奖励推理

Hyomin Kim, Junghye Kim, Joanie Hayoun Chung, Yoonjin Oh, Kyungjae Lee, Sungbin Lim, Sungwoong Kim

发表机构 * Korea University(高丽大学)

AI总结 提出SG-PVR视频奖励模型,通过计划-验证推理和时空场景图,系统验证提示中的每个条件,实现细粒度语义对齐,提升文本到视频生成的组合对齐。

详情
AI中文摘要

文本到视频(T2V)生成的奖励模型指导后训练,但常在细粒度语义对齐上失败。我们将其归因于现有基于推理的奖励模型的两个结构弱点:它们没有系统地验证提示中描述的每个条件,并且支持每个判断的视觉证据在其自由形式推理中仍然是隐式的。我们提出SG-PVR,一种视频奖励模型,通过基于时空场景图的计划-验证推理来解决这些限制。验证计划将提示分解为原子声明,确保检查每个要求。时空场景图编码实体、属性和时间基础关系,从视频中提取并作为持久的结构化视觉参考贯穿推理过程。每个声明都针对视频和场景图进行验证,将判断锚定在明确的视觉证据上。SG-PVR在语义对齐(包括细粒度时间语义)上取得了强劲性能。作为测试时重排序器,它进一步增强了T2V生成中的组合对齐。

英文摘要

Reward models for text-to-video (T2V) generation guide post-training but often fail at fine-grained semantic alignment. We trace this to two structural weaknesses in existing reasoning-based reward models: they do not systematically verify every condition described in the prompt, and the visual evidence supporting each judgment remains implicit in their free-form reasoning. We propose SG-PVR, a video reward model that addresses these limitations through plan-and-verify reasoning grounded in spatio-temporal scene graphs. The verification plan decomposes the prompt into atomic claims, ensuring every requirement is checked. The spatio-temporal scene graph, encoding entities, attributes, and temporally-grounded relations, is extracted from the video and maintained as a persistent structured visual reference throughout reasoning. Each claim is verified against both the video and the scene graph, anchoring judgments in explicit visual evidence. SG-PVR achieves strong performance on semantic alignment, including fine-grained temporal semantics. As a test-time reranker, it further enhances compositional alignment in T2V generation.

2606.11837 2026-06-11 cs.CV cs.AI 新提交

LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation

LASA:一种用于开放词汇场景草图语义分割的弱监督方法

Liwen Yi, Xianlin Zhang, Yue Zhang, Yue Ming, Xueming Li

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出LASA方法,通过跨层聚合Vision Transformer注意力图,在弱监督下实现开放词汇场景草图的语义分割,显著提升分割精度和空间一致性。

详情
AI中文摘要

开放词汇场景草图语义分割旨在基于推理时指定的灵活类别词汇,为稀疏线条图分配密集语义标签,而无需在训练期间依赖像素级标注。与自然图像不同,草图缺乏纹理和颜色线索,使得语义理解严重依赖于笔画布局和空间配置,这一挑战导致单层视觉-语言特征本质上不稳定。我们的关键观察是,来自不同Vision Transformer层的注意力图编码了互补的空间线索:浅层捕获全局结构布局,而深层聚焦于局部笔画交叉和物体部件。这表明跨层聚合比任何单独一层提供了更稳健的结构先验。利用这一洞察,我们提出了一种结构感知框架,基于\textbf{逐层累积结构注意力}(\textbf{LASA}),该框架聚合多层注意力以在弱监督下指导层次化语义对齐,并在推理期间细化预测。在FS-COCO、SFSD和FrISS上的实验表明,与先前的弱监督基线相比,LASA将mIoU分别提高了+3.43、+8.01和+15.74,在分割精度和空间一致性上均表现出一致的提升。我们的源代码将公开提供。

英文摘要

Open-vocabulary scene sketch semantic segmentation aims to assign dense semantic labels to sparse line drawings based on flexible category vocabularies specified at inference time, without relying on pixel-level annotations during training. Unlike natural images, sketches lack texture and color cues, making semantic understanding heavily dependent on stroke layout and spatial configuration, a challenge that renders single-layer vision-language features inherently unstable. Our key observation is that attention maps from different Vision Transformer layers encode complementary spatial cues: shallow layers capture global structural layouts, while deeper layers focus on local stroke intersections and object parts. This suggests that cross-layer aggregation provides a more robust structural prior than any individual layer alone. Leveraging this insight, we propose a structure-aware framework built upon \textbf{L}ayer-wise \textbf{A}ccumulated \textbf{S}tructural \textbf{A}ttention (\textbf{LASA}), which aggregates multi-layer attention to guide hierarchical semantic alignment under weak supervision and refine predictions during inference. Experiments on FS-COCO, SFSD, and FrISS show that LASA improves mIoU by $+3.43$, $+8.01$, and $+15.74$ over the prior weakly supervised baselines, demonstrating consistent gains in both segmentation accuracy and spatial coherence. Our source code will be made publicly available.

2606.11836 2026-06-11 cs.SD cs.AI eess.AS 新提交

Towards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering

面向语音基础模型的无数据无训练压缩:基于参数聚类的方法

Haoning Xu, Zhaoqing Li, Huimeng Wang, Youjun Chen, Chengxi Deng, Mengzhe Geng, Xunying Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) National Research Council Canada(加拿大国家研究委员会)

AI总结 提出一种基于k-means通道聚类的无数据无训练压缩方法,通过层间不同参数簇数实现细粒度混合稀疏剪枝,在HuBERT-large和Whisper-large-v3上显著降低WER。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

本文提出了一种新颖的无数据无训练压缩方法,用于语音基础模型,该方法通过k-means进行通道级聚类。还探索了更细粒度的混合稀疏剪枝,通过层间不同数量的参数簇实现。在LibriSpeech数据集上进行的实验表明,当对HuBERT-large进行50%的剪枝稀疏度操作时,在微调前,测试干净和测试其他子集上,相对于基于幅度的剪枝,获得了27.73%/18.61%绝对(34.37%/21.91%相对)的一致WER降低;在仅3个epoch的微调后,获得了0.19%/0.79%绝对(3.36%/4.62%相对)的降低。在Whisper-large-v3上,在10%稀疏度下,相对于基于幅度的剪枝,观察到2.86%/5.02%绝对(59.21%/55.29%相对)的类似WER降低,所有这些相对于未压缩基线均没有显著的WER增加。

英文摘要

This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via k-means. More fine-grained, mixed sparsity pruning by layer-level varying number of parameter clusters is also explored. Experiments conducted on the LibriSpeech dataset suggest that when operating with pruning sparsity of 50% on HuBERT-large, consistent WER reductions of 27.73%/18.61% absolute (34.37%/21.91% relative) over the magnitude-based pruning were obtained on the test-clean and test-other subsets before fine-tuning and 0.19%/0.79% absolute (3.36%/4.62% relative) after fine-tuning with only 3 epochs. Similar WER reductions of 2.86%/5.02% absolute (59.21%/55.29% relative) were observed against magnitudebased pruning on Whisper-large-v3 at 10% sparsity, all with no significant WER increase relative to the uncompressed baseline.

2606.11833 2026-06-11 cs.LG q-bio.NC 新提交

Flow Matching with In-Context Priors for Out-of-Distribution Brain Dynamics

基于上下文先验的分布外脑动力学流匹配

Sam Gijsen, Michał Łukomski, Marc-André Schulz, Kerstin Ritter

发表机构 * Hertie Institute for AI in Brain Health, University of Tübingen(赫蒂人工智能脑健康研究所,图宾根大学) Tübingen AI Center, University of Tübingen(图宾根人工智能中心,图宾根大学) Charité – Universitätsmedizin Berlin, Department of Psychiatry and Psychotherapy(柏林夏里特医学院,精神病学与心理治疗系) German Center for Mental Health (DZPG), partner site Tübingen(德国心理健康中心(DZPG),图宾根合作站点)

AI总结 提出一种逐时间步条件扩散Transformer,通过注入组合语言和可选空间先验,实现未见认知任务下fMRI脑动力学的零样本生成,支持反事实神经科学。

详情
Comments
Code and pretrained models available at this https URL
AI中文摘要

流匹配和扩散模型能够实现从图像到蛋白质等领域的条件生成,最近扩展到分布外上下文。然而,神经时间序列的生成模型主要局限于分类条件,阻碍了组合和零样本泛化。在这项工作中,我们提出了一种逐时间步条件扩散Transformer,通过注入组合语言和可选空间先验在上下文中,生成未见认知任务期间的真实fMRI脑动力学。这种零样本生成可以通过在经验验证之前支持新型认知实验的计算机设计和评估,从而促进反事实神经科学。利用该模型,我们在数百个保留任务条件下进行评估,并描述与训练流形相关的预测性能。仅从语言出发,模型恢复了跨任务和保留空间激活模式的区域特异性招募。当空间先验可用时,它们通过将生成锚定在仅靠语言退化的任务空间区域来补充文本路径,同时保留反事实任务规范所需的组合结构。据我们所知,这是首个用于未见认知任务的整个皮层fMRI动力学生成模型,推动了反事实神经科学和数据驱动的实验设计。

英文摘要

Flow matching and diffusion models enable conditional generation across domains ranging from images to proteins, with recent extensions to out-of-distribution contexts. Yet generative models of neural time series have largely remained restricted to categorical conditioning, precluding compositional and zero-shot generalization. In this work, we propose a per-timestep conditioned diffusion transformer for generating realistic fMRI brain dynamics during unseen cognitive tasks by injecting both compositional language and optional spatial priors in-context. Such zero-shot generation could enable counterfactual neuroscience by supporting in-silico design and evaluation of novel cognitive experiments before empirical validation. Leveraging this model, we evaluate across hundreds of held-out task conditions and characterize predictive performance in relation to the training manifold. From language alone, the model recovers region-specific recruitment across tasks and held-out spatial activation patterns. Spatial priors, when available, complement the text pathway by anchoring generation in regions of task space where language alone degrades, while retaining the compositional structure needed for counterfactual task specification. To our knowledge this is the first generative model of whole-cortex fMRI dynamics for unseen cognitive tasks, advancing counterfactual neuroscience and data-driven experimental design.

2606.11831 2026-06-11 cs.LG cs.AI 新提交

From Uniform to Learned Graph Priors: Diffusion for Structure Discovery

从均匀到学习图先验:用于结构发现的扩散

Qi Shao, Hao Guo, Jiawen Chen, Duxin Chen, Wenwu Yu

发表机构 * School of Mathematics, Southeast University(东南大学数学学院)

AI总结 提出Diff-prior,一种扩散参数化的自适应先验,通过可学习的去噪式校准对边后验进行结构化校准,提升神经关系推理方法的结构发现可靠性。

详情
Comments
15 pages, 3 figures, Accepted by KDD 2026
AI中文摘要

神经关系推理(NRI)方法通过离散潜在边的变分推理从轨迹中发现交互图。然而,这些方法通常依赖于过度简化的因子化图先验。这种先验通常接近均匀分布,将边视为独立实体。这种系统性错位与现实世界系统不匹配,导致边后验分散且不明确,限制了结构发现的可靠性。为了解决这个问题,我们提出了\textit{Diff-prior},一种扩散参数化的自适应先验,用于校准潜在图分布而非生成图。我们的核心见解是将先验整合重新构建为一种可学习的去噪式校准,将分散、不确定的边后验组织成更可靠的整体结构,该结构可通过扩散模型训练。Diff-prior学习一个自适应结构先验,在推理过程中对边后验进行结构化校准,引导其朝向更接近底层结构的分布。Diff-prior在结构采样之前操作,并直接对编码器边分布进行去噪校准,为结构化变量提供了一种通用的训练范式。在标准基准上的实验验证了我们的框架,结果表明Diff-prior提高了结构推理的性能,并在多个NRI系列架构中生成更明确的边后验。代码可在以下网址获取:https://this URL。

英文摘要

Neural relational inference (NRI) methods discover interaction graphs from trajectories through variational reasoning on discrete potential edges. However, these methods typically rely on oversimplified, factorized graph priors. Such priors, typically nearing uniform distributions, treat edges as independent entities. This systemic misalignment does not match the real-world systems and yields diffuse and indecisive edge posteriors limiting the reliability of structural discovery. To address this, we propose \textit{Diff-prior}, a diffusion-parameterized adaptive prior used to calibrate latent graph distribution rather than generate graphs. Our core insight is to reframe prior integration as a learnable denoising-style calibration that organizes scattered, uncertain edge posteriors into a more reliable overall structure which can be trained by the diffusion model. Diff-prior learns an adaptive structure prior that performs structured calibration on the edge posteriors during inference, guiding it towards a distribution closer to the underlying structure. The diff-prior operates before structural sampling and acts as a denoising calibrator directly on the encoder edge distribution, which provides a generic training paradigm over structured variables. Experiments on standard benchmarks validated our framework, and the results indicate that Diff-prior improves the performance of structure inference and generates more decisive edge posteriors across multiple NRI-family architectures. The code is available on this https URL.

2606.11830 2026-06-11 cs.AI 新提交

Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

面向医学研究分析的技能增强型AI代理:一项NSCLC转录组生物标志物任务中的探索性多模型人类评估

Qianyu Yao, Fei Sun, Bocheng Huang, Wei Chen, Jiarui Jiang, Shu Quan, Yifei Chen, Wenjie Xu, Bo li, Liping Su, Ruoqiong Wu, Huhai Hong, Huimei Wang

发表机构 * AIPOCH PTE. LTD.

AI总结 本研究通过非小细胞肺癌免疫治疗生物标志物任务,评估技能增强型AI代理相比原生AI在转录组研究分析输出质量上的提升,发现质量信号方向性但未达统计显著性。

详情
AI中文摘要

背景。大型语言模型和AI代理越来越多地用于支持生物医学研究,但原生模型输出可能遗漏关键分析步骤、误用方法或夸大结论。我们评估了自主访问医学研究技能包是否与更高质量的AI生成转录组研究分析输出相关,相比于无技能的原生AI。方法。我们使用非小细胞肺癌免疫治疗生物标志物任务进行了一项探索性多模型人类评估。测试了六个模型骨干。评估包括21个匿名输出:9个原生AI输出和12个通过OpenClaw实现的AI代理生成的技能增强输出。四位非专家生物医学评审员和两位盲法专家评估每个输出,每位评审员类型给出两个评分。主要结局是专家评定的总体质量。结果。技能增强输出在专家总体质量上方向性高于原生AI输出(均值5.50 vs 5.11;差异=0.39;bootstrap 95% CI,-0.04至0.90;Welch p=0.156)。非专家评审员质量呈现相同方向(均值4.72 vs 4.47;差异=0.26;bootstrap 95% CI,-0.25至0.80;Welch p=0.373)。专家一致性有限(单评分ICC=-0.15),模型特异性效应为描述性且异质性。结论。在此探索性样本中,自主技能访问显示出方向性质量信号,但信号小于专家评分噪声,不应视为确证性证据。这些发现主要激励了具有更强可靠性控制、平台复制和生物学有效性评估的技能增强型AI代理的更大规模评估。

英文摘要

Background. Large language models and AI agents are increasingly used to support biomedical research, but native model outputs may omit key analytical steps, misuse methods, or overstate conclusions. We evaluated whether autonomous access to a medical research skill package was associated with higher-quality AI-generated transcriptomic research-analysis outputs compared with native AI without skills. Methods. We conducted an exploratory multi-model human evaluation using a non-small cell lung cancer immunotherapy biomarker task. Six model backbones were tested. The evaluation included 21 anonymized outputs: 9 native-AI outputs and 12 skill-augmented outputs generated through an AI agent implementation represented by OpenClaw. Four non-expert biomedical reviewers and two blinded experts evaluated each output, with two ratings from each reviewer type. The primary outcome was expert-rated overall quality. Results. Skill-augmented outputs showed directionally higher expert overall quality than native-AI outputs (mean 5.50 vs 5.11; difference=0.39; bootstrap 95\% CI, -0.04 to 0.90; Welch p=0.156). Non-expert reviewer quality showed the same direction (mean 4.72 vs 4.47; difference=0.26; bootstrap 95\% CI, -0.25 to 0.80; Welch p=0.373). Expert agreement was limited (single-rating ICC=-0.15), and model-specific effects were descriptive and heterogeneous. Conclusions. Autonomous skill access showed a directional quality signal in this exploratory sample, but the signal was smaller than expert-rating noise and should not be interpreted as confirmatory evidence. The findings primarily motivate larger evaluations of skill-augmented AI agents with stronger reliability controls, platform replication, and biological-validity assessment.

2606.11828 2026-06-11 cs.SD cs.AI cs.CR cs.MM 新提交

Feature-Aligned Speech Watermarking for Robustness to Reconstruction Distortions

特征对齐的语音水印技术以抵抗重建失真

Haiyun Li, Shuhai Peng, Zhisheng Zhang, Jingran Xie, Xiaofeng Xie, Hanyang Peng, Zhiyong Wu

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Shenzhen Key Laboratory of Intelligent Media and Content Understanding(深圳市智能媒体与内容理解重点实验室) Tencent AI Lab(腾讯人工智能实验室)

AI总结 提出特征对齐水印方法,通过将水印与原始语音特征分布对齐,在保持不可感知性的同时提高水印能量,增强对语音重建模型的鲁棒性。

详情
Comments
Accepted by ICME2026
AI中文摘要

音频水印旨在将可识别信息嵌入音频中同时保持不可感知性。现有方法采用高保真、低能量设计以保持感知质量,但由此产生的水印在语音重建模型的抑制下缺乏鲁棒性。由于现有设计中固有的鲁棒性-保真度权衡,提高鲁棒性具有挑战性,增加水印能量会提高鲁棒性但降低保真度。为解决此问题,我们提出一种特征对齐的水印方法,将水印与原始语音特征分布对齐,允许更高的水印能量以提高鲁棒性同时保持不可感知性。我们使用预训练的语音编解码器生成伪语音水印,并将其融合到输入音频的频谱图中,通过VAD损失和感知损失引导在浊音区域嵌入。实验表明,我们的方法在保持与现有方法相当的不可感知性的同时,在见过和未见过的语音重建模型下均显著提高了鲁棒性。

英文摘要

Audio watermarking aims to embed identifiable information into audio while remaining imperceptible. Existing methods adopt high-fidelity, low-energy designs to preserve perceptual quality, but the resulting watermarks lack robustness under suppression by speech reconstruction models. Improving robustness is challenging due to the inherent robustness-fidelity trade-off in existing designs, where increasing watermark energy improves robustness but reduces fidelity. To address this problem, we propose a feature-aligned watermarking method that aligns the watermark with the original speech feature distribution, allowing higher watermark energy to improve robustness while preserving imperceptibility. We use a pretrained speech codec to generate a pseudo-speech watermark and fuse it into the spectrogram of the input audio, with VAD loss and perceptual losses guiding embedding within voiced regions. Experiments show that our method maintains imperceptibility comparable to existing approaches while substantially improving robustness under both seen and unseen speech reconstruction models.

2606.11826 2026-06-11 cs.RO 新提交

Modular Anthropomorphic Hand Design via Multi-Parameter Finger Benchmarking and Selection

模块化拟人手设计:基于多参数手指基准测试与选择

Yu Zhang, Huijiang Wang, Josie Hughes

发表机构 * The CREATE Lab, Institute of Mechanical Engineering, Swiss Federal Institute of Technology in Lausanne (EPFL)(瑞士洛桑联邦理工学院机械工程研究所CREATE实验室)

AI总结 提出一种模块化拟人手设计框架,通过多参数基准测试优化手指设计,实现整手性能提升,并在多物体抓取和灯泡旋拧任务中验证有效性。

详情
Comments
14 pages, 13 figures. Submitted to an IEEE journal for possible publication
AI中文摘要

设计拟人灵巧手仍然具有挑战性,因为设计空间跨越形态、驱动和传感特性,而性能指标既包括任务相关也包括任务无关。现有的优化方法通常是非结构化的,或者只考虑单一性能指标,限制了系统比较和针对性改进。虽然整手的设计考虑很重要,但单个手指的特性在灵巧性中起着关键作用。通过开发一个手指可以模块化集成到完整遥操作手中的机器人手平台,我们提出优化手指可以显著提高整手性能。该方法能够在手指集成到手部进行任务级验证之前,通过多个定量基准快速筛选不同的手指级原型。候选手指设计(包含关节、骨骼、皮肤和传感器位置的变化)使用面向机制和任务相关的指标进行评估,建立了组件设计与整手体现之间的定量联系。该框架通过开发具有优化手指的拟人机器人手得到验证,展示了这些手指如何在多物体抓取和灯泡旋拧等任务中实现性能提升。

英文摘要

Designing anthropomorphic dexterous robotic hands remains challenging as the design space straddles morphology, actuation, and sensing properties, and performance metrics span both task-dependent and task-agnostic. Existing optimization methods are often unstructured or consider only a single performance metric, limiting systematic comparison and targeted refinement. While the design considerations of the entire hand are significant, the individual finger properties play a key role in dexterity. By developing a robotic hand platform where fingers can be modularly integrated into a full teleoperated hand, we propose that optimizing the fingers can significantly improve overall hand performance. This approach enables rapid screening of different finger-level prototypes through a number of quantitative benchmarks before their integration into the hand for task-level validation. Candidate finger designs (incorporating variations in joint, bone, skin, and sensor placement) are assessed using both mechanism-oriented and task-relevant metrics, which establish a quantitative link between component design and full hand embodiment. The framework is validated through the development of an anthropomorphic robotic hand with optimized fingers, demonstrating how these fingers enable performance improvements across tasks, including multi-object grasping and light bulb screwing.

2606.11818 2026-06-11 cs.RO 新提交

Human-Guided Co-Manipulation of Carbon Fiber Plies

碳纤维铺层的人机协同操作

Rami Ojanen, James Fant-Male, Roel Pieters

发表机构 * Automation Technology and Mechanical Engineering, Tampere University(坦佩雷大学自动化技术与机械工程系)

AI总结 针对柔性材料自动化处理困难的问题,本文提出结合语音指令、视觉腕部跟踪和力/柔顺控制的多模态方法,实现碳纤维铺层的高效人机协同操作。

详情
Comments
Accepted to the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)
AI中文摘要

由于柔性材料的可变形性带来的挑战,完全自动化处理这类物体是一项困难的任务。同时,全手动过程在人体工程学上可能具有挑战性、繁琐且低效。因此,人机协作(HRC)和协同操作(co-manipulation)在该领域受到越来越多的关注,因为它们能够在需要时让人类参与,同时提高生产力。为了实现操作员与机器人之间的高效协同操作和交互,需要不同的模态和控制方法。在本文中,我们提出并研究了用于碳纤维铺层协同操作的不同控制方法,在受控环境中评估每种方法的优缺点。我们提出,语音指令、通过视觉进行腕部跟踪以及力与柔顺控制的多模态组合将为任务的完整和直观控制提供最佳解决方案。

英文摘要

The handling of flexible materials is a difficult task to fully automate due to the challenges caused by the deformability of these types of objects. Meanwhile, a fully manual process can be ergonomically challenging, tedious and inefficient. Thus, human-robot collaboration (HRC) and cooperative manipulation (co-manipulation) have received increasing interest in this field as they enable human involvement when needed while also improving productivity. To enable efficient co-manipulation and interaction between the human operator and the robot, different modalities and control methods are required. In this paper, we present and examine different control methods for co-manipulation of carbon fiber plies, evaluating the pros and cons of each method in a controlled setting. We propose that a multimodal combination of speech commands, wrist-tracking through vision, and force with compliant control would provide the best solution for complete and intuitive control of the task.

2606.11816 2026-06-11 cs.CL cs.AI 新提交

WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning

WorldReasoner: 评估语言模型代理是否通过有效推理预测事件

Yizhou Chi, Eric Chamoun, Zifeng Ding, Andreas Vlachos

发表机构 * Department of Computer Science and Technology, University of Cambridge(剑桥大学计算机科学与技术系)

AI总结 提出WorldReasoner框架,通过时间有效检索、证据质量和因果图推理三个维度评估语言模型代理的事件预测能力,发现时间有效检索是结果准确性的最强驱动因素。

详情
AI中文摘要

预测现实世界事件要求语言模型代理在不完整、时间有限的信息下进行不确定性推理。然而,评估代理是否真正进行预测需要的不仅仅是最终答案的准确性:模型可能通过回忆记忆中的训练事实、引用捏造的证据或产生无根据的因果故事而正确。我们提出WorldReasoner,一个用于时间有效事件预测的评估框架。每个任务向代理提供一个已解决的预测问题、一个模拟的预测日期,并且只能访问该日期之前可用的证据;在问题解决后,该框架对提交的概率、引用的证据和可选的因果事件图进行评分。WorldReasoner报告三个互补的轴:针对已解决答案的结果质量、针对引用来源的证据质量,以及针对解决后事后图的推理质量。该基准测试由一个代理构建管道构建,该管道生成预测问题、收集时间戳证据并大规模构建事后参考图,最终产生345个已解决的任务,这些任务源自14,141篇文章,其图覆盖8,087个提取的事件。在六种受控代理设置中,时间有效检索是结果准确性的最强驱动因素;因果图构建提高了关键事件的恢复;并且正确的图支持预测更牢固地基于关键事件和相关来源,但代理仍然难以将基于证据的推理转化为校准的概率。

英文摘要

Forecasting real-world events requires language-model agents to reason under uncertainty from incomplete, time-bounded information. Yet evaluating whether agents genuinely forecast requires more than final-answer accuracy: a model may be correct by recalling memorized training facts, citing fabricated evidence, or producing an unsupported causal story. We present WorldReasoner, an evaluation framework for temporally valid event forecasting. Each task gives an agent a resolved forecasting question, a simulated forecast date, and access only to evidence available before that date; after resolution, the framework scores the submitted probability, cited evidence, and optional causal event graph. WorldReasoner reports three complementary axes: outcome quality against resolved answers, evidence quality over cited sources, and reasoning quality against post-resolution hindsight graphs. The benchmark is built by an agentic construction pipeline that generates forecasting questions, collects time-stamped evidence, and builds hindsight reference graphs at scale, yielding 345 resolved tasks derived from 14,141 articles with graphs covering 8,087 extracted events. Across six controlled agent settings, temporally valid retrieval is the strongest driver of outcome accuracy; causal graph construction improves key-event recovery; and correct graph-enabled forecasts are more strongly grounded in key events and relevant sources, yet agents still struggle to convert grounded evidence into calibrated probabilities.

2606.11806 2026-06-11 cs.CL 新提交

External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs

生产级LLM系统中的外部经验服务:面向部署的质量-成本权衡研究

Lin Sun, Heming Zhang, Xiangzheng Zhang

发表机构 * Qiyuan Tech(奇元科技)

AI总结 研究生产级LLM系统中注入外部经验的质量-成本权衡,发现选择性检索优于全局注入,且检索质量比增加Top-K更重要,成本效益因任务输出长度而异。

详情
AI中文摘要

生产级LLM系统会积累可重用的操作经验,但实际部署问题不仅仅在于这种经验是否有帮助,更在于不同的服务策略如何在现实约束下权衡质量与在线成本。注入外部经验可以提升任务质量,但也会增加提示负担、延迟和服务压力。我们将\textit{外部经验服务}作为一个面向部署的质量-成本权衡问题进行研究。我们在一个真实的审核场景中评估该问题,并使用工具使用和GPQA作为辅助对比任务,这些任务暴露了不同的输出-成本区间。我们比较了无经验基线、随机经验控制、全局提示注入和基于检索的选择性注入,并分析了任务质量和服务成本。结果表明,一旦经验变得依赖于具体案例,选择性检索比无条件的全局注入提供了更强的操作点。进一步表明,检索质量比单纯增加Top-$K$更重要,并且相同的服务策略在短输出和密集解码场景下可能表现出截然不同的成本效益曲线。这些发现表明,外部经验最好被视为一种选择性的、成本感知的服务决策,而不是通用的附加组件。总体而言,在所研究的设置中,只有当服务接口和任务特定的成本结构使其质量提升值得在线成本时,外部经验才是有价值的。

英文摘要

Production LLM systems accumulate reusable operational experience, but the practical deployment issue is not merely whether such experience can help. It is how different serving strategies trade off quality against online cost under realistic constraints. Injecting external experience can improve task quality, yet it also increases prompt burden, latency, and serving pressure. We study \textit{external experience serving} as a deployment-oriented quality-cost trade-off problem. We evaluate this question in a real production moderation setting, with tool-use and GPQA as supporting contrast tasks that expose different output-cost regimes. We compare no-experience baselines, random experience controls, global prompt injection, and retrieval-based selective injection, and analyze both task quality and serving cost. The results show that, once experience becomes case-dependent, selective retrieval provides a stronger operating point than unconditional global injection. They further show that retrieval quality matters more than simply increasing Top-$K$, and that the same serving policy can exhibit substantially different cost-benefit profiles across short-output and decode-heavy regimes. These findings suggest that external experience is best treated as a selective, cost-aware serving decision rather than as a universal add-on. Overall, in the settings studied here, external experience pays off only when both the serving interface and the task-specific cost structure make its quality gains worth the online cost.

2606.11805 2026-06-11 cs.CV cs.AI 新提交

TextHOI-3D: Text-to-3D Hand-Object Interaction via Discrete Multi-View Generation and Joint Mesh Optimization

TextHOI-3D: 基于离散多视图生成与联合网格优化的文本到三维手物交互

Zixiong Hao, Zhencun Jiang

发表机构 * Technical University of Munich(慕尼黑工业大学) Tongji University(同济大学) Shanghai Research Institute for Intelligent Autonomous Systems(上海自主智能无人系统科学中心)

AI总结 提出TextHOI-3D框架,通过多视图离散表示连接文本生成与几何恢复,实现文本驱动的三维手物网格生成,显著降低物体倒角距离和穿透体积。

详情
Comments
11 pages, 8 figures, 3 tables
AI中文摘要

文本条件的三维生成在图像和孤立物体方面进展迅速,但生成手物网格仍然具有挑战性:输出必须保持语言语义、跨视图一致性、物体几何、关节手部形状以及物理上合理的接触。我们提出TextHOI-3D,一个分阶段框架,使用生成的多视图观测作为文本条件视觉生成与几何感知手物恢复之间的显式接口。TextHOI-3D为固定相机的手物观测学习紧凑的VQ令牌空间,通过CLIP条件的视觉自回归模型从文本预测多视图视觉令牌,并通过先验初始化、多视图联合优化和抗穿透细化恢复统一的手物网格。该设计将语义生成与几何恢复分离,同时通过离散多视图表示保持两个阶段的连接。在HO3D衍生评估中,与单视图对应相比,多视图设置将物体倒角距离从17.26毫米降低到4.92毫米,穿透体积从5.3721立方厘米降低到0.2193立方厘米,同时改善了手部误差和表面F分数。这些结果支持多视图视觉令牌作为文本驱动三维手物网格创建的有效中间表示。

英文摘要

Text-conditioned 3D generation has progressed rapidly for images and isolated objects, but producing a hand-object mesh remains challenging: the output must preserve language semantics, cross-view consistency, object geometry, articulated hand shape, and physically plausible contact. We present TextHOI-3D, a staged framework that uses generated multi-view observations as an explicit interface between text-conditioned visual generation and geometry-aware hand-object recovery. TextHOI-3D learns a compact VQ token space for fixed-camera hand-object observations, predicts multi-view visual tokens from text with a CLIP-conditioned visual autoregressive model, and recovers a unified hand-object mesh through prior initialization, multi-view joint optimization, and anti-penetration refinement. The design separates semantic generation from geometric recovery while keeping both stages connected by a discrete multi-view representation. On HO3D-derived evaluations, the multi-view setting reduces object CD from 17.26 mm to 4.92 mm and penetration volume from 5.3721 cm^3 to 0.2193 cm^3 compared with a single-view counterpart, while improving hand errors and surface F-scores. These results support multi-view visual tokens as an effective intermediate representation for text-driven 3D hand-object mesh creation.

2606.11804 2026-06-11 cs.AI cs.CR cs.LG 新提交

Toward Trustworthy AI: Multi-Target Adversarial Attacks and Robust Defenses for Continuous Data Summarization

迈向可信赖的人工智能:针对连续数据摘要的多目标对抗攻击与鲁棒防御

Yuefang Lian, Longkun Guo, Zhongrui Zhao, Zhigang Lu, Yanan Cai, Shuchao Pang, Dachuan Xu, Jason Xue

发表机构 * Nankai University(南开大学) James Cook University(詹姆斯库克大学) Western Sydney University(西悉尼大学) Beijing University of Technology(北京工业大学) Fuzhou University(福州大学) Nanjing University of Science and Technology(南京理工大学) CSIRO's Data 61(澳大利亚联邦科学与工业研究组织Data61) The University of Adelaide(阿德莱德大学)

AI总结 研究通过DR-子模优化在相似性层面扰动下对连续数据摘要进行对抗攻击,提出多目标攻击生成和鲁棒防御的近似算法,实验表明攻击有效且防御能改善鲁棒性-缓解权衡。

详情
Comments
Submitted to IEEE Transactions on Information Forensics and Security (IEEE TIFS)
AI中文摘要

可信赖的人工智能需要可靠的数据处理管道,而不仅仅是鲁棒的下游预测模型。作为上游组件,数据摘要决定了哪些信息被保留并传递给后续的学习或决策模块。因此,对摘要过程的对抗性扰动可能以上游方式损害可信赖的人工智能:它们可能改变所选摘要,降低其代表性,并进一步降低后续学习任务的效用。在本文中,我们通过DR-子模优化研究相似性层面扰动下的连续数据摘要对抗攻击。我们证明了一类多分辨率图像摘要目标可以表示为非负子模集函数的多线性扩展,并满足具有$m$-弱单调性的DR-子模性。然后,我们将多目标攻击生成表述为一个最小-最大问题,其中优化相似性结构的一个可容许扰动以降低多个目标摘要模型。为了缓解此类扰动,我们将针对混合攻击类型的鲁棒防御表述为一个正则化的最大-最小问题。对于这两个问题,我们开发了具有理论保证的近似算法。在真实数据和受控聚类基准上的实验表明,所提出的攻击在代表性的低到中等预算范围内是有效的,并且可以导致下游任务性能损失。所提出的防御在结构化设置中改善了鲁棒性-缓解权衡,同时也揭示了真实数据上鲁棒保护的参数敏感性。

英文摘要

Trustworthy AI requires reliable data-processing pipelines, not only robust downstream predictive models. As an upstream component, data summarization determines which information is retained and passed to subsequent learning or decision modules. Therefore, adversarial perturbations to the summarization process can compromise trustworthy AI in an upstream manner: they may alter the selected summary, reduce its representativeness, and further degrade the utility of subsequent learning tasks. In this paper, we study adversarial attacks on continuous data summarization under similarity-level perturbations through DR-submodular optimization. We show that a class of multi-resolution image summarization objectives can be formulated as multilinear extensions of non-negative submodular set functions and satisfy DR-submodularity with $m$-weak monotonicity. We then formulate multi-target attack generation as a min-max problem, where one admissible perturbation of the similarity structure is optimized to degrade multiple target summarization models. To mitigate such perturbations, we formulate robust defense against mixed attack types as a regularized max-min problem. For both problems, we develop approximation algorithms with theoretical guarantees. Experiments on real-data and controlled clustered benchmarks show that the proposed attack is effective in representative low-to-moderate budget regimes and can induce downstream task-performance loss. The proposed defense improves the robustness--mitigation trade-off in structured settings, while also revealing the parameter sensitivity of robust protection on real data.

2606.11797 2026-06-11 cs.LG 新提交

Space-sampled Value Decay: Forgetting Mechanisms for Non-stationary Deep Reinforcement Learning

空间采样值衰减:非平稳深度强化学习的遗忘机制

Felix Störck, Fabian Hinder, Barbara Hammer

发表机构 * CITEC, Faculty of Technology, Bielefeld University(比勒费尔德大学技术学院CITEC)

AI总结 受啮齿动物遗忘行为启发,提出空间采样值衰减作为显式遗忘机制,用于深度强化学习应对环境漂移,在DQN和SAC上验证效果与局限。

详情
Comments
Accepted at The 2nd Workshop on Epistemic Intelligence in Machine Learning, EIML@ICML 2026, (non-archival)
AI中文摘要

对小鼠等啮齿动物的研究表明,即使没有提供关于变化的信息(不确定性),它们也能适应环境参数的变化(“漂移”)——这种行为可以通过遗忘机制建模。非平稳强化学习(NSRL)致力于改进最先进的强化学习方法以应对变化的环境:然而,这些方法通常需要关于漂移的(部分)完美信息,如“任务ID”或“上下文”。为了减轻漂移的影响,本文开发了\emph{空间采样值衰减},作为基于值的深度强化学习架构的一种显式遗忘机制,这是一种简单而有效的方法。特别地,我们展示并讨论了在非平稳环境中评估深度Q网络(DQN)和软演员-评论家(SAC)的修改时,在获得的回报方面的积极效果以及局限性。

英文摘要

Studies on rodents such as mice have shown the capabilities to adapt their behavior when dealing with changing parameters (``drift'') of the environment even if no information about change is provided (uncertainty) -- a behavior that can be modeled by forgetting mechanisms. Non-stationary Reinforcement Learning (NSRL) deals with adapting state-of-the-art RL methods to deal with changing environments: these however usually require (partially) perfect information about the drift such as ``task IDs'' or ``context''. To mitigate the effects of drift, this work develops \emph{Space-sampled Value Decay} as an explicit forgetting mechanism for value-based deep RL architectures as a simple yet effective approach. In particular we demonstrate and discuss positive effects but also limitations in achieved returns for modifications of Deep Q-networks (DQN) and Soft Actor-Critic (SAC) when evaluated on non-stationary environments.

2606.11793 2026-06-11 cs.LG cs.AI physics.ao-ph 新提交

AI4Land: Scalable Deep Learning for Global High-Resolution Land Use Reconstruction

AI4Land: 面向全球高分辨率土地利用重建的可扩展深度学习

Amirpasha Mozaffari, Marina Castaño, Stefano Materia, Etienne Tourigny, Oscar Molina-Sedano, Jordi Varela-Agrelo, Dario Garcia-Gasulla, Miguel Castrillo Melguizo, Mario Acosta, Amanda Duarte

发表机构 * Barcelona Supercomputing Center(巴塞罗那超级计算中心)

AI总结 提出AI4Land框架,采用U-Net两阶段方法,结合粗分辨率情景数据与静态地理特征,重建高分辨率年度土地利用与覆盖,减少陆地碳循环不确定性,支持气候模拟。

详情
AI中文摘要

陆地碳循环的不确定性仍是气候预测的主要制约因素,部分源于地球系统模型中陆面表征和变率的不确定性。为解决此问题,我们提出了数据驱动框架AI4Land,用于生成关键陆面变量的高分辨率历史重建和未来预测。该框架采用U-Net架构的两阶段方法。在第一阶段(本文重点),它通过整合粗分辨率情景数据与静态地理特征,重建年度土地利用与土地覆盖。在计划的第二阶段,生成的高分辨率地图将用于在更细时间尺度上预测动态生物物理变量,特别是叶面积指数。模型基于地球观测数据训练,学习再现空间明确且物理一致的陆面模式,并将时间覆盖扩展到缺乏直接观测的时期。AI4Land在MareNostrum5上开发和训练,展示了GPU加速的高性能计算基础设施如何支持全球尺度的气候AI流水线。最终产品是一套开源模拟器,旨在与数字孪生平台(如Destination Earth计划下开发的平台)实时耦合。通过按需提供逼真且演变的陆面条件,本工作旨在减少关键不确定性,提高下一代气候模拟的预测能力。

英文摘要

Uncertainty in the terrestrial carbon cycle remains a major constraint in climate projections, partly driven by the uncertainties affecting the land surface representation and variability in Earth system models. To address this limitation, we present a data-driven framework AI4Land, for generating high-resolution historical reconstructions and future projections of key land surface variables. The framework follows a two-phase approach using a U-Net architecture. In the first phase, which is the focus of this work, it reconstructs annual land use and land cover by integrating coarse-resolution scenario data with static geophysical features. In a planned second phase, the resulting high-resolution maps will be used to predict dynamic biophysical variables, particularly leaf area index, at finer temporal scales. Trained on Earth observation data, the models learn to reproduce spatially explicit and physically consistent land surface patterns, extending temporal coverage to periods lacking direct observations. AI4Land was developed and trained on MareNostrum5, demonstrating how GPU-accelerated HPC infrastructure enables global-scale climate AI pipelines. The final product is a suite of open-source emulators designed for real-time coupling with digital twin platforms, such as those developed under the Destination Earth initiative. By delivering realistic and evolving land surface conditions on demand, this work aims to reduce critical uncertainties and improve the predictive power of next-generation climate simulations.

2606.11792 2026-06-11 cs.CV cs.AI cs.CL 新提交

MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

MultiToP:学习修补视觉令牌以减轻视频大型多模态模型中的幻觉

Yuansheng Gao, Wenbin Xing, Jiahao Yuan, Kaiwen Zhou, Han Bao, Zonghui Wang, Wenzhi Chen

发表机构 * Zhejiang University(浙江大学) Sun Yat-sen University(中山大学) East China Normal University(华东师范大学)

AI总结 提出MultiToP框架,通过轻量级视觉令牌修补器动态替换不可靠视觉令牌,结合信息引导排名校准和稀疏正则化,在不修改原模型情况下减少视频多模态模型幻觉,显著提升F1分数和问答准确率。

详情
Comments
Preprint
AI中文摘要

视频大型多模态模型在视频理解方面取得了显著进展,但仍容易产生幻觉,即生成的响应未能忠实于输入视频。在本文中,我们提出MultiToP,一种多模态上下文感知的视觉令牌修补框架,通过在语言生成之前优化不可靠的视觉令牌来减轻幻觉。MultiToP引入了一个轻量级的视觉令牌修补器,用于预测令牌级替换分布,并选择性地用动态全局修补令牌替换不可靠的视觉令牌。为了有效训练修补器,我们进一步提出了信息引导的排名校准,利用从主干网络派生的答案条件帧级信息线索来指导令牌替换。结合真实答案监督和稀疏正则化,MultiToP实现了局部视觉证据优化,而无需修改原始模型。大量实验表明,MultiToP在Vript-HAL上有效减少了幻觉,且推理开销可忽略不计,将Qwen3-VL-4B-Instruct的F1分数相比原始模型提高了50.60%。同时,MultiToP保持了通用的视频理解能力,在ActivityNet-QA上为Video-LLaVA-7B带来了18.58%的相对准确率提升。

英文摘要

Video Large Multimodal Models have achieved remarkable progress in video understanding, yet they remain prone to hallucinations, where generated responses are not faithfully supported by the input video. In this paper, we propose MultiToP, a multimodal-context-aware visual token patching framework that mitigates hallucinations by refining unreliable visual tokens before language generation. MultiToP introduces a lightweight Visual Token Patcher to predict token-level replacement distributions and selectively substitute unreliable visual tokens with a dynamic global patch token. To train the patcher effectively, we further propose information-guided rank calibration, which uses answer-conditioned frame-level information cues derived from the backbone to guide token replacement. Combined with ground-truth answer supervision and sparsity regularization, MultiToP enables localized visual evidence refinement without modifying the original model. Extensive experiments demonstrate that MultiToP effectively reduces hallucinations on Vript-HAL with negligible inference overhead, improving the F1 scores of Qwen3-VL-4B-Instruct by 50.60% over the vanilla model. Meanwhile, MultiToP preserves general video understanding ability, yielding an 18.58% relative accuracy gain on ActivityNet-QA for Video-LLaVA-7B.

2606.11786 2026-06-11 cs.CL 新提交

Lius: Translation Model Based Instructional Lingustic Using Continual Instruction Tuning In Kupang Malay

Lius:基于持续指令调优的库邦马来语教学语言学翻译模型

Joanito Agili Lopo, Yunita Sari, Guntur Budi Herwanto

发表机构 * Universitas Gadjah Mada(加札马达大学)

AI总结 针对低资源语言库邦马来语,提出利用双语词典的词汇和语义特征设计指令,并采用持续指令调优(CIT)范式微调大语言模型,在多个指标上超越基线4-6分,优于NMT和多语言LLM 10-13分。

详情
Comments
This paper is the result of the Master Thesis in Master of Artificial Intelligence at Universitas Gadjah Mada
AI中文摘要

大语言模型(LLM)为翻译任务提供了新的潜力,但在处理低资源语言时常常出现性能下降。为了解决这一限制,我们提出了一种针对低资源语言库邦马来语微调LLM的方法。我们的方法涉及利用双语词典的显式词汇和语义特征设计一组指令,并引入持续指令调优(CIT),一种支持基于迭代指令训练的训练范式。实验结果表明,我们名为Lius的模型在多个评估指标上比标准指令调优模型提高了4-6分,并超越了神经机器翻译(NMT)和多语言LLM模型10-13分。这些发现突显了我们的方法在减轻低资源语言翻译中对大规模并行数据依赖的潜力。

英文摘要

Large Language Models (LLMs) offer new potential for translation tasks but often experience performance degradation when handling low-resource languages. To address this limitation, we propose an approach for fine-tuning LLMs on a low-resource language, Kupang Malay. Our approach involves designing a set of instructions by leveraging explicit lexical and semantic features from a bilingual dictionary, and introducing Continual Instruction Tuning (CIT), a training paradigm that enables iterative instruction-based training. Experimental results demonstrate that our model, named Lius, yields notable improvements over standard instruction-tuned models by outperforming 4-6 points, and surpassing both Neural Machine Translation (NMT) and Multilingual LLM models by 10-13 points on several evaluation metrics. These findings highlight the potential of our approach to mitigate the reliance on large-scale parallel data in low-resource language translation.

2606.11782 2026-06-11 cs.CV 新提交

Seeing What Matters: Perceptual Wrapper with Common Randomness for 3D Gaussian Splatting

看见重要之处:基于公共随机性的感知包装器用于3D高斯泼溅

He-Bi Yang, Jing-Zhong Chen, Yen-Kuan Ho, Sang NguyenQuang, Fan-Yi Hsu, Yun-Yu Lee, Jui-Chiu Chiang, Wen-Hsiao Peng

发表机构 * National Yang Ming Chiao Tung University(国立阳明交通大学) National Chung Cheng University(国立中正大学)

AI总结 针对3D高斯泼溅在内存受限和率失真优化管道中高频纹理合成困难的问题,提出一种2D感知包装器,利用伪随机高斯噪声和Wasserstein失真监督,以内容与视角相关的方式增强渲染输出,显著提升感知质量并压缩模型大小。

详情
Comments
18 pages, 9 figures
AI中文摘要

虽然3D高斯泼溅(3DGS)实现了令人印象深刻的实时渲染,但它经常难以合成高频纹理,这一限制在内存受限和率失真优化(RDO)管道中尤为严重。为了解决这个问题,我们提出了一种通用的2D感知包装器,它以内容和视角相关的方式增强现有3DGS表示的渲染输出。我们的方法利用一个以伪随机高斯噪声为条件的轻量级合成网络来合成感知上合理的纹理。在Wasserstein失真的监督下,该网络学习匹配局部特征统计,而不是严格强制逐像素重建保真度,从而有效缓解标准框架中固有的模糊性。我们展示了我们的即插即用方法在普通、内存受限和RDO 3DGS方法中的广泛适用性。全面的主观和客观实验证实,我们的方法显著优于现有基线,在急剧减小文件或模型尺寸的同时,实现了卓越的感知质量。

英文摘要

While 3D Gaussian Splatting (3DGS) achieves impressive real-time rendering, it frequently struggles to synthesize high-frequency textures, a limitation heavily exacerbated in memory-constrained and rate-distortion-optimized (RDO) pipelines. To address this, we propose a versatile 2D perceptual wrapper that enhances the rendered outputs of existing 3DGS representations in a content- and view-dependent manner. Our method leverages a lightweight synthesis network conditioned on pseudo-random Gaussian noise to synthesize perceptually plausible textures. Supervised by Wasserstein Distortion, the network learns to match local feature statistics rather than strictly enforcing pixel-wise reconstruction fidelity, effectively mitigating the blurriness inherent in standard frameworks. We demonstrate the broad applicability of our plug-and-play approach across vanilla, memory-constrained, and RDO 3DGS methods. Comprehensive subjective and objective experiments confirm that our method significantly improves over existing baselines, yielding superior perceptual quality at sharply reduced file or model sizes.

2606.11779 2026-06-11 cs.CV 新提交

Battery detection of XRay images using transfer learning

基于迁移学习的X射线图像电池检测

Nermeen Abou Baker, David Rohrschneider, Uwe Handmann

发表机构 * Ruhr West University of Applied Sciences(鲁尔西应用科学大学)

AI总结 本研究利用迁移学习,基于YOLOv5m模型检测X射线图像中的电池,并分类三种锂离子电池,检测精度达94%,推理时间22毫秒。

详情
Comments
Published at the European Symposium on Artificial Neural Networks (ESANN 2022)
AI中文摘要

在许多应用中,检测和分类电池的需求急剧增加。本研究证明了迁移学习在预测图像中是否包含电池、定位电池以及识别三种类型的锂离子电池(即棱柱形、软包和圆柱形)方面的潜力。特别地,它关注于两种应用中的迁移学习方法:使用预训练的YOLOv5m训练大规模数据集以检测电子设备,然后利用这些训练后的权重来检测和分类电池。电池检测精度达到94%,比预训练的YOLOv5m权重高出5%,推理时间为22毫秒。

英文摘要

The need for detecting and sorting batteries is drastically increasing for many applications. This study proves the potential of transfer learning in predicting whether the image contains a battery or not, the location and identifying three types of batteries, namely: prismatic, pouch, and cylindrical Lithium-Ion Batteries (LIB). Particularly, it focuses on the transfer learning method in two applications: Training a large-scale dataset to detect electronic devices using a pre-trained YOLOv5m, then using these latter trained weights to detect and classify the batteries. The precision of battery detection achieves 94%, which outperforms the pretrained YOLOv5m weights with 5%, in 22 ms inference time.

2606.11770 2026-06-11 cs.AI 新提交

SVoT: State-aware Visualization-of-Thought for Spatial Reasoning via Reinforcement Learning

SVoT: 基于强化学习的空间推理状态感知思维可视化

Chao Lei, Yanbei Jiang, Markus Hiller, Zhijian Zhou, Xunye Tian, Krista A. Ehinger, Nir Lipovetzky

发表机构 * School of Computing and Information Systems, The University of Melbourne(墨尔本大学计算与信息系统学院)

AI总结 提出SVoT框架,通过强化学习生成可验证的中间状态和可视化,结合文本与视觉推理链,提升多模态大模型在多跳空间推理中的可靠性。

详情
AI中文摘要

空间推理对多模态大语言模型(MLLMs)仍是一个挑战,因为它需要在中间状态和状态转换上进行可靠的多跳推理。当前研究通常不验证中间状态,并将状态转换视为隐式过程,这限制了多跳空间推理的可靠性。为解决这一问题,我们提出状态感知思维可视化(SVoT),一种强化学习框架,生成交错、可验证的中间状态和可视化。SVoT将转换推理链整合到生成过程中,使模型能够通过交错的文本和视觉推理验证动作前提和效果。我们通过组相对策略优化(GRPO)训练SVoT,通过奖励设计实例化验证,并评估不同细粒度奖励的效果。由于现有基准将状态转换简化为单变量更新,大大简化了问题,我们通过扩展经典环境并引入两个需要多对象交互和数值推理的新领域Pacman和Gather,建立了五个领域。这些领域支持对多跳空间推理的系统评估,并对生成的中间状态和转换推理进行定量验证。具有转换感知监督的SVoT在引入的领域中达到了最先进的性能,在分布外测试集上实现了高达65%的绝对准确率提升。

英文摘要

Spatial reasoning remains a challenge for Multimodal Large Language Models (MLLMs), as it requires reliable multi-hop inference over both intermediate states and state transitions. Current studies often leave intermediate states unverified and treat state transitions as implicit processes, which limits reliability in multi-hop spatial reasoning. To address this, we propose State-aware Visualization-of-Thought (SVoT), a reinforcement learning framework that generates interleaved, verifiable intermediate states and visualizations. SVoT integrates transition reasoning chains into the generation processes, enabling the model to verify action preconditions and effects through interleaved textual and visual reasoning. We train SVoT via Group Relative Policy Optimization (GRPO), instantiating verification through reward design and evaluating the efficacy of different fine-grained rewards. As existing benchmarks reduce state transitions to single-variable updates, substantially simplifying the problems, we establish five domains by extending classical environments and introducing two novel domains, Pacman and Gather, that require multi-object interactions and numerical reasoning. These domains support systematic evaluation of multi-hop spatial reasoning with quantitative verification of generated intermediate states and transition reasoning. SVoT with transition-aware supervision achieves state-of-the-art performance across the introduced domains, yielding up to a 65% absolute accuracy gain on out-of-distribution test sets.

2606.11769 2026-06-11 cs.AI cs.LG 新提交

When Do Data-Driven Systems Exhibit the Capability to Infer?

数据驱动系统何时展现出推理能力?

Maximilian Poretschkin, Tabea Naeven

发表机构 * Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS)(弗劳恩霍夫智能分析与信息系统研究所) University of Bonn(波恩大学) Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔机器学习和人工智能研究所)

AI总结 针对欧盟AI法案中推理能力定义模糊的问题,基于统计学习理论提出分级框架,通过信用评分案例展示如何判断系统是否具备推理能力。

详情
AI中文摘要

欧盟AI法案是第一部全面的人工智能法规,为所谓高风险和通用AI系统规定了广泛的义务。AI法案下AI系统的一个关键区别特征是推理能力。由于AI法案未明确定义推理,某些数据驱动系统存在灰色地带。一个具体例子是信用评分系统,被AI法案附件三列出。然而,这些系统通常使用统计模型实现,不清楚它们是否具有推理能力,从而是否属于AI法案的AI定义。受统计学习理论启发,本文开发了一个分级不同推理能力水平的框架。基于AI法案和委员会关于人工智能系统定义的指南,我们分析了哪些水平构成AI法案意义上的充分推理能力,以及哪些地方需要进一步的监管明确性。我们通过创建两个现实的信用评分工作流程来说明该框架,并展示推理是否以及在哪里发生。我们的分析表明,不仅需要考虑单个模型,还需要考虑整个数据处理工作流程。它还表明,开发过程中人类专家的参与可能对推理能力产生重大影响。代码可在此https URL找到。

英文摘要

The European AI Act is the first comprehensive regulation of artificial intelligence (AI), setting out extensive obligations, particularly for so-called high-risk and general-purpose AI systems. A key distinguishing feature of AI systems under the AI Act is the capability to infer. Since the AI Act does not clearly define what inference is, there is a gray area for certain data-driven systems. A specific example is credit scoring systems, which are listed by Annex III of the AI Act. At the same time, however, these are often implemented using statistical models for which it is unclear whether they have the capability to infer and thus fall under the AI definition of the AI Act at all. Motivated by statistical learning theory, this work develops a framework for grading different levels of the capability to infer. Based on the AI Act and the Commission Guidelines on the definition of an artificial intelligence system, we analyze which levels constitute sufficient capability to infer within the meaning of the AI Act and where further regulatory clarity is needed. We illustrate the framework by creating two realistic credit scoring workflows and show whether and where inference occurs in them. Our analysis illustrates that not only individual models but the entire data processing workflow must be considered. It also shows that the involvement of human experts during development can have significant influence on the capability to infer. Code can be found at this https URL.

2606.11767 2026-06-11 cs.RO cs.AI 新提交

Blind Dexterous Grasping via Real2Sim2Real Tactile Policy Learning

通过真实到仿真到真实触觉策略学习的盲操作灵巧抓取

Shengcheng Luo, Xiyan Huang, Zhe Xu, Wanlin Li, Ziyuan Jiao, Chenxi Xiao

发表机构 * ShanghaiTech University(上海科技大学) Beijing Institute for General Artificial Intelligence(北京通用人工智能研究院)

AI总结 提出一种结合Real2Sim触觉校准、布局感知触觉编码器和触觉条件扩散策略的框架,实现仅依赖触觉的灵巧手盲抓取,在真实机器人上对20个物体达到27%成功率。

详情
Comments
23 pages, 6 figures
AI中文摘要

使用灵巧手进行盲抓取是一项关键的操作能力。然而,由于触觉的仿真到真实差距以及稀疏触觉信号的有限表达能力,为真实机器人学习这种仅依赖触觉的策略仍然具有挑战性。为了弥合这一差距,我们提出了一个仅依赖触觉的盲抓取框架,该框架可部署在物理多指机器人手上。我们的方法结合了三个关键组成部分。首先,我们引入了一个Real2Sim触觉校准流程,构建了一个接触校准的数字孪生模拟器,能够复现真实的触觉信号。其次,我们使用布局感知触觉编码器改进了稀疏触觉观测的表达能力,该编码器通过自监督预训练融入了传感器几何先验。第三,为了提高对未见物体的泛化能力,我们在校准后的模拟器中训练了特定物体的强化学习专家,并将其成功的抓取轨迹聚合为触觉条件扩散策略。我们在配备分布式触觉传感的物理LEAP手上评估了我们的方法,涉及10个见过和10个未见过的物体。部署的策略在所有20个物体上实现了27%的真实世界抓取成功率,无需真实世界的抓取演示或视觉输入。仿真消融实验表明,布局感知触觉预训练提高了抓取性能,而传感级评估确认Real2Sim校准增加了仿真与硬件之间触觉接触事件的一致性。这些结果表明,接触事件校准、几何感知触觉表示学习和基于扩散的策略聚合为真实灵巧机器人手上的仅触觉盲抓取提供了一条有效路径。项目页面:此HTTP URL。

英文摘要

Blind grasping with a dexterous hand is a crucial manipulation capability. Nevertheless, learning such tactile-only policies for real robots remains challenging due to the tactile sim-to-real gap and the limited expressiveness of sparse tactile signals. To bridge this gap, we propose a framework for tactile-only blind grasping that is deployable on a physical multi-fingered robotic hand. Our approach combines three key components. First, we introduce a Real2Sim tactile calibration pipeline that constructs a contact-calibrated digital-twin simulator capable of reproducing real tactile signals. Second, we improve the expressiveness of sparse tactile observations using a layout-aware tactile encoder, which incorporates sensor-geometry priors through self-supervised pretraining. Third, to improve generalization to unseen objects, we train object-specific reinforcement-learning experts in the calibrated simulator and aggregate their successful grasp trajectories into a tactile-conditioned Diffusion Policy. We evaluate our method on a physical LEAP Hand equipped with distributed tactile sensing across 10 seen and 10 unseen objects. The deployed policy achieves a 27\% real-world grasp success rate across all 20 objects, without real-world grasping demonstrations or visual input. Simulation ablations show that layout-aware tactile pretraining improves grasping performance, while sensing-level evaluations confirm that Real2Sim calibration increases the consistency of tactile contact events between simulation and hardware. Together, these results suggest that contact-event calibration, geometry-aware tactile representation learning, and diffusion-based policy aggregation provide an effective path toward tactile-only blind grasping on real dexterous robotic hands. Project page: this http URL.

2606.11762 2026-06-11 cs.CL cs.AI 新提交

Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

语言模型在开放式任务中的自动化创造力评估

Min Sen Tan, Zachary Kit Chun Choy, Syed Ali Redha Alsagoff, Nadya Yuki Wangsajaya, Mohor Banerjee, Swaagat Bikash Saikia, Alvin Chan

发表机构 * Raffles Institution(莱佛士书院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Lee Kong Chian School of Medicine, Nanyang Technological University(南洋理工大学李光前医学院) Centre of AI in Medicine (C-AIM), Nanyang Technological University(南洋理工大学人工智能医学中心)

AI总结 提出一种领域无关的自动化框架,通过语义熵和检索式多智能体评估,量化LLM在开放式任务中的发散与收敛创造力,并在问题解决、研究构思和创意写作三个领域验证其有效性。

详情
Comments
Accepted to ACL 2026 (Main Conference). 35 pages, 16 figures. Code: this https URL
AI中文摘要

大型语言模型(LLMs)在语言理解、推理和生成方面取得了显著进展,激发了对其创造潜力的日益关注。实现这一潜力需要系统化和可扩展的方法来评估跨不同任务的创造力。然而,大多数现有的创造力指标与特定任务紧密耦合,将领域假设嵌入评估过程,限制了可扩展性和通用性。为解决这一差距,我们引入了一个自动化、领域无关的框架,用于量化LLM在开放式任务中的创造力。我们的方法将测量装置与创造性任务本身分离,实现了可扩展、任务无关的评估。发散创造力通过语义熵(一种无参考且稳健的新颖性和多样性指标)进行测量,并针对人类注释、基于LLM的新颖性判断和基线多样性度量进行了验证。收敛创造力通过一种新颖的基于检索的多智能体评判框架进行评估,该框架提供上下文敏感的任务完成评估,效率提升超过60%。我们在三个性质不同的领域验证了我们的框架:问题解决(MacGyver)、研究构思(HypoGen)和创意写作(BookMIA),使用了广泛的LLM套件。实证结果表明,我们的框架可靠地捕捉了创造力的关键方面,包括新颖性、多样性和任务完成,并揭示了模型属性(如大小、温度、时效性和推理)如何影响创造性表现。我们的工作为自动化的LLM创造力评估建立了可重复和可泛化的标准,为可扩展的基准测试铺平了道路,并加速了创造性AI的进展。

英文摘要

Large language models (LLMs) have achieved remarkable progress in language understanding, reasoning, and generation, sparking growing interest in their creative potential. Realizing this potential requires systematic and scalable methods for evaluating creativity across diverse tasks. However, most existing creativity metrics are tightly coupled to specific tasks, embedding domain assumptions into the evaluation process, and limiting scalability and generality. To address this gap, we introduce an automated, domain-agnostic framework for quantifying LLM creativity across open-ended tasks. Our approach separates the measurement apparatus from the creative task itself, enabling scalable, task-agnostic assessment. Divergent creativity is measured using semantic entropy, a reference-free and robust metric for novelty and diversity, validated against human annotations, LLM-based novelty judgments and baseline diversity measures. Convergent creativity is assessed via a novel retrieval-based multi-agent judge framework that delivers context-sensitive evaluation of task fulfilment with over 60% improved efficiency. We validate our framework in three qualitatively distinct domains: problem-solving (MacGyver), research ideation (HypoGen), and creative writing (BookMIA), using a broad suite of LLMs. Empirical results show that our framework reliably captures key facets of creativity, including novelty, diversity, and task fulfilment, and reveal how model properties, such as size, temperature, recency, and reasoning, impact creative performance. Our work establishes a reproducible and generalizable standard for automated LLM creativity evaluation, paving the way for scalable benchmarking and accelerating progress in creative AI.

2606.11761 2026-06-11 cs.LG 新提交

RCAP: Robust, Class-Aware, Probabilistic Dynamic Dataset Pruning

RCAP: 鲁棒的、类别感知的、概率性动态数据集剪枝

Atif Hassan, Swanand Khare, Jiaul H. Paik

发表机构 * IIT Kharagpur(印度理工学院卡哈拉格普尔分校)

AI总结 提出RCAP算法,通过闭式解估计每类样本保留比例并自适应调整,结合高损失样本优先采样策略,在多种数据集和训练范式下优于现有方法,仅用10%数据即可提升类别不平衡数据集性能1%以上。

详情
Comments
Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence (UAI 2025)
AI中文摘要

动态数据剪枝技术旨在通过模型训练期间定期选择输入数据的代表性子集来降低计算成本,同时最小化信息损失。然而,现有方法在平衡和不平衡数据集中,特别是在高剪枝率下,往往难以保持较强的最差组准确率。为了解决这一挑战,我们提出了RCAP,一种用于分类任务的鲁棒的、类别感知的、概率性动态数据集剪枝算法。RCAP应用闭式解来估计每个类别应包含在训练子集中的样本比例。该比例通过类别聚合损失在每个epoch自适应调整。随后,它采用自适应采样策略,优先选择具有高损失的样本来填充类别子集。我们在六个从类别平衡到高度不平衡的多样化数据集上,使用五种不同的模型,在三种训练范式(从头训练、迁移学习和微调)下评估了RCAP。我们的方法在所有剪枝率下始终优于最先进的数据集剪枝方法,实现了卓越的最差组准确率。值得注意的是,仅使用10%的数据,RCAP在类别不平衡数据集上相比全数据训练性能提升超过1%,同时平均加速8.69倍。代码可在此https URL获取。

英文摘要

Dynamic data pruning techniques aim to reduce computational cost while minimizing information loss by periodically selecting representative subsets of input data during model training. However, existing methods often struggle to maintain strong worst-group accuracy, particularly at high pruning rates, across balanced and imbalanced datasets. To address this challenge, we propose RCAP, a Robust, Class-Aware, Probabilistic dynamic dataset pruning algorithm for classification tasks. RCAP applies a closed-form solution to estimate the fraction of samples to be included in the training subset for each individual class. This fraction is adaptively adjusted in every epoch using class-wise aggregated loss. Thereafter, it employs an adaptive sampling strategy that prioritizes samples having high loss for populating the class-wise subsets. We evaluate RCAP on six diverse datasets ranging from class-balanced to highly imbalanced using five distinct models across three training paradigms: training from scratch, transfer learning, and fine-tuning. Our approach consistently outperforms state-of-the-art dataset pruning methods, achieving superior worst-group accuracy at all pruning rates. Remarkably, with only $10\%$ data, RCAP delivers $>1\%$ improvement in performance on class-imbalanced datasets compared to full data training while providing an average $8.69\times$ speedup. The code can be accessed at this https URL

2606.11751 2026-06-11 cs.CV cs.AI 新提交

AnchorEdit: Maintaining Temporal Consistency in Multi-turn Image Editing via Causal Memory

AnchorEdit: 通过因果记忆在多轮图像编辑中保持时间一致性

Hang Xu, Xiaoxiao Ma, Guohui Zhang, Yu Hu, Siming Fu, Jie Huang, Lin Song, Haoyang Huang, Nan Duan, Feng Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学) JD Explore Academy(京东探索研究院)

AI总结 提出首个自回归扩散框架AnchorEdit,通过因果记忆机制和自展开策略解决多轮编辑中的身份漂移和误差累积问题,在10轮以上交互中保持高保真度。

详情
Comments
Code: this https URL
AI中文摘要

多轮图像编辑对于迭代设计至关重要,但当前模型在连续步骤中常面临身份漂移和误差累积。现有研究利用视频先验保持一致性,但其依赖的双向注意力与交互式编辑的因果、顺序性质根本不符。本文提出AnchorEdit,首个专为高分辨率、长期多轮编辑设计的自回归(AR)扩散框架。AnchorEdit通过三阶段训练课程弥合视频先验与因果推理之间的差距:保持身份的单轮预训练、使用新颖的自展开策略进行因果AR强制微调以缓解暴露偏差,以及用于高效4步生成的一致性蒸馏。在推理过程中,我们引入记忆机制来锚定初始主体身份,并确保在扩展编辑轨迹上的稳定外推。为评估性能,我们提供了一个新的高分辨率多轮编辑基准,旨在压力测试长期稳定性。大量实验表明,AnchorEdit达到了最先进的结果,即使在10轮以上的交互中也能保持卓越的主体保真度和指令遵循能力。

英文摘要

Multi-turn image editing is essential for iterative design, yet current models often struggle with identity drift and error accumulation over successive steps. While existing research leverages video priors for consistency, their reliance on bidirectional attention is fundamentally misaligned with the causal, sequential nature of interactive editing. In this paper, we propose AnchorEdit, the first autoregressive (AR) diffusion-based framework designed specifically for high-resolution, long-term multi-turn editing. AnchorEdit bridges the gap between video priors and causal inference through a three-stage training curriculum: identity-preserving sing-turn pretraining, causal AR forcing fine-tuning with a novel self-rollout strategy to mitigate exposure bias, and consistency distillation for efficient 4-step generation. During inference, we introduce a memory mechanism to anchor the initial subject identity and ensure stable extrapolation across extended editing trajectories. To evaluate performance, we provide a new high-resolution multi-turn editing benchmark designed to stress-test long-horizon stability. Extensive experiments demonstrate that AnchorEdit achieves state-of-the-art results, maintaining exceptional subject fidelity and instruction following even over 10+ interaction rounds.

2606.11744 2026-06-11 cs.CL cs.AI 新提交

Hey Chat, Can You Teach Me? Structuring Socratic Dialogue for Human Learning in the Wild

嘿,聊天机器人,你能教我吗?为人类学习构建结构化苏格拉底式对话

Sidney Tio, Arunesh Sinha, Pradeep Varakantham

发表机构 * School of Computing and Information Systems, Singapore Management University(新加坡管理大学计算与信息系统学院) Department of Management Science and Information Systems, Rutgers Business School(罗格斯大学商学院管理科学与信息系统系)

AI总结 针对LLM在长对话中教学效果差的问题,提出分离课程规划、苏格拉底对话和知识状态推断的系统,使用PPO策略决定教学顺序,在STEM和非STEM主题上优于基线模型。

详情
Comments
10 Main Body Pages, with Appendices
AI中文摘要

大型语言模型现在被广泛用于日常学习,但底层交互通常是非结构化的聊天,而不是遵循课程。与正式的在线学习系统不同,这些交互没有学生的先前记录,因此对学生已知内容的任何估计都必须从对话本身推断。我们表明,仅通过扩展模型并不能弥补这一差距。前沿和教育调优的LLM在要求长时间辅导学生时表现不佳,因为这需要同时做三件事:导师必须安排课程顺序,进行苏格拉底式对话,并从对话中推断学生的知识状态。我们建议分离这些职责。给定学生查询,我们的系统构建一个先决知识图谱,其中子主题是节点,依赖关系是边,并将辅导视为决定下一个要教授哪个节点以及在该节点上花费多少轮对话后再继续。一个轻量级的PPO策略处理这个顺序决策,而LLM在所选节点进行苏格拉底式交流并返回学生进展信号。在保留的STEM和非STEM主题上,我们的PPO配对导师优于启发式基线、前沿通用模型以及专门用于苏格拉底式对话的模型:无论是在学生达到完全课程掌握的速度上,还是在所需的对话轮数上。明确的课程结构带来了底层模型扩展所无法提供的收益。

英文摘要

Large language models are now widely used for everyday learning, but the underlying interactions are typically unstructured chats rather than following a curriculum. Unlike formal online learning systems, these interactions carry no prior record of the student, so any estimate of what the student already knows must be inferred from the dialogue itself. We show that this gap is not closed by scaling models alone. Frontier and education-tuned LLMs perform poorly when asked to tutor a student over an extended session, because doing so requires three things at once. The tutor must sequence a curriculum, conduct Socratic dialogue, and infer the student's knowledge state from that dialogue. We propose separating these responsibilities. Given a student query, our system constructs a prerequisite knowledge graph in which subtopics are nodes and dependencies are edges, and frames tutoring as deciding which node to teach next and how many dialogue turns to spend on it before moving on. A lightweight PPO policy handles this sequencing decision, while an LLM conducts the Socratic exchange at the chosen node and returns a signal of student progress. Across held-out STEM and non-STEM topics, our PPO-paired tutor outperforms heuristic baselines, frontier general-purpose models, and a model specialised for Socratic dialogue: on both the rate at which students reach full curriculum mastery and the number of turns required. Explicit curriculum structure delivers gains that scaling the underlying model does not.

2606.11743 2026-06-11 cs.RO cs.GR cs.LG 新提交

TacCoRL: Integrating Tactile Feedback into VLA via Simulation

TacCoRL: 通过仿真将触觉反馈集成到视觉-语言-动作模型中

Siyu Ma, Yuqi Liang, Chang Yu, Yunuo Chen, Hao Su, Yixin Zhu, Yin Yang, Chenfanfu Jiang

发表机构 * University of California, Los Angeles(加利福尼亚大学洛杉矶分校) University of California, San Diego(加利福尼亚大学圣迭戈分校) University of Electronic Science and Technology of China(电子科技大学) Peking University(北京大学) University of Utah(犹他大学)

AI总结 提出TacCoRL框架,通过仿真与真实联合训练和强化学习,将触觉反馈注入视觉-语言-动作策略,在接触密集型任务中平均成功率提升22.5%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型为机器人操作提供了强大的视觉、语言和动作先验,但仅凭视觉观察往往缺失接触密集型任务所需的局部接触状态。我们提出TacCoRL,一个可扩展的框架,将触觉反馈注入VLA策略,并通过仿真-真实联合训练和基于仿真的强化学习(RL)进行改进,无需大规模触觉预训练或广泛的真实世界接触探索。关键思想不仅是添加触觉作为输入,而是学习在接近失败状态下接触读数应如何调节动作响应,这些状态在演示中罕见且在硬件上收集风险高。我们使用真实对齐的仿真器作为接触交互的闭环训练环境。混合的仿真和真实轨迹首先在预训练策略中热启动触觉条件动作。具有可验证任务奖励的强化学习随后通过仿真接触回滚优化策略。它强化导致任务完成的触觉条件动作,而真实轨迹上的监督目标将精炼策略锚定到部署的视觉、触觉和动作分布。所得策略直接转移到真实机器人,无需特权仿真状态或在线真实世界RL。在四个双臂接触密集型任务中,最终的视觉-触觉策略平均成功率达到72.5%,而基线为50.0%。结果视频和更多细节见此链接。

英文摘要

Vision-language-action (VLA) models provide strong visual, language, and action priors for robot manipulation, but visual observations alone often miss the local contact state required for contact-rich tasks. We present TacCoRL, a scalable framework that injects Tactile feedback into VLA policies and improves them through sim-real Co-training and simulation-based reinforcement learning (RL), without requiring large-scale tactile pretraining or extensive real-world contact exploration. The key idea is not only adding touch as an input, but learning how contact readings should modulate action responses in near-failure states that are rare in demonstrations and risky to collect on hardware. We use a real-aligned simulator as a closed-loop training environment for contact interaction. Mixed simulated and real trajectories first warm-start tactile-conditioned actions in the pretrained policy. Reinforcement learning with verifiable task rewards then optimizes the policy using simulated contact rollouts. It reinforces tactile-conditioned actions that lead to task completion, while a supervised objective on real trajectories keeps the refined policy anchored to deployment visual, tactile, and action distributions. The resulting policy transfers directly to the real robot without privileged simulation state or online real-world RL. Across four bimanual contact-rich tasks, the final visuo-tactile policy achieves an average success rate of 72.5%, compared to baseline of 50.0%. Result videos and more details are available at this https URL

2606.11740 2026-06-11 cs.CV cs.CL 新提交

UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA

UniReason-Med: 用于医学VQA中二维到三维迁移的共享基础推理接口

Mengzhuo Chen, Yan Shu, Chi Liu, Hongming Piao, Xidong Wang, Derek Li, Bryan Dai

发表机构 * IQuest Research

AI总结 提出UniReason-Med框架,通过共享基础推理接口从2D医学图像向3D医学VQA迁移推理能力,结合监督微调和强化学习,显著提升3D推理性能。

详情
AI中文摘要

我们研究了当两种输入类型通过共同的推理接口对齐时,来自丰富2D医学图像的基础推理监督是否能够改善3D医学VQA。我们引入了UniReason-Med,一个单一检查点框架,在推理时处理2D图像或切片序列化的3D体积,通过共享框语法、区域标记注入和共同的基础推理策略生成交错文本推理和局部视觉证据。为了训练这个接口,我们构建了UniMed-CoT,一个包含220K指令微调数据集,具有交错的文本推理和基础视觉证据,包括170K 2D和50K 3D样本。通过监督微调后接结果级强化学习,UniReason-Med学会生成基础推理轨迹,而在强化学习期间无需基于IoU/Dice的定位奖励。数据混合和组件消融实验表明,联合2D+3D基础监督显著改善了仅3D训练的3D推理,而基础化和区域标记注入对2D和3D任务都有持续益处。这些结果表明,共享的基础推理接口可以将推理结构从2D图像迁移到切片序列化的体积医学理解。代码和数据公开在https://this URL。

英文摘要

We study whether grounded reasoning supervision from abundant 2D medical images can improve 3D medical VQA when both input types are aligned through a common reasoning interface. We introduce UniReason-Med, a single-checkpoint framework that processes either a 2D image or a slice-serialized 3D volume at inference time, generating interleaved textual reasoning and localized visual evidence through shared box syntax, region-token injection, and a common grounded reasoning policy. To train this interface, we construct UniMed-CoT, a 220K instruction-tuning dataset with interleaved textual reasoning and grounded visual evidence, including 170K 2D and 50K 3D samples. Through supervised fine-tuning followed by outcome-level reinforcement learning, UniReason-Med learns to generate grounded reasoning traces without IoU/Dice-based localization rewards during RL. Data-mixture and component ablations show that joint 2D+3D grounded supervision substantially improves 3D reasoning over 3D-only training, while grounding and region-token injection consistently benefit both 2D and 3D tasks. These results suggest that a shared grounded reasoning interface can transfer reasoning structure from 2D images to slice-serialized volumetric medical understanding. The code and data are publicly available at this https URL.

2606.11739 2026-06-11 cs.CV cs.AI 新提交

Multi-View In-Cabin Monitoring System for Public Transport Vehicles

公共交通车辆的多视角座舱内监控系统

Evgeny Gorelik, Kenny Dean Karrow, Fikret Sivrikaya, Sahin Albayrak, Christian Baumann

发表机构 * Technische Universität Berlin(柏林工业大学) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心)

AI总结 提出一个多视角座舱内监控数据集,包含同步RGB-D图像和LiDAR数据,并提供3D人体姿态和边界框标注,支持多视角3D检测模型评估。

详情
Comments
Submitted to ICDM2026
AI中文摘要

我们介绍了一个用于公共交通的多视角座舱内监控数据集,包含来自四个朝内摄像头和覆盖数字化的、部分自动化的德国城市公交车内部空间的旋转LiDAR的同步RGB和深度图像。该数据集包含9,136个同步样本及其标注,并附带一个校准和伪标签流程,可生成乘客的3D人体姿态估计和定向3D边界框。我们还提供了nuScenes格式转换,并基准测试了代表性的多视角3D检测模型(例如Lift-Splat-Shoot和BEVFusion),支持多视角座舱内感知模型的比较评估和小规模训练。该数据集和工具可在以下网址获取:此https URL。

英文摘要

We introduce a multi-view in-cabin monitoring dataset for public transportation with synchronized RGB and depth images from four inward-facing cameras and a rotating LiDAR covering the vehicle interior of a digitalized and partly automated German city bus. The dataset contains 9.136 synchronized samples with annotations and is accompanied by a calibration and pseudo-labeling pipeline that generates 3D human pose estimates and oriented 3D bounding boxes for occupants. We further provide a nuScenes-format conversion and benchmark representative multi-view 3D detection models (e.g., Lift-Splat-Shoot and BEVFusion), supporting comparative evaluation and small-scale training of multi-view in-cabin perception models. The dataset and tools are available at this https URL.