arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.11606 2026-06-11 cs.CV 新提交

Frozen Foundation-Model Embeddings Discard Small-Lesion Signal in Chest Radiography: Implications for Pre-Deployment Evaluation

冻结的基础模型嵌入在胸部X光检查中丢弃小病灶信号:对部署前评估的启示

Raajitha Muthyala, Zhenan Yin, Alekhya Jilla, Frank Li, Theo Dapamede, Bardia Khosravi, Mohammadreza Chavoshi, Judy Gichoya, Saptarshi Purkayastha

发表机构 * Department of Biomedical Engineering and Informatics, Indiana University(印第安纳大学生物医学工程与信息学系) Department of Radiology and Imaging Sciences, Emory University(埃默里大学放射学与影像科学系)

AI总结 本研究系统量化了五种冻结的视觉Transformer基础模型在胸部X光检查中保留或丢失小尺度、低对比度信号的情况,发现全局聚合步骤会无声地抑制小尺度信号,但可从补丁令牌中恢复。

详情
AI中文摘要

冻结的视觉Transformer(ViT)基础模型嵌入越来越多地用作下游胸部X光检查(CXR)流程的基础,然而在冻结的前向传播中,小尺度、低对比度信号在何处保留或丢失,尚未在架构、预训练领域和目标之间进行系统量化。我们探测了五种冻结的ViT(RAD-DINO、DINOv2-B/14、DINOv3 ViT-7B、BiomedCLIP、MedSigLIP)和一个冻结的DINO预训练ResNet-50架构对照,跨越三个大型CXR队列(NIH-CXR14、MIMIC-CXR、Emory-CXR;总池n=492,724)和ChestX-Det10(n=3,543;1,462个小病灶边界框,涵盖钙化、结节、肿块)。每个模型通过小尺度扰动面板和区域感知边界框分层探针对真实病灶进行评估,比较来自同一前向传播的三种池化模式:分类令牌(CLS)、补丁均值(所有最终层补丁令牌的平均值)和边界框限制的局部补丁。在扰动面板上,CLS嵌入处于随机水平(ROC曲线下面积[AUC] 0.500-0.524);补丁均值在等模糊和网状细细胞上与CLS无区别,但在较大方向模糊足迹上随CLS上升,而全局决策任务的疾病AUC范围为0.642-0.913。局部补丁探针从同一前向传播中恢复AUC约1.0(每个模型平均改进+0.412至+0.488);ResNet-50对照重现了随机水平。在ChestX-Det10上,图像级CLS分类显示类内小与大层间差距高达+0.243 AUC;同一前向传播上的边界框级局部补丁池化在每个(模型×类别)单元上恢复AUC >= 0.899。冻结的ViT嵌入在全局聚合步骤中无声地抑制小尺度信号;该信号可从补丁令牌中恢复,但需依赖于感兴趣区域。

英文摘要

Frozen vision-transformer (ViT) foundation-model embeddings increasingly serve as the substrate for downstream chest-radiography (CXR) pipelines, yet where small-scale, low-contrast signal is retained or lost in the frozen forward pass has not been systematically quantified across architectures, pretraining domains, and objectives. We probed five frozen ViTs (RAD-DINO, DINOv2-B/14, DINOv3 ViT-7B, BiomedCLIP, MedSigLIP) and a frozen DINO-pretrained ResNet-50 architectural control across three large CXR cohorts (NIH-CXR14, MIMIC-CXR, Emory-CXR; aggregate pool n=492,724) and ChestX-Det10 (n=3,543; 1,462 small-lesion bounding boxes across Calcification, Nodule, Mass). Each model was evaluated with a small-scale-perturbation panel and a region-aware bounding-box-stratified probe on real lesions, comparing three pooling modes from the same forward pass: classification token (CLS), patch-mean (mean over all final-layer patch tokens), and bounding-box-restricted patch-local. On the perturbation panel, CLS embeddings sat at the chance floor (area under the ROC curve [AUC] 0.500-0.524); patch-mean was indistinguishable from CLS on iso-blur and reticular-fine cells but rose with CLS on larger directional-blur footprints, while disease AUC on globally decided tasks ranged 0.642-0.913. Patch-local probes recovered AUC ~1.0 from the same forward pass (per-model mean improvement +0.412 to +0.488); the ResNet-50 control reproduced the chance floor. On ChestX-Det10, image-level CLS classification showed within-class small-versus-large stratum gaps up to +0.243 AUC; bounding-box-level patch-local pooling on the same forward pass recovered AUC >= 0.899 on every (model x class) cell. Frozen ViT embeddings silently suppress small-scale signal at the global-aggregation step; the signal is recoverable from patch tokens conditional on a region of interest.

2606.11602 2026-06-11 cs.CV 新提交

On Aligning Hierarchical Standardized Embedding for Audio-visual Generalized Zero-shot Learning

面向音视频广义零样本学习的层次化标准化嵌入对齐

Zihan Zhang, Jie Hong, Siyuan Fan, Yanghao Zhou, Pengfei Fang

发表机构 * Southeast University(东南大学) The University of Hong Kong(香港大学) Beijing Institute of Technology(北京理工大学) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education(新一代人工智能技术及其跨学科应用重点实验室(东南大学),教育部) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院)

AI总结 提出AHSE方法,通过Z-score标准化和层次化对齐策略(语义、类别、批次三级)解决音视频与文本模态间的分布与结构差异,在三个基准数据集上取得竞争性能。

详情
AI中文摘要

音视频广义零样本学习(AV-GZSL)是一项具有挑战性的任务,旨在通过整合音频和视觉模态的数据来分类已见和未见对象或场景。近期研究主要集中于融合或对齐音频和视觉特征以生成更具信息量的音视频嵌入。此外,大多数现有方法对齐音视频与文本特征仅依赖于优化目标。然而,这些方法忽视了音视频与文本模态之间固有的分布和结构差异。为解决这一局限性,我们提出一种名为层次化标准化嵌入对齐(AHSE)的方法,该方法能够在共享嵌入空间内实现标准化音视频与文本嵌入的层次化对齐。具体而言,我们首先对融合后的音视频和文本嵌入应用Z-score标准化以减少分布不匹配。然后,我们引入一种层次化对齐策略,在语义、类别和批次三个层面最小化差异,从而构建一个更鲁棒且结构良好的嵌入空间。该策略不仅保留了语义和类间关系,还保持了每个批次内的空间一致性。在三个基准数据集:VGGSound-GZSL、UCF-GZSL和ActivityNet-GZSL上的大量实验表明,AHSE在零样本学习中取得了竞争性能。

英文摘要

Audio-visual Generalized Zero-shot Learning (AV-GZSL) is a challenging task that aims to classify both seen and unseen objects or scenes by integrating data from audio and visual modalities. Recent studies primarily focus on fusing or aligning audio and visual features to generate more informative audio-visual embeddings. Also, aligning the audio-visual and textual features of most existing methods relies solely on the optimization objectives. However, those methods neglect the inherent distributional and structural differences between audio-visual and textual modalities. To address this limitation, we propose a method termed Aligning Hierarchical Standardized Embedding (AHSE), which enables hierarchical alignment of standardized audio-visual and textual embeddings within a shared embedding space. Specifically, we first apply Z-score standardization to the fused audio-visual and textual embeddings to reduce distributional mismatches. We then introduce a hierarchical alignment strategy that minimizes discrepancies at the semantic, class, and batch levels, thereby constructing a more robust and well-structured embedding space. This strategy not only preserves semantic and inter-class relationships but also maintains spatial consistency within each batch. Extensive experiments on three benchmark datasets: VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL, demonstrate that AHSE achieves competitive performance in zero-shot learning.

2606.11601 2026-06-11 cs.CV 新提交

Spatially Coupled Phase-to-Depth Calibration for Fringe Projection Profilometry

条纹投影轮廓术中的空间耦合相位-深度标定

Sehoon Tak, Jae-Sang Hyun

发表机构 * Department of Mechanical Engineering, Yonsei University(延世大学机械工程系)

AI总结 提出一种空间耦合的相位-深度变换,通过全局相位标量与仿射空间项共享所有像素的映射,替代逐像素独立标定,提升空间一致性并减少表面伪影。

详情
AI中文摘要

在条纹投影轮廓术(FPP)中,深度通常通过在每个相机像素处独立拟合相位-深度关系来恢复。尽管这种逐像素标定实现了较高的局部精度,但相邻像素即使观测同一光滑表面,也可能获得显著不同的标定函数,导致空间不一致的几何结构和结构化表面伪影。我们提出一种空间耦合的相位-深度变换,其中所有像素共享一个单一的低维映射——全局相位标量与在未畸变参考相机网格上的仿射空间项相结合——而非独立的逐像素拟合,可选地通过一个有界、空间平滑的校正场进行增强。我们进一步引入一种原生网格配对方案,直接在参考相机网格上构建相位-深度标定对:当深度监督来自校正后的主动立体管线时,在立体3D空间中拟合平面,并沿原生射线采样回相机网格,因此相位图从未被校正。在具有高分辨率扫描仪真实数据集的牙齿目标上,所提出的模型达到了与主动立体参考相当的点到表面RMSE(约12微米聚合),同时在空间一致性上显著优于逐像素多项式和有理标定,并将运行时映射减少为每个像素的少量逐元素操作,参数存储可忽略不计。

英文摘要

In fringe projection profilometry (FPP), depth is commonly recovered by fitting a phase-to-depth relation independently at each camera pixel. Although such pixel-wise calibration achieves high local accuracy, neighboring pixels can acquire markedly different calibration functions even when they observe the same smooth surface, producing spatially inconsistent geometry and structured surface artifacts. We propose a spatially coupled phase-depth transformation in which all pixels share a single low-dimensional mapping-global phase scalars combined with affine spatial terms on the undistorted reference-camera grid-rather than independent per-pixel fits, optionally augmented by a bounded, spatially smooth correction field. We further introduce a native-grid pairing scheme that constructs phase-depth calibration pairs directly on the reference-camera grid: when depth supervision comes from a rectified active-stereo pipeline, planes are fitted in stereo 3D and sampled back onto the camera grid along native rays, so the phase maps are never rectified. On a dental target with high-resolution scanner ground truth, the proposed model attains point-to-surface RMSE comparable to an active-stereo reference (about 12{\mu}m aggregate) while substantially improving spatial coherence over pixel-wise polynomial and rational calibration, and reduces the runtime mapping to a few element-wise operations per pixel with negligible parameter storage.

2606.11599 2026-06-11 cs.CL cs.LG 新提交

When is Your LLM Steerable?

你的大模型何时可操控?

Chenrui Fan, Yize Cheng, Ming Li, Soheil Feizi, Tianyi Zhou

发表机构 * University of Maryland, College Park(马里兰大学帕克分校) MBZUAI, UAE(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出通过模型生成初期的内部状态预测激活操控是否成功,并利用该预测器优化操控强度搜索,降低解码成本。

详情
AI中文摘要

激活操控提供了一种轻量级的方法来控制语言模型在推理时的行为,但其成功与否严重依赖于提示、概念、模型和操控配置。寻找成功操控的范围和边界通常需要昂贵的网格搜索和对完整自回归生成的后验评估。在这项工作中,我们研究了是否可以从模型在生成过程初期(例如,生成前几个token后)的内部状态预测可操控性,以及如何利用这样的预测器来提高操控成功率。为此,我们首先引入了ASTEER,一个包含140万次操控生成的测试平台,涵盖150个概念,每个操控成功/失败均已标注。利用该测试平台,我们通过提取特征来比较操控前后跨层和初始解码步骤的隐藏状态,分析模型的早期解码动态。这些特征帮助我们理解操控效果如何沿层和token位置传播,为可操控性预测提供关键信息。然后,我们在这些特征上训练梯度提升决策树(GBDT)分类器,以预测干预是否会欠操控、成功或过操控,而无需完整生成。我们的预测器在未见过的概念上达到了约0.7的宏F1分数,表明早期隐藏状态编码了关于最终操控效果的大量结构化信息。我们进一步利用该可操控性预测器作为操控强度搜索的指导,以极小的解码成本实现了接近最优的性能。

英文摘要

Activation steering offers a lightweight approach to control language models' behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration. Finding the regime and boundaries of successful steering typically requires expensive grid searches and post-hoc evaluation of full autoregressive rollouts. In this work, we investigate whether steerability can be predicted from the model's internal states at the beginning of the generation process, e.g., after generating the first few tokens, and how to leverage such a predictor to improve steering success rate. To this end, we first introduce ASTEER, a testbed including 1.4M steered generations, spanning 150 concepts with each steering success/failure labeled. Leveraging this testbed, we analyze the model's early decoding dynamics by extracting features that compare hidden states before and after steering across layers and initial decoding steps. These features help us understand how steering's effects propagate along layers and token positions, which provide key information for steerability prediction. We then train a Gradient Boosting Decision Trees (GBDT) classifier on these features to predict whether an intervention will under-steer, succeed, or over-steer without requiring full rollout. Our predictor achieves around 0.7 macro-F1 score on unseen concepts, demonstrating that early hidden states encode substantial, structured information about eventual steering efficacy. We further leverage this steerability predictor as guidance for steering strength searching, achieving near-optimal performance with a small fraction of decoding cost.

2606.11585 2026-06-11 cs.LG cs.CL nlin.AO 新提交

Kuramoto Attention: Synchronizing Self-Attention on the Torus

Kuramoto注意力:在环面上同步自注意力

Joshua Nunley

发表机构 * Department of Informatics, Luddy School of Informatics, Computing, and Engineering, Cognitive Science Program, Indiana University Bloomington(印第安纳大学伯明顿分校信息学系,卢迪信息学、计算与工程学院,认知科学项目)

AI总结 提出Kuramoto注意力层,将隐藏坐标视为角度,通过门控余弦相似度和环形均值更新实现自注意力,等价于Kuramoto耦合项,在字符级语言建模中达到与强基线相近的性能。

详情
Comments
13 pages, 2 figures, 3 tables
AI中文摘要

我们引入了Kuramoto注意力,一种自注意力层,其中每个隐藏坐标是一个角度。该层通过门控余弦相似度对令牌进行评分,关注先前的相位状态,并通过注意力加权的环形均值的切线分量更新每个令牌。由于值是原始相位状态,该更新恰好是Kuramoto耦合项$\sum_u A_{t,u}\sin(\theta_u-\theta_t)$,其中注意力矩阵充当自适应、内容相关的耦合核。等价地,门控分数是环面上的学习度量,用于选择哪些令牌耦合,更新将每个令牌拉向其选择的令牌的环形均值,从而收紧它们的相位一致性。相同的两个成分,即不变相似度分数和流形上的均值,定义了任何紧致群上的此类层;环面是阿贝尔情形,两者都有闭式解。softmax权重解决了一个熵正则化的相位检索问题,旋转位置编码作为分数中与位置相关的相位漂移进入。在enwiki8字符级语言建模中,该层作为功能语言模型训练,其每字符比特数接近强匹配的RoPE+SwiGLU Transformer:在100万参数时相差0.02 BPC(1.637±0.010对比1.616±0.004),在500万参数时中位数持平(五个种子下1.448对比1.452),Transformer在均值上领先(1.468对比1.456)。这些实验表明,受约束的几何结构在此规模下是可行的语言模型;结构本身及其同步解释是贡献。消融实验隔离了承重组件,结果给出了自注意力和相位同步之间的紧凑桥梁。

英文摘要

We introduce Kuramoto attention, a self-attention layer in which each hidden coordinate is an angle. The layer scores tokens by gated cosine similarity, attends over previous phase states, and updates each token by the tangent component of the attention-weighted circular mean. Because the values are the raw phase states, this update is exactly the Kuramoto coupling term $\sum_u A_{t,u}\sin(\theta_u-\theta_t)$, with the attention matrix acting as an adaptive, content-dependent coupling kernel. Equivalently, the gated score is a learned metric on the torus that selects which tokens couple, and the update pulls each token toward the circular mean of the tokens it selects, tightening their phase agreement. The same two ingredients, an invariant similarity score and an on-manifold mean, define such a layer on any compact group; the torus is the abelian case, where both are closed-form. The softmax weights solve an entropy-regularized phase-retrieval problem, and rotary position enters as a position-dependent phase drift in the score. On enwiki8 character-level language modeling, the layer trains as a functional language model whose bits-per-character stays close to a strong matched RoPE+SwiGLU transformer: within $0.02$ BPC at one million parameters ($1.637\pm0.010$ versus $1.616\pm0.004$) and level on the median at five million ($1.448$ versus $1.452$ over five seeds) with the transformer ahead on the mean ($1.468$ versus $1.456$). These experiments establish that the constrained geometric structure is a viable language model at this scale; the structure itself, and its synchronization reading, is the contribution. Ablations isolate the load-bearing components, and the result gives a compact bridge between self-attention and phase synchronization.

2606.11583 2026-06-11 cs.LG 新提交

Beyond the Golden Teacher: Enhancing Graph Learning through LLM-GNN Co-teaching

超越黄金教师:通过LLM-GNN协同教学增强图学习

Zhuoyi Peng, Hanlin Gu, Lixin Fan, Yi Yang

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) WeBank(微众银行)

AI总结 针对文本属性图上的少样本学习,提出LLM-GNN协同教学框架,避免固定教师模型,通过双向伪标签交换和基于轮次的偏好优化,显著提升图学习性能。

详情
Comments
Code: this https URL
AI中文摘要

文本属性图(TAGs)支撑着现实世界的应用,如引文网络、社交媒体和电子商务。TAGs上的少样本图学习是困难的:每类只有少量标签,其余图数据未标注,GNN和LLM都无法单独良好学习。GNN读取拓扑结构,在冷节点上失败;LLM读取文本,在文本模糊节点上失败。现有的LLM-GNN方法都遵循相同的模式:指定一个模型为黄金教师,并使用其输出(如特征或伪标签)来监督另一个模型。我们认为这种黄金教师假设在稀疏监督下会失效:没有一个模型是黄金的,将任何一个视为黄金教师会将其盲点转移到学生模型中。因此,我们提出:能否避免指定任一模型为黄金教师,仍然进行有效的图学习?我们的答案是LLM-GNN协同教学,一种双向协同教学框架,其中没有模型被固定为教师。GNN和LLM在特定架构的小损失准则下交换它们最自信的伪标签,并且每轮都更新。然后从轨迹中挖掘监督信息:每当一个节点从第t轮的跨模型矛盾变为第t+1轮的跨模型一致时,LLM在同一输入上的两个答案形成一个偏好对(旧的矛盾自我 < 新的同伴认可自我),用于DPO训练。我们称之为基于轮的伪标签偏好优化(RPL-PO)。在六个基准测试上,LLM-GNN协同教学始终优于GNN-as-Judge和所有先前方法,在Cora和ogbn-arxiv上的绝对3-shot增益分别为7.86%和7.73%;改进延续到5-shot和零样本跨数据集迁移。错误结构分析进一步表明,放弃黄金教师假设显著提高了LLM在困难样本上的图学习能力。

英文摘要

Text-attributed graphs (TAGs) underlie real-world applications such as citation networks, social media, and e-commerce. Few-shot graph learning on TAGs is hard: with only a handful of labels per class and the rest of the graph unannotated, neither GNNs nor LLMs can learn well on their own. GNNs read topology and fail on cold nodes; LLMs read text and fail on text-ambiguous nodes. Existing LLM-GNN methods all follow the same recipe: designate one model as the golden teacher and use its outputs (e.g., features or pseudo-labels) to supervise the other. We argue this golden-teacher assumption breaks under sparse supervision: neither model is golden, and treating either as such transfers its blind spots into the student. We therefore ask: can we avoid designating either model as the golden teacher, and still perform effective graph learning? We answer with LLM-GNN Co-Teaching, a bidirectional co-teaching framework in which neither model is fixed as teacher. The GNN and LLM exchange their most confident pseudo-labels under an architecture-specific small-loss criterion, and both update every round. Supervision is then mined from the trajectory: whenever a node moves from cross-model contradiction at round t to cross-model agreement at round t+1, the LLM's two answers on the same input form a preference pair (old contradicting self < new peer-endorsed self) for DPO training. We call this Round-based Pseudo-Label Preference Optimization (RPL-PO). On six benchmarks, LLM-GNN Co-Teaching consistently outperforms GNN-as-Judge and all prior methods, with absolute 3-shot gains of 7.86% on Cora and 7.73% on ogbn-arxiv; improvements carry over to 5-shot and to zero-shot cross-dataset transfer. Error-structure analysis further shows that abandoning the golden-teacher assumption substantially improves the LLM's graph learning capability on challenging samples.

2606.11577 2026-06-11 cs.RO 新提交

Distortion-Resilient Robotic Imitation Learning for Autonomous Cable Routing

抗畸变机器人模仿学习用于自主电缆布线

Hao Wang, Fu-Zhao Ou, Shiqi Wang, Zhaolin Wan, Xiaopeng Fan

发表机构 * School of Artificial Intelligence, Harbin Institute of Technology(哈尔滨工业大学人工智能学院) Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系) Pengcheng Laboratory(鹏城实验室) Suzhou Research Institute, Harbin Institute of Technology(哈尔滨工业大学苏州研究院)

AI总结 提出一种包含图像质量评估、置信度学习和决策模块的机器人模仿学习框架,在图像畸变下仍保持高性能,实验验证了其有效性。

详情
AI中文摘要

智能控制方法的快速发展赋予了机器人强大的自主智能。电缆布线作为工业中的基础任务,为机器人灵巧性和序列决策提供了严格的基准。在这些实际场景中,图像观测畸变频繁发生。低质量图像观测的样本常常阻碍准确的模型训练,对智能控制系统的可靠性和准确性构成挑战。然而,目前尚未有针对图像信号畸变场景的专用智能控制解决方案。同时,图像质量信息未被充分利用以进一步提升智能控制方法的性能。为此,我们提出了一种新颖的机器人模仿学习框架,该框架包含图像质量评估模块、基于置信度的学习机制和决策模块,旨在即使在畸变图像观测下也能保持高性能。在所提出的框架中,图像质量评估模块与基于置信度的学习机制协同作用,以增强决策模块的有效性。具体来说,引入图像质量评估模块从图像观测中提取图像质量信息,而基于置信度的学习机制自适应地优先处理具有挑战性的样本以提高学习效果。决策模块确定适当的离散技能或连续动作。实验结果表明,我们提出的框架提升了决策模块的整体性能。

英文摘要

The rapid development of intelligent control methodologies has endowed robots with powerful autonomous intelligence. Cable routing, a ubiquitous foundational task in industry, provides a rigorous benchmark for robotic dexterity and sequential decision-making. In these practical scenarios, image observation distortion frequently occurs. Samples characterized by low-quality image observations often hinder accurate model training, posing challenges to the reliability and accuracy of intelligent control systems. Nevertheless, no dedicated intelligent control solution has been proposed for scenarios of image signal distortion. Meanwhile, image quality information has not been sufficiently exploited to further enhance the performance of intelligent control methodologies. To this end, we propose a novel robotic imitation learning framework that comprises an image quality assessment module, a confidence-based learning mechanism, and a decision-making module, which is designed to maintain high performance even under distorted image observations. In the proposed framework, the image quality assessment module synergizes with the confidence-based learning mechanism to enhance the efficacy of the decision-making module. Specifically, the image quality assessment module is incorporated to extract image quality information from image observations, while the confidence-based learning mechanism adaptively prioritizes challenging samples to improve learning effectiveness. The decision-making module determines appropriate discrete skills or continuous actions. Experimental results demonstrate that our formulated framework enhances the overall performance of the decision-making module.

2606.11576 2026-06-11 cs.CV cs.AI 新提交

AVIS: Adaptive Test-Time Scaling for Vision-Language Models

AVIS: 视觉语言模型的自适应测试时缩放

Ahmadreza Jeddi, Minh Ngoc Le, Amirhossein Kazerouni, Hakki Can Karaimer, Hue Nguyen, Iqbal Mohomed, Michael Brudno, Alex Levinshtein, Konstantinos G. Derpanis, Babak Taati, Radek Grzeszczuk

发表机构 * AI Center-Toronto, Samsung Electronics(三星电子多伦多AI中心) University of Toronto(多伦多大学) Vector Institute(向量研究所) York University(约克大学)

AI总结 提出AVIS,通过轻量策略联合优化视觉上下文缩放和推理缩放,利用无训练的关键多样性剪枝和自适应自一致性,在多种基准上提升精度-计算权衡。

详情
Comments
Project page: this https URL
AI中文摘要

现代视觉语言模型(VLM)受益于思维链提示和测试时缩放,但这些增益通常因大视觉上下文和长解码链而带来高昂推理成本。我们将此成本通过两个耦合的轴来审视:视觉上下文缩放(VCS),控制传递给语言模型的视觉证据量;以及视觉推理缩放(VRS),控制推理时推理搜索的执行量。现有方法通常一次优化一个轴,而跨这些轴的联合计算分配尚未充分探索。我们引入自适应视觉推理缩放(AVIS),一种轻量策略,根据每个查询自适应调整VCS和VRS。AVIS通过关键多样性视觉(KDV)剪枝实现VCS,这是一种无训练的$O(N)$基于关键字的规则,用于在预填充前移除冗余视觉令牌;并通过自适应自一致性实现VRS,使用学习的难度预测器选择推理滚动的数量。AVIS易于部署,兼容共享预填充推理,其中所有滚动重用单个预填充过程和KV缓存。在多样化的图像和视频推理基准上,AVIS相对于仅VCS和仅VRS的基线改善了精度-计算权衡,并且在RL后训练的VLM上仍然有效,同时保持低计算和低延迟。

英文摘要

Modern Vision-Language Models (VLMs) benefit from chain-of-thought prompting and test-time scaling, but these gains often come with prohibitive inference cost due to large visual contexts and long decoding chains. We view this cost through two coupled axes: Visual Context Scaling (VCS), which controls how much visual evidence is passed to the language model, and Visual Reasoning Scaling (VRS), which controls how much inference-time reasoning search is performed. Existing methods typically optimize one axis at a time, leaving the joint allocation of compute across these axes underexplored. We introduce Adaptive Visual Inference Scaling (AVIS), a lightweight policy that adapts both VCS and VRS per query. AVIS realizes VCS through Key Diversity Visual (KDV) pruning, a training-free $O(N)$ key-based rule for removing redundant visual tokens before prefilling, and realizes VRS through adaptive self-consistency, using a learned difficulty predictor to select the number of reasoning rollouts. AVIS is deployment-friendly and compatible with shared-prefill inference, where all rollouts reuse a single prefilling pass and KV cache. Across diverse image and video reasoning benchmarks, AVIS improves the accuracy--compute trade-off relative to VCS-only and VRS-only baselines, and remains effective on top of RL post-trained VLMs while keeping compute and latency low.

2606.11574 2026-06-11 cs.LG cond-mat.mtrl-sci physics.chem-ph stat.ML 新提交

Range-Aware Bayesian Optimization for Discovering Diverse Designs within Target Property Windows

范围感知贝叶斯优化用于在目标属性窗口内发现多样化设计

Shengli Jiang, Jason Wu, Charles M. Schroeder, Michael A. Webb

发表机构 * Department of Chemical and Biological Engineering, Princeton University(普林斯顿大学化学与生物工程系)

AI总结 提出范围感知贝叶斯优化框架,通过采集函数直接评分候选解满足目标范围的后验概率,在基准任务和实际案例中比标准方法发现更多样化的有效设计。

详情
Comments
64 pages, 6 main text figures, 17 supporting figures, 6 supporting tables
AI中文摘要

在许多材料和产品设计问题中,理想的候选物表现出可接受范围内的属性,而非达到单一最优值。恢复满足此类规格的多个不同解也具有实际价值,因为某些候选物可能因成本、可加工性或鲁棒性等原因而更受青睐,而这些因素难以直接编码到目标函数中。在此,我们开发了一个范围感知贝叶斯优化(BO)框架,其中采集函数直接评分候选解满足目标范围的后验概率。该框架自然扩展到在共享候选空间上并行追求多个不同规格。在基准任务中,范围感知采集一致地比标准BO基线和最近的目标寻求方法恢复更大且更多样化的有效设计集。其效用进一步在两个实际动机的设计案例研究中得到证明,涉及优化聚合物合成的反应条件和发现指定光学吸收带的序列定义低聚物,并得到量子化学计算的支持。这些结果表明,范围感知BO可以为规格驱动设计提供实用且样本高效的基础,特别是当设计灵活性和解多样性是重要考虑因素时。

英文摘要

In many materials and product design problems, desirable candidates exhibit properties that fall within an acceptable range rather than achieve a single optimum. Recovering multiple, distinct solutions that satisfy such specifications is also practically valuable, as some candidates may be preferred for reasons of cost, processability, or robustness that are difficult to encode directly in an objective function. Here, we develop a range-aware Bayesian optimization (BO) framework in which the acquisition function directly scores the posterior probability that a candidate satisfies a target range. The framework naturally extends to parallel pursuit of multiple distinct specifications over a shared candidate space. Across benchmark tasks, range-aware acquisition consistently recovers larger and more diverse sets of valid designs than standard BO baselines and recent goal-seeking methods. Its utility is further demonstrated in two practically motivated design case studies involving optimizing reaction conditions for polymer synthesis and sequence-defined oligomer discovery for prescribed optical absorption bands, supported by quantum chemical calculations. These results suggest that range-aware BO can provide a practical and sample-efficient foundation for specification-driven design, particularly when design flexibility and solution diversity are important considerations.

2606.11573 2026-06-11 cs.CV 新提交

Understanding Cross-Sensor Feature Variations for Generalizable 3D Perception

理解跨传感器特征变化以实现可泛化的3D感知

Xin Qiu, Wenjie Liu, Fuyuan Ai, YuChen Tan, Zhiwei Xu, Chunyi Song

发表机构 * Zhejiang University(浙江大学)

AI总结 针对雷达-相机BEV感知跨数据集性能下降问题,提出频域场景变化建模框架,通过合成多样源域视图并正则化融合表示,提升3D检测器鲁棒性,无需目标域样本。

详情
AI中文摘要

雷达-相机BEV感知在跨数据集评估时常常性能下降,因为驾驶场景、传感器配置和环境条件的变化会改变输入观测和内部融合表示。本文从源域变化建模的角度研究这一问题,旨在提高基于BEV的3D检测器的鲁棒性,而无需依赖目标域样本。我们引入一个框架,在频域中表征视觉场景变化,并利用这些变化合成多样的源域视图。通过比较生成的融合BEV表示,该框架进一步捕捉图像级变化如何影响多模态BEV特征。然后利用这些变化模式对检测器进行正则化,鼓励学习到的融合空间在潜在场景变化下保持稳定。所提出的方法仅在训练期间应用,推理流程保持不变。在View-of-Delft和TJ4DRadSet之间的跨数据集雷达-相机3D检测实验表明,该方法在多个BEV融合骨干网络上均有一致的改进,并且当少量目标域数据可用时,增益仍然有效。

英文摘要

Radar-camera BEV perception often suffers from degraded performance when evaluated across datasets, as changes in driving scenes, sensor configurations, and environmental conditions can alter both the input observations and the internal fused representations. This work studies this issue from the perspective of source-domain variation modeling, aiming to improve the robustness of BEV-based 3D detectors without relying on target-domain samples. We introduce a framework that characterizes visual scene variations in the frequency domain and uses them to synthesize diverse source-domain views. By comparing the resulting fused BEV representations, the framework further captures how image-level variations influence multi-modal BEV features. These variation patterns are then used to regularize the detector, encouraging the learned fusion space to remain stable under latent scene changes. The proposed method is applied only during training and leaves the inference pipeline unchanged. Experiments on cross-dataset radar-camera 3D detection between View-of-Delft and TJ4DRadSet demonstrate consistent improvements over multiple BEV fusion backbones, and the gains remain effective when a small amount of target-domain data is available.

2606.11572 2026-06-11 cs.CV 新提交

FreqKD: Frequency-Decoupled Cross-Modal Knowledge Distillation for Infrared Object Detection

FreqKD: 面向红外目标检测的频率解耦跨模态知识蒸馏

Keval Thaker, Venkatraman Narayanan, Abdalmalek Aburaddaha, Samir A. Rawashdeh

发表机构 * University of Michigan-Dearborn(密歇根大学迪尔伯恩分校)

AI总结 针对RGB与红外图像模态差异,提出频率解耦蒸馏框架FreqKD,对低频和高频成分分别施加严格MSE和松弛log-MSE损失,在KAIST数据集上提升DINOv2基线2.4 mAP50。

详情
AI中文摘要

通过知识蒸馏从大规模RGB基础模型迁移学习到红外图像,由于图像形成物理的根本差异仍然具有挑战性。我们研究了RGB-IR模态间隙的频谱结构,观察到特征差异在空间频率上并不均匀:低频分量(形状、布局)比高频分量(纹理、精细边缘)表现出更大的跨模态对齐,后者反映了模态特定特征。基于这一分析,我们提出了FreqKD,一种频率解耦蒸馏框架,对每个频带应用适应其跨模态一致性的非对称监督。该方法对低频带采用严格的均方误差(MSE)以保留共享的结构信息,对高频带采用松弛的log-MSE损失(权重为0.1)以提供边缘指导同时容忍纹理差异。对500个配对样本的频谱差异分析表明,在所有分析的Transformer层中,高频差异平均超过低频差异2.4倍。在KAIST多光谱行人检测上,FreqKD达到64.1 mAP50,比DINOv2基线提高2.4点。学到的表示可跨数据集(FLIR ADAS,+2.1 mAP50)、任务(MFNet分割,+1.85平均交并比)和架构(ResNet-50,+1.0 mAP50)迁移。代码见:this https URL

英文摘要

Transfer learning from large-scale RGB foundation models to infrared (IR) imagery through knowledge distillation (KD) remains challenging due to fundamental differences in image formation physics. We investigate the spectral structure of the RGB--IR modality gap and observe that feature divergence is not uniform across spatial frequencies: low-frequency components (shape, layout) show greater cross-modal alignment than high-frequency components (texture, fine edges), which reflect modality-specific characteristics. Based on this analysis, we propose FreqKD, a frequency-decoupled distillation framework that applies asymmetric supervision adapted to each band's cross-modal consistency. The method employs strict mean squared error (MSE) on the low-frequency band to preserve shared structural information and a relaxed log-MSE loss (weighted at 0.1) on the high-frequency band to provide edge guidance while tolerating texture differences. Spectral divergence analysis on 500 paired samples shows that high-frequency divergence exceeds low-frequency divergence by a factor of 2.4x on average across all analysed transformer layers. On KAIST multispectral pedestrian detection, FreqKD achieves 64.1 mAP50, improving 2.4 points over the DINOv2 baseline. The learned representation transfers across datasets (FLIR ADAS, +2.1 mAP50), tasks (MFNet segmentation, +1.85 mean intersection-over-union), and architectures (ResNet-50, +1.0 mAP50). Code is available at: this https URL

2606.11569 2026-06-11 cs.RO cs.AI 新提交

ConsistencyPlanner: Real-time Planning with Fast-Sampling Consistency Models

ConsistencyPlanner: 基于快速采样一致性模型的实时规划

Qichao Zhang, Xing Fang, Jiaqi Fang, Zhenwen Cai, Jie Ling, Qiankun Yu, Dongbin Zhao

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所多模态人工智能系统国家重点实验室) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Guangzhou Zaofu Intelligent Technology Co., Ltd.(广州造父智能科技有限公司)

AI总结 提出Consistency Planner框架,利用快速采样一致性模型实现高效多模态采样,并结合注意力增强解码器融合异构特征,在Waymax模拟器中显著提升安全性和实时性。

详情
AI中文摘要

在复杂真实驾驶场景中的闭环规划对自动驾驶系统构成了关键挑战。虽然传统的基于规则的方法是可解释的,但其预定义的启发式方法缺乏对动态交通环境的适应性。基于学习的方法已显示出巨大潜力。然而,基于学习的方法尽管有前景,但在建模多样化和多模态驾驶行为与实时规划之间难以平衡,常常导致犹豫不决或不安全的行动。为了解决这一限制,我们提出了Consistency Planner,一个具有快速采样一致性模型的实时规划框架。我们的方法基于两个关键技术贡献。高效多模态采样:我们采用快速采样一致性模型生成一组多样化的合理未来轨迹。这使得多模态行动的高效实时探索成为可能,克服了先前迭代生成方法的计算瓶颈。异构特征融合:我们引入了一个注意力增强解码器,将异构输入特征(包括场景特征和动作令牌)动态整合成一个连贯的表示,以实现稳健的规划。在Waymax模拟器中的广泛评估表明,与现有方法相比,在安全指标上具有优越性能,在具有挑战性的动态场景中尤其出色。

英文摘要

Closed-loop planning in complex, real-world driving scenarios presents a critical challenge for autonomous driving systems. While traditional rule-based methods are interpretable, their predefined heuristics lack the adaptability for dynamic traffic environments. Learning-based approaches have shown considerable promise. Conversely, learning-based approaches, despite their promise, struggle to balance the modeling diverse and multimodal driving behaviors and real-time planning, often leading to indecisive or unsafe actions. To address this limitation, we propose Consistency Planner, a real-time planning framework with fast-sampling consistency models. Our approach is built upon two key technical contributions. Efficient Multimodal Sampling: We employ fast-sampling consistency models to generate a diverse set of plausible future trajectories. This enables efficient, real-time exploration of multimodal actions, overcoming the computational bottlenecks of previous iterative generative methods. Heterogeneous Feature Fusion: We introduce an attention-enhanced decoder that dynamically integrates heterogeneous input features (including scene feature and action token) into a cohesive representation for robust planning. Extensive evaluation in the Waymax simulator demonstrates superior performance in safety metrics compared to existing methods, with particularly strong results in challenging dynamic scenarios.

2606.11568 2026-06-11 cs.CV 新提交

4DP-QA: Scalable QA for 4D Perception in Vision Language Models

4DP-QA:面向视觉语言模型中4D感知的可扩展问答

Seokju Cho, Abhishek Badki, Hang Su, Jindong Jiang, Ziyao Zeng, Seungryong Kim, Sifei Liu, Orazio Gallo

发表机构 * NVIDIA(英伟达) Yale University(耶鲁大学) KAIST AI(韩国科学技术院人工智能学院)

AI总结 针对视觉语言模型难以理解动态场景的问题,提出一种关注运动场景理解的问答生成流水线,通过真运动追踪解耦物体与相机运动,生成大规模数据集4DP-QA和基准4DP-QA-Bench,训练现有模型在外部基准上取得性能提升。

详情
Comments
Project page: this https URL
AI中文摘要

尽管近期取得了进展,视觉语言模型(VLM)仍然难以理解世界的动态。我们注意到,对4D场景进行推理的能力本身具有挑战性,且因两个因素而进一步复杂化。首先,VLM通过其投影到2D图像上间接观察运动。其次,现有数据集未能解耦物体和相机运动。为应对这些挑战,我们提出一个关注运动相关场景理解的问答生成流水线。我们特别关注相机与运动之间的纠缠,通过以传统方式以及一种新颖的固定参考系(称为真运动追踪)进行追踪,从而提供对运动的直观描述。通过该流水线,我们生成了一个包含40万样本的大规模训练数据集4DP-QA(4D感知问答)和一个包含2200样本的基准数据集4DP-QA-Bench。在我们的数据集上训练现有模型在外部基准上取得了性能提升,验证了我们方法的有效性。

英文摘要

Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about a 4D scene, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection onto 2D images. Second, existing datasets fail to disentangle object and camera motion. To address these challenges, we present a QA generation pipeline that focuses on motion-related scene understanding. We take particular care of the entanglement of camera and object motion by casting tracking in both the traditional way and in a novel, fixed reference system, dubbed True-Motion Tracking, which provides an intuitive description of motion. From this pipeline, we generate a large-scale training dataset of 400K samples, 4DP-QA (4D Perception QA), and a 2.2K-sample benchmark, 4DP-QA-Bench. Training existing models on our dataset yields performance improvements on an external benchmark, validating the effectiveness of our method.

2606.11562 2026-06-11 cs.LG cs.CL 新提交

GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs

GraphInfer-Bench:评估LLM在图上的推理能力基准

Zhuoyi Peng, Jingzhou Jiang, Hanlin Gu, Lixin Fan, Yi Yang

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Webank(微众银行)

AI总结 提出GraphInfer-Bench基准,通过五个任务(描述与比较)测试LLM能否从节点及其邻域推断出无法从单节点或路径检索的答案,发现所有方法均存在差距。

详情
Comments
Code: this https URL; Dataset: this https URL
AI中文摘要

图分析支撑着许多应用,这些应用的答案无法从单个记录中查找或沿路径检索:洗钱团伙、药物重定位、用户偏好和科学主题都是从节点及其邻域推断出来的。我们引入GraphInfer-Bench,一个评估LLM是否能够执行这种图推理的基准:产生一个开放式的答案,该答案没有单个节点支持,也没有路径可检索。现有的图问答协议无法测试这种能力:算法模拟、节点分类、单节点描述、KG-QA和GraphRAG都允许从单个节点或沿路径检索答案。GraphInfer-Bench定义了五个任务,涵盖描述(区域是什么)和比较(区域如何不同),每个任务的设计使得真实答案不存在于任何单个节点中。发布版本包含42,000个样本,跨越六个真实世界图,自动生成并通过四层质量控制协议筛选。我们评估了四种方法族在相同任务上的表现:图-令牌对齐模型、零样本前沿闭源LLM、Graph2Text监督微调以及作为结构参考的普通GNN。没有方法族能够弥合差距。图-令牌对齐部分处理描述任务(关系、主题),但在比较任务上失败。前沿LLM在基于LLM的方法中在离群点检测和社区划分上领先,但在掩码节点预测上落后。Graph2Text SFT在描述方面是最强的基于LLM的方法,但在比较方面落后于前沿LLM。在每个任务上,普通GNN匹配或击败了最强的基于LLM的方法,在社区检测上差距最大。GraphInfer-Bench揭示了图推理是一个开放的能力差距,而不是任何单一架构的属性。

英文摘要

Graph analysis underlies many applications whose answers cannot be looked up in a single record or retrieved along a path: laundering rings, drug repurposing, user preference, and scientific theme are all inferred from a node together with its neighbourhood. We introduce GraphInfer-Bench, a benchmark for whether LLMs can perform this graph inference: producing an open-ended answer that no single node supports and no path retrieves. Existing graph-QA protocols cannot test this capability: algorithm simulation, node classification, single-node description, KG-QA, and GraphRAG all admit answers retrievable from one node or along a path. GraphInfer-Bench defines five tasks along Description (what a region is) and Comparison (how regions differ), each constructed so the ground truth lives in no single node. The release contains 42,000 samples across six real-world graphs, produced automatically and screened by a four-layer quality-control protocol. We evaluate four method families against the same tasks: graph-token alignment models, zero-shot frontier closed-source LLMs, Graph2Text supervised fine-tuning, and plain GNNs as a structural reference. No method family closes the gap. Graph-token alignment partially handles description tasks (relational, theme) but collapses on comparison tasks. Frontier LLMs lead on outlier detection and community partition among LLM-based methods but lag on masked-node prediction. Graph2Text SFT is the strongest LLM-based method on the description side yet falls behind frontier LLMs on comparison. Across every task, plain GNNs match or beat the strongest LLM-based row, with the largest margin on community detection. GraphInfer-Bench surfaces graph inference as an open capability gap rather than a property of any one architecture.

2606.11559 2026-06-11 cs.AI 新提交

HERO: Hindsight-Enhanced Reflection from Environment Observations for Agentic Self-Distillation

HERO: 基于环境观察的后见增强反思的智能体自蒸馏

Haoran Liu, Yuwei Zhang, Xiyao Li, Bohan Lyu, Jingbo Shang

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Independent Researcher(独立研究员) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出HERO框架,利用环境观察作为局部对齐反馈进行自蒸馏,解决多轮设置中特权反馈与当前决策上下文不对齐导致的性能下降问题,在TauBench和WebShop上提升任务成功率并减少冗余轮次。

详情
AI中文摘要

强化学习通常通过轨迹的终端结果来提升多轮智能体能力,这使得难以确定每个中间轮的信用分配。最近的在线自蒸馏方法通过自教师将特权反馈转化为密集的令牌级监督,提供了一种有前景的替代方案。我们的研究动机是观察到当朴素地将此范式扩展到多轮设置时出现意外的性能下降,我们将其归因于特权反馈(如成功轨迹或终端结果)与学生当前决策上下文之间缺乏对齐。我们引入了HERO,一种后见增强的自蒸馏框架,它使用下一个环境观察作为局部对齐反馈。每次轨迹展开后,HERO反思完成的交互,将每个观察转化为紧凑的轮级诊断,捕获关于原始动作的可操作反馈,如其必要性、有效性或失败原因。在TauBench和WebShop上,HERO比仅环境反馈的自蒸馏和GRPO提高了任务成功率并减少了不必要的轮次。在训练轮次预算有限(成功轨迹稀少且GRPO提供弱奖励对比信号)的情况下,它尤其有效。

英文摘要

Reinforcement learning typically improves multi-turn agent capabilities through the terminal outcome of the trajectories, which makes it difficult to determine credit assignments for each intermediate turns. Recent on-policy self-distillation methods offer a promising alternative by converting privileged feedback into dense token-level supervision through a self-teacher. Our study is motivated by the unexpected performance degradation observed when naively extending this paradigm to multi-turn settings, which we attribute to a lack of alignment between privileged feedback, such as successful trajectories or terminal outcomes, and the student's current decision context. We introduce HERO, a hindsight-enhanced self-distillation framework that uses next environment observations as locally aligned feedback. After each rollout, HERO reflects on the completed interaction to convert each observation into a compact turn-level diagnosis, that captures actionable feedback about the original action such as its necessity, validity or failure cause. On TauBench and WebShop, HERO improves task success and reduces unnecessary turns over environment-feedback-only self-distillation and GRPO. It is especially effective under limited training turn budgets, where successful rollouts are rare and GRPO provides weak reward-contrast signals.

2606.11553 2026-06-11 cs.LG 新提交

APEX: A Network-Native Time-Series Foundation Model for Forecasting and Anomaly Detection for Wireless Edge Operations

APEX:面向无线边缘运维的预测与异常检测的网络原生时间序列基础模型

Swadhin Pradhan, Niloo Bahadori, Peiman Amini

发表机构 * Cisco Systems, USA(思科系统公司)

AI总结 提出网络原生解码器Transformer APEX,针对企业AP遥测数据预训练,在DHCP退化基准上MAE比最强基线降低18%,异常检测F1=0.93,边缘版本实现亚秒级隐私保护推理。

详情
Comments
5 pages, 1 figure, 4 tables. Discusses a network-native time-series foundation model for wireless edge operations
AI中文摘要

通用时间序列基础模型对无线网络遥测数据的迁移效果较差,因为这些信号具有突发性、零膨胀性且跨协议层耦合。我们提出APEX,一个网络原生的、仅解码器的Transformer,用于预测企业AP遥测数据,并以DHCP退化作为代表性网络任务进行评估。APEX在来自约4500个生产无线网络的10通道多变量遥测数据(约10万AP时间序列,每个AP 34个指标)上预训练,并提供APEX-Large(269M参数,云端)和APEX-Edge(10.5M参数,边缘)两个版本。在192步(4天)的DHCP退化基准上,APEX-Large比最强的基础模型基线(Toto)MAE降低18%,比SARIMA降低38%,异常检测F1=0.93,而APEX-Edge能够在AP级边缘硬件上实现亚秒级、保护隐私的推理。这些结果表明,网络原生预训练是主动无线运维的实用基础。

英文摘要

Generic time-series foundation models transfer poorly to wireless network telemetry whose signals are bursty, zero-inflated, and coupled across protocol layers. We present APEX, a network-native, decoder-only transformer for forecasting enterprise AP telemetry, and evaluate it on DHCP degradation as a representative network task. APEX is pre-trained on 10-channel multivariate telemetry from ~4,500 production wireless networks (~100K AP time series, 34 metrics per AP), and is available as APEX-Large (269M, cloud) and APEX-Edge (10.5M, edge). On a 192-step (4-day) DHCP degradation benchmark, APEX-Large reduces MAE by 18% over the strongest foundation-model baseline (Toto) and 38% over SARIMA, with anomaly-detection F1 = 0.93, while APEX-Edge enables sub-second, privacy-preserving inference on AP-class edge hardware. These results suggest network-native pre-training is a practical foundation for proactive wireless operations.

2606.11546 2026-06-11 cs.CV 新提交

VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio

VL-DINO: 利用CLIP视觉-语言知识进行开放词汇目标检测

Hao Zhang, Qinran Lin, Linqi Song, Yong Li

发表机构 * Chongqing University(重庆大学) City University of Hong Kong(香港城市大学)

AI总结 提出VL-DINO,通过QPSC模块构建高质量正样本增强视觉-语言对齐,VSE模块蒸馏CLIP视觉知识,ORSA模块对齐区域特征与文本嵌入,在LVIS零样本检测上达到36.3/38.1 AP。

详情
AI中文摘要

像CLIP这样的视觉-语言模型可以为开放词汇目标检测提供丰富的语义先验。然而,将文本和视觉知识联合集成到检测架构中仍然具有挑战性。在本文中,我们提出了VL-DINO,一种通过更有效地利用CLIP的视觉-语言知识来增强DINO的开放词汇检测器。具体来说,首先开发了一个查询引导的正样本构建(QPSC)模块,以构建额外的高质量正样本,使原始DINO框架能够更好地适应跨异构数据源的混合训练,同时提供更多的视觉-语言对齐信号,从而在训练过程中融入更丰富的文本知识。然后引入了一个视觉语义编码器(VSE)模块,将CLIP视觉知识蒸馏到骨干网络提取的特征中,生成用于后续编码器精炼的融合特征。基于融合特征,一个目标-区域语义对齐(ORSA)模块提取以目标为中心的区域特征,并将其与相应的文本嵌入对齐,进一步融入文本线索。在零样本设置下,VL-DINO-T和VL-DINO-L在LVIS基准上分别达到了36.3和38.1 AP,持续优于先前的高级方法。大量实验证明了所提出设计的有效性和竞争性能。

英文摘要

Vision-language models like CLIP can provide rich semantic priors for open-vocabulary object detection. However, jointly integrating both textual and visual knowledge into detection architectures remains challenging. In this paper, we propose VL-DINO, an open-vocabulary detector that enhances DINO through more effective exploitation of CLIP's vision-language knowledge. Specifically, a Query-guided Positive Sample Construction (QPSC) module is first developed to construct additional high-quality positive samples, enabling the vanilla DINO framework to better accommodate mixed training across heterogeneous data sources while providing more vision-language alignment signals, thereby incorporating richer textual knowledge during training. A Visual Semantic Encoder (VSE) module is then introduced to distill CLIP visual knowledge into backbone-extracted features, producing fused features for subsequent encoder refinement. Based on the fused features, an Object-Region Semantic Alignment (ORSA) module extracts object-centric region features and aligns them with the corresponding textual embeddings, further incorporating textual cues. In the zero-shot setting, VL-DINO-T and VL-DINO-L achieve 36.3 and 38.1 AP on the LVIS benchmark, respectively, consistently outperforming prior advanced approaches. Extensive experiments demonstrate the effectiveness and competitive performance of the proposed design.

2606.11543 2026-06-11 cs.AI cs.SE 新提交

SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior

SkillJuror:衡量智能体技能组织如何改变运行时行为

Zhiyu Chen, Zihan Guo, Bo Huang, Bingwei Lu, Jianghao Lin, Yuanjian Zhou, Weinan Zhang

发表机构 * Tongji University(同济大学) Shanghai Innovation Institute(上海创新研究院) Sun Yat-sen University(中山大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出SkillJuror框架,通过渐进式披露与扁平基线对比,发现技能组织方式改变智能体搜索和应用程序知识的行为,并在82个任务中提升4.1%的验证通过率。

详情
AI中文摘要

Agent技能在推理时为大语言模型(LLM)智能体提供程序性知识,但当前的基准测试很少区分技能的内容与其组织方式。我们通过渐进式披露(Progressive Disclosure)研究这种区别,其中简洁的根文件按需引导智能体访问支持资源,并将其与归一化的扁平基线进行比较。我们提出SkillJuror,一个通过语义控制变体、匹配的多试验评估和轨迹证据来评估技能编写范式的框架,同时保持任务知识固定。在82个任务的SkillsBench研究中,渐进式披露在总体结果之前改变了运行时行为:每个轨迹触及的不同技能资源从1.18增加到3.85,有效采纳事件从1.33增加到3.92。在410个匹配试验中,它还产生了17个额外的验证通过试验(比归一化扁平基线提高4.1%)。收益取决于任务。当支持资源指导实现、检查或修复时,渐进式披露有帮助,但当成功取决于精确的输出约定、数值阈值或长工件生成流水线时,效果较弱。这些结果表明,技能组织不仅仅是呈现方式:它可以改变智能体搜索和应用程序知识的方式,而结果收益取决于暴露的资源是否对任务可操作。代码见:https://this URL。

英文摘要

Agent Skills augment large language model (LLM) agents with procedural knowledge at inference time, but current benchmarks rarely distinguish what a Skill says from how it is organized. We study this distinction through Progressive Disclosure, where a concise root file points agents to supporting resources on demand, and compare it with a normalized flat baseline. We present SkillJuror, a framework for evaluating Skill writing paradigms through semantically controlled variants, matched multi-trial evaluations, and trajectory evidence while holding task knowledge fixed. In an 82-task SkillsBench study, Progressive Disclosure changes runtime behavior before aggregate outcomes: distinct Skill resources touched per trajectory rise from 1.18 to 3.85, and effective uptake events rise from 1.33 to 3.92. It also yields 17 additional verifier-passing trials out of 410 matched trials (+4.1%) over the normalized flat baseline. The benefit is task-dependent. Progressive Disclosure helps when supporting resources guide implementation, checking, or repair, but is weaker when success hinges on exact output conventions, numerical thresholds, or long artifact-generation pipelines. These results show that Skill organization is not mere presentation: it can change how agents search and apply procedural knowledge, while outcome gains depend on whether the exposed resources are actionable for the task. Code is available at this https URL.

2606.11542 2026-06-11 cs.CL cs.AI 新提交

Pretrained self-supervised speech models can recognize unseen consonants

预训练自监督语音模型能够识别未见过的辅音

Chihiro Taguchi, Éric Le Ferrand, Hirosi Nakagawa, Hitomi Ono, Kanji Kato, Emily Prud'hommeaux, David Chiang

发表机构 * University of Notre Dame(圣母大学) University at Buffalo(纽约州立大学布法罗分校) Tokyo University of Foreign Studies(东京外国语大学) Reitaku University(丽泽大学) Boston College(波士顿学院)

AI总结 研究预训练自监督语音模型(Wav2Vec2、HuBERT)对Khoisan语言中罕见吸气辅音的识别能力,发现模型对吸气辅音的识别准确率高于非吸气辅音,表明自监督学习能泛化到稀有音素。

详情
Comments
6 pages, 3 figures, 3 tables, accepted at Interspeech 2026
AI中文摘要

现代预训练自监督自动语音识别模型在大规模音频数据上训练,将语音编码为上下文表示。然而,它们的训练数据严重偏向高资源语言,低资源语言数据很少,这引发了对类型学上不常见的语音声音(如主要出现在Khoisan语言中的吸气辅音)可能代表性不足的担忧。这引出了我们的核心研究问题:这些模型能否像识别其他语音声音一样准确地识别吸气辅音?为了解决这个问题,我们在两种富含吸气辅音的Khoisan语言(G|ui和West !Xoon)的数据上微调并比较了预训练自监督语音模型(Wav2Vec2和HuBERT)。我们的结果显示,微调后的模型一致地更准确地识别吸气辅音而非非吸气辅音,表明自监督学习能够泛化到包括稀有音素在内的人类语音声音。

英文摘要

Modern pretrained self-supervised automatic speech recognition models are trained on large-scale audio data to encode speech into contextualized representations. However, their training data are heavily skewed toward high-resource languages with little data from low-resource languages, raising concerns about the potential underrepresentation of typologically uncommon speech sounds such as click consonants primarily found in Khoisan languages. This leads to our central research question: Can these models recognize click consonants as accurately as other speech sounds? To address this question, we fine-tune and compare pretrained self-supervised speech models (Wav2Vec2 and HuBERT) on data from two click-rich Khoisan languages (G|ui and West !Xoon). Our results reveal that the fine-tuned models consistently recognize clicks more accurately than non-clicks, suggesting that self-supervision enables generalization across human speech sounds including rare phonemes.

2606.11537 2026-06-11 cs.AI cs.CE 新提交

MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

MoCA-Agent: 一种用于金融和数值推理的声明市场代码智能体

Abdelrahman Abdallah, AbdelRahim A. Elmadany, Sameh Al Natour, Hasan Cavusoglu, Adam Jatowt, Muhammad Abdul-Mageed

发表机构 * University of Innsbruck(因斯布鲁克大学) University of British Columbia(不列颠哥伦比亚大学) Toronto Metropolitan University(多伦多都会大学)

AI总结 提出MoCA-Agent,通过声明级验证和代码生成解决金融表格问答中的数值推理错误,在十个基准上取得强性能。

详情
AI中文摘要

金融和表格问答不仅需要流畅的推理:答案必须基于支持它们的确切事实、公式、单位、符号和尺度。单个误读的单元格或错误操作可能会悄无声息地产生看似合理但错误的结果。我们引入了 \textsc{MOCA-Agent},一种声明市场代码智能体,它用声明级验证取代了自由形式的多智能体辩论。该系统将每个问题分解为类型化的原子声明,要求专业交易智能体买入或卖出这些声明,将其订单清算为置信度加权的接受/拒绝决策,并从市场支持的证据中合成可执行的Python程序。然后,一个代码感知验证器检查程序的执行、结构一致性和常见的金融推理错误,最多进行一次市场感知修复轮次。在涵盖金融数值推理、通用表格推理、ESG问答和多模态图表推理的十个公开基准上,\textsc{MOCA-Agent} 使用固定的 Qwen3.6-27B 骨干网络实现了强劲性能,包括在 FinQA 上达到 78.3%,在 FinanceMath 上达到 76.0%,在 MultiHiertt 上达到 71.2%,在 ESGenius 上达到 86.9%,以及在 FinChart-Bench 上平均达到 85.6%。这些结果表明,在原子声明级别聚合证据,而不是整个答案,提高了高风险数值推理的鲁棒性。\footnote{代码和数据可在以下网址获取:this https URL。}

英文摘要

Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units, signs, and scales that support them. A single misread cell or incorrect operation can silently produce a plausible but wrong result. We introduce \textsc{MOCA-Agent}, a market-of-claims code agent that replaces free-form multi-agent debate with claim-level verification. The system decomposes each question into typed atomic claims, asks specialist trader agents to buy or sell those claims, clears their orders into confidence-weighted accept/reject decisions, and synthesizes an executable Python program from market-supported evidence. A code-aware verifier then checks the program for execution, structural consistency, and common financial reasoning errors, with at most one market-aware repair round. Across ten public benchmarks spanning financial numerical reasoning, general tabular reasoning, ESG question answering, and multimodal chart reasoning, \textsc{MOCA-Agent} achieves strong performance using a fixed Qwen3.6-27B backbone, including $78.3\%$ on FinQA, $76.0\%$ on FinanceMath, $71.2\%$ on MultiHiertt, $86.9\%$ on ESGenius, and $85.6\%$ average on FinChart-Bench. These results show that aggregating evidence at the level of atomic claims, rather than whole answers, improves robustness in high-stakes numerical reasoning.\footnote{The code and data are available: this https URL.

2606.11535 2026-06-11 cs.RO 新提交

Adversarial Attacks on Learned Policies for Surgical Robotic Tasks

针对手术机器人任务学习策略的对抗攻击

Shutong Jin, Ziyang Chen, Preethi Satish, Paavan Gupta, Florian T. Pokorny, Ken Goldberg

发表机构 * University of California, Berkeley(加州大学伯克利分校) KTH Royal Institute of Technology(瑞典皇家理工学院)

AI总结 研究学习型策略在机器人辅助手术中易受对抗攻击的脆弱性,提出破坏性和引导性攻击方法,实验表明攻击可使手术子任务成功率平均降低61%。

详情
AI中文摘要

基于学习的策略正被考虑用于增强机器人辅助手术中人类外科医生的灵巧性。从视觉观察到机器人动作的端到端映射是否容易受到对抗性攻击,从而可能导致患者受伤?在本文中,我们首次研究了手术机器人中学习型策略面临的对抗性威胁。我们研究了两种威胁模式:(a) 破坏性攻击,其中难以察觉的视觉扰动中断策略执行,以及 (b) 引导性攻击,其中此类扰动将策略动作引导至攻击者指定的方向。我们提出了三种对抗性攻击方法,每种方法对策略信息的访问权限逐渐增加,并评估了它们对两个手术子任务(清创和缝合)的影响。我们的评估涵盖了三种端到端策略架构:ACT、扩散策略和Pi0。此外,我们引入了一类新的光度对抗攻击,它模仿自然视觉变化(如光照变化)来生成有效且视觉上合理的扰动。使用清创和缝合模型进行的560次物理实验结果表明,最先进的策略可能受到显著干扰,导致手术子任务成功率平均降低61%。项目页面:此 https URL

英文摘要

Learning-based policies are being considered to augment the dexterity of human surgeons in robot-assisted surgery. Can the end-to-end mapping from visual observations to robot actions be vulnerable to adversarial attacks, potentially leading to patient injury? In this paper, we present the first study of adversarial threats to learning-based policies in surgical robotics. We investigate two threat modes: (a) disruptive attacks, where imperceptible visual perturbations interrupt policy execution, and (b) steering attacks, where such perturbations steer policy actions toward attacker-specified directions. We formulate three adversarial attack methods, each with increasing access to policy information, and evaluate their impact on two surgical subtasks: debridement and suturing. Our evaluation covers three end-to-end policy architectures: ACT, Diffusion Policy, and Pi0. In addition, we introduce a new class of photometric adversarial attacks that mimic natural visual changes, such as lighting variations, to generate effective yet visually plausible perturbations. Results from 560 physical experiments using phantoms for debridement and suturing suggest that state-of-the-art policies can be significantly disrupted, resulting in an average 61% reduction in surgical subtask success rates. Project page: this https URL

2606.11531 2026-06-11 cs.CL cs.IT 新提交

Measuring language complexity from hierarchical reuse of recurring patterns

从重复模式的层次复用测量语言复杂度

Junyi Zhou, Rui Liu, Pengyu Liu, Yu Liu

发表机构 * Department of Systems Science, Faculty of Arts and Sciences, Beijing Normal University(北京师范大学文理学院系统科学系) International Academic Center of Complex Systems, Beijing Normal University(北京师范大学国际复杂系统学术中心) Department of Chinese Language and Literature, Faculty of Arts and Sciences, Beijing Normal University(北京师范大学文理学院中国语言文学系) Center for Linguistic Sciences, Beijing Normal University(北京师范大学语言学科学中心) School of Systems Science, Beijing Normal University(北京师范大学系统科学学院) Department of Mathematics and Applied Mathematical Sciences, University of Rhode Island(罗德岛大学数学与应用数学科学系) Department of Cell and Molecular Biology, University of Rhode Island(罗德岛大学细胞与分子生物学系)

AI总结 提出基于算法信息论的梯径指数,通过层次复用重复子结构测量语言复杂度,在21个平行语料库中验证了等复杂度假说和权衡假说。

详情
Comments
17 pages, 4 figures
AI中文摘要

我们引入梯径指数作为基于算法信息论的语言复杂度度量。它通过层次复用重复子结构来重建序列所需的最小步骤数,捕捉了一种可精确计算但受约束的算法可压缩性形式,与Kolmogorov复杂度相关但不同。我们将梯径方法应用于Parallel Universal Dependencies数据集中的21个平行语料库。梯径指数在不同语言间近似不变,且变化远小于语料库长度。当所有语料库映射到统一的二进制表示时,这一现象更为明显,从表示无关的角度为等复杂度假说提供了证据。我们还观察到字符库存大小与语料库长度之间的权衡,以及词汇级和语料库级重建复杂度之间的权衡,支持了总复杂度守恒并在语言层次间重新分布的权衡假说。梯径方法识别出的可重用子结构(无需任何语言输入)与自然词汇中存在的单词和形态成分重叠。梯径方法捕获的层次复用与认知科学中提出的组块机制相似,即人类认知系统在共享记忆和处理约束下将语言输入压缩为嵌套的、可重用的单元。认知组块与梯径方法之间的这种联系为等复杂度假说和权衡假说提供了新的解释,将两者都根植于支撑所有人类语言处理的共享认知架构中。

英文摘要

We introduce the ladderpath index as a measure of language complexity grounded in algorithmic information theory. It counts the minimum steps needed to reconstruct a sequence through hierarchical reuse of repeated substructures, capturing an exactly computable but constrained form of algorithmic compressibility related to, but distinct from, Kolmogorov complexity. We apply the ladderpath approach to 21 parallel corpora from the Parallel Universal Dependencies dataset. The ladderpath index is approximately invariant across the languages, and varies much less than the corpus length. This is more pronounced when all corpora are mapped to a unified binary representation, providing evidence for the equi-complexity hypothesis from a representation-independent perspective. We also observe trade-offs between character inventory size and corpus length, and between vocabulary-level and corpus-level reconstruction complexity, supporting the trade-off hypothesis that total complexity is conserved and redistributed across linguistic levels. The reusable substructures identified by the ladderpath approach, without any linguistic input, overlap with words and morphological components attested in the natural vocabulary. The hierarchical reuse captured by the ladderpath approach parallels the chunking mechanisms proposed in cognitive science, where the human cognitive system compresses linguistic input into nested, reusable units under shared memory and processing constraints. This connection between cognitive chunking and the ladderpath approach provides a new interpretation for the equi-complexity and trade-off hypotheses, grounding both in the shared cognitive architecture that underlies language processing across human languages.

2606.11525 2026-06-11 cs.RO cs.LG 新提交

Learning Object Manipulation from Scratch via Contrastive Interaction

通过对比交互从零开始学习物体操作

Tongle Shen, Caleb Chuck, Fan Feng, Biwei Huang

发表机构 * UC San Diego(加州大学圣地亚哥分校) UT Austin(德克萨斯大学奥斯汀分校)

AI总结 针对对比强化学习在交互密集操作任务中表现不佳的问题,提出交互加权重采样方法,通过保留模式边界提升多模态分段非线性可达性表示,在仿真和真实机器人空气曲棍球任务中取得显著改进。

详情
AI中文摘要

对比强化学习(CRL)通过学习动力学的结构化表示,在多种目标条件机器人任务中取得了近期成功。然而,尽管在运动控制和简单控制领域表现优异,CRL在交互密集的操作任务中常常遇到困难。我们认为这一困难的关键来源是物体中心交互,如接触或抓取,这些交互会引起潜在动态模式的显著变化。在这项工作中,我们将操作动力学建模为分段平滑马尔可夫过程,并证明交互引起的模式变化产生了分段非线性可达性结构,这使得标准CRL能量函数难以表示和规划。基于这一分析,我们引入了交互加权重采样(IWR)。IWR在交互前、中、后阶段进行交互感知重采样,鼓励学习到的表示保留决定未来可达性的模式边界,以捕获多模态和分段非线性可达性。在包括2D动态控制、机器人操作和机器人空气曲棍球在内的交互中心环境中,IWR相比先前的CRL方法提高了样本效率和整体性能,在仿真中平均提升19.8%。最后,通过使用IWR训练的策略进行仿真到现实的迁移,我们展示了首个能够击打目标的真实世界目标条件机器人空气曲棍球智能体,成功率从25%提升到60%。项目页面:此 http URL。

英文摘要

Contrastive Reinforcement Learning (CRL) has seen recent success in a wide variety of goal-conditioned robotics tasks by learning structured representations of the dynamics. However, despite its success in locomotion and simpler control domains, CRL often struggles in interaction-rich manipulation. We argue that a key source of this difficulty is object-centric interaction, such as contact or grasping, that induces distinct changes in the underlying dynamic modes. In this work, we formulate manipulation dynamics as a piecewise-smooth Markov process and show that interaction-induced mode changes create piecewise nonlinear reachability structures that are difficult for standard CRL energy functions to represent and plan over. Based on this analysis, we introduce Interaction-weighted Resampling (IWR). IWR performs interaction-aware resampling around phases before, during, and after interactions, encouraging the learned representation to preserve the mode boundaries that determine future reachability to capture multi-modal and piecewise nonlinear reachability. Across interaction-centric environments, including 2D dynamic control, robotic manipulation, and robot air hockey, IWR improves both sample efficiency and overall performance over prior CRL methods, with 19.8% average improvement in simulation. Finally, using a sim-to-real pipeline with policies trained by IWR, we demonstrate the first real-world goal-conditioned robot air hockey agent capable of hitting goals, improving success from 25% to 60%. Project Page: this http URL.

2606.11522 2026-06-11 cs.AI cs.LG 新提交

Search Discipline for Long-Horizon Research Agents

长周期研究智能体的搜索纪律

Adithya Srinivasan, Devesh Paragiri

发表机构 * North Carolina State University(北卡罗来纳州立大学) University of Maryland(马里兰大学)

AI总结 针对研究智能体使用聚合指标评估候选方案导致科学有效性反转的问题,提出一种外部审计协议,基于分解行为而非单一分数进行决策。

详情
Comments
9 pages, 1 figure
AI中文摘要

自主研究智能体现在根据指标提出、评估和选择科学候选方案,该指标通常是在区域、切片或队列的异质空间上聚合的简化值。我们表明,当科学有效性存在于这种分解结构中时,聚合值可能错误地将候选方案排在首位。总体数字改善,但底层结构反转,因此基于该数字的决策会接受一个悄然破坏模型的候选方案。这种失败并非领域特定,只要候选方案的有效性是多维的,而其验证器是单一简化值,就会出现。我们在生态系统人口模型中的火灾模型任务上展示了这种反转。得分最高的候选方案和略低的候选方案在全球得分上处于噪声范围内,但得分最高的候选方案破坏了受保护的北方区域,而另一个则保护了它们。区分它们的是每个区域的行为,而不是总体数字。这个决策不应留给产生候选方案的智能体。优化分数的智能体是最不可能发现分数错误的一方,一旦智能体停止,提示就没有剩余轮次。我们将决策移到一个外部控制循环,该循环根据每个候选方案的分解行为进行审计,并在智能体决策后采取行动。它可以降级智能体本会接受的候选方案,也可以重新打开智能体声明已完成的运行。我们的贡献在于反转发现本身,以及一种搜索纪律协议,该协议基于可审查的候选效果证据而非分数进行决策。

英文摘要

Autoresearch agents now propose, evaluate, and select scientific candidates against a metric, and that metric is usually an aggregate reduced over a heterogeneous space of regions, slices, or cohorts. We show that when scientific validity lives in that disaggregated structure, the aggregate can rank the wrong candidate first. The headline number improves while the structure underneath inverts, so a decision made on the number accepts a candidate that quietly breaks the model. The failure is not domain-specific. It appears wherever a candidate's validity is multi-dimensional but its verifier is a single reduction. We demonstrate the inversion on a fire-model task in the Ecosystem Demography model. The highest-scoring candidate and a slightly lower one are within noise of each other on global score, yet the top-scoring one collapses the protected boreal regions while the other preserves them. What separates them is the per-region behavior, not the headline number. This decision should not be left to the agent that produced the candidates. The agent optimizing the score is the last party likely to catch the score being wrong, and a prompt has no remaining turn once the agent has stopped. We move the decision to an external control loop that audits each candidate on its disaggregated behavior and acts after the agent has decided. It can demote a candidate the agent would have accepted, and it can reopen a run the agent had declared finished. Our contribution is the inversion finding itself, and a search-discipline protocol that decides on reviewable candidate-effect evidence instead of the score.

2606.11521 2026-06-11 cs.LG 新提交

Counterexample Guided Learning in the Large using Reasoning Agents

使用推理代理的大规模反例引导学习

Hongyi Liu, Frederic Sala, Thomas Reps, Adithya Murali

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出反例引导的LLM正则表达式归纳框架,通过验证器反馈和代理策略(如反思与修复循环)显著提升样本效率和复杂任务成功率。

详情
Comments
Code, data, and resources are publicly available for research purposes: this https URL
AI中文摘要

LLM和LLM代理在获得反馈时应能改进,但识别其何时能做到这一点很困难:反馈是异质的、领域特定的且难以控制。我们通过要求LLM执行正则表达式归纳来应对这一挑战,这是一个经典的符号学习问题,其中存在以反例形式存在的精确反馈机制。在反例引导学习中,学习者(LLM)从正/负标记字符串中提出候选正则表达式,教师(验证器)返回反例,展示候选语言与目标语言之间的差异。我们识别出新的反例引导细化策略,如正则化和符号反例聚类,这些策略能够实现有效的正则表达式学习。我们还探索了代理策略,如反思和修复循环。实验发现,验证器反馈显著提高了具有挑战性的正则表达式归纳任务的样本效率,减少了所需标记示例的数量,并使得能够学习标准提示失败时的复杂目标表达式。例如,在最困难的任务组上,我们的反例引导框架在两个不同的正则表达式领域将成功率从3.2%提高到38.1%,从38.9%提高到74.1%。这些结果表明,LLM可以从丰富的反馈中受益,而不仅仅将其视为额外数据,为基于LLM的程序合成和形式推理的鲁棒验证器引导方法打开了大门。

英文摘要

LLMs and LLM agents should improve when given feedback, but identifying when they are able to do so is difficult: feedback is heterogeneous, domain-specific, and difficult to control. We approach this challenge by asking LLMs to perform regular-expression induction, a classical symbolic learning problem where precise mechanisms for feedback exist in the form of counterexamples. In counterexample-guided learning, a learner (LLM) proposes candidate regular expressions from positive/negative-labeled strings, and the teacher (verifier) returns counterexamples showcasing the difference between the candidate and target languages. We identify novel counterexample-guided refinement strategies that enable effective regex learning, such as regularization and symbolic counterexample clusters. We also explore agentic strategies such as reflection and repair loops. Empirically, we find that verifier feedback substantially improves sample efficiency on challenging regex-induction tasks, reducing the number of labeled examples required and enabling learning of complex target expressions where standard prompting fails. For example, on the hardest task groups, our counterexample-guided framework improves success from 3.2% to 38.1% and from 38.9% to 74.1% on two different regex domains. These results suggest that LLMs can benefit from rich feedback beyond treating it as additional data, opening the door for robust verifier-guided methods for LLM-based program synthesis and formal reasoning.

2606.11518 2026-06-11 cs.LG cs.AI 新提交

SirenFNO: Efficient and Full Frequency Learning of Fourier Neural Operators

SirenFNO:高效且全频率学习的傅里叶神经算子

Pengqing Shi, Jie Yin, Stephen Tierney, Junbin Gao

发表机构 * The University of Sydney(悉尼大学)

AI总结 提出SirenFNO框架,利用正弦表示网络学习隐式神经表示并进行模态核参数化,消除频率截断,实现全频谱学习,在多个PDE基准上以最多73倍参数减少取得性能提升。

详情
Comments
9 pages, accepted by IJCAI 2026
AI中文摘要

傅里叶神经算子(FNO)是近似求解偏微分方程的有效且高效的替代方法,并能跨离散化泛化。然而,由于依赖频率截断以保持FNO的学习效率,实证研究表明FNO对低频信息存在频谱偏差,这可能阻碍学习能力,尤其是对于某些具有强烈高频振荡的偏微分方程。为了解决这一局限性,我们提出了SirenFNO,一种利用正弦表示网络(SIREN)学习隐式神经表示并进行模态核参数化的新颖框架。我们的SIREN参数化以常数且与离散化无关的参数数量学习全网格频谱,从而消除了频率截断的需要。我们进一步通过函数张量分解扩展SirenFNO,以提高参数和学习效率。实证结果表明,我们的SirenFNO在保持离散化不变性的情况下,以约4到15倍的参数减少持续优于FNO,并且我们的函数分解变体在多个PDE基准上以最多73倍的参数减少获得了性能提升。

英文摘要

Fourier neural operators (FNOs) are effective and efficient surrogates for approximating solutions of PDEs and generalize across discretizations. However, owing to the reliance on frequency truncation to maintain learning efficiency of FNOs, empirical studies suggest that FNOs exhibit spectral bias toward low-frequency information, which may hinder the learning capability especially for certain PDEs with strong high-frequency oscillations. To address this limitation, we propose SirenFNO, a novel framework that leverages sinusoidal representation networks (SIRENs) to learn implicit neural representations and performs mode-wise kernel parameterization. Our SIREN parameterization learns a full-grid spectrum with a constant and discretization-independent parameter count, thereby eliminating the need for frequency truncation. We further extend SirenFNO with functional tensor decompositions to enhance parameter and learning efficiency. Empirical results show that our SirenFNO consistently outperforms FNO with approximately $4$ to $15$ times parameter reductions with preserved discretization invariance, and our functional decomposition variants obtain performance improvements with a maximum of $73$ times fewer parameters across multiple PDE benchmarks.

2606.11514 2026-06-11 cs.SD 新提交

CS-YODAS: A Mined Dataset of In-the-Wild Code-Switched Speech

CS-YODAS:一个挖掘自真实环境的代码转换语音数据集

Brian Yan, Qingzheng Wang, Matthew Wiesner, Anuj Diwan, Olga Iakovenko, Alexander Polok, Injy Hamed, Shuichiro Shimizu, Iris Emerman Thomas Hain, David R. Mortensen, Peter Viechnicki, Shinji Watanabe

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Johns Hopkins University(约翰霍普金斯大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Sheffield(谢菲尔德大学) Brno University of Technology(布尔诺理工大学) MBZUAI(穆罕默德·本·扎耶德人工智能大学) Kyoto University(京都大学)

AI总结 本文提出CS-YODAS数据集,通过可扩展的人机协同流程从多语言YouTube数据中挖掘真实代码转换语音,涵盖7种基质语言共313小时,并分析其分布特征与语言对切换模式。

详情
AI中文摘要

我们提出CS-YODAS,一个基于Creative Commons许可的数据集,包含从多语言YouTube数据中挖掘的真实环境代码转换语音。代码转换(CS),即在话语或对话中交替使用不同语言,在多语言环境中很常见,但在现有的CS语音资源中代表性不足,这些资源通常规模小、领域特定或人为构建。基于YODAS语料库,我们开发了一个可扩展的人机协同流程,用于识别和验证自然发生的代码转换。最终数据集总计313小时,涵盖7种基质语言,提供了多样化的真实世界自发性代码转换语音示例。我们进一步分析了真实环境中代码转换的分布和特征,考察了语言对频率和切换模式,并报告了口语语言识别的基线结果。我们希望CS-YODAS能够促进对代码转换语音更广泛和全面的研究。数据集链接:此https URL。

英文摘要

We present CS-YODAS, a Creative Commons-licensed dataset of in-the-wild code-switched speech mined from multilingual YouTube data. Code-switching (CS), or the alternation between languages within an utterance or conversation, is common in multilingual settings but remains underrepresented in existing CS speech resources, which are typically small, domain-specific, or artificially constructed. Building on the YODAS corpus, we develop a scalable, human-in-the-loop pipeline for identifying and validating naturally occurring code-switching. The resulting dataset, which totals 313 hours and spans 7 matrix languages, provides diverse, real-world examples of spontaneous code-switched speech. We further analyze the distribution and characteristics of code-switching in the wild, examining language-pair frequencies and switching patterns, and report baseline results for spoken language identification. We hope that CS-YODAS will encourage broader and more comprehensive research on code-switched speech. Dataset link: this https URL.

2606.11512 2026-06-11 cs.CL 新提交

SAGE: Answer-Conditioned Uncertainty Targets for Verbal Uncertainty Alignment

SAGE: 面向言语不确定性对齐的答案条件不确定性目标

Kaiwen Shi, Zheyuan Zhang, Yanfang Ye

发表机构 * University of Notre Dame(圣母大学)

AI总结 提出SAGE目标,通过答案条件不确定性几何从模型采样响应中构建群组级不确定性目标,结合GUPO训练框架优化言语不确定性表达,在多项推理任务中提升不确定性排序、降低校准误差和过度自信。

详情
AI中文摘要

大型语言模型越来越多地通过自然语言语句表达不确定性,但这些表达往往无法反映模型的采样行为。我们将言语不确定性对齐作为一个分布校准问题:提示的适当不确定性目标应从重复模型输出中估计,而非来自孤立响应。然而,仅靠群组展开是不够的,因为由此产生的目标必须提供有用的训练信号。现有目标仅部分满足这一要求。我们提出SAGE(语义答案引导熵),一种群组级不确定性目标,它在采样响应上构建答案条件不确定性几何。SAGE保留了分类、数值和符号答案的区别,同时保持平滑且尺度保持的校准信号。我们进一步通过群组不确定性偏好优化(GUPO)应用该目标,这是一种不确定性通道训练框架,监督言语不确定性表达而非完整响应。在事实、数学和多项选择推理任务上的实验表明,不确定性排序得到改善,校准误差降低,过度自信减少。

英文摘要

Large language models increasingly express uncertainty through natural-language statements, yet these expressions often fail to reflect the model's sampled behavior. We study verbal uncertainty alignment as a distributional calibration problem: the appropriate uncertainty target for a prompt should be estimated from repeated model outputs rather than from an isolated response. However, group rollouts alone are insufficient, since the resulting target must provide a useful training signal. Existing targets only partially satisfy this requirement. We propose SAGE, Semantic-Answer Guided Entropy, a group-level uncertainty target that constructs an answer-conditioned uncertainty geometry over sampled responses. SAGE preserves categorical, numeric, and symbolic answer distinctions while maintaining a smooth and scale-preserving calibration signal. We further apply this target through Group-Uncertainty Preference Optimization, or GUPO, an uncertainty-channel training framework that supervises verbal uncertainty expressions rather than the full response. Experiments across factual, mathematical, and multiple-choice reasoning tasks show improved uncertainty ranking, lower calibration error, and reduced overconfidence.

2606.11508 2026-06-11 cs.LG q-bio.QM 新提交

Probabilistic Contrastive Pretraining for Multi-task ADME Property Prediction

概率对比预训练用于多任务ADME性质预测

Yifan Xue, Srimukh Prasad Veccham, Saee Paliwal, Tyler Shimko, Micha Livne

发表机构 * NVIDIA(英伟达)

AI总结 提出分子图-Transformer预训练框架,结合化学自监督与对比互信息,通过统一概率潜变量目标优化重构、对比和化学任务,在多任务微调中采用任务特定MLP头,在三个数据集上平均提升7.6%-9.5%。

详情
AI中文摘要

准确预测吸收、分布、代谢和排泄(ADME)性质对药物发现至关重要,但由于ADME终点存在噪声、相互依赖且数据有限,仍然具有挑战性。我们提出了一种分子图-Transformer预训练框架,结合了化学特异性自监督与对比互信息机器学习(cMIM)。我们的方法将分子图编码为潜变量,从图导出的潜代码重建SMILES字符串,并用领域特定的自监督化学任务增强对比目标。我们不是将这些任务视为具有单独调整损失权重的辅助正则化器,而是将重建、对比判别和化学特异性监督表述为单个概率潜变量目标中的单位权重对数概率因子。对于微调,我们提出了一种具有任务特定多层感知器头的多任务GNN读出架构,在保留共享表示学习的同时减轻负迁移并改进异质非线性任务关系的建模。在Biogen、ExpansionRX和ChEMBL-MT上,所得到的对比KERMT预训练相比KERMT基线分别提高了7.6%、9.9%和9.5%(在显著改进的终点上平均)。将ADME邻近分子添加到预训练语料库进一步改善了迁移,并且对比组件锐化了化学上有意义的潜邻域。

英文摘要

Accurate prediction of absorption, distribution, metabolism, and excretion (ADME) properties is critical to drug discovery, but remains challenging because ADME endpoints are noisy, interdependent, and often data-limited. We propose a molecular graph-transformer pretraining framework that combines chemistry-specific self-supervision with contrastive mutual information machine learning (cMIM). Our method encodes molecular graphs into latent variables, reconstructs SMILES strings from the graph-derived latent codes, and augments the contrastive objective with domain-specific self-supervised chemistry tasks. Rather than treating these tasks as auxiliary regularizers with separately tuned loss weights, we formulate reconstruction, contrastive discrimination, and chemistry-specific supervision as unit-weighted log-probability factors in a single probabilistic latent-variable objective. For fine-tuning, we propose a multi-task GNN readout architecture with task-specific multilayer perceptron heads, preserving shared representation learning while mitigating negative transfer and improving the modeling of heterogeneous, nonlinear task relationships. Across Biogen, ExpansionRX, and ChEMBL-MT, the resulting Contrastive KERMT pretraining improves over the KERMT baseline by 7.6%, 9.9%, and 9.5% respectively (averaged over significantly-improved endpoints). Adding ADME-adjacent molecules to the pretraining corpus further improves transfer, and the contrastive component sharpens chemically meaningful latent neighborhoods.

2606.11507 2026-06-11 cs.CV 新提交

SceneMiner: Identity-Preserving Multi-Task Fine-Tuning for Unified BEV Scene Mining

SceneMiner: 保持身份的多任务微调用于统一BEV场景挖掘

Abdalmalek Aburaddaha, Venkatraman Narayanan, Keval Thaker, Samir A. Rawashdeh

发表机构 * University of Michigan-Dearborn(密歇根大学迪尔伯恩分校)

AI总结 提出SceneMiner,一种统一的仅相机鸟瞰图管道,通过冻结视觉语言骨干网络在单次前向传播中发出互补的挖掘信号,并发现跨任务干扰问题,通过零初始化新子模块和冻结共享流参数的身份保持多任务微调解决。

详情
AI中文摘要

从驾驶日志中挖掘困难、安全关键的场景受到缺乏难度标签的瓶颈,且没有单一的代理(碰撞风险、轨迹歧义或语义稀有性)足以单独找到这些场景。我们提出SceneMiner,一种统一的、仅相机的鸟瞰图管道,从冻结的视觉语言骨干网络在单次前向传播中发出互补的挖掘信号,无需激光雷达或雷达:用于文本提示场景搜索的检索嵌入、多标签场景标签分布以及连续的基于物理的风险评分(运动预测是副产品,而非贡献)。构建这样的多头模型暴露了我们的核心发现,即我们称之为跨任务干扰的失败模式:添加或升级一个头会改变共享激活流并降低权重冻结的兄弟头,因此仅冻结参数是不够的。我们的贡献,即保持身份的多任务微调,通过零初始化每个新子模块并冻结每个馈入共享流的参数来消除这种干扰。挖掘头因此保持比特一致,同时仅训练约102k参数。标签头通过将每个场景池化为32个视觉令牌,在20个场景标签上达到mAP 0.4614(micro-F1 0.5557),嵌入头支持文本提示检索,经定性验证。代码可在以下网址获取:this https URL

英文摘要

Mining hard, safety-critical scenes from driving logs is bottlenecked by the absence of difficulty labels, and no single proxy, collision risk, trajectory ambiguity, or semantic rarity suffices to find such scenes on its own. We present SceneMiner, a unified, camera-only bird's-eye-view pipeline that emits complementary mining signals from a frozen vision-language backbone in a single forward pass, with no LiDAR or radar: a retrieval embedding for text-prompted scenario search, a multi-label scene-tag distribution, and a continuous physics-based risk score (a motion forecast is a byproduct, not a contribution). Building such a multi-head model exposes our central finding, a failure mode we term cross-task interference: adding or upgrading one head shifts a shared activation stream and degrades weight-frozen sibling heads, so freezing parameters alone is insufficient. Our contribution, identity-preserving multi-task fine-tuning, removes this interference by zero-initializing every new sub-module and freezing every parameter that feeds the shared stream. The mining heads are thereby preserved bit-identically while training only ~102k parameters. The tagging head reaches mAP 0.4614 (micro-F1 0.5557) on 20 scene tags by pooling each scene into 32 visual tokens, and the embedding head supports text-prompted retrieval, validated qualitatively. Code is available at: this https URL