arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2023
热门方向导航
2606.18746 2026-06-18 cs.AI 新提交

What Must Generalist Agents Remember?

通用型智能体必须记住什么?

Khurram Yamin, Namrata Deka, Maitreyi Swaroop, Albert Ting, Jeff Schneider, Bryan Wilder

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文形式化论证了通用型智能体为在多个环境和目标下近似最优行动,必须存储领域相关信息以区分观察瓶颈处的不兼容最优动作,并证明记忆可用于重构局部转移动态。

详情
AI中文摘要

本文形式化地阐述了通用型智能体为了在多个环境和目标下近似最优地行动,必须在记忆中存储什么。它表明,当两个领域共享一个观察瓶颈但需要不兼容的最优动作时,任何一致近似最优的策略必须在该瓶颈处诱导出不同的记忆分布。这一结果产生了一个分离定理:足够成功的智能体不能仅依赖当前状态观察,而必须在记忆中保留领域相关信息。本文进一步证明,如果智能体的记忆包含足够的信息来估计相关目标的值,那么该记忆可用于近似重构智能体的局部转移动态。综合这些结果,将记忆刻画为支持领域区分、转移模型重构和通用型智能体规划的基板。

英文摘要

This paper develops a formal account of what generalist agents must store in memory in order to act near-optimally across multiple environments and goals. It shows that when two domains share an observational bottleneck but require incompatible optimal actions, any uniformly near-optimal policy must induce distinct memory distributions at that bottleneck. The result yields a separation theorem: sufficiently successful agents cannot rely only on current state observations, but must preserve domain-relevant information in memory. The paper further shows that if an agent's memory contains enough information to estimate values for related goals, then that memory can be used to approximately reconstruct the agent's local transition dynamics. Together, these results characterize memory as the substrate that supports domain disambiguation, transition-model reconstruction, and planning for generalist agents.

2606.18738 2026-06-18 cs.SD 新提交

GRIDEX: Grid-Grounded Forensic Explanations for Deepfake Spectrogram Analysis

GRIDEX:基于网格的深度伪造频谱图取证解释

Thi Ngan Ha Do, Tingmin Wu, Alsharif Abuadbba, Kristen Moore

发表机构 * CSIRO(澳大利亚联邦科学与工业研究组织)

AI总结 提出GRIDEX框架,通过两阶段学习(SFT+GRPO)定位频谱图异常区域并生成结构化取证解释,提升伪造检测的可解释性。

详情
AI中文摘要

语音生成技术的进步使得人工语音越来越逼真。尽管现代分类模型在深度伪造检测方面可以达到高准确率,但它们不会产生证据,例如指出欺骗线索在频谱图中的位置及其声学含义,从而限制了它们在取证中的实用性。完整频谱图的人工分析是资源密集型的,因此证据应将注意力集中在最具诊断性的区域。此外,现有的可解释性方法在将上下文属性与局部证据联系起来方面的能力有限,使得解释更难验证。为了克服这一限制,我们提出了GRIDEX,这是一个流水线,当给定深度伪造频谱图时,它会生成其异常的取证解释。该流水线(i)选择频谱图中前K个异常区域,并(ii)为每个异常生成解释。这些解释遵循分类声学字段的模式,包括时间、频谱、语音信息和解释文本。据我们所知,这是第一个使用区域定位为深度伪造频谱图生成结构化取证解释的框架。GRIDEX采用两阶段学习范式进行训练,该范式将监督微调(SFT)与群体相对策略优化(GRPO)相结合。在我们的数据集上的实验表明,与强大的视觉语言模型(VLM)基线相比,伪影定位和解释质量有所提高。数据集和代码将在发表后发布。

英文摘要

The advancement of speech generation technologies has made artificial speech increasingly realistic. Although modern classification models can achieve high accuracy when it comes to deepfake detection, they do not produce evidences such as indicating where spoof cues appear in the spectrogram and what they imply acoustically, limiting their usefulness in forensic settings. Manual analysis of full spectrograms is resource-intensive, so evidence should narrow attention to the most diagnostic regions. Moreover, existing explainability methods have limited capabilities in connecting contextual attributes to localized evidence, making explanations harder to verify. To overcome this limitation, we propose GRIDEX, a pipeline that, when given a deepfake spectrogram, generates forensic explanations of its anomalies. The pipeline (i) selects top-K anomalous regions in the spectrogram and (ii) produces an explanation for each anomaly. The explanations follow a schema of categorical acoustic fields, including temporal, spectral, phonetic information and interpretation text. To our knowledge, this is the first framework to generate structured forensic explanations using regional grounding for deepfake spectrograms. GRIDEX is trained with a two-stage learning paradigm that combines supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO). Experiments on our dataset show improved artifact localization and explanation quality over strong vision-language model (VLM) baselines. The dataset and code will be released upon publication.

2606.18732 2026-06-18 cs.LG cs.CV 新提交

Low-Cost Neuromorphic Fall Detection Using Synthetic Event Data and Hybrid SNNs

低成本神经形态跌倒检测:使用合成事件数据和混合SNN

Guillermo Rojas, Gonzalo Soto, Daniel Yunge

发表机构 * School of Electrical Engineering Pontificia Universidad Católica de Valparaíso, Chile(瓦尔帕莱索天主教大学电气工程学院)

AI总结 提出混合SNN-CNN模型,从智能手机视频合成事件相机数据,实现高效准确的跌倒检测。

Comments 4 pages, 6 figures, presented at ICONS 2025 during the Poster Session, but not published

详情
AI中文摘要

本工作提出了混合模型,将脉冲神经网络(SNN)与卷积神经网络(CNN)组件集成,以从传统智能手机视频生成的模拟事件相机数据(动态视觉传感器,DVS)中学习。主要针对人类跌倒检测,该方法通过将视频帧转换为事件数据,利用SNN的能效和时空处理能力。通过多个数据集上的模拟评估所提出的模型,并将其性能与传统机器学习模型进行比较。结果表明,在不牺牲准确性的情况下显著提高了效率,强调了将SNN和DVS技术结合用于现实环境中复杂任务的潜力。

英文摘要

This work presents the development of hybrid models that integrate spiking neural networks (SNNs) with components of convolutional neural networks (CNNs) to learn from simulated event-based camera data (Dynamic Vision Sensor, DVS) generated from conventional smartphone videos. Aimed primarily at human fall detection, the approach leverages the energy efficiency and spatio-temporal processing capabilities of SNNs by converting video frames into event-based data. The proposed models are evaluated through simulations on multiple datasets, comparing their performance to that of traditional machine learning models. Results demonstrate significant gains in efficiency without sacrificing accuracy, underscoring the potential of combining SNNs and DVS technology for complex tasks in real-world environments.

2606.18728 2026-06-18 cs.CL 新提交

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

LegalWorld: 法律智能体的生命周期交互环境

Songhan Zuo, Shengbin Yue, Tao Chiang, Guanying Li, Yun Song, Xuanjing Huang, Zhongyu Wei

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Northwest University of Political and Law(西北政法大学)

AI总结 提出LegalWorld,一个将中国民事诉讼建模为五阶段因果链的生命周期交互环境,基于75309对判决书构建,并评估多智能体在连续诉讼中的能力差异。

详情
AI中文摘要

民事诉讼本质上是一个生命周期过程:律师第一天起草的内容会约束数月后庭审的走向。然而,现有的法律基准评估的是孤立的子任务,而先前的法律智能体模拟器每次从共享的真实情况重新初始化场景,忽略了跨阶段的因果依赖关系。我们提出LegalWorld,一个生命周期交互环境,将中国民事诉讼建模为五个阶段(七个子场景)的因果连接状态链,基于75,309对中国民事判决书构建。我们为其配备了可重用的基础设施(本地记忆、全局案件记忆、技能/工具库),确保每个争议在其整个生命周期中保持一致。在此环境基础上,我们构建了LongJud-Bench,用于评估智能体在所有五个连接阶段的能力。来自217名法律背景评估者的18,992个评分证实,LegalWorld的轨迹在程序上忠实且角色一致;跨模型的能力级评估揭示了聚合分数无法暴露的显著分歧,没有单一骨干模型在咨询、起草和庭审辩护中均领先。详细资源将公开发布。

英文摘要

Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators reinitialize each scenario from shared ground truth, leaving cross-stage causal dependencies unmodeled. We present LegalWorld, a life-cycle interactive environment that models Chinese civil litigation as a causally connected state chain of five stages (seven sub-scenarios), grounded in 75,309 paired Chinese civil judgments. We pair it with reusable infrastructure (local memory, global case memory, a Skill/Tool library) that keeps each dispute consistent across its full life cycle. Building on this environment, we construct LongJud-Bench to evaluate agent capability across all five connected stages. 18,992 ratings from 217 legal-background evaluators confirm that LegalWorld trajectories are procedurally faithful and role-consistent; and a capability-level cross-model evaluation reveals sharp divergences that aggregate scores cannot expose, with no single backbone leading across consultation, drafting, and courtroom advocacy. Detailed resources will be released publicly.

2606.18726 2026-06-18 cs.LG cs.AI 新提交

Graph Grounded Cross Attention Transformer Neural Network for Structurally Constrained Full Event Sequence Generation in Predictive Process Monitoring

基于图锚定交叉注意力Transformer神经网络的预测过程监控中结构约束完整事件序列生成

Fang Wang, Ernesto Damiani

发表机构 * Department of Computer Science, University of Milan(米兰大学计算机科学系)

AI总结 提出图锚定交叉注意力Transformer(GGATN),通过全局过程图作为结构化记忆、Transformer自注意力编码序列位置、图锚定交叉注意力注入过程拓扑,结合维特比式图约束解码,一次性生成完整事件序列,在六个基准日志上优于LLM基线。

Comments 40 pages

详情
AI中文摘要

结构约束的事件序列生成仍然具有挑战性,因为生成的路径必须保持转移可行性、时间顺序、终止和属性一致性。在预测过程监控(PPM)中,这一挑战表现为完整事件序列生成,而现有工作主要处理子任务,如下一个活动、剩余时间、结果和属性预测。本文提出了图锚定交叉注意力Transformer神经网络(GGATN)用于这一统一的PPM任务。GGATN使用全局过程图作为结构化活动记忆,通过Transformer自注意力对序列位置进行上下文化,并通过图锚定交叉注意力注入过程拓扑。与自回归解码不同,GGATN一次性生成活动、时间戳、长度以及事件级和序列级属性,随后进行维特比风格的图约束解码以获得可行路径和显式终止。在六个基准事件日志上的实验表明,其生成质量优于局部指令提示的LLM基线。GGATN在序列相似性、Damerau-Levenshtein相似性、基于二元组的控制流相似性和持续时间分布方面取得了强劲性能,同时保持零幻觉活动和零序列级属性不一致。消融分析证实了全局图编码器作为稳定的结构先验。可解释性分析展示了图结构、序列上下文、反馈细化和约束解码如何塑造生成过程。

英文摘要

Structurally constrained event sequence generation remains challenging because generated paths must preserve transition feasibility, temporal order, termination, and attribute consistency. In predictive process monitoring (PPM), this challenge appears as full event sequence generation, whereas existing work mainly addresses component tasks such as next activity, remaining time, outcome, and attribute prediction. This paper proposes the Graph Grounded Cross Attention Transformer Neural Network (GGATN) for this unified PPM task. GGATN uses a global process graph as structured activity memory, contextualizes sequence positions through Transformer self attention, and injects process topology through graph grounded cross attention. Unlike autoregressive decoding, GGATN generates activities, timestamps, length, and event level and sequence level attributes in a single pass, followed by Viterbi style graph constrained decoding for feasible paths and explicit termination. Experiments on six benchmark event logs show more reliable generation quality than local instruction prompted LLM baselines. GGATN achieves strong performance on sequence similarity, Damerau Levenshtein similarity, bigram based control flow similarity, and duration distribution, while maintaining zero hallucinated activities and zero sequence level attribute inconsistency. Ablation analyses confirm the global graph encoder as a stable structural prior. Interpretability analyses show how graph structure, sequence context, feedback refinement, and constrained decoding shape generation.

2606.18723 2026-06-18 cs.CV cs.LG 新提交

Clinically Aligned Geometry Constraints for Robust IVUS Vessel Boundary Segmentation

临床对齐的几何约束用于鲁棒的IVUS血管边界分割

Yunshu Chen, Litao Yang, Giuseppe Di Giovanni, Jordan Tan, Deval Mehta, Andrew Lin, Derek Chew, Masasi Fujino, Julie Butters, Stephen Nicholls, Zongyuan Ge, Kyung Hoon Cho

发表机构 * AIM For Health Lab, Monash University(莫纳什大学AIM健康实验室) Department of Data Science and Artificial Intelligence, Faculty of IT, Monash University(莫纳什大学信息技术学院数据科学与人工智能系) Monash University Victorian Heart Institute(莫纳什大学维多利亚心脏研究所) School of Computing Technologies, RMIT University(皇家墨尔本理工大学计算技术学院) National Cerebral and Cardiovascular Center(国立循环器病研究中心) Department of Cardiology, Chonnam National University Hospital and Medical School(全南大学医院和医学院心脏病学系)

AI总结 提出GeoCat网络,通过双编码器与可微几何一致性损失,在IVUS分割中降低边界漂移和拓扑错误,提升临床几何测量精度。

Comments MICCAI2026 Accepted

详情
AI中文摘要

血管内超声(IVUS)管腔和外弹性膜(EEM)分割对于定量评估冠状动脉斑块负荷至关重要。管腔或EEM勾画的误差会直接传播到斑块面积、斑块负荷和几何测量中。然而,优先考虑重叠分数的标准方法常常遭受边界漂移和拓扑错误,导致临床测量不准确。我们提出GeoCat,一个几何一致性网络,使用双笛卡尔-极坐标编码器,结合跨域注意力和时间融合,处理5帧IVUS片段。可微的几何一致性损失直接监督临床相关描述符,包括直径、方向和横截面积。该模型在来自146名患者的12,242张标注帧上训练,这些帧使用两种商用IVUS系统采集。我们使用分割准确性和斑块相关临床指标评估性能,包括Dice/IoU、边界测量(95HD(mm)、ASSD)、拓扑违规率和临床几何误差(dmax/dmin、角度和面积)。在我们的数据集上,GeoCat实现了0.93的Dice,将95HD降低到0.14 mm,并将拓扑违规率降低到1.0%。重要的是,它显著提高了几何保真度,产生0.13-0.16 mm的直径误差和约8度的角度误差,支持可靠的斑块负荷量化。

英文摘要

Intravascular ultrasound (IVUS) lumen and external elastic membrane (EEM) segmentation is important for quantitative coronary plaque burden assessment. Errors in lumen or EEM delineation directly propagate to plaque area, plaque burden and geometric measurements. However, standard methods prioritising overlap scores often suffer from boundary drift and topology errors, leading to inaccurate clinical measurements. We present GeoCat, a geometry-consistent network that processes 5-frame IVUS clips using dual Cartesian-polar encoders with cross-domain attention and temporal fusion. A differentiable geometry consistency loss directly supervises clinically relevant descriptors including diameters, orientations, and cross-sectional areas. The model is trained on 12,242 annotated frames from 146 patients acquired with two commercial IVUS systems. We evaluate performance using both segmentation accuracy and plaque-relevant clinical metrics, including Dice/IoU, boundary measures(95HD (mm), ASSD), topology violation rate, and clinical geometry errors (dmax/dmin, angles, and areas). On our dataset, GeoCat achieves a Dice of 0.93, reduces 95HD to 0.14 mm, and lowers topology violations to 1.0%. Importantly, it significantly improves geometric fidelity, yielding diameter errors of 0.13-0.16 mm and angular errors of ~8 degrees, supporting reliable plaque burden quantification.

2606.18721 2026-06-18 cs.CV 新提交

Rethinking the Pointer Loss in Table Structure Recognition: Geometry-Aware Pointer Loss for Spatial Locality

重新思考表格结构识别中的指针损失:面向空间局部性的几何感知指针损失

Hong-Jun Choi, Jongho Lee, Jaeyoung Kim

发表机构 * Teamreboott Inc.(Teamreboott公司)

AI总结 针对指针网络在表格结构识别中相邻单元格错误占79.6%的问题,提出几何感知指针损失,通过反距离加权重写交叉熵目标,聚焦邻近单元格梯度,在不增加推理成本下提升性能。

详情
AI中文摘要

使用指针网络的表格结构识别(TSR)通过预测HTML序列同时将标签与检测到的文本(或单元格)区域对齐,取得了令人印象深刻的结果。然而,我们的分析揭示,当指针网络失败时,79.6%的错误发生在空间相邻的单元格之间(曼哈顿距离<=2)。尽管如此,标准交叉熵损失对所有负候选样本赋予相同权重。在这项工作中,我们提出了几何感知指针(GAP)损失,它根据与真实值的空间邻近性重新加权交叉熵目标。通过应用反距离加权,GAP将梯度流集中在模型最困难的区域:相邻单元格比远处单元格获得更强的梯度。我们的方法仅需对损失计算进行简单修改,保持相同的模型架构且零额外推理成本。在PubTabNet和SynthTabNet上的大量实验表明,GAP持续减少相邻单元格错误,达到了新的最先进性能。我们的发现表明,在损失层面融入几何归纳偏置为鲁棒TSR提供了一种简单而有效的方法。我们的代码可在以下网址获取:this https URL

英文摘要

Table Structure Recognition (TSR) using a pointer network achieves impressive results by predicting HTML sequences while aligning tags to detected text (or cell) regions. However, our analysis reveals that when pointer networks fail, 79.6% of errors occur between spatially adjacent cells (Manhattan distance <= 2). Despite this, standard cross-entropy loss weights all negative candidates equally. In this work, we propose Geometry-Aware Pointer (GAP) Loss, which reweights the cross-entropy objective based on spatial proximity to ground truth. By applying inverse distance weighting, GAP focuses gradient flow where the model struggles most: immediate neighbors receive stronger gradients than distant cells. Our approach requires only a straightforward modification to the loss computation, maintaining the same model architecture with zero additional inference cost. Extensive experiments on PubTabNet and SynthTabNet demonstrate that GAP consistently reduces adjacent-cell errors, achieving new state-of-the-art performance. Our findings suggest that incorporating geometric inductive biases at the loss level provides a simple yet effective approach to robust TSR. Our code is available at https://github.com/teamreboott/GAP

2606.18717 2026-06-18 cs.CL cs.AI 新提交

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Morpheus: 一种面向土耳其语的形态感知神经分词器和词嵌入器

Tolga Şakar

发表机构 * Independent Researcher(独立研究者)

AI总结 针对土耳其语粘着特性,提出Morpheus神经词素边界模型,实现无损可逆分词与结构化词嵌入,在可逆分词器中达到最低比特每字符(1.425),词素对齐F1提升至0.61,GPU内存节省约19%。

详情
AI中文摘要

土耳其语是粘着语:意义由词素承载,然而驱动现代语言模型的子词分词器根据语料库统计分割单词,切碎了承载语义的后缀,并且在WordPiece和基于规则的分析器的情况下,无法将其输出解码回原始文本。本文提出\textbf{Morpheus},一个面向土耳其语的神经词素边界模型,它同时是一个无损的、形态感知的分词器和一个词嵌入生成器。一个可微的泊松-二项式动态规划程序在训练期间将每个字符的边界概率转化为软词素隶属度,在推理时转化为精确的片段,无需字符串归一化,因此$\mathrm{decode}(\mathrm{encode}(w)) = w$由构造保证。由于该模型是神经模型,相同的正向传播在分词的同时也输出结构化的词嵌入。在可逆分词器中——唯一适用于生成的分词器——Morpheus达到了最低的比特每字符(1.425),将子词家族的金标准词素对齐大致翻倍(MorphScore宏F1从约0.32提升至0.61),并且相比64K词汇量的子词分词器节省了约19%的GPU内存。作为嵌入器,冻结的Morpheus向量在词汇检索(根家族MAP 0.85)和同根验证(ROC-AUC 1.00)上领先,超越了多语言检索器BGE-M3和BERTurk;在上下文和屈折依赖的任务(NER、格/数探测)上,更重的上下文编码器仍然领先——我们将这一权衡归因于Morpheus以词根为中心的几何结构。代码:此https URL 模型:此https URL 交互演示:此https URL。

英文摘要

Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents \textbf{Morpheus}, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so $\mathrm{decode}(\mathrm{encode}(w)) = w$ holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character ($1.425$), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 $0.61$ vs.\ ${\sim}0.32$), and uses ${\sim}19\%$ less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP $0.85$) and same-root verification (ROC-AUC $1.00$), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.

2606.18709 2026-06-18 cs.CL 新提交

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

LLMs难以衡量区分不同水平学生的题目:阅读理解评估中题目区分度研究

Han Chen, Ming Li, Chenguang Wang, Yijun Liang, Dawei Zhou, Hong jiao, Tianyi Zhou

发表机构 * MBZUAI(穆罕默德·本·扎耶德人工智能大学) University of Maryland(马里兰大学) Virginia Tech(弗吉尼亚理工大学)

AI总结 本研究评估42个LLM在零样本设置下预测题目区分度的能力,发现直接预测与人类校准的区分度相关性弱(最高Spearman 0.152),基于CTT的响应校准相关性有限(0.241),表明LLM尚不能可靠捕捉题目区分度。

详情
AI中文摘要

题目区分度是教育评估的一个基本心理测量属性,它衡量一个题目是否能有效区分高水平和低水平学生。虽然已有研究探讨了大语言模型(LLM)能否估计题目难度,但尚不清楚它们能否捕捉题目区分度。在本工作中,我们使用两种互补方法评估了42个专有和开源LLM在零样本设置下的表现:直接区分度预测,即模型从其内容中显式估计题目的区分度值;以及基于响应的经典测试理论(CTT)校准,其中LLM的答案被视为合成学生响应以计算区分度分数。我们的结果表明,直接预测与人类校准的区分度一致性较弱:表现最好的模型仅达到0.152的Spearman相关性。基于响应的CTT校准提供了更强但仍然有限的信号,全人格合成受访者池达到0.241的Spearman相关性。这些发现突显了题目区分度作为基于LLM的心理测量评估的一个开放挑战:当前的LLM包含非随机的区分度相关信号,但它们尚不能可靠地捕捉评估题目如何区分人类学生。

英文摘要

Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency. While various existing works have explored whether large language models (LLMs) can estimate item difficulty, it remains unclear whether they can capture item discrimination. In this work, we evaluate 42 proprietary and open-weight LLMs in zero-shot settings using two complementary approaches: direct discrimination prediction, where models explicitly estimate an item's discrimination value from its content, and response-based Classical Test Theory (CTT) calibration, where LLM answers are treated as synthetic student responses to compute discrimination scores. Our results show that direct prediction yields weak alignment with human-calibrated discrimination: the best-performing model reaches only a Spearman correlation of 0.152. Response-based CTT calibration provides a stronger but still limited signal, with the all-persona synthetic respondent pool reaching a Spearman correlation of 0.241. These findings highlight item discrimination as an open challenge for LLM-based psychometric evaluation: current LLMs contain non-random discrimination-relevant signal, but they do not yet reliably capture how assessment items distinguish human students.

2606.18707 2026-06-18 cs.CV 新提交

PEFT-MedSAM: Efficient Fine-Tuning of Medical Foundation Models for Explainable Skin Lesion Segmentation

PEFT-MedSAM:面向可解释皮肤病变分割的医学基础模型高效微调

Asad Channa, Abdullah Khan, Asghar Ali Chandio, Aamir Akbar, Shahzad Memon, Aqib Hussain, Ameer Hamza

发表机构 * Department of Computer Science, Quaid-e-Awam University of Engineering, Sciences & Technology(计算机科学系,卡迪尔-阿瓦姆工程、科学与技术大学) Department of Artificial Intelligence, Quaid-e-Awam University of Engineering, Sciences & Technology(人工智能系,卡迪尔-阿瓦姆工程、科学与技术大学) Department of Computer Science, Sindh Madressatul Islam University, City Campus, Karachi(计算机科学系, Sind 阿里斯坦伊斯兰大学,卡拉奇城校区) Department of Computer Science and Digital Technologies, School of Architecture, Computing and Engineering, University of East London(计算机科学与数字技术系,建筑、计算与工程学院,东伦敦大学)

AI总结 提出参数高效微调方法PEFT-MedSAM,冻结预训练编码器仅训练轻量解码器,在ISIC 2018上达到0.9411 Dice系数,并通过Grad-CAM可解释性增强临床可信度。

详情
AI中文摘要

使用深度学习模型对皮肤镜图像进行皮肤病变自动分割,有助于比常规检测更早发现黑色素瘤。然而,大多数现有的深度学习方法性能不佳。本文旨在提出一种名为PEFT-MedSAM的参数高效微调方法,用于适配医学分割一切模型(MedSAM)以自动分割皮肤镜皮肤病变。PEFT-MedSAM方法仅使用轻量级掩码解码器训练模型,同时保持预训练图像编码器和提示编码器冻结。在ISIC 2018基准数据集上的实验表明,与完全训练的U-Net基线(0.8715 Dice系数)和零样本MedSAM推理(0.8997 Dice系数)相比,PEFT-MedSAM获得了0.9411的Dice系数和0.8918的交并比。使用PH2数据集进行的外部验证显示Dice系数为0.9467,标准差为±0.0310。这些主张的支持证据包括比较两个数据集的Wilcoxon符号秩检验p值小于0.0001,以及bootstrap估计的95%置信区间[0.9364, 0.9447],该区间表示重复测试获得的平均Dice系数的估计范围。为了增加临床可信度,我们使用Grad-CAM可解释性以及基于指向游戏的评估方法,在验证集上评估CNN基线模型。结果表明,在包含519张图像的验证集上,准确率达到98.27%,并确认模型正确分类了包含皮肤病变的区域。

英文摘要

Automated segmentation of skin lesions using deep learning models for dermoscopic images can be very helpful in finding melanomas earlier than they would normally be detected. However, most deep learning methods available do not perform well. The aim of this paper is to present a parameter-efficient fine-tuning method called PEFT-MedSAM for adapting the Medical Segment Anything Model (MedSAM) to automatically segment dermoscopic skin lesions. The PEFT-MedSAM method uses only the lightweight mask decoder for training the model while keeping the pre-trained image encoder and prompt encoder frozen. The experiments performed on the ISIC 2018 benchmark dataset shows that PEFT-MedSAM obtains a dice coefficient of .9411 and an intersection over union value of .8918 when compared to both a fully trained U-Net baseline (.8715 dice coefficient) and zero-shot MedSAM inference (.8997 dice coefficient). The external validation of the model using PH2 dataset shows .9467 dice coefficient with +/- .0310 standard deviation. Supportive evidence for these claims include a p-value less than .0001 for Wilcoxon signed rank tests comparing the two datasets and bootstrap-estimated 95% confidence intervals of [.9364,.9447] that represent the estimated range of possible values for the average dice coefficient obtained by repeating the test. To increase clinical trustworthiness, we used Grad-CAM explainability along with a pointing game based evaluation methodology to evaluate the CNN baseline model on the validation set. The results showed that we had an accuracy rate of 98.27% on the validation set of 519 images and confirmed that the model classified regions containing skin lesions.

2606.18704 2026-06-18 cs.RO 新提交

Selective Unit-Cell Actuation in Lattice Structures for Distributed Morphology in Soft Robots

晶格结构中的选择性单元胞驱动用于软体机器人的分布式形态变化

Trevor Exley, Altair Coutinho, Lucia Beccai

发表机构 * Istituto Italiano di Tecnologia (IIT)(意大利技术研究院)

AI总结 提出嵌入式气动单元胞,将弯曲支柱晶格与双向波纹管致动器集成,通过空间驱动模式实现全局形态控制,实验验证了可扩展位移、力生成及弯曲、抓取和爬行运动。

Comments Accepted to IROS 2026, 8 pages, 5 figures

详情
AI中文摘要

软晶格结构越来越多地用于机器人中以定制柔顺性和引导变形;然而,驱动通常是在设备或模块级别引入,致动器插入到原本被动的架构中。在这项工作中,我们将致动器-晶格协同设计推进到单元胞尺度。我们提出了一种嵌入式气动单元胞,它将弯曲支柱晶格几何形状与双向波纹管致动器集成在一个单一的整体元件中。当镶嵌时,晶格作为一个分布式驱动场,其中全局形态由空间驱动模式而非均匀加压控制。对1x1、2x2和3x3镶嵌的实验表征展示了可扩展的位移和力生成,具有可重复的循环性能。在3x3x3阵列中,单元胞的选择性驱动产生了不同的全局变形模式,包括弯曲和定向抓取,而无需改变硬件配置。此外,耦合主动和被动单元胞实现了弯曲驱动的爬行运动,证明了异质镶嵌可以通过不对称变形进行平移。这些结果确立了单元胞级驱动作为晶格基软体机器人分布式变形的策略,并为可扩展的整体机器人架构提供了基础。

英文摘要

Soft lattice structures are increasingly used in robotics to tailor compliance and guide deformation; however, actuation is typically introduced at the device or module level, with actuators inserted into otherwise passive architectures. In this work, we move actuator-lattice co-design to the unit-cell scale. We present an embedded pneumatic unit cell that integrates curved-strut lattice geometry with a bidirectional bellow actuator within a single monolithic element. When tessellated, the lattice functions as a distributed actuation field in which global morphology is governed by spatial actuation patterns rather than uniform pressurization. Experimental characterization of 1x1, 2x2, and 3x3 tessellations demonstrates scalable displacement and force generation with repeatable cyclic performance. Selective actuation of unit cells in a 3x3x3 array produces distinct global deformation modes, including bending and directional grasping, without altering hardware configuration. Additionally, coupling active and passive unit cells enables bending-driven crawling locomotion, demonstrating that heterogeneous tessellations can translate through asymmetric deformation. These results establish unit-cell-level actuation as a strategy for distributed morphing in lattice-based soft robots and provide a foundation for scalable, monolithic robotic architectures.

2606.18703 2026-06-18 cs.LG q-bio.QM 新提交

Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment

跨模态生物学语言模型的逻辑空间对比对齐

Yanjun Shao, Yundi Chen, Yashvi Patel, Aurelien Pelissier, María Rodríguez Martínez

发表机构 * Biomedical Informatics and Data Science, Yale School of Medicine(耶鲁医学院生物医学信息学与数据科学)

AI总结 提出LOGICA框架,在输出逻辑空间进行对比学习,通过门控跨模态适配器保留预训练似然接口,实现跨不同词汇表模型的上下文条件预测,在蛋白质-配体结合、TCR-肽活性和药物耐药性预测任务上超越现有方法。

详情
AI中文摘要

预训练的生物学语言模型通过掩码标记预测暴露每个标记的概率分布,提供序列设计、变异评分和机制解释所依赖的似然接口。然而,这些分布是从广泛的无标注语料中学习得到的,并未自然地以任务特定的生物学上下文(如相互作用伙伴、细胞环境或治疗干预)为条件。现有的上下文匹配方法通常通过池化嵌入、对比潜在空间或任务特定的预测头来扭曲这一接口。我们提出了LOGICA(逻辑空间对比对齐),一种用于上下文条件预测的框架,直接在输出逻辑空间中进行对比学习。通过与每个模型的原生标记头兼容的门控跨模态适配器,LOGICA保留了预训练的似然接口,并将上下文化的标记对数似然转换为匹配分数。对齐是通过上下文敏感的标记概率来定义的,而不是共享嵌入空间中的邻近性,从而能够从具有不同词汇表的模型之间的稀疏配对数据中学习,无需共享分词器或解码器。LOGICA特别适用于突变局部变异排序,其中比较简化为扰动位点上突变标记的上下文条件似然。在蛋白质-配体结合、TCR-肽活性和药物条件耐药性预测中,LOGICA优于先前的最先进方法,包括匹配的潜在对比和条件MLM基线,同时保留了用于解释和生成的标记级接口。在保留基因的单突变药物耐药性预测中,LOGICA将AUC从接近随机的潜在空间基线约0.55提高到约0.65。

英文摘要

Pretrained biological language models expose per-token probability distributions through masked-token prediction, providing the likelihood interface central to sequence design, variant scoring, and mechanistic interpretation. Yet these distributions are learned from broad unlabeled corpora and are not naturally conditioned on task-specific biological contexts such as interaction partners, cellular environments, or therapeutic interventions. Existing contextual matching methods often distort this interface through pooled embeddings, contrastive latent spaces, or task-specific prediction heads. We introduce LOGICA (Logit-space Contrastive Alignment), a framework for context-conditioned prediction that performs contrastive learning directly in output-logit space. Using gated cross-modal adapters compatible with each model's native token head, LOGICA preserves the pretrained likelihood interface and converts contextualized token log-likelihoods into matching scores. Alignment is defined through context-sensitive token probabilities rather than proximity in a shared embedding space, enabling learning from sparse paired data across models with distinct vocabularies, without a shared tokenizer or decoder. LOGICA is particularly effective for mutation-local variant ranking, where comparisons reduce to context-conditioned likelihoods of mutant tokens at perturbed sites. Across protein--ligand binding, TCR--peptide activity, and drug-conditioned resistance prediction, LOGICA improves over prior state-of-the-art methods, including matched latent-contrastive and conditional MLM baselines, while retaining a token-level interface for interpretation and generation. On held-out-gene single-mutation drug-resistance prediction, LOGICA improves AUC from near-random latent-space baselines of $\sim$0.55 to $\sim$0.65.

2606.18702 2026-06-18 cs.CV 新提交

UniTemp: Unlocking Video Generation in Any Temporal Order via Bidirectional Distillation

UniTemp: 通过双向蒸馏实现任意时间顺序的视频生成

Lin Zhang, Sicheng Mo, Zefan Cai, Jinhong Lin, Zihao Lin, Jiuxiang Gu, Krishna Kumar Singh, Yuheng Li, Yin Li

发表机构 * University of Wisconsin Madison(威斯康星大学麦迪逊分校) Adobe Research(Adobe 研究院) University of California Los Angeles(加利福尼亚大学洛杉矶分校) University of California Davis(加利福尼亚大学戴维斯分校)

AI总结 提出UniTemp框架,通过双向蒸馏训练单个自回归模型,支持任意时间方向(前向、后向、中间插值)的视频生成,解决因果3D VAE在后向生成中的不连续性,提升可控性。

详情
AI中文摘要

自回归视频扩散模型已成为长视频生成的一种有前景的方法,在流式设置中表现出色。然而,现有方法仅限于前向时间生成,而实际视频创作通常需要灵活的生成顺序,例如,基于未来上下文进行后向扩展,或基于过去和未来上下文进行中间插值生成。我们通过训练一个支持任意时间方向生成的自回归模型来弥合这一差距。一个关键的技术挑战来自视频扩散模型中广泛使用的因果3D VAE,它编码的潜变量严格依赖于过去上下文。虽然这种因果结构适合前向生成,但在后向生成时会导致块间不连续性。为了解决这个问题,我们引入了块级锚点潜变量,这是一组辅助潜变量,用于在后向生成过程中恢复块边界处缺失的过去上下文。基于这一设计,我们提出了UniTemp,一个双向蒸馏框架,训练单个自回归学生模型用于任意方向的视频生成。在推理时,UniTemp可以基于任意过去和/或未来帧进行条件生成,提高了双向和中间插值生成的可控性。实验表明,与仅前向方法相比,UniTemp在短和长视频生成上保持了竞争性能,同时支持多种工作流程,如双向视频扩展、中间插值生成、循环视频生成、场景转换和视觉故事生成。项目网站:此 https URL

英文摘要

Autoregressive video diffusion models have emerged as a promising approach for long video generation, achieving strong performance in streaming settings. However, existing methods are restricted to forward temporal generation, whereas practical video creation often requires flexible generation order, e.g., conditioning on future context to extend backward, or on both past and future context for inbetween generation. We bridge this gap by training an autoregressive model that supports generation in arbitrary temporal directions. A key technical challenge arises from the Causal 3D VAE widely used in video diffusion models, which encodes latents strictly conditioned on past context. While suited for forward generation, this causal structure causes inter-block discontinuities when generation proceeds backward. To address this, we introduce blockwise anchor latents, a set of auxiliary latents that restore the missing past context at block boundaries during backward generation. Built on this design, we propose UniTemp, a bidirectional distillation framework that trains a single autoregressive student model for any-direction video generation. At inference time, UniTemp conditions on arbitrary past and/or future frames, improving controllability for both bidirectional and inbetween generation. Experiments show that UniTemp maintains competitive performance on short and long video generation compared to forward-only methods, while enabling diverse workflows such as bidirectional video extension, inbetween generation, looping video generation, scene transition, and visual story generation. Project website: https://lzhangbj.github.io/projects/unitemp/

2606.18699 2026-06-18 cs.CL cs.AI cs.IR 新提交

TW-LegalBench: Measuring Taiwanese Legal Understanding

TW-LegalBench: 衡量台湾法律理解

Fei-Yueh Chen, Chun Huang Lin, Chan Wei Hsu, Kuan Hsuan Yeh, Zih-Ching Chen, Kuan-Ming Chen, Patrick Chung-Chia Huang

发表机构 * University of Rochester(罗切斯特大学) National Taiwan University(国立台湾大学) NVIDIA(英伟达)

AI总结 提出TW-LegalBench基准,包含多项选择、开放式问答和法律判决预测任务,评估13个LLM在台湾法律上的表现,发现顶尖模型通过律师考试但未达到法官检察官标准,且法律条文引用困难。

Comments 10 pages, 2 figures, To appear in ICAIL 2026

详情
AI中文摘要

大型语言模型(LLM)在多种任务上展现出令人印象深刻的能力,但其在特定司法管辖区法律推理上的表现仍未充分探索。我们提出TW-LegalBench,利用台湾法律系统丰富的官方公开语料库,填补了在普通法基准(侧重英文来源)和大陆法基准(侧重简体中文来源)之外评估LLM在台湾法律上的空白。TW-LegalBench包含三种任务类型:(1)涵盖18个专业领域五年官方考试的超过16,000道多项选择题(MCQ);(2)来自法律专业人员考试的117道开放式问答题(OEQ),附有官方评分标准;(3)超过14,000个法律判决预测(LJP)实例,涵盖数百种犯罪类别。我们使用MCQ的准确率、基于评分标准点的分解式LLM作为裁判框架评估OEQ,以及LJP的判决准确性和法条引用指标,评估了13个LLM。我们的结果显示,表现最佳的模型超过了合格律师的通过门槛(通过率:11%),但未达到法官和检察官的通过标准(通过率:1-2%)。对于LJP,虽然模型展示了合理的判决类型准确性和刑期预测能力,但它们难以准确引用具体法律条文。这些发现表明,即使LLM在资格考试上的表现接近人类水平,可靠的 legal 文本生成仍然具有挑战性。

英文摘要

Large language models (LLMs) have shown impressive capabilities across diverse tasks, yet their performance on jurisdiction-specific legal reasoning remains underexplored. We present TW-LegalBench that utilizes Taiwanese legal system's rich official corpus open to the public to fill the gap in evaluating LLMs on Taiwanese law, among common-law benchmarks that focus on English sources and civil-law benchmarks focusing on sources of Simplified Chinese. TW-LegalBench comprises three task types: (1) over 16,000 multiple-choice questions (MCQs) across five years of official examinations in 18 professional domains; (2) 117 open-ended essay questions (OEQs) from examinations for legal professionals with official scoring rubrics; and (3) more than 14,000 legal judgment prediction (LJP) instances covering hundreds of crime categories. We evaluate 13 LLMs using accuracy for MCQs, a decomposed LLM-as-Judge framework based on the scoring rubric points for OEQs, and metrics for sentencing accuracy and statute citation for LJP. Our results reveal that top-performing models exceed the passing threshold for qualified lawyers (passing rate: 11%) but fall short of that for judges and prosecutors (passing rate: 1~2%). For LJP, while models demonstrate reasonable verdict type accuracy and sentence prediction capability, they struggle to cite exact legal articles. These findings highlight that reliable legal text generation remains challenging for LLMs, even though their performance on qualification examinations approaches human level.

2606.18697 2026-06-18 cs.LG cs.CR cs.RO 新提交

Stealthy World Model Manipulation via Data Poisoning

通过数据投毒进行隐蔽的世界模型操纵

Yibin Hu, Xiaolin Sun, Zizhan Zheng

发表机构 * Department of Computer Science(计算机科学系)

AI总结 提出SWAAP框架,通过两阶段数据投毒(双层级优化寻找有害目标模型+梯度匹配隐蔽实现)操纵学习到的世界模型,导致规划性能显著下降,且能规避多种防御检测。

Comments 41 pages, 8 figures, 11 tables. Submitted to NeurIPS 2026

详情
AI中文摘要

基于模型的学习智能体使用学习到的世界模型来预测未来状态、规划行动并适应新环境。然而,从收集的经验中更新世界模型的过程创造了一个训练时攻击面:对抗性投毒的微调轨迹可以操纵学习到的动力学,从而破坏下游规划。在本文中,我们提出了SWAAP,这是第一个针对学习到的世界模型的两阶段数据投毒框架。在第一阶段,SWAAP利用过渡梯度定理实现的一阶双层优化,识别出一个有害的目标世界模型,该模型在规划下诱导低回报行为,同时保持接近干净动力学。在第二阶段,SWAAP通过隐蔽约束的梯度匹配实现该目标,仅修改有限比例的微调过渡目标,使得诱导的训练梯度将受害者模型引向对抗目标,同时预测误差正则化器鼓励投毒目标保持接近世界模型的自然近似误差。为了评估攻击的隐蔽性,我们在投毒管道的三个阶段评估了防御和可检测性:投毒过渡的预训练检测、微调期间的鲁棒训练以及测试时对结果世界模型的监控。在多种连续控制任务中,SWAAP导致显著的性能下降,同时保持投毒过渡接近干净数据,并规避了评估的非自适应残差/CUSUM/TRIM风格防御。这些结果揭示了世界模型适应管道中的实际漏洞,并强调了需要保护世界模型训练数据和所学动力学的鲁棒性方法。

英文摘要

Model-based learning agents use learned world models to predict future states, plan actions, and adapt to new environments. However, the process of updating world models from collected experience creates a training-time attack surface: adversarially poisoned fine-tuning trajectories can manipulate the learned dynamics and thereby corrupt downstream planning. In this paper, we propose SWAAP, the first two-stage data poisoning framework for learned world models. In the first stage, SWAAP identifies a harmful target world model that induces low-return behavior under planning while remaining close to clean dynamics, using first-order bilevel optimization enabled by a transition-gradient theorem. In the second stage, SWAAP realizes this target through stealth-constrained gradient matching, modifying only a limited fraction of fine-tuning transition targets so that the induced training gradients steer the victim model toward the adversarial target, while a prediction-error regularizer encourages the poisoned targets to remain close to the world model's natural approximation error. To assess attack stealthiness, we evaluate defenses and detectability across three stages of the poisoning pipeline: pre-training detection of poisoned transitions, robust training during fine-tuning, and test-time monitoring of the resulting world model. Across diverse continuous-control tasks, SWAAP causes substantial performance degradation while keeping poisoned transitions close to clean data and evading the evaluated non-adaptive residual/CUSUM/TRIM-style defenses. These results reveal a practical vulnerability in world-model adaptation pipelines and highlight the need for robustness methods that protect both world-model training data and learned dynamics.

2606.18688 2026-06-18 cs.LG cs.AI 新提交

Dual-Channel Grounded World Modeling (DCGWM): Structural Prevention of Objective Interference Collapse via Heterogeneous External Grounding with Inward-Only Gradient Flow

双通道接地世界建模 (DCGWM):通过异构外部接地与内向梯度流结构性防止目标干扰崩溃

Akshay Hazare

发表机构 * Independent Researcher(独立研究者)

AI总结 提出双通道接地世界建模(DCGWM),通过分区潜空间和内向梯度流,结构性防止联合嵌入预测架构中多目标接地导致的目标干扰崩溃。

Comments Position paper. Experimental validation in progress

详情
AI中文摘要

联合嵌入预测架构(JEPAs)是世界模型表示学习的主要方法。我们识别出基于JEPA的世界模型在接地于两种性质不同的外部信号时存在一种失败模式:物理动力学(稀疏、高幅度、满足约束的梯度修正)和社会行为动力学(扩散、分布匹配的修正)。我们将其称为目标干扰崩溃(OIC):我们认为在共享潜空间中的联合学习会导致主导通道系统地崩溃从属通道的表示子空间,且仅通过损失加权无法解决。我们提出双通道接地世界建模(DCGWM),通过分区潜空间(物理子空间Z_p,行为子空间Z_b)和内向梯度流,从结构上防止OIC。物理接地通道通过VICReg风格的对齐到物理测量仅更新Z_p;社会行为接地通道通过对齐到涌现多智能体模拟的轨迹仅更新Z_b。通道间接口模块在任务级别耦合子空间,而不产生跨子空间梯度。非对称接地 adherence 损失通过硬铰链惩罚物理违反和软KL惩罚行为发散来惩罚 rollout 漂移。生成渲染层在架构上与潜世界模型隔离。我们给出三个理论结果:分区消除了与OIC相关的梯度干扰路径;每个接地子空间从其对齐目标继承抗崩溃保证;在生成目标几何形状的假设下,生成隔离是必要的。本文建立了问题表述和架构;实验验证正在进行中,将在未来修订中报告。

英文摘要

Joint Embedding Predictive Architectures (JEPAs) are a leading approach to world model representation learning. We identify a failure mode in JEPA-based world models grounded against two qualitatively distinct external signals: physical dynamics (sparse, high-magnitude, constraint-satisfying gradient corrections) and social-behavioral dynamics (diffuse, distribution-matching corrections). We term this Objective Interference Collapse (OIC): we argue that joint learning in a shared latent space causes the dominant channel to systematically collapse the subordinate channel's representational subspace, in a manner not resolvable by loss weighting alone. We propose Dual-Channel Grounded World Modeling (DCGWM), designed to structurally prevent OIC through a partitioned latent space (physical subspace Z_p, behavioral subspace Z_b) with inward-only gradient flow. A Physical Grounding Channel updates only Z_p via VICReg-style alignment to physical measurements; a Social-Behavioral Grounding Channel updates only Z_b via alignment to trajectories from an emergent multi-agent simulation. An Inter-Channel Interface Module couples the subspaces at the task level without cross-subspace gradients. An Asymmetric Grounding Adherence Loss penalizes rollout drift with a hard hinge for physical violations and a soft KL for behavioral divergence. A Generative Rendering Layer is architecturally isolated from the latent world model. We present three theoretical results: the partition removes the gradient-interference pathway implicated in OIC; each grounded subspace inherits anti-collapse guarantees from its alignment objective; and generative isolation is necessary under a stated assumption on the generative objective's geometry. This manuscript establishes the problem formulation and architecture; experimental validation is ongoing and will be reported in a future revision.

2606.18687 2026-06-18 cs.CV cs.RO 新提交

Spatially Stratified Distillation for Heterogeneous Radar Place Recognition

空间分层蒸馏用于异构雷达位置识别

Sagun Singh Shrestha, Samuel Harding, Abdelwahed Khamis, Saimunur Rahman, Peyman Moghadam

发表机构 * CSIRO Robotics(澳大利亚联邦科学与工业研究组织机器人实验室) University of Queensland(昆士兰大学)

AI总结 针对4D汽车雷达与密集旋转雷达之间的异构位置识别,提出空间分层蒸馏(SSD)方法,通过基于雷达回波的物理空间非对称对齐,在重叠区域强制特征对齐,在稀疏区域降低蒸馏权重,在HeRCULES数据集上达到最先进性能。

Comments IEEE ICRA Workshop on Open Challenges for Rigorous Robot Perception 2026

详情
AI中文摘要

可扩展的全天候位置识别越来越依赖于异构雷达位置识别来桥接不同的硬件平台。一个显著的应用是将来自经济高效的4D汽车雷达的查询与由密集旋转雷达构建的高保真参考地图进行匹配。这一过程从根本上受到4D传感器极端稀疏性(和窄视场)的限制,该传感器仅捕获旋转雷达数据库中存在的结构密度的一小部分。先前的工作通过统一不同的雷达信号来解决这个问题,即将两种信号投影到共同的表示空间。然而,它们在多会话环境中性能下降。在本文中,我们提出了空间分层蒸馏(SSD);一种策略,用直接从物理雷达回波导出的非对称空间对齐取代标准的均匀蒸馏。在两个雷达都有重叠回波的区域,SSD强制进行强特征对齐。关键的是,在4D学生雷达缺乏回波但教师雷达在共享视场内包含有效结构的稀疏区域,SSD应用大幅折扣的蒸馏权重。对最近的HeRCULES数据集的广泛评估表明,SSD显著优于先前的位置识别方法,在其具有挑战性的动态序列上取得了最先进的结果。

英文摘要

Scalable, all-weather place recognition increasingly relies on heterogeneous radar place recognition to bridge diverse hardware platforms. A notable application is matching queries from cost-effective 4D automotive radars against high-fidelity reference maps built by dense spinning radars. This process is fundamentally limited by the extreme sparsity (and narrow field-of-view) of the 4D sensor, which captures only a fraction of the structural density present in the spinning radar database. Prior efforts address this issue by unifying different radar signals. That is, projecting both signals into a common representational space. Yet, they suffer performance degradation in multi-session environments. In this paper, we propose spatially-stratified distillation (SSD); a strategy that replaces standard uniform distillation with an asymmetric spatial alignment derived directly from physical radar returns. In regions where both radars exhibit overlapping returns, SSD enforces strong feature alignment. Crucially, in sparse regions where the 4D student lacks returns but the teacher contains valid structure within the shared field of view, SSD applies heavily discounted distillation weights. Extensive evaluations of the recent HeRCULES dataset demonstrate that SSD significantly outperforms prior place recognition methods, achieving state-of-the-art results on its challenging dynamic sequences.

2606.18686 2026-06-18 cs.AI cs.CL cs.LG 新提交

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

ForecastBench-Sim:一个模拟世界预测基准

Jaeho Lee, Nick Merrill, Ezra Karger

发表机构 * Forecasting Research Institute(预测研究所)

AI总结 提出基于Freeciv游戏模拟的预测基准ForecastBench-Sim,通过游戏回滚生成可控、即时可解的预测问题,用于评估AI系统的概率推理能力。

Comments 15 pages, 5 main figures, 6 appendix figures. Spotlight presentation at Forecasting as a New Frontier of Intelligence / Workshop on AI Forecasting, ICML 2026

详情
AI中文摘要

通用AI系统的预测基准通常继承现实世界的约束:结果缓慢显现、尾部事件罕见、反事实问题难以评分。我们引入ForecastBench-Sim,一个基于Freeciv(一款以文明系列为模型的回合制策略游戏)游戏回滚的模拟世界预测基准。预测者接收固定的世界报告(当前游戏状态的结构化快照),并回答关于隐藏未来状态的问题;然后基准继续模拟并对预测进行评分。由于世界是模拟的,同一设置可以生成任意时间跨度的连续或二元预测问题、用于条件或因果问题的配对干预世界,以及罕见或破坏性结果的已解决示例。我们描述了基准流程、问题族、评分协议和发布工件,并报告了来自模型评估和匿名人工试点的验证切片。ForecastBench-Sim旨在通过提供受控、即时可解的任务来补充现实世界预测基准,用于研究动态世界状态下的概率推理。

英文摘要

Forecasting benchmarks for general-purpose AI systems usually inherit the constraints of the real world: outcomes resolve slowly, tail events are rare, and counterfactual questions are difficult to score. We introduce ForecastBench-Sim, a simulated-world forecasting benchmark built on game rollouts from Freeciv, a turn-based strategy game modelled on the Civilization series. Forecasters receive a fixed world report (a structured snapshot of the current game state) and answer questions about hidden future states; the benchmark then continues the simulation and scores forecasts. Because the world is simulated, the same setup can generate continuous or binary forecasting questions at arbitrary time horizons, paired intervention worlds for conditional or causal questions, and resolved examples of rare or disruptive outcomes. We describe the benchmark pipeline, question families, scoring protocol, and release artifacts, and report validation slices from model evaluations and an anonymized human pilot. ForecastBench-Sim is intended to complement real-world forecasting benchmarks by providing controlled, immediately resolvable tasks for studying probabilistic reasoning under dynamic world states.

2606.18682 2026-06-18 cs.CV 新提交

Multi-Class Brain Tumor Classification Using Advanced Deep Learning Models: A Comparative Study

使用先进深度学习模型的多类脑肿瘤分类:一项比较研究

Asad Channa, Asghar Ali Chandio, Akhtar Hussain Jalbani, Mehwish Leghari, Shahzad Memon

发表机构 * Department of Computer Science, Quaid-e-Awam University of Engineering, Sciences & Technology(夸迪-艾瓦姆工程、科学与技术大学计算机科学系) Department of Artificial Intelligence, Quaid-e-Awam University of Engineering, Sciences & Technology(夸迪-艾瓦姆工程、科学与技术大学人工智能系) The Faculty of Artificial Intelligence and Cyber Security, Universiti Teknikal Malaysia Melaka(马来西亚梅拉卡技术大学人工智能与网络安全学院) Department of Data Science, Quaid-e-Awam University of Engineering, Sciences & Technology(夸迪-艾瓦姆工程、科学与技术大学数据科学系) Department of Computer Science and Digital Technologies, School of Architecture, Computing and Engineering, University of East London(东伦敦大学建筑、计算与工程学院计算机科学与数字技术系)

AI总结 本研究比较五种CNN架构(包括定制模型和四种预训练模型)在约10,000张MRI图像上的多类脑肿瘤分类性能,发现EfficientNetB0以95%准确率最优,尤其显著提高了脑膜瘤的召回率(89%)。

详情
AI中文摘要

尽管深度学习最近取得了进展,但从MRI图像中准确分类脑肿瘤仍然面临挑战。在本研究中,我们对五种不同的卷积神经网络(CNN)架构进行了全面评估,包括一个定制的基线模型和四个预训练模型,用于使用临床来源的约10,000张MRI图像数据集对多类脑肿瘤进行分类。我们使用了五种不同的架构:VGG16、VGG19、DenseNet121和EfficientNetB0,它们都在相同的实验框架内进行了测试和训练。性能通过总体准确率和肿瘤召回率来衡量,以评估每种架构的临床相关性能。我们发现,与其他测试的架构相比,EfficientNetB0具有最佳的整体分类准确率95%;具体来说,VGG16(94.37%)、VGG19(92.29%)、DenseNet121(90.91%)和定制CNN(78.00%)。我们研究的一个特别重要的发现是,在检测脑膜瘤方面有显著改进;具体而言,简单的CNN可以以约20%的召回率检测脑膜瘤,而EfficientNetB0能够以89%的召回率检测脑膜瘤。脑膜瘤通常难以检测,因为它们在MRI图像上可能表现得非常微妙。此外,一个有趣的发现是,更深的VGG19性能不如较浅的VGG16。这表明,在处理医学图像时,CNN模型的架构效率可能比其深度更重要。总体而言,EfficientNetB0似乎在分类准确率、模型参数数量和临床有意义性能之间提供了最佳权衡。

英文摘要

Despite recent advancements in deep learning, accurately classifying brain tumors from MRI images continues to pose challenges. In this research, we present a comprehensive evaluation of five different convolutional neural networks (CNN) architectures, including a customized baseline model and four pre-trained models - for use in classifying multi-class brain tumors using a clinically-sourced dataset of approximately 10,000 MRI images. We have utilized five different architectures; VGG16, VGG19, DenseNet121, and EfficientNetB0, which were all tested and trained within an identical experimental framework. Performance was measured by both overall accuracy and tumor-wise recall as a means to measure the clinically-relevant performance of each architecture. We found that EfficientNetB0 had the best overall classification accuracy at 95%, when compared to the other architectures tested; specifically VGG16 (94.37%), VGG19 (92.29%), DenseNet121 (90.91%) and the customized CNN (78.00%). An especially important finding of our research was the considerable improvement in detecting meningiomas; specifically, while simple CNNs could detect meningiomas with a recall rate of approximately 20%, EfficientNetB0 was able to detect meningiomas with a recall rate of 89%. Meningiomas are often difficult to detect because they can appear very subtly on MRI images. Additionally, an interesting finding was that the deeper VGG19 performed worse than the shallower VGG16. This indicates that in many cases the architectural efficiency of a CNN model may be more important than its depth when working with medical images. Overall, EfficientNetB0 appears to provide the optimal trade-off between classification accuracy, number of parameters used in the model and clinically meaningful performance.

2606.18681 2026-06-18 cs.CV 新提交

Moving Beyond Diversity: Visual Token Pruning as Subspace Reconstruction for Efficient VLMs

超越多样性:将视觉令牌剪枝视为子空间重建以实现高效视觉语言模型

Jaeyeon Lee, Shunjie Wen, Dong-Wan Choi

发表机构 * Inha University(延世大学)

AI总结 提出SPARE方法,将令牌剪枝重构为子空间重建问题,通过迭代选择投影残差大的令牌进行剪枝,并引入反相关性机制保留上下文信息,在LLaVA上剪枝94%令牌仍保持95%性能。

Comments ECCV 2026 Under Review

详情
AI中文摘要

尽管视觉语言模型(VLM)性能卓越,但由于大量视觉令牌的存在,它们产生了巨大的计算开销。虽然多样性最大化已成为令牌减少的主流策略,但现有方法依赖于基于余弦的归一化相似度,忽略了幅度信息,无法忠实逼近原始特征表示,导致性能次优,尤其是在组合多技能推理任务上。本文提出SPARE,一种子空间重建方法,将令牌剪枝重新表述为列子集选择问题,并显式最小化重建误差。通过迭代选择投影残差大的令牌,SPARE在角度多样性之外实现了重建驱动的剪枝。此外,我们揭示了一个反直觉的反相关性现象:图像-文本相关性得分较低的令牌能更好地保留上下文信息。基于这一发现,我们将反相关性作为额外的选择标准纳入SPARE,以促进上下文感知的令牌选择。在多个VLM和基准上的大量实验表明,SPARE始终达到最先进的性能,在组合任务上取得显著提升。当应用于LLaVA时,SPARE在完全无需训练的情况下,可移除高达94%的视觉令牌,同时保留95%的基线性能。

英文摘要

Despite their remarkable performance, Vision Language Models (VLMs) incur substantial computational overhead due to the large number of visual tokens. While diversity maximization has become a dominant strategy for token reduction, existing methods rely on cosine-based normalized similarity that discards magnitude information, failing to faithfully approximate the original feature representation and leading to suboptimal performance, particularly on compositional multi-skill reasoning tasks. In this paper, we introduce SPARE, a subspace reconstruction method that reformulates token pruning as a column subset selection problem and explicitly minimizes reconstruction error. By iteratively selecting tokens with large projection residuals, SPARE performs reconstruction-driven pruning beyond angular diversity. Moreover, we reveal a counterintuitive anti-relevance phenomenon: tokens with lower image-text relevance score can better preserve contextual information. Based on this finding, we incorporate anti-relevance into SPARE as an additional selection criterion to promote context-aware token selection. Extensive experiments across multiple VLMs and benchmarks demonstrate that SPARE consistently achieves state-of-the-art performance, with strong gains on compositional tasks. When applied to LLaVA, SPARE removes up to 94% of visual tokens while retaining 95% of the baseline performance, all in a fully training-free manner.

2606.18680 2026-06-18 cs.RO 新提交

High-Degree-of-Freedom Lightweight Bioinspired Leg for Enhanced Mobility in Small Robots

高自由度轻量化仿生腿:提升小型机器人机动性

Haoqi Han, Yifei Yu, Jiaming Zhang, Xinru Cui, Linxi Feng, Hesheng Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai University of Electric Power(上海电力大学)

AI总结 针对微型机器人腿部自由度受限问题,提出一种四自由度并联腿机构,通过同心设计简化运动学,实现轻量化(18.9g)和大工作空间(>22255 mm³),显著提升运动灵活性。

详情
Journal ref
2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
AI中文摘要

在微型机器人领域,如何在严格的空间限制下通过增加腿部机构的自由度来增强运动能力仍然是一个重大挑战。受昆虫运动启发,本文提出了一种新型的微型四自由度并联腿机构,并系统分析了其机械设计、电气系统和运动学。该设计采用两个球形五杆连杆机构,在并联四杆配置中实现空间运动。此外,采用同心设计策略简化了腿部运动学的解析解。由于采用并联系统架构,所有执行器均位于主体上,与传统高自由度腿部结构相比,大大降低了运动部件的等效惯性。系统总质量仅为18.9 g,末端执行器输出力约为0.5 N,工作空间超过22255 mm³。实验结果表明,所提出的单腿机构具有优异的运动灵活性,凸显了其在微型仿生机器人领域的潜力。

英文摘要

In microrobotics, enhancing locomotion capabilities by increasing the degrees of freedom (DoF) of leg mechanisms under severe spatial constraints remains a significant challenge. Inspired by insect locomotion, this paper presents a novel micro-scale parallel leg mechanism with four degrees of freedom, and systematically analyzes its mechanical design, electrical system, and kinematics. The design incorporates two spherical five-bar linkages to achieve spatial motion within a parallel four-bar configuration. Furthermore, a concentric design strategy is employed to simplify the analytical solution of the leg kinematics. Due to the parallel system architecture, all actuators are located on the main body, substantially reducing the equivalent inertia of moving parts compared to traditional high-DOF leg structures. The total mass of the system is only 18.9 g, with an end-effector output force of approximately 0.5 N and a workspace exceeding 22255 mm3. Experimental results demonstrate that the proposed single-leg mechanism achieves excellent motion flexibility, highlighting its potential for micro bio-inspired robotics.

2606.18677 2026-06-18 cs.LG cs.AI 新提交

Bounded Context Management for Tabular Foundation Models on Stream Learning

表格基础模型在流学习中的有界上下文管理

Jinmo Lee, Doyun Choi, Moongi Choi, Jaemin Yoo

发表机构 * Seoul National University(首尔大学) KAIST(韩国科学技术院)

AI总结 针对表格流学习中分布漂移问题,提出上下文管理策略CURE,通过不确定性门控准入和冗余感知驱逐管理上下文,在七个流上相对提升最高27.0%。

Comments Accepted as a spotlight oral (top 5%) at the 2nd ICML Workshop on Foundation Models for Structured Data (FMSD@ICML2026)

详情
AI中文摘要

表格流学习需要在分布漂移下对顺序到达的样本进行预测。虽然标准方法通过更新模型状态来适应,但表格基础模型(TFMs)以上下文方式基于标记上下文进行预测,使其成为流学习的自然替代方案。这便将挑战从如何更新模型转移到如何管理上下文。我们提出一种未来信息视角,为上下文管理导出三个实际需求:保留最近样本、保留不确定样本、移除冗余样本。我们将这些需求实例化为CURE(通过不确定性感知准入和冗余感知驱逐的上下文管理),一种具有熵门控准入和冗余感知驱逐的上下文管理策略。在七个流上,CURE相比经典流学习器相对提升高达27.0%,在多个TFM骨干上保持鲁棒,并在其他策略变体中排名第一。代码和数据集可在该https URL获取。

英文摘要

Tabular stream learning requires predictions on sequentially arriving examples under distribution shift. While standard methods adapt by updating model states, tabular foundation models (TFMs) make predictions conditioned on a labeled context in an in-context manner, making them a natural alternative for stream learning. This shifts the challenge from how to update the model to how to manage the context. We propose a future information view that yields three practical requirements for context management: preserve recent examples, retain uncertain examples, and remove redundant examples. We instantiate these requirements as CURE (Context management via Uncertainty-aware admission and Redundancy aware Eviction), a context-managing policy with entropy-gated admission and redundancy-aware eviction. Across seven streams, CURE shows up to 27.0% relative improvement over classical stream learners, remains robust across multiple TFM backbones, and ranks first among other policy variants. Code and datasets are available at https://github.com/morcellinus/CURE-ICML-FMSD.

2606.18676 2026-06-18 cs.LG cs.CV 新提交

InTrain: Intrinsic Trainability for Zero-Cost Neural Architecture Search

InTrain: 面向零成本神经架构搜索的内在可训练性

Qinqin Zhou, Fuhai Chen, Jipeng Wu, Zhiwei Chen, Zhikai Hu, Weiwei Cai

发表机构 * School of Computer and Data Science, Fuzhou University(福州大学计算机与数据科学学院) School of Computer and Data Science, Minjiang University(闽江学院计算机与数据科学学院) School of Artificial Intelligence, Nanchang University(南昌大学人工智能学院) Department of Computer Science, Hong Kong Baptist University(香港浸会大学计算机科学系) School of Interdisciplinary Medicine and Engineering, Harbin Medical University(哈尔滨医科大学跨学科医学与工程学院)

AI总结 提出统一理论代理InTrain,通过几何容量和优化韧性两个协同成分形式化架构的可训练性,在NAS基准上达到与集成方法相当的排序相关性。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
AI中文摘要

免训练神经架构搜索有望在不进行昂贵训练的情况下高效发现高性能网络。然而,现有的零成本代理依赖于碎片化的启发式方法,未能捕捉基本问题:是什么使一个架构具有可训练性?本文引入内在可训练性(InTrain),一个统一的理论代理,将可训练性形式化为由两个协同成分——几何容量和优化韧性——涌现出的架构不变性。我们通过分析神经信息处理来操作化内在可训练性。几何容量通过激活协方差特征谱的参与比量化,捕捉表示流形的有效维度。优化韧性通过累积梯度健康度测量,评估跨网络深度的反向传播鲁棒性。InTrain通过尺度不变的乘法耦合综合这些维度,我们假设这对于捕捉它们协同、非加性的关系至关重要。在标准NAS基准和搜索空间上的大量实验表明,InTrain达到了与最先进的基于集成的代理相当的排序相关性,并优于其他单指标方法。

英文摘要

Training-free neural architecture search promises efficient discovery of high-performance networks without costly training. However, existing zero-cost proxies rely on fragmented heuristics that fail to capture the fundamental question: what makes an architecture trainable? This paper introduces Intrinsic Trainability (InTrain), a unified theoretical proxy that formalizes trainability as an architectural invariant emerging from two synergistic components: geometric capacity and optimization resilience. We operationalize intrinsic trainability through analysis of neural information processing. Geometric capacity is quantified via the participation ratio of activation covariance eigenspectrum, capturing the effective dimensionality of representation manifolds. Optimization resilience is measured through cumulative gradient health, assessing the robustness of backpropagation across network depth. InTrain synthesizes these dimensions through a scale-invariant multiplicative coupling, which we hypothesize is essential for capturing their synergistic, non-additive relationship. Extensive experiments on standard NAS benchmarks and search spaces demonstrate that InTrain achieves ranking correlations on par with state-of-the-art ensemble-based proxies and outperforms other single-metric methods.

2606.18675 2026-06-18 cs.CV 新提交

BrainFusionNet: a deep learning and XAI model to understand local, global, and sequential features of MRI images for improved brain tumour detection

BrainFusionNet:一种用于理解MRI图像局部、全局和序列特征以改进脑肿瘤检测的深度学习与XAI模型

Md Taimur Ahad, Bo Song, Yan Li

发表机构 * School of Mathematics, Physics and Computing, University of Southern Queensland(南方昆士兰大学数学、物理与计算学院) School of Engineering, University of Southern Queensland(南方昆士兰大学工程学院)

AI总结 提出BrainFusionNet混合模型,结合CNN、ViT和GRU提取MRI空间、上下文和序列特征,并集成SHAP、LIME和GradCAM进行可解释性分析,在公开数据集上达到98%准确率,优于SOTA CNN。

详情
Journal ref
Brain Inf. 13, 21 (2026)
AI中文摘要

磁共振成像(MRI)的噪声给深度学习(DL)带来挑战,当肿瘤边界模糊、肿瘤位置和外观复杂时尤其如此。因此,我们开发了BrainFusionNet,它结合卷积神经网络(CNN)、视觉变换器(ViT)和门控循环单元(GRU),从MRI图像中提取空间、上下文和序列特征,以改进脑肿瘤分类。此外,集成了可解释AI(如SHAP、LIME和GradCAM),以可视化和突出显示有助于BrainFusionNet决策过程的图像区域。所提出的BrainFusionNet模型在两个公开MRI数据集上进行了评估,K折验证表明在两个数据集上准确率均达到98%。该模型与六种最先进的(SOTA)CNN和迁移学习进行了比较。在SOTA CNN中,DenseNet121和VGG16达到了96%的最高准确率。BrainFusionNet的新颖之处在于,该混合模型能够有效提取MRI图像的局部和全局特征,即使在小尺度肿瘤区域和肿瘤尺寸较小的情况下也是如此。该模型具有平衡的序列CNN架构,以捕获低层和深层特征;以及定制的ViT,可捕获局部特征、稳定梯度流并降低MRI图像训练期间梯度消失的风险。CNN和ViT的输出被馈送到GRU以进行最终分类。此外,我们分析像素强度以确定MRI图像质量是否影响图像分类。我们的发现在图像解释方面非常新颖,因为我们发现MRI图像中像素强度的分布会影响DL性能。

英文摘要

The noise of Magnetic Resonance Imaging MRI poses challenges for Deep Learning DL when tumor boundaries are obscured tumor location and appearance are complex Therefore we develop BrainFusionNet that combines Convolutional Neural Networks CNNs Vision Transformers ViT and Gated Recurrent Units GRUs to extract spatial contextual and sequential features from MRI images for improved brain tumor classification Furthermore explainable AI such as SHAP LIME and GradCAM are integrated to visualise and highlight image regions that contribute to BrainFusionNets decisionmaking process The proposed BrainFusionNet model is evaluated on two publicly available MRI datasets Kfold validation suggests 98 accuracy on both datasets The model was compared with the six stateoftheart SOTA CNNs and transfer learning Among the SOTA CNNs DenseNet121 and VGG16 achieved the highest accuracy of 96 The novelty of BrainFusionNet is that the hybrid model effectively extracts local and global features from MRI images even in smallscale tumor regions and small tumor sizes The model has a balanced sequential CNN architecture to capture lowlevel and deeperlayer features a customized ViT that captures local features stabilizes gradient flow and reduces the risk of vanishing gradients during MRI image training The CNN and ViT outputs are fed into a GRU for final classification Furthermore we analyze pixel intensities to determine whether MRI image quality affects image classification Our findings are very novel in image interpretation as we found that the distribution of pixel intensities in MRI images affects DL performance

2606.18672 2026-06-18 cs.LG cs.AI q-bio.GN 新提交

scGTN: Deep Siamese Graph Transformer Network for Single-cell RNA Sequencing Clustering

scGTN:用于单细胞RNA测序聚类的深度孪生图变换网络

Jinke Wu, Yifan Wang, Siyu Yi, Caiyang Yu, Ziyue Qiao, Nan Yin, Jiancheng Lv, Wei Ju

发表机构 * Sichuan University(四川大学) University of International Business and Economics(对外经济贸易大学) Great Bay University(大湾区大学) The Education University of Hong Kong(香港教育大学)

AI总结 提出scGTN框架,通过孪生图变换网络整合基因表达与细胞间结构信息,利用最优传输策略进行自监督聚类,在多个数据集上优于现有方法。

Comments Accepted by Proceedings of the Thirty-Fifth International Joint Conference on Artificial Intelligence (IJCAI 2026)

详情
AI中文摘要

单细胞RNA测序(scRNA-seq)在表征细胞水平基因表达、识别细胞类型以及促进对细胞异质性的理解中起着关键作用。尽管scRNA-seq数据聚类取得了显著进展,但我们认为当前方法常常忽略scRNA-seq数据固有的稀疏性和噪声,以及复杂的细胞间结构信息。为此,本文提出了一种基于深度孪生图变换网络(称为scGTN)的新型单细胞RNA-seq聚类框架,该框架明确整合了基因表达谱和细胞间结构依赖关系以进行细胞聚类。具体而言,我们将scRNA-seq数据建模为图,并构建两个增强图视图作为双视图以捕获互补的细胞间信息。然后,采用孪生图变换网络显式整合最短路径信息和节点间距离,以捕获细胞间更丰富的结构关系。最后,我们采用最优传输策略以自监督方式指导细胞聚类。在多个基准scRNA-seq数据集上的大量实验表明,我们的scGTN始终优于现有方法。我们的代码可在以下网址获取:https://github.com/...(原文链接)。

英文摘要

Single-cell RNA sequencing (scRNA-seq) serves a pivotal role in characterizing gene expression at the cellular level, enabling the identification of cell types and advancing the understanding of cellular heterogeneity. Despite the significant progress in scRNA-seq data clustering, we argue that current methods always ignore the sparsity and noise, as well as the complex intercellular structural information inherent in scRNA-seq data. Toward this end, in this paper, we propose a novel single-cell RNA-seq clustering framework via deep Siamese Graph Transformer Network (termed scGTN), which explicitly integrates gene expression profile and intercellular structural dependencies for cell clustering. In particular, we formulate scRNA-seq data as a graph and construct two augmented graph views that serve as dual views to capture complementary intercellular information. Then, a Siamese graph transformer network is employed to explicitly incorporate shortest-path information and node-wise distances for capturing richer structural relationships between cells. Finally, we employ an optimal transport strategy to guide the cell clustering in a self-supervised manner. Extensive experiments on multiple benchmark scRNA-seq datasets demonstrate that our scGTN consistently outperforms existing methods. Our code is available at https://github.com/W-RMSL/scGTN.

2606.18664 2026-06-18 cs.SD cs.AI 新提交

NeuralMUSIC: A Hybrid Neural-Subspace Framework for Robot Sound Source Localization

NeuralMUSIC: 一种用于机器人声源定位的混合神经-子空间框架

Yizhuo Yang, Junqiao Fan, Shenghai Yuan, Lihua Xie

发表机构 * School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电气与电子工程学院)

AI总结 提出NeuralMUSIC混合框架,结合神经网络估计空间协方差矩阵与经典MUSIC子空间方法,通过频率注意力融合和自监督学习提升机器人声源定位的鲁棒性和跨域泛化能力。

详情
AI中文摘要

可靠的声源定位是机器人听觉的基础,使自主机器人能够感知空间线索并在动态环境中有效运行。经典方法如多信号分类(MUSIC)具有坚实的理论基础,但在低信噪比下性能下降。基于深度学习的方法虽然取得了有前景的性能,但通常难以在多种条件下泛化。为了解决这些挑战,我们提出了NeuralMUSIC,一种用于机器人声源定位的混合神经-子空间框架。具体来说,神经网络首先从多通道麦克风观测中估计空间协方差矩阵。然后将预测的协方差集成到经典的MUSIC流程中,包括特征值分解(EVD)和伪谱计算,随后通过频率注意力融合(FAF)模块产生最终的DOA估计。为了提高数据效率,我们进一步引入了一种自监督空间相关学习(SSCL)策略,利用未标记的声学数据来捕获空间结构。跨不同机器人任务的广泛实验表明,NeuralMUSIC在实现有竞争力的定位精度的同时,表现出更强的鲁棒性和跨域泛化能力。

英文摘要

Reliable sound source localization is fundamental to robot audition, enabling autonomous robots to perceive spatial cues and operate effectively in dynamic environments. Classical methods such as Multiple Signal Classification (MUSIC) offer strong theoretical foundations but degrade under low signal-to-noise ratios. While deep learning-based approaches achieve promising performance, they often struggle with limited generalization across conditions. To address these challenges, we propose NeuralMUSIC, a hybrid neural-subspace framework for robotic sound source localization. Specifically, a neural network first estimates the spatial covariance matrix from multichannel microphone observations. The predicted covariance is then integrated into a classical MUSIC pipeline with eigenvalue decomposition (EVD) and pseudo-spectrum computation, followed by a Frequency Attention Fusion (FAF) module to produce the final DOA estimates. To improve data efficiency, we further introduce a Self-supervised Spatial Correlation Learning (SSCL) strategy that leverages unlabeled acoustic data to capture spatial structure. Extensive experiments across different robotic tasks demonstrate that NeuralMUSIC achieves competitive localization accuracy while exhibiting improved robustness and cross-domain generalization.

2606.18663 2026-06-18 cs.CL 新提交

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

RegMix-D: 通过代理训练轨迹实现动态数据混合

Kaiyan Zhao, Zhongtao Miao, Akiko Aizawa, Yoshimasa Tsuruoka

发表机构 * The University of Tokyo(东京大学) National Institute of Informatics(国立信息学研究所)

AI总结 提出RegMix-D,通过代理训练轨迹预测多阶段最优混合比例,实现动态数据混合,在13个下游任务上优于RegMix和DoReMi,且代理计算预算仅为RegMix的25%。

Comments Work in progress

详情
AI中文摘要

数据混合选择对于大型语言模型预训练至关重要。现有方法如RegMix通过在小规模代理运行上拟合回归模型来选择单个静态混合。我们提出RegMix-D,这是RegMix的一个简单扩展,用于动态混合。我们的关键观察是,代理运行不仅产生端点损失,还产生完整的损失轨迹,这些轨迹可用于进一步改进数据混合。通过在这些轨迹上训练回归模型,我们可以预测多个训练阶段的最优混合。RegMix-D支持两种部署模式:一种离线变体,在目标训练之前生成完整的混合计划;另一种在线变体,在训练期间使用观察到的损失自适应调整混合。在Pile数据集的250亿token上使用1B参数目标模型的实验表明,RegMix-D在13个下游任务上一致优于RegMix和DoReMi,同时保持代理高效:即使仅使用128个代理模型(RegMix代理计算预算的25%),它也超越了RegMix。

英文摘要

Data mixture selection is critical for Large Language Model pretraining. Existing methods such as RegMix select a single static mixture by fitting a regression model on small-scale proxy runs. We propose RegMix-D, a simple extension of RegMix to dynamic mixing. Our key observation is that proxy runs produce not only endpoint losses, but also full loss trajectories, which can be used to further improve data mixture. By training regression model on these trajectories, we can predict optimal mixtures at multiple training stages. RegMix-D supports two deployment modes: an offline variant that generates a complete mixture schedule before target training, and an online variant that adapts the mixture during training using observed loss. Experiments on 25B tokens of the Pile dataset with a 1B parameter target model show that RegMix-D consistently improves over RegMix and DoReMi across 13 downstream tasks while remaining proxy-efficient: it surpasses RegMix even with only 128 proxy models (25% of RegMix's proxy compute budget).

2606.18661 2026-06-18 cs.CV cs.AI 新提交

LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

LandslideAgent与多模态LandslideBench:一种面向自主滑坡识别与分析的领域规则增强型智能体

Chengfu Liu, Dongyang Hou, Junwu Xiang, Cheng Yang, Xuezhi Cui, Zeyuan Wang, Liangtian Liu, Zelang Miao

发表机构 * Central South University(中南大学)

AI总结 提出指令驱动智能体框架,包含多模态数据集LandslideBench、滑坡专用视觉语言模型LandslideVLM及领域规则增强智能体LandslideAgent,实现自主滑坡识别与分析。

详情
AI中文摘要

智能滑坡灾害解译对于防灾减灾至关重要,然而当前范式难以同时提取视觉特征和高层次地球科学语义,而通用视觉语言模型在复杂地质场景中存在感知局限和领域幻觉。为解决这些挑战,我们提出一个指令驱动的智能体框架,包含三个组成部分。首先,通过多VLM交叉验证和交互式标注构建LandslideBench,这是一个多模态细粒度数据集,包含七个子类型标签、高分辨率图像、像素级掩膜和高质量文本描述。然后,通过LoRA在LandslideBench上微调面向滑坡的VLM——LandslideVLM,以增强地质语义理解。最后,以LandslideVLM为认知核心的领域规则增强智能体LandslideAgent,采用双规则控制器,结合结构化报告元数据约束和交叉验证识别约束,来调控自动化工具调用。实验表明,LandslideBench为五种主流模型在细粒度分类和语义分割上提供了有效基线。LandslideVLM在滑坡判别、细粒度分类和语义描述质量上分别提升了10.96%、32.87%和15.91%。LandslideAgent进一步实现了自主多源空间数据推理,实现了滑坡识别与分析的全流程智能化。

英文摘要

Intelligent landslide hazard interpretation is critical for disaster prevention, yet current paradigms struggle to simultaneously extract visual features and high-level geoscientific semantics, while general-purpose vision-language models (VLMs) suffer from perceptual limitations and domain hallucinations in complex geological scenarios. To address these challenges, we propose an instruction-driven agentic framework comprising three components. First, LandslideBench, a multimodal fine-grained dataset with seven subtype labels, high-resolution imagery, pixel-level masks, and high-quality textual descriptions, is constructed via multi-VLM cross-validation and interactive annotation. Then, LandslideVLM, a landslide-oriented VLM, is fine-tuned via LoRA on LandslideBench to enhance geological semantic understanding. Finally, LandslideAgent, a domain rule-enhanced agent taking LandslideVLM as its cognitive backbone, employs a dual-rule controller incorporating structured report metadata constraints and cross-validation identification constraints to regulate automated tool invocation. Experiments demonstrate that LandslideBench provides effective baselines across five mainstream models on fine-grained classification and semantic segmentation. LandslideVLM achieves accuracy improvements of 10.96%, 32.87%, and 15.91% on landslide discrimination, fine-grained classification, and semantic description quality, respectively. LandslideAgent further enables autonomous multi-source spatial data inference, realizing full-process intelligence for landslide identification and analysis.

2606.18659 2026-06-18 cs.SD 新提交

Responsible ASR: Overcoming Challenges of Foundational Models in Narrow-Band and Low-Resource Settings

负责任的ASR:克服窄带和低资源场景下基础模型的挑战

Tejas Godambe, Nutan Choudhary, Sanket Shah, Nagaraj Adiga, Sharath Adavanne

发表机构 * Applied AI(应用人工智能)

AI总结 本文评估了开源和商业基础ASR模型在窄带对话中的表现,针对低资源语言印地语和低资源口音印度英语,发现零样本性能不佳,微调虽有改进但效果因语言和口音而异。

详情
AI中文摘要

全球电话对话通常通过窄带信道进行,且往往是自发和口语化的。本文评估了广泛使用的基础自动语音识别(ASR)模型——包括开源和商业模型——在窄带对话中的性能,针对低资源语言印地语和低资源口音印度英语。我们首先在零样本设置下评估这些模型,发现它们的性能整体上仍不理想。强调了ASR模型在窄带和低资源语言场景中面临的挑战后,我们进一步研究了使用有限真实标注录音对开源模型进行微调的影响。我们的发现表明,虽然微调带来了一些改进,但其效果因语言和口音而异,很大程度上受预训练期间遇到的数据量影响。

英文摘要

Telephony conversations worldwide are conducted over narrow-band channels and are often spontaneous and colloquial in nature. This paper evaluates the performance of widely used foundational automatic speech recognition (ASR) models -- both open-source and commercial -- on narrow-band conversations in Hindi, a low-resource language, and Indian-accented English, a low-resource accent. We first assess these models in a zero-shot setting and find that their performance remains suboptimal across the board. Highlighting the challenges faced by ASR models in narrow-band and low-resource language scenarios, we further investigate the impact of fine-tuning open-source models using a limited set of real-life annotated recordings. Our findings indicate that while fine-tuning provides some improvements, its effectiveness varies across languages and accents, largely influenced by the amount of data encountered during pretraining

2606.18658 2026-06-18 cs.CV eess.IV 新提交

On-Manifold Variational Learning with Heat-Kernel Priors

基于热核先验的流形变分学习

Jiarui Xing, Tal Zeevi, Nian Wu, Jian Wang

发表机构 * Yale School of Medicine(耶鲁大学医学院) University of Virginia(弗吉尼亚大学) Harvard Medical School(哈佛医学院)

AI总结 提出一种流形锚定变分框架,利用几何感知EM算法选择热核加权潜图上的图中心点作为原型,确保原型在流形上,并通过Dirichlet能量正则化保持潜空间几何平滑,在心脏瘢痕和脑MRI基准上取得最高精度和清晰原型。

详情
AI中文摘要

学习医学影像队列的无监督表示可以揭示临床上有意义的原型,而无需专家标签,这些标签通常带有噪声且无法捕捉真实的病理异质性。然而,现有的深度潜变量模型通过欧几里得平均估计高斯混合先验,产生的原型会偏离弯曲的数据流形,并随着子种群数量的增加而退化。我们提出了一种流形锚定变分框架,基于几何感知的期望最大化(EM)算法,其M步骤选择每个子种群原型作为热核加权潜图上具有最高扩散中心性的图中心点,确保每个原型保持在流形上。Dirichlet能量正则化强制潜空间的几何平滑性,每个子种群的不确定性分数实现了无标签的质量评估。流形锚定EM是一种通用几何工具,扩展了标准EM,并易于应用于其他潜变量模型。在心脏瘢痕和脑MRI基准上,我们的框架在所有比较方法中取得了最高精度,产生了迄今为止最清晰的原型,并且在所有基线退化的较大子种群数量下保持稳定。

英文摘要

Learning unsupervised representations of medical imaging cohorts can reveal clinically meaningful prototypes without expert labels, which are often noisy and fail to capture true pathological heterogeneity. However, existing deep latent-variable models estimate Gaussian mixture priors via Euclidean averaging, producing prototypes that drift off the curved data manifold and degenerate as the number of sub-populations grows. We propose a manifold-anchored variational framework built on a geometry-aware Expectation-Maximization (EM) algorithm, whose M-step selects each sub-population prototype as the graph medoid with the highest diffusion centrality on a heat-kernel-weighted latent graph, ensuring that every prototype remains on-manifold. A Dirichlet energy regularizer enforces geometric smoothness of the latent space, and a per-sub-population uncertainty score enables label-free quality assessment. \rev{The manifold-anchored EM is a general-purpose geometric tool that extends standard EM and applies readily to other latent-variable models beyond this setting.} On cardiac scar and brain MRI benchmarks, our framework attains the highest accuracy among all compared methods, produces the sharpest prototypes reported to date, and remains stable at large sub-population counts where all baselines degenerate.