arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
专题追踪
2606.15457 2026-06-16 cs.CV cs.LG 新提交

Lesion-DDPM: Lesion-Enhanced 3D Diffusion for MS MRI Synthesis

Lesion-DDPM:用于MS MRI合成的病灶增强3D扩散模型

Weidong Zhang, Yongchan Jung, Shafayat Mowla Anik, Furen Xiao, Vasudevan Janarthanan, Enkhzaya Chuluunbaatar, Byeong Kil Lee, Jeeho Ryoo

发表机构 * University of Texas at Arlington(德克萨斯大学阿灵顿分校) University of Texas at San Antonio(德克萨斯大学圣安东尼奥分校) University of Texas at Dallas(德克萨斯大学达拉斯分校) National Taiwan University Hospital(国立台湾大学医院) National University of Mongolia(蒙古国立大学) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出Lesion-DDPM,一种3D条件扩散框架,通过多级解剖掩膜注入和病灶加权重建损失,实现病灶感知的FLAIR合成,在MS病灶分割下游任务中显著提升Dice分数。

详情
AI中文摘要

3D FLAIR MRI被广泛推荐为多发性硬化(MS)脑部成像的标准MRI序列之一,但公开可用的MS数据集仍然相对较小,且在不同扫描仪、采集协议和病灶模式上存在差异。这种稀缺性和异质性阻碍了稳健的神经影像机器学习模型的发展,尤其对于旨在合成图像同时保留小而稀疏病灶的生成模型而言,这是一个挑战。我们提出了Lesion-DDPM,一种用于病灶感知FLAIR合成的3D条件扩散框架,该框架结合了多级解剖掩膜注入以及病灶加权重建损失,以在保持整体大脑结构的同时强调病灶体素。使用MSLesSeg数据集的精选子集,我们将Lesion-DDPM与代表性的最先进GAN和扩散模型进行比较,评估图像生成指标和下游3D U-Net分割性能。在我们的实验中,Lesion-DDPM在所有方法中实现了最低的病灶区域重建误差。在下游3D U-Net病灶分割任务中,仅使用Lesion-DDPM生成的扫描训练并在真实MRI上评估的模型达到了0.616的Dice分数,而最佳竞争合成数据集为0.569。当将Lesion-DDPM图像添加到真实训练集中时,Dice分数进一步增加到0.685。

英文摘要

3D FLAIR MRI is widely recommended as one of the standard MRI sequences for brain imaging in multiple sclerosis (MS), but publicly available MS datasets remain relatively small and vary across scanners, acquisition protocols, and lesion patterns. This scarcity and variability hinder the development of robust neuroimaging machine learning models and are particularly challenging for generative models that aim to synthesize images while preserving small, sparse lesions. We propose Lesion-DDPM, a 3D conditional diffusion framework for lesion-aware FLAIR synthesis that incorporates multi-level anatomical mask injection together with a lesion-weighted reconstruction loss to emphasize lesion voxels while maintaining global brain structure. Using a curated subset of the MSLesSeg dataset, we compare Lesion-DDPM with representative state-of-the-art GAN- and diffusion-based models, assessing both image-generation metrics and downstream 3D U-Net segmentation. In our experiments, Lesion-DDPM achieved the lowest lesion-region reconstruction error among all methods. In a downstream 3D U-Net lesion segmentation task, a model trained only on Lesion-DDPM-generated scans and evaluated on real MRIs reached a Dice score of 0.616 compared with 0.569 for the best competing synthetic dataset. When Lesion-DDPM images were added to the real training set, the Dice score further increased to 0.685.

2606.15455 2026-06-16 cs.LG cs.AI 新提交

Understanding Diversity Collapse in RLVR via the Lens of Overtraining

通过过度训练的视角理解RLVR中的多样性崩溃

Suqin Yuan, Jinkun Chen, Jiyang Zheng, Muyang Li, Lei Feng, Dadong Wang, Tao Xiang, Tongliang Liu, Bo An

发表机构 * Sydney AI Centre, The University of Sydney(悉尼大学悉尼人工智能中心) Southeast University(东南大学) Microsoft(微软) Data61, CSIRO(澳大利亚联邦科学与工业研究组织Data61) Chongqing University(重庆大学) Nanyang Technological University(南洋理工大学)

AI总结 本文通过过度训练的视角形式化RLVR中的多样性崩溃,发现标准训练中大部分更新是过度训练,并提出贝叶斯边界门控(BBG)方法,通过估计每个问题对推理边界的边际贡献来优化,提升多个基准上的Pass@k。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为增强大型语言模型推理能力的关键方法。然而,RLVR常常遭受\emph{多样性崩溃}:Pass@$1$提升而高$k$的Pass@$k$下降,这被视为模型推理边界的收窄。我们通过\emph{过度训练}的视角形式化了这种多样性崩溃:一旦一个问题对参考指标的贡献有效饱和,进一步的更新不再扩展模型能解决的问题,但仍将概率质量集中在on-policy采样偏好的轨迹上。在每次问题少量rollout的标准设置下,即使单次成功也会使问题进入高$k$ Pass@$k$的近乎饱和状态,因此标准RLVR中的大多数更新从边界角度来看都是过度训练。这一视角也提供了一种解读:RLVR能否扩展模型超越基础模型的推理能力?由于RLVR结构上偏向于高$k$ Pass@$k$,其总体下降本身并不意味着没有新的推理增益。在干预上,将更新限制在零成功的问题上,在困难基准上将Pass@$256$提升到基础模型之上;在观察上,标准RLVR训练中,最初不可解的问题中有相当一部分变得可解。基于这些发现,我们提出\emph{贝叶斯边界门控}(BBG),通过估计每个问题对推理边界的边际贡献,将优化从过度训练中转移出来。在多个推理基准上,BBG在广泛的$k$范围内提升了平均Pass@$k$。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a key approach for enhancing the reasoning abilities of large language models. However, RLVR often suffers from \emph{diversity collapse}: Pass@$1$ improves while high-$k$ Pass@$k$ degrades, which is viewed as a narrowing of the model's reasoning boundary. We formalize this diversity collapse through the lens of \emph{overtraining}: once a problem's contribution to the reference metric has effectively saturated, further updates no longer expand what the model can solve but still concentrate probability mass on the trajectories favored by on-policy sampling. Under a standard setup with few rollouts per problem, even a single observed success places a problem in a nearly saturated regime for high-$k$ Pass@$k$, so most updates in standard RLVR are overtraining from the boundary perspective. This perspective also suggests a reading of whether RLVR can expand the model's reasoning abilities beyond the base model: since RLVR is structurally biased against high-$k$ Pass@$k$, its aggregate decline does not by itself mean that no new reasoning gains occurred. Interventionally, restricting updates to problems with zero observed success lifts Pass@$256$ above the base model on difficult benchmarks; observationally, a non-trivial fraction of initially unsolvable problems become solvable during standard RLVR training. Building on these findings, we propose \emph{Bayesian Boundary Gating} (BBG), which redirects optimization away from overtraining by estimating each problem's marginal contribution to the reasoning boundary. Across multiple reasoning benchmarks, BBG improves average Pass@$k$ across a wide range of $k$.

2606.15452 2026-06-16 cs.LG math.AT q-fin.RM stat.ML 新提交

PHINN: Persistent Homology Inspired Neural Network for Rare-Event Time Series Generation

PHINN: 基于持久同构的稀有事件时间序列生成神经网络

Emre Yusuf, Ren Takahashi, Jayabrata Bhaduri

发表机构 * Defense.Codes (a DBA of CapaCloud Corp)(Defense.Codes(CapaCloud Corp 的商用名))

AI总结 提出PHINN框架,利用动态Betti曲线作为条件信号和持久景观损失保持同调一致性,在金融、流行病和多模态基准上拓扑保真度优于统计和扩散基线。

Comments 15 pages, 4 figures

详情
AI中文摘要

时间序列中的稀有事件对建模至关重要,但由于数据稀缺而难以学习。当前的生成模型难以处理极端值。我们观察到稀有事件会留下独特的拓扑指纹——从点云嵌入中Betti数的转变——这些指纹比统计矩更稳定且更具判别性。我们提出了PHINN,一个流匹配框架,使用动态Betti曲线作为条件信号,并采用持久景观损失来保持同调一致性。它可扩展到多变量数据,包含一个自然语言接口来设置Betti目标,支持跨领域元学习和少样本生成,并提供经过认证的对抗鲁棒性。在金融、流行病和多模态基准上,PHINN在拓扑保真度(beta-RMSE降低41-63%,转换准确率提高84%)方面优于统计和扩散基线,在尾部覆盖方面与跳跃扩散模型相当,在形状保真度方面超过它们。所有结果均具有95%置信区间。

英文摘要

Rare events in time series are critical to model but hard to learn due to data scarcity. Current generative models struggle with extreme values. We observe that rare events leave distinct topological fingerprints - transitions in Betti numbers from point-cloud embeddings - that are more stable and discriminative than statistical moments. We introduce PHINN, a flow-matching framework using dynamic Betti curves as conditioning signals and a persistence landscape loss for homology consistency. It scales to multivariate data, includes a natural-language interface to set Betti targets, supports cross-domain meta-learning and few-shot generation, and provides certified adversarial robustness. On financial, epidemiological, and multi-modal benchmarks, PHINN outperforms statistical and diffusion baselines in topological fidelity (beta-RMSE down 41-63%, transition accuracy up 84%) and matches jump-diffusion models in tail coverage while exceeding them in shape fidelity. All results have 95% confidence intervals.

2606.15449 2026-06-16 cs.CL cs.IR cs.LG 新提交

Transfer Learning for FHIR Questionnaire Terminology Binding

面向 FHIR 问卷术语绑定的迁移学习

Maxim Gorshkov

发表机构 * Department of Computer Science, Stanford University(斯坦福大学计算机科学系)

AI总结 将 FHIR 问卷项与 LOINC 代码的绑定视为检索问题,比较六种方法,发现 BioLORD 在 top-1 准确率上最优,而对比微调在 top-5 和 top-10 上表现更好,并分析了分布偏移和错误类型。

详情
AI中文摘要

电子预授权工作流要求 FHIR 问卷项携带 LOINC 代码,但 HL7 Da Vinci CDS-Library 中的大多数项缺乏这些绑定。我们将其视为一个检索问题:给定问卷项的文本,从 97,314 个活跃代码池中找到正确的 LOINC 代码。我们在一个包含 54 个项的评估集上比较了六种方法(TF-IDF、冻结 MiniLM、BioBERT、BioLORD、对比微调 MiniLM 以及 TF-IDF+GPT 重排序器),该评估集涵盖三种查询风格(自然问题、中等和简洁)。没有单一方法在所有指标上获胜。BioLORD 是一个在生物医学本体定义上预训练的冻结编码器,尽管没有见过任务特定数据,但其 top-1 准确率最高(R@1 = 0.185,MRR = 0.246),而在原始 LHC-Forms 对上的对比微调则在 R@5(0.389)和 R@10(0.426)上表现最佳。分布偏移消融实验表明,为什么我们主表中的微调不是最强的:在原始对中添加 GPT 生成的释义后,R@5 从 0.389 降至 0.296,因此增强联合在除 R@1 外的所有指标上均不如仅使用原始训练。性能在 5k 训练对时达到峰值。对 BioLORD 的 R@1 失败案例的错误分析表明,错误特异性和歧义文本案例共占错误的 59%。

英文摘要

Electronic prior authorization workflows require FHIR Questionnaire items to carry LOINC codes, yet most items in the HL7 Da Vinci CDS-Library lack these bindings. We treat this as a retrieval problem: given a Questionnaire item's text, find the correct LOINC code in a pool of 97,314 active codes. We compare six methods (TF-IDF, frozen MiniLM, BioBERT, BioLORD, contrastively fine-tuned MiniLM, and a TF-IDF+GPT reranker) on a 54-item evaluation set spanning three query styles (natural question, medium, and terse). No single method wins on every metric. BioLORD, a frozen encoder pre-trained on biomedical ontology definitions, has the best top-rank accuracy (R@1 = 0.185, MRR = 0.246) despite seeing no task-specific data, while a contrastive fine-tune on raw LHC-Forms pairs takes R@5 (0.389) and R@10 (0.426). A distribution-shift ablation shows why the fine-tune in our main table is not the strongest one: adding GPT-generated paraphrases to the raw pairs drops R@5 from 0.389 to 0.296, so the augmented union underperforms raw-only training on every metric except R@1. Performance peaks at 5k training pairs. Error analysis on BioLORD's R@1 failures shows that wrong-specificity and ambiguous-text cases together account for 59% of errors.

2606.15447 2026-06-16 cs.AI 新提交

Hierarchical Modeling of ICD Codes in EHR Foundation Models

EHR基础模型中ICD码的分层建模

Megha Thukral, Dong Gyun Kang, Rudra Pratap Singh, Shruthi Kashinath Hiremath, Katrin Hänsel, Thomas Plötz

发表机构 * School of Interactive Computing, Georgia Institute of Technology(佐治亚理工学院交互计算学院) Optum AI

AI总结 研究利用ICD-10-CM层次结构作为归纳偏置,通过序列增强和图注入两种机制改进EHR表示学习,实验表明显式编码层次结构在域内和跨数据集任务中均优于扁平表示。

详情
AI中文摘要

电子健康记录基础模型通常将ICD诊断码视为扁平标记,忽略了捕获疾病家族、子类别和细粒度诊断细节的临床上有意义的层次结构。因此,现有的EHR表示学习方法并未明确利用编码系统中已有的层次结构。在这项工作中,我们研究ICD-10-CM层次结构作为临床表示学习的一般归纳偏置。我们研究了两种互补的机制来融入层次结构:首先,通过在BERT风格的transformer中向诊断序列添加对应于ICD层次不同级别的标记;其次,通过结合诊断共现结构的层次感知边将层次结构注入基于图的代码表示中。在这些设置下,我们评估显式层次结构是否改进了下游预测、层次结构的哪些级别最有用、层次编码是否改善了跨数据集的迁移,以及层次结构如何重塑嵌入相似性结构。我们在两个大规模真实世界临床数据集上进行了实验:MIMIC-IV(用于预训练和域内评估)和eICU(用于通过冻结编码器探测评估跨数据集迁移)。我们的发现表明,显式编码ICD层次结构在域内和跨数据集设置中均优于扁平代码表示,同时揭示了最有效的层次级别取决于任务和建模方法。更广泛地说,我们专注于层次感知的EHR表示学习,并表明编码层次结构的好处可泛化到不同的建模设置和层次级别。

英文摘要

Electronic health record foundation models typically treat ICD diagnosis codes as flat tokens, overlooking the clinically meaningful hierarchical structure that captures disease families, subcategories, and fine-grained diagnostic detail. As a result, existing EHR representation learning methods do not explicitly exploit the hierarchical structure already present in the coding system. In this work, we study ICD-10-CM hierarchy as a general inductive bias for clinical representation learning. We investigate two complementary mechanisms for incorporating hierarchy: first, by augmenting diagnosis sequences in a BERT-style transformer with tokens corresponding to different levels of the ICD hierarchy, and second, by injecting hierarchy into graph-based code representations through hierarchy-aware edges combined with diagnosis co-occurrence structure. Across these settings, we evaluate whether explicit hierarchy improves downstream prediction, which levels of the hierarchy are most useful, whether hierarchy encoding improves transfer across datasets, and how hierarchy reshapes embedding similarity structure. We conduct experiments on two large-scale real-world clinical datasets: MIMIC-IV, used for pretraining and in-domain evaluation, and eICU, used to assess cross-dataset transfer via frozen encoder probing. Our findings show that explicitly encoding ICD hierarchy improves over flat code representations in both in-domain and cross-dataset settings, while revealing that the most useful level of hierarchy depends on both the task and the modeling approach. More broadly, we focus on hierarchy-aware EHR representation learning and show that the benefits of encoding hierarchy are generalizable across modeling settings and hierarchy levels.

2606.15436 2026-06-16 cs.LG cs.AI eess.AS 新提交

Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models

超越分类:呼吸声学基础模型的咳嗽回归基准

Mayur Sanap, Prasanna Desikan, Edgar Lobaton

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出多模型多目标咳嗽回归基准,评估五个基础模型在六个目标上的表现,发现MLP-small优于线性探测,揭示数据集大小与头部容量的权衡,并展示跨数据集迁移的不对称性。

Comments Accepted at the ICML 2026 Workshop on Structured Data for Health

详情
AI中文摘要

呼吸声学基础模型(FMs)在咳嗽分类方面表现出色,但其从咳嗽音频中预测连续健康量的能力在很大程度上尚未被探索,尽管在无法进行物理测量的环境中,被动年龄、BMI和疾病概率估计具有临床价值。我们引入了多模型、多目标的咳嗽回归基准,在三个数据集上评估了五个FMs(OPERA-CT、OPERA-CE、OPERA-GT、HeAR、M2D+Resp)在六个目标上的表现,采用受试者不重叠协议,并比较了线性、MLP-small和全MLP回归头。MLP-small在所有任务上击败了均值预测基线,并在30个模型×任务组合中的23个中优于线性探测,而全MLP在小规模临床数据上过拟合,但在更大数据集上恢复,揭示了数据集大小与头部容量之间的权衡。HeAR在Coswara数据集上的年龄回归中领先(9.12年MAE);其CIDRZ结果因可能存在HeAR-CIDRZ预训练重叠而被排除在主要声明之外。OPERA-GT在所有三个数据集的年龄回归中优于OPERA-CT,其中CIDRZ的差异在种子方差范围内,将生成预训练的优势从呼吸扩展到咳嗽。HeAR和M2D+Resp在N=50个样本时达到接近完整性能,而OPERA模型需要N=400个样本。跨数据集迁移强烈不对称,大规模多样化数据可泛化到小规模临床人群(CoughVID到CIDRZ:-0.17年),但反之则不然(CIDRZ到Coswara:+2.43年,+26.6%)。

英文摘要

Respiratory acoustic foundation models (FMs) excel at cough classification, yet their ability to predict continuous health quantities from cough audio remains largely unexplored, despite the clinical value of passive age, BMI, and disease probability estimation in settings where physical measurements are unavailable. We introduce the multi-model, multi-target cough regression benchmark evaluating five FMs (OPERA-CT, OPERA-CE, OPERA-GT, HeAR, M2D+Resp) across six targets on three datasets under subject-disjoint protocols, comparing linear, MLP-small, and full MLP regression heads. MLP-small beats the mean-predictor baseline on all tasks and linear probing in 23 of 30 model x task cases, with full MLP overfitting on small clinical data but recovering on larger sets, revealing a dataset size x head-capacity trade-off. HeAR leads within-dataset age regression on Coswara (9.12 yr MAE); its CIDRZ result is excluded from headline claims owing to possible HeAR-CIDRZ pretraining overlap. OPERA-GT is favored over OPERA-CT on age in all three datasets, with the CIDRZ margin within seed variance, extending a generative-pretraining advantage from breath to cough. HeAR and M2D+Resp reach near-full performance at N = 50 samples while OPERA models require N = 400. Cross-dataset transfer is strongly asymmetric as large diverse data generalises to small clinical populations (CoughVID to CIDRZ: -0.17 yr) but not vice versa (CIDRZ to Coswara: +2.43 yr, +26.6%).

2606.15434 2026-06-16 cs.RO cs.HC cs.SY eess.SY 新提交

A Bilateral Teleoperation Framework for Dexterous Manipulation

灵巧操作的双边遥操作框架

Stefano Dalla Gasperina, Dong Ho Kang, Haiyun Zhang, Aldo Galvan, Job D. Ramirez, Aaron Kim, Mark Helwig, Kazuto Yokoyama, Takahisa Ueno, Tetsuya Narita, Ann Majewicz-Fey, Ashish D. Deshpande, Luis Sentis

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Sony Group Corporation(索尼集团公司) Meta Reality Labs Research(Meta现实实验室研究)

AI总结 提出模块化双边遥操作框架,集成操作端输入与机器人端灵巧手及柔顺臂,通过位置重定向、差分控制、多尺度触觉反馈和共享控制实现灵巧操作,验证了协调控制与接触感知能力。

Comments 4 pages, 7 figures, 1 appendix,

详情
AI中文摘要

灵巧遥操作需要精确的手臂-手协调、低延迟反馈以及在真实接触丰富环境中的鲁棒交互。本文提出一个模块化双边遥操作框架,将操作端输入接口与机器人端灵巧手和柔顺机械臂集成在统一控制架构中。该系统支持基于位置的手部重定向、差分臂控制、多尺度触觉反馈和共享控制,以实现稳定操作。我们通过一个真实的灵巧操作任务验证了该框架,突出了协调的手臂-手控制和接触感知交互。除了可行性之外,我们还识别了与跨具身不匹配、触觉反馈粒度和共享控制相关的关键设计见解。所提出的平台提供了一个实用的遥操作系统,并为未来从演示中学习的研究收集高质量演示奠定了基础。

英文摘要

Dexterous teleoperation requires precise arm-hand coordination, low-latency feedback, and robust interaction in real-world contact-rich environments. This paper presents a modular bilateral teleoperation framework that integrates operator-side input interfaces with a robot-side dexterous hand and compliant robotic arm in a unified control architecture. The system supports position-based hand retargeting, differential arm control, multi-scale haptic feedback, and shared control for stable manipulation. We validate the framework through a real-world dexterous manipulation task, highlighting coordinated arm-hand control and contact-aware interaction. Beyond feasibility, we identify key design insights related to cross-embodiment mismatch, haptic feedback granularity, and shared control. The proposed platform provides a practical teleoperation system and a foundation for collecting high-quality demonstrations for future learning-from-demonstration research.

2606.15431 2026-06-16 cs.RO 新提交

A Corridor-Scale CARLA-VISSIM Co-Simulation Framework for Multi-Intersection Urban Traffic

面向多交叉口城市走廊的CARLA-VISSIM联合仿真框架

Sima Ashayer, Austin Haris, Mina Sartipi

发表机构 * University of Tennessee at Chattanooga(田纳西大学查塔努加分校)

AI总结 提出CARLA-VISSIM双向步进同步联合仿真框架,集成微观交通逻辑与高保真3D渲染,在田纳西州查塔努加市MLK大道约15个交叉口走廊上验证,支持混合控制与感知就绪的走廊级交通研究。

详情
AI中文摘要

本文提出了一个已实现的CARLA-VISSIM联合仿真框架,用于美国田纳西州查塔努加市马丁·路德·金大道上约15个相连交叉口的城市走廊。该系统通过双向步进同步接口集成CARLA 0.10.0(Unreal Engine 5)与PTV VISSIM 2026,将VISSIM的微观车辆、行人和信号控制器逻辑与CARLA的高保真3D渲染相结合。基于LiDAR的高程模型和RoadRunner的高清地图提供了地形精确的道路几何,并在两个仿真器中一致部署。该框架包含显式的参与者所有权、镜像生命周期管理、坐标协调以及每个参与者最新状态的更新策略,实现了VISSIM控制的交通流与CARLA控制的自我车辆之间的稳定交互。一个走廊规模的案例研究展示了在约100辆车和100名行人的峰值负载下,交通信号镜像、车辆-行人同步交互以及稳定的混合控制操作。该部署捕捉了MLK街上五个信号化交叉口及其连接的上游和下游交叉口的交互,揭示了多交叉口走廊特有的同步挑战。结果表明,以MLK为中心的走廊为验证跨仿真器一致性提供了有效测试平台,且所提出的架构支持可靠的、感知就绪的走廊级交通联合仿真。

英文摘要

This paper presents an implemented CARLA-VISSIM co-simulation framework for an urban corridor comprising approximately fifteen connected intersections centered on Martin Luther King Jr. Boulevard in Chattanooga, Tennessee. The system integrates CARLA 0.10.0 Unreal Engine 5 with PTV VISSIM 2026 through a bidirectional, step-synchronized interface that couples VISSIM's microscopic vehicle, pedestrian, and signal-controller logic with CARLA's high-fidelity 3D rendering. A LiDAR-derived elevation model and RoadRunner-based High Definition (HD) map provide terrain-accurate road geometry deployed consistently across both simulators. The framework incorporates explicit actor ownership, mirrored lifecycle management, coordinate reconciliation, and a latest-state-per-actor update policy, enabling stable interaction between VISSIM-controlled traffic and a CARLA-controlled ego vehicle. A corridor-scale case study demonstrates consistent traffic-signal mirroring, synchronized vehicle-pedestrian interactions, and stable mixed-authority operation under peak loads of approximately 100 vehicles and 100 pedestrians. The deployment captures the interaction of the five signalized intersections along MLK Street and their connecting upstream and downstream intersections, revealing synchronization challenges unique to multi-intersection corridors. Results indicate that this MLK-centered corridor provides an effective testbed for verifying cross-simulator consistency and that the proposed architecture supports reliable, perception-ready co-simulation for corridor-level traffic studies.

2606.15427 2026-06-16 cs.LG cs.AI cs.CV 新提交

Post-Launch Capability Expansion of Vision-Language Models via Prompting for On-Orbit Spacecraft Inspection

通过提示实现视觉语言模型发射后能力扩展用于在轨航天器检测

Nicholas A. Welsh, Lennon J. Shikhman, Monty Nehru Attazs, Seemanthini K. Putane, Van Minh Nguyen, Ryan T. White

发表机构 * Florida Institute of Technology(佛罗里达理工学院) University of Florida(佛罗里达大学)

AI总结 研究利用提示驱动的视觉语言模型在轨扩展语义能力,无需修改权重即可通过自然语言提示检测新航天器部件,在129张图像上零样本实例分割达到0.385 mAP@0.5。

Comments 5 pages, 1 figure, 2 tables. Equal contribution by Nicholas A. Welsh and Lennon Shikhman. Published in the CVPR2026 Workshop on AI4Space

详情
AI中文摘要

星载检测系统通常在发射前部署感知模型,之后更新模型权重或扩展固定标签集在操作上变得不可行。虽然监督模型可以在飞行前集成,但在轨道上添加新的语义能力需要重新训练和重新上传参数。我们研究提示驱动的视觉语言模型是否能够实现发射后语义扩展,允许通过自然语言提示指定新的航天器部件,而无需修改星载权重。我们在一个包含129张先前未见卫星图像的测试集上,采用严格冻结的单次推理协议,评估了航天器部件的零样本实例分割。在固定全局阈值且无后处理的情况下,SAM3达到0.385 mAP@0.5和0.267 mAP@0.5:0.95。性能强烈依赖于尺度:大型结构元素如航天器主体(0.639 AP@0.50)和太阳翼(0.598 AP@0.5)定位可靠,而相对较小的附件如天线(0.221 AP@0.5)和推进器(0.081 AP@0.5)仍然困难。提示形式影响性能,包含空间和几何描述符的结构化提示相比短类别名称提示提升高达82%。该模型在当代嵌入式GPU的内存和计算范围内运行,表明提示驱动的定位可以为主要航天器结构提供发射后语义扩展的实用机制,同时突显了在轨道域偏移下细粒度部件零样本定位的局限性。

英文摘要

Spaceborne inspection systems often deploy perception models prior to launch, after which updating model weights or expanding fixed label sets becomes operationally impractical. While supervised models can be integrated pre-flight, adding new semantic capabilities in orbit requires retraining and re-uploading parameters. We investigate whether prompt-driven vision--language models can enable post-launch semantic expansion, allowing new spacecraft components to be specified via natural-language prompts without modifying onboard weights. We evaluate zero-shot instance segmentation of spacecraft components under a strictly frozen, single-pass inference protocol on a test set of $129$ images of previously unseen satellites. Under fixed global thresholds and no post-processing, SAM3 achieves $0.385$ mAP@$0.5$ and $0.267$ mAP@$0.5{:}0.95$. Performance is strongly scale-dependent: large structural elements like spacecraft bodies ($0.639$ AP@$0.50$) and solar arrays ($0.598$ AP@$0.5$) localize reliably, while relatively small appendages like antennas ($0.221$ AP@$0.5$) and thrusters ($0.081$ AP@$0.5$) remain difficult. Prompt formulation influences performance, with structured prompts incorporating spatial and geometric descriptors yielding up to $82%$ improvement over short category-name prompts. The model operates within the memory and compute envelope of contemporary embedded GPUs, suggesting prompt-driven grounding can provide a practical mechanism for post-launch semantic extension of dominant spacecraft structures while highlighting limitations of zero-shot localization for fine-scale components under orbital domain shift.

2606.15422 2026-06-16 cs.CL q-bio.BM 新提交

Pepti-Agent: An AI Agent for Peptide Design and Optimization

Pepti-Agent: 一种用于肽设计与优化的人工智能代理

Houxu Chen, Achuth Chandrasekhar, Amir Barati Farimani

发表机构 * Biomedical Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA(生物医学工程系,卡内基梅隆大学,匹兹堡,PA 15213,美国) Mechanical Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA(机械工程系,卡内基梅隆大学,匹兹堡,PA 15213,美国)

AI总结 提出Pepti-Agent,一种基于模型上下文协议(MCP)的闭环肽设计框架,通过可独立检查的生成、预测和突变工具,结合大语言模型控制器和实时属性预测,实现多目标优化与可复现基准测试。

详情
AI中文摘要

治疗性肽占据小分子和生物制剂之间有价值的设计空间,但它们的开发需要同时满足几个相互竞争的约束:溶解度、溶血活性和非特异性表面污染由重叠的序列特征控制,因此改善一个属性往往会降低另一个属性。计算设计通过将生成模型与基于序列的属性预测器配对,迭代地提出和优化候选物来解决这一问题。然而,这些组件通常被连接成难以检查、扩展或重用的整体脚本,并且它们通常通过自然语言推理而不是跟踪每个候选物不断变化的多属性状态来优化序列。我们提出了Pepti-Agent,一个闭环的、肽特异性的框架,它将生成、属性预测和单残基突变暴露为可独立检查的模型上下文协议(MCP)工具。一个大语言模型控制器调用这些工具,并在调用之间查阅实时的预测器输出,因此优化由每个序列当前的属性概况指导,而不是仅由语言推理指导。任务特异性的PeptideGPT模型生成候选物,基于ProtBERT的分类器对溶解度、溶血和非污染进行评分,两个可互换的突变算子提出序列编辑。通过记录控制器决策、预测器输出和接受突变的每一步迹,Pepti-Agent为多目标设计策略的基准测试和为实验验证优先排序候选物提供了可复现的基础。

英文摘要

Therapeutic peptides occupy a valuable design space between small molecules and biologics, but their development requires satisfying several competing constraints at once: solubility, hemolytic activity, and nonspecific surface fouling are governed by overlapping sequence features, so improving one property often degrades another. Computational design addresses this by pairing generative models with sequence-based property predictors, iteratively proposing and refining candidates. However, these components are typically wired together as monolithic scripts that are difficult to inspect, extend, or reuse, and they often refine sequences by natural-language reasoning rather than by tracking the evolving multi-property state of each candidate. We present Pepti-Agent, a closed-loop, peptide-specific framework that exposes generation, property prediction, and single-residue mutation as independently inspectable Model Context Protocol (MCP) tools. A large language model controller invokes these tools and consults live predictor output between calls, so refinement is guided by each sequence's current property profile rather than by language reasoning alone. Task-specific PeptideGPT models generate candidates, ProtBERT-based classifiers score solubility, hemolysis, and non-fouling, and two interchangeable mutation operators propose sequence edits. By recording a per-step trace of controller decisions, predictor outputs, and accepted mutations, Pepti-Agent offers a reproducible substrate for benchmarking multi-objective design strategies and for prioritizing candidates for experimental validation.

2606.15420 2026-06-16 cs.LG cs.AI 新提交

Constitutional Value Potentials: reading and steering internal priority margins in language models

宪法价值潜力:读取和引导语言模型中的内部优先级边际

Tong Che, Rui Wu

发表机构 * NVIDIA Research(英伟达研究院) Rutgers University(罗格斯大学)

AI总结 提出宪法价值潜力(CVP)方法,通过从隐藏状态学习标量势来读取模型内部的价值优先级边际,以预测和干预价值冲突,AUROC高达0.95。

详情
AI中文摘要

宪法告诉语言模型应该重视什么,但很少有方法告诉我们它是否真的重视。遵守程度通过输出来判断,而输出证据在价值冲突中最脆弱,此时重要的不是模型提及哪个价值,而是它愿意牺牲哪个价值。我们提供证据表明,这种仲裁可以从结构化边际读出中的激活状态中读取。我们引入宪法价值潜力(CVP)。对于每个价值,我们从隐藏状态学习一个标量势:一种保存该价值的内部压力,其监督不是来自提示,而是来自独立评判者对模型自身响应实际保存了哪个价值的裁决。两个势的符号差就是优先级边际。宪法条款成为边际保持为正的主张,而单个监控分数在边际不为正时发出警报。该监控器预测冲突违规的AUROC高达0.95,优于强隐藏状态探针,并在三个Qwen2.5尺度上泛化到未见过的合成冲突。该信号在答案开始时出现,来自提示尾部和第一个响应令牌。早期读取该信号,可以揭示对抗性优先级攻击是否实际上已将模型推向违规,而不仅仅是提示看起来具有对抗性。相同的方向也支持干预测试:在选定的引导设置下,沿着价值方向移动会按预期方向改变评判的权衡。这些结果表明,一些与宪法相关的优先级可以作为激活空间中的边际访问,而不仅仅是输出行为。

英文摘要

A constitution tells a language model what to value, but little tells us whether it does. Adherence is judged from outputs, and output evidence is most fragile on value conflicts, where what matters is not which value a model mentions but which one it is willing to sacrifice. We provide evidence that this arbitration can be read from activations in a structured margin readout. We introduce Constitutional Value Potentials (CVP). For each value we learn a scalar potential from the hidden state: an internal pressure to preserve that value, supervised not by the prompt but by an independent judge's verdict on which value the model's own response actually preserved. The signed difference of two potentials is a priority margin. A constitutional clause becomes the claim that a margin stays positive, and a single monitor score flags when it does not. The monitor predicts conflict violations with AUROC up to 0.95, beats a strong hidden-state probe, and generalizes to held-out synthetic conflicts across three Qwen2.5 scales. The signal appears as the answer begins, from the prompt tail and first response token. Read this early, the same signal reveals whether an adversarial priority hack has actually pushed the model toward a violation, rather than only whether the prompt looks adversarial. The same directions also support intervention tests: under selected steering settings, moving along a value direction shifts judged trade-offs in the intended direction. Together, these results suggest that some constitution-relevant priorities are accessible as activation-space margins, rather than only as output behavior.

2606.15419 2026-06-16 cs.CL cs.AI 新提交

Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering

让LLMs互相评判:面向医学问答的多智能体同行评审推理

Zaifu Zhan, Shuang Zhou, Rui Zhang

发表机构 * University of Minnesota(明尼苏达大学)

AI总结 提出多智能体同行评审推理方法,让多个LLM独立生成思维链推理并相互评估,选择最优推理链输出答案,在三个医学问答数据集上优于单模型和多数投票方法。

Comments Accepted by the Journal of the American Medical Informatics Association

详情
AI中文摘要

目的:提升大语言模型在医学问答中的准确性、可解释性和鲁棒性。方法:我们设计了一种多智能体同行评审推理方法,其中多个LLM智能体独立生成包含候选答案的思维链推理,然后作为同行评审者评估彼此推理的事实正确性和逻辑合理性。选择评分最高的推理链生成最终答案。使用五个最先进的LLM(Llama-3.1-8B、Qwen2.5-7B、Phi-4、DeepSeek-LLM-7B、GPT-oss-20B)在三个基准数据集(HeadQA、MedQA-USMLE和PubMedQA)上进行实验。性能与单模型思维链推理和基于思维链的多数投票进行了比较。结果:同行评审推理始终优于两种基线。最佳模型组合在数据集上的平均准确率达到0.820,超过了最强单模型(0.777)和多数投票集成(最高0.789)。该方法还随着参与模型数量的增加而有效扩展,同时同行评估可靠地区分了高质量和低质量的推理链。结论:提出的多智能体同行评审推理方法使LLM既能作为求解者又能作为评估者,在医学问答中取得了优越性能。通过强调推理质量而非仅答案一致性,该方法提高了准确性、可解释性和鲁棒性,为可信赖的生物医学AI系统提供了有前景的方向。

英文摘要

Objective: To enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). Method: We designed a multi-agent peer-reviewed reasoning method in which multiple LLM agents independently generate chain-of-thought reasoning with candidate answers, then act as peer reviewers to evaluate each other's reasoning for factual correctness and logical soundness. The highest-rated reasoning chain is selected to produce the final answer. Experiments were conducted with five state-of-the-art LLMs (Llama-3.1-8B, Qwen2.5-7B, Phi-4, DeepSeek-LLM-7B, GPT-oss-20B) on three benchmark datasets: HeadQA, MedQA-USMLE, and PubMedQA. Performance was compared against single-model chain-of-thought reasoning and chain-of-thought-based majority voting. Results: Peer-reviewed reasoning consistently outperformed both baselines. The best model combination achieved an average accuracy of 0.820 across datasets, exceeding the strongest single model (0.777) and majority voting ensembles (up to 0.789). The method also scaled effectively with more participating models, while peer assessments reliably distinguished high- from low-quality reasoning chains. Conclusion: The proposed multi-agent peer-reviewed reasoning method enables LLMs to act as both solvers and evaluators, yielding superior performance in MedQA. By emphasizing reasoning quality rather than answer agreement alone, this approach improves accuracy, interpretability, and robustness, offering a promising direction for trustworthy biomedical AI systems.

2606.15417 2026-06-16 cs.CV 新提交

From Frames to Temporal Graphs: In-Context Egocentric Action Recognition with Vision-Language Models

从帧到时间图:基于视觉语言模型的上下文第一人称动作识别

Bessie Dominguez-Dager, Francisco Gomez-Donoso, Miguel Cazorla, Marc Pollefeys, Daniel Barath, Zuria Bauer

发表机构 * University of Alicante(阿利坎特大学) ETH Zürich(苏黎世联邦理工学院) Microsoft(微软)

AI总结 提出将视频转换为时间动作图,通过多阶段提示生成自然语言叙述并结构化,实现上下文学习,在EGTEA和Epic-Kitchens-100上显著提升零样本和少样本动作识别性能。

详情
AI中文摘要

第一人称视频中的动作推理需要捕捉手-物交互的细粒度过渡,而通用视觉语言模型(VLM)在直接处理原始像素时往往难以胜任。我们提出通过将视频转换为时间动作图,将视觉感知与符号推理解耦。在多阶段提示流程中,我们首先在短时间窗口上生成密集的自然语言叙述作为语义瓶颈,然后将其形式化为结构化的开放词汇图表示。在EGTEA和Epic-Kitchens-100数据集上,符号表示实现了高效的上下文学习:少样本图演示相比零样本帧和图推理均带来显著的准确率提升。即使在零样本设置下,尽管潜在的预训练污染可能有利于基于像素的推理,但基于图的推理仍能与像素推理保持竞争力。在来自6个模型家族、参数范围从2B到235B的11个开源VLM上,我们的发现表明,当前VLM作为符号推理器比作为直接视觉观察者更有效。通过将视频投影到语言领域,我们提供了一种可扩展、无需微调的替代端到端方法,更好地利用了这些模型的潜在推理优势。代码将公开。

英文摘要

Action reasoning in egocentric video requires capturing fine-grained transitions of hand-object interactions, a task where general-purpose Vision-Language Models (VLMs) often struggle when operating directly on raw pixels. We propose to decouple visual perception from symbolic reasoning by converting videos into Temporal Action Graphs. In a multi-stage prompting pipeline, we first generate dense natural language narratives over short temporal windows as a semantic bottleneck, then formalize them into structured, open-vocabulary graph representations. On the EGTEA and Epic-Kitchens-100 datasets, the symbolic representation unlocks efficient in-context learning: few-shot graph demonstrations yield substantial accuracy gains over zero-shot frame and graph-based inference alike. Even in the zero-shot setting, graph-based reasoning remains competitive with pixel-based inference despite potential pretraining contamination favoring the latter. Across 11 open-weight VLMs from 6 model families ranging from 2B to 235B parameters, our findings indicate that current VLMs are more effective as symbolic reasoners than as direct visual observers. By projecting video into the language domain, we provide a scalable, fine-tuning-free alternative to end-to-end approaches that better leverages these models' latent reasoning strengths. The code will be made public.

2606.15416 2026-06-16 cs.CL 新提交

Encode Errors: Representational Retrieval of In-Context Demonstrations for Multilingual Grammatical Error Correction

编码错误:多语言语法错误纠正中上下文示例的表征检索

Guangyue Peng, Wei Li, Wen Luo, Houfeng Wang

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机学院多媒体信息处理国家重点实验室)

AI总结 提出从LLM内部状态提取语法错误表征(GER)用于检索上下文示例,显著提升多语言语法错误纠正的少样本性能,在低资源语言上F0.5提升达1.20倍。

Comments 15 pages, 6 figures

Journal ref Findings of the Association for Computational Linguistics: ACL 2025, pages 21166-21180, Vienna, Austria. Association for Computational Linguistics, 2025

详情
AI中文摘要

语法错误纠正(GEC)涉及检测和纠正语法的错误使用。虽然具有上下文学习(ICL)能力的大型语言模型(LLM)在各种自然语言处理(NLP)任务上取得了显著进展,但它们在GEC上的少样本性能仍然次优。这主要是由于难以检索到能够捕捉错误模式而非语义相似性的合适上下文示例。在本文中,我们证明LLM可以通过其内部状态固有地捕捉与语法错误相关的信息。从这些状态中,我们提取了语法错误表征(GER),这是一种信息丰富且语义中立的语法错误编码。我们基于GER的新型检索方法显著提升了多语言GEC数据集上ICL设置的性能,提高了纠正的精确度。对于高资源语言,我们在8B大小的开源模型上的结果与Deepseek2.5和GPT-4o-mini等闭源模型相当。对于低资源语言,我们的F0.5分数比基线高出最多1.20倍。该方法为多语言GEC提供了一种更精确且资源高效的解决方案,为可解释的GEC研究提供了一个有前景的方向。

英文摘要

Grammatical Error Correction (GEC) involves detecting and correcting the wrong usage of grammar. While large language models (LLMs) with in-context learning (ICL) capabilities have shown significant progress on various natural language processing (NLP) tasks, their few-shot performance on GEC remains suboptimal. This is mainly due to the challenge of retrieving suitable in-context demonstrations that capture error patterns instead of semantic similarity. In this paper, we demonstrate that LLMs can inherently capture information related to grammatical errors through their internal states. From these states, we extract the Grammatical Error Representation (GER), an informative and semantically neutral encoding of grammatical errors. Our novel GER-based retrieval method significantly boosts performance in ICL settings on multilingual GEC datasets, improving the precision of correction. For high-resource languages, our results on 8B-sized open-source models match those of closed-source models such as Deepseek2.5 and GPT-4o-mini. For low-resource languages, our $F_{0.5}$ scores surpass the baseline by up to a factor of 1.20. This method provides a more precise and resource-efficient solution for multilingual GEC, offering a promising direction for interpretable GEC research.

2606.15412 2026-06-16 cs.CL cs.AI 新提交

Few-Shot Biomedical Relation Extraction with Large Language Models: A Viable Alternative to Supervised Learning?

基于大语言模型的少样本生物医学关系抽取:监督学习的可行替代方案?

Jakob Mraz, Tomaž Curk, Blaž Zupan

发表机构 * University of Ljubljana(卢布尔雅那大学) Baylor College of Medicine(贝勒医学院)

AI总结 研究利用大语言模型进行少样本生物医学关系抽取,比较成对分类与联合生成两种方法,发现联合生成更精确高效,在宏F1上超越监督基线,尤其在稀有关系类型上表现突出。

详情
AI中文摘要

生物医学关系抽取(BioRE)是将生物医学文献转化为结构化知识的关键步骤。然而,现有方法大多依赖在昂贵标注数据集上训练的监督模型,限制了其在关系类型和领域上的可扩展性和适应性。我们研究了基于提示学习的大语言模型(LLMs)进行少样本BioRE,并比较了两种任务形式:成对分类(预测单个实体对的关系)和联合生成(在单次模型调用中提取多个关系)。在BioREDirect数据集上的实验揭示了明确的精确率-召回率权衡。成对分类实现了更高的召回率,而联合生成更精确且计算效率更高。最佳模型达到了0.44的微F1分数,显著优于之前的少样本结果(0.34),但仍低于监督基线(0.56)。这一差距大部分归因于一个定义模糊的关系类型。当使用宏F1评估时(在类别不平衡设置下更能反映跨关系类型的性能),基于提示的方法优于监督基线(0.45 vs. 0.38),尤其在稀有关系类型上。这些发现突显了LLMs在低资源场景下进行BioRE的潜力,并强调了定义良好的关系模式的重要性。

英文摘要

Biomedical relation extraction (BioRE) is a key step in transforming biomedical literature into structured knowledge. However, most existing approaches rely on supervised models trained on costly annotated datasets, limiting their scalability and adaptability across relation types and domains. We investigate few-shot BioRE using prompt-based learning with large language models (LLMs) and compare two task formulations: pairwise classification, which predicts relations for individual entity pairs, and joint generation, which extracts multiple relations in a single model call. Experiments on the BioREDirect dataset reveal a clear precision-recall trade-off. Pairwise classification achieves higher recall, whereas joint generation is more precise and computationally efficient. The best-performing model achieves a micro-F1 score of 0.44, substantially outperforming previous few-shot results (0.34) while remaining below the supervised baseline (0.56). Much of this gap is attributable to a single ambiguously defined relation type. When evaluated using macro-F1, which better captures performance across relation types in an imbalanced setting, prompt-based approaches outperform the supervised baseline (0.45 vs. 0.38), particularly on rare relation types. These findings highlight the potential of LLMs for BioRE in low-resource settings and underscore the importance of well-defined relation schemas.

2606.15409 2026-06-16 cs.CV 新提交

Segmentation-based Detection for Efficient Multi-Task Spacecraft Perception

基于分割检测的高效多任务航天器感知

Sivaperuman Muniyasamy, Surendar Devasundaram

发表机构 * University of Arizona(亚利桑那大学)

AI总结 针对太空视觉感知中的多任务需求,提出集成MobileNetV3编码器与U-Net风格解码器的轻量架构,通过分割掩码联合推导检测框,在SPARK 2026挑战赛中获得0.9482综合得分,排名第二。

Comments 8 pages, 2 figures, 6 tables. CVPRW AI4SPACE-SPARK 2026 Challenge Stream-1 First Place Winners. Code is available at https://github.com/sivaastro/segdet-spark

详情
AI中文摘要

基于视觉的感知是空间态势感知以及自主在轨操作(如交会、对接、服务和导航)的基础。然而,该领域的进展受到标注空间图像稀缺以及具有挑战性的视觉域特性(包括剧烈的光照变化、低信噪比和高对比度)的限制。我们针对SPARK 2026挑战赛的Stream 1,该任务要求一个单一模型完成多目标类型的航天器分类、检测和细粒度部件分割。我们提出了一种紧凑架构,集成了MobileNetV3编码器和U-Net风格解码器,结合了计算效率与精确的密集预测。在单航天器场景下,检测通过预测部件掩码的并集解析得到,避免了单独的边界框回归头。我们的方法取得了0.9482的整体排行榜分数,其中分类、检测和分割的任务特定分数分别为1.0000、0.9788和0.8917。所提出的方法在SPARK 2026挑战赛中总体排名第二,表明轻量级编码器-解码器架构能够为实际星载视觉系统提供强大的多任务性能。

英文摘要

Vision-based perception is fundamental to Space Situational Awareness and autonomous on-orbit operations such as rendezvous, docking, servicing, and navigation. However, progress in this area is limited by the scarcity of annotated space imagery and by challenging visual-domain characteristics including severe illumination changes, low signal-to-noise ratio, and high contrast. We address Stream 1 of the SPARK 2026 Challenge, which requires a single model for spacecraft classification, detection, and fine-grained component segmentation across multiple target types. We propose a compact architecture that integrates a MobileNetV3 encoder with a U-Net-style decoder, combining computational efficiency with accurate dense prediction. Detection is derived analytically from the union of predicted component masks, avoiding a separate bounding-box regression head in the single-spacecraft setting. Our method achieved an overall leaderboard score of 0.9482, with task-specific scores of 1.0000 in classification, 0.9788 in detection, and 0.8917 in segmentation. The proposed approach ranked second overall in the SPARK 2026 Challenge, demonstrating that lightweight encoder-decoder architectures can deliver strong multi-task performance for practical onboard space vision systems.

2606.15405 2026-06-16 cs.CL cs.AI 新提交

T-Mem: Memory That Anticipates, Not Archives

T-Mem:预测而非归档的记忆

Weidong Guo, Dakai Wang, Zixuan Wang, Hui Liu, Yu Xu

发表机构 * Tencent(腾讯)

AI总结 提出T-Mem架构,通过写时触发机制覆盖描述性和关联性回忆,解决长对话中语义关联检索问题,在LoCoMo和LoCoMo-Plus上达到SOTA。

详情
AI中文摘要

长期记忆对于对话代理在扩展对话中保持连贯性、遵循多个会话前做出的承诺以及根据每个用户调整行为至关重要。然而,当前基于LLM的长期对话记忆受限于查询与存储内容(包括词汇和稠密向量)之间的相似性。当查询和记忆共享表面特征(如措辞或命名实体,我们称之为描述性)时,该方法有效。但它忽略了另一类同样有价值的案例,即查询和记忆不共享表面特征,仅通过潜在语义弧(关联性)相连。在这种机制下,现有的长期记忆系统普遍失败。覆盖这另一半使得助手首次能够主动将过去的对话作为语义资产。在记忆方面,这是认知科学中称为情景未来思维的工程对应物:预演过去的经验,以便在未来需要找到它的上下文中使用。我们将这些写时预演称为触发器。我们提出T-Mem,这是第一个覆盖描述性和关联性回忆的长期对话记忆架构。在两种证据粒度(单个事实和完整交流)上,T-Mem实例化一个描述性触发器家族和一个关联性触发器家族,使得每个记忆都能从表面相似和相关性约束的查询中访问。作为实证验证,T-Mem在LoCoMo和LoCoMo-Plus上达到了最先进水平。

英文摘要

Long-term memory is essential for conversational agents to remain coherent across extended dialogues, follow through on commitments made many sessions earlier, and adapt their behaviour to each user. Current LLM-backed long-term conversational memory, however, is reachability-bounded by the similarity between a query and stored content, both lexical and dense-vector. The approach is effective when query and memory share surface features such as wording or named entities (we call this descriptive). But it misses another, equally valuable class of cases, where query and memory do not share surface features and are tied only by a latent semantic arc (associative). On this regime prevailing long-term memory systems collectively fail. Covering this other half is what allows an assistant, for the first time, to actively draw on past dialogue as a semantic asset. On the memory side, this is the engineering counterpart of what cognitive science calls episodic future thinking: rehearsing past experience for the future contexts under which it will need to be found. We call these write-time rehearsals triggers. We propose T-Mem, the first long-term conversational memory architecture that covers both descriptive and associative recall. At each of two evidence granularities, single facts and full exchanges, T-Mem instantiates one descriptive trigger family and one associative trigger family, so that every memory remains reachable from both surface-similar and relevance-bound queries. As empirical validation, T-Mem reaches state-of-the-art on both LoCoMo and LoCoMo-Plus.

2606.15396 2026-06-16 cs.CL cs.AI 新提交

CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

CHILLGuard:面向细粒度中文大语言模型安全护栏的可扩展数据构建与模型感知偏好对齐

Wenbo Yu, Bohua Wang, Hao Fang, Kuofeng Gao, Jingru Zeng, Xiaochen Yang, Tianyi Zhang, Xiaoxiao Ma, Jiawei Kong, Hao Wu, Bin Chen, Shu-Tao Xia, Min Zhang

发表机构 * Tsinghua University(清华大学) Beijing Normal University(北京师范大学) South China University of Technology(华南理工大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Shenzhen ShenNong Information Technology Co., Ltd.(深圳神农信息技术有限公司)

AI总结 针对中文场景,提出细粒度风险分类体系(5大类31小类),通过可扩展数据构建管道生成高质量训练数据,并采用模型感知直接偏好优化训练CHILLGuard,在基准上F1分数提升15.92%。

详情
AI中文摘要

大语言模型生成的恶意内容可能带来严重的安全风险和伦理问题。虽然现有的大语言模型安全护栏在英语或多语言环境中表现出色,但它们缺乏对中文特定监管政策、文化背景和语言细微差别的适应,无法支持针对不同部署需求的细粒度风险分类。在本文中,我们引入了一个面向中文场景的5大类、31小类细粒度风险分类体系,并构建了CHILLGuard:一个专门的中文大语言模型内容安全护栏。为了解决高质量标注中文安全数据的严重稀缺问题,我们提出了一个可扩展的多阶段数据构建管道:通过检索增强生成扩展多源语料库,通过提示工程改写生成隐式有害样本,并通过多模型投票的标签校准精炼高质量数据。基于此,我们构建了CHILLGuardTrain,一个包含405,007样本的大规模训练集,以及CHILLGuardTest,一个严格策划的包含51,745样本的标注测试集。然后,我们在生成器-分类器协作框架下,通过模型感知直接偏好优化在CHILLGuardTrain上训练CHILLGuard。在多种设置下的广泛实验证明了CHILLGuard的最先进性能,例如,在我们的基准上,F1分数相比Qwen3Guard-8B-Strict提升了15.92%。我们将在https://github.com/cswbyu/CHILLGuard发布我们的资源。

英文摘要

Malicious content generated from large language models (LLMs) could pose severe safety risks and ethical concerns. While existing LLM safety guardrails excel in English or multilingual settings, they lack adaptation to Chinese-specific regulatory policies, cultural context and linguistic nuances, failing to support fine-grained risk classification for diverse deployment needs. In this paper, we introduce a 5-macro, 31-micro category fine-grained risk taxonomy for Chinese scenarios, and build CHILLGuard: a dedicated Chinese LLM content safety guardrail. To address the critical scarcity of high-quality annotated Chinese safety data, we propose a scalable multi-stage data construction pipeline: we expand multi-source corpus via retrieval-augmented generation, generate implicit harmful samples through prompt engineering rewriting, and refine high-quality data via multi-model voting-based label calibration. Based on this, we build CHILLGuardTrain, a large-scale training set with 405,007 samples, and CHILLGuardTest, a rigorously curated annotated test set with 51,745 samples. We then train CHILLGuard on CHILLGuardTrain under a generator-classifier collaborative framework via Model-aware Direct Preference Optimization. Extensive experiments under multiple settings demonstrate the state-of-the-art performance of CHILLGuard, e.g., a 15.92% improvement of F1 score over Qwen3Guard-8B-Strict on our benchmark. We will release our resources at https://github.com/cswbyu/CHILLGuard.

2606.15390 2026-06-16 cs.CL cs.AI cs.LG 新提交

Not All Skills Help: Measuring and Repairing Agent Knowledge

并非所有技能都有用:测量与修复智能体知识

Yixuan Wang, Yiyang Zhou, Yiming Liang, Congyu Zhang, Fuxiao Liu, Jiawei Zhou, Huaxiu Yao

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) Purdue(普渡大学) NVIDIA(英伟达)

AI总结 提出ASSAY框架,通过随机掩码测量技能因果贡献,分离技能生成与筛选,在推理时抑制负面技能,显著提升LLM智能体任务完成率。

Comments 18 pages, 5 figures

详情
AI中文摘要

LLM智能体可以通过从经验中积累自然语言技能来改进,而无需更新权重,但当前系统将所有关于保留哪些技能以及如何应用它们的决策完全交由LLM判断。我们认为这混淆了两个不同的角色:从经验中生成技能是判断擅长的创造性行为,而决定该技能是否真正有帮助则需要跨多个任务的实证证据。通过随机掩码测量每个技能的因果贡献,我们发现技能库表现出普遍的因果异质性:单个技能通常在某些任务类型上有帮助,但在其他任务类型上有害,然而它们的相反效应在总体上相互抵消,使得全局筛选方法无法察觉。我们提出ASSAY,一个将生成与筛选分离的框架:它在小型开发集上计算每个技能的因果归因,离线重组技能库,并为每个测试任务抑制预测效应为负的技能。在跨越四个提供商的七个基础模型以及两个基准(AppWorld和tau-bench)上,ASSAY始终优于先前的技能筛选方法。在AppWorld最难的数据划分上,DeepSeek-V3实现了69.3%的任务目标完成率(相对提升47.4%),在所有已发表方法(包括权重调整方法)中达到了新的最先进水平。在tau-bench零售领域,GPT-4.1相对提升8.7%,在公开排行榜上超越了o4-mini、o1和GPT-4.5,且无需任何权重修改。消融实验将主要收益归因于每任务掩码,证实瓶颈在于推理时将技能与任务匹配,而非全局移除不良技能。代码已开源:https://github.com/aiming-lab/assay。

英文摘要

LLM agents can improve without weight updates by accumulating natural-language skills from experience, but current systems entrust every decision about which skills to keep and how to apply them to LLM judgment alone. We argue that this conflates two distinct roles: generating a skill from experience is a creative act that judgment handles well, while deciding whether that skill actually helps requires empirical evidence across many tasks. Measuring per-skill causal contributions via randomized masking, we find that skill libraries exhibit pervasive causal heterogeneity: individual skills routinely help on some task types while hurting on others, yet their opposing effects cancel in aggregate, making them invisible to global curation methods. We propose ASSAY, a framework that separates generation from curation: it computes a per-skill causal attribution on a small development set, restructures the library offline, and suppresses skills with negative predicted effect for each test task. Across seven base models spanning four providers and two benchmarks (AppWorld and tau-bench), ASSAY consistently improves over prior skill-curation approaches. On AppWorld's hardest split, DeepSeek-V3 achieves 69.3% task-goal completion (47.4% relative improvement), a new state of the art among all published methods including weight-tuned approaches. On tau-bench retail, GPT-4.1 improves by 8.7% relative, advancing past o4-mini, o1, and GPT-4.5 on the public leaderboard without any weight modification. Ablation traces the dominant gain to per-task masking, confirming that the bottleneck is matching skills to tasks at inference time, not removing bad skills globally. Code is available at https://github.com/aiming-lab/assay.

2606.15389 2026-06-16 cs.CV 新提交

Timestep Rescheduling in Diffusion Inversion

扩散反演中的时间步重调度

Shangquan Sun, Ting Gong, Zhirui Liu, Jiamin Wu, Runkai Zhao, Mianxin Liu, Wenqi Ren, Xiaochun Cao

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 针对扩散反演中时间步选择影响反演精度的问题,提出一种基于全局重缩放和局部动态规划的非均匀时间步调度器,有效降低反演误差,提升图像重建与编辑性能。

Comments Accepted by ICML 2026. 23 pages, including appendices

详情
AI中文摘要

扩散反演将图像映射回扩散模型的高斯潜在空间,是图像重建和编辑的关键任务。虽然DDIM实现了快速确定性反演,但它固有地引入了累积为明显反演误差的偏差。现有方法通常通过求解不动点问题来解决这一问题,但很大程度上忽略了噪声调度器中扩散时间步的选择如何影响反演保真度。在这项工作中,我们揭示了扩散反演中的偏差尺度强烈依赖于时间步大小,并呈现出抛物线趋势,较大的误差集中在较小和较大的时间步。基于这一发现,我们提出了一种简单而有效的非均匀时间步调度器,该调度器集成了全局重缩放和基于局部动态规划的重调度,实现了计算资源的战略分配,从而最小化整体反演误差并保持更高的反演精度。我们的方法可作为现有反演技术的即插即用增强,无需额外参数或计算开销。通过大量实验,我们验证了集成我们的调度器能够持续提升现有反演方法的性能,在图像重建和编辑中取得更优结果。

英文摘要

Diffusion inversion, which maps images back to the Gaussian latent space of a diffusion model, is a critical task for image reconstruction and editing. While DDIM enables fast deterministic inversion, it inherently introduces deviations that accumulate into noticeable inversion errors. Existing methods often address this by solving a fixed-point problem but largely overlook how the selection of the diffusion timestep in the noise scheduler influences inversion fidelity. In this work, we reveal that the deviation scale in diffusion inversion is strongly dependent on the timestep size, and exhibits a parabolic trend, with larger errors concentrated at both small and large timesteps. Based on this finding, we propose a simple yet effective nonuniform timestep scheduler that integrates a global rescaling with a local dynamic programming based rescheduling, enabling a strategic allocation of computational effort that minimizes the overall inversion error and preserves higher inversion accuracy. Our method serves as an off-the-shelf enhancement for existing inversion techniques and requires no extra parameters or computational overhead. Through extensive experiments, we verify that integrating our scheduler consistently boosts the performance of existing inversion methods, achieving superior results in image reconstruction and editing.

2606.15385 2026-06-16 cs.AI 新提交

Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds

语言模型智能体中的奖励黑客:重新审视AI安全网格世界

Ömer Veysel Çağatan, Xuandong Zhao

发表机构 * KUIS AI Center, Koç University(科奇大学KUIS人工智能中心) University of California, Berkeley(加州大学伯克利分校)

AI总结 本研究将AI安全网格世界框架改编为文本评估套件,发现语言模型在零样本下出现规范博弈,通过直接奖励优化扩大观察与隐藏奖励差距,且标准缓解措施无效。

Comments 28 pages, 16 figures, 13 tables

详情
AI中文摘要

奖励黑客(AI系统利用错误指定的目标获得高奖励而未实现预期目标)仍然是AI安全的核心挑战。然而,大多数已知实例是在前沿系统中事后发现的,难以进行受控研究。我们将AI安全网格世界框架改编为基于文本的评估套件,将经典的强化学习安全任务重新表述为基于语言的智能体任务。在前沿和中规模模型中,我们发现规范博弈零样本出现:系统在隐藏安全目标上表现不佳的同时,系统地获得高观察奖励,甚至看似安全的行为也可能反映误解而非原则性安全。强化学习不能纠正这些失败:直接奖励优化扩大了观察奖励和隐藏奖励之间的差距,因为模型的初始能力使其在发现更安全的策略之前锁定在局部奖励策略上。这种模式在模型规模(1.5B--14B)中持续存在,并且不能通过更精细的信用分配、探索提示或熵正则化来解决。我们的结果表明,当使用有能力的语言模型智能体优化代理目标时,奖励黑客自然出现,并且抵抗标准缓解措施,这表明在代理设置中代理奖励失败可能需要超越标准探索和信用分配修复的方法。为了促进可重复性,本工作的代码可在我们的公共仓库中获取:\href{https://github.com/asparius/verl-agent-safety}{https://github.com/asparius/verl-agent-safety}。

英文摘要

Reward hacking, where AI systems exploit misspecified objectives to achieve high reward without satisfying intended goals, remains a central challenge in AI safety. Yet most known instances have been discovered post hoc in frontier systems where controlled study is impractical. We adapt the AI Safety Gridworlds framework into a text-based evaluation suite that reformulates classic reinforcement learning safety tasks for language-based agents. Across frontier and mid-scale models, we find that specification gaming emerges zero-shot: models systematically achieve high observed reward while underperforming on hidden safety objectives, and even apparently safe behaviors can reflect misunderstanding rather than principled safety. Reinforcement learning does not correct these failures: direct reward optimization widens the gap between observed and hidden reward, as the model's initial competence causes it to lock into locally rewarding strategies before discovering safer alternatives. This pattern persists across model scales (1.5B--14B) and is not resolved by finer credit assignment, exploration prompts, or entropy regularization. Our results show that reward hacking arises naturally when optimizing proxy objectives with capable language model agents and resists standard mitigations, suggesting that proxy-reward failures in agentic settings may require approaches beyond standard exploration and credit-assignment fixes. To facilitate reproducibility, the code for this work is available at \href{https://github.com/asparius/verl-agent-safety}{our public repository}.

2606.15378 2026-06-16 cs.CL cs.LG 新提交

Rethinking the Role of Efficient Attention in Hybrid Architectures

重新思考高效注意力在混合架构中的作用

Ziqing Qiao, Yinuo Xu, Chaojun Xiao, Zhou Su, Zihan Zhou, Yingfa Chen, Xiaoyue Xu, Xu Han, Zhiyuan Liu

发表机构 * Tsinghua University(清华大学) OpenBMB

AI总结 本文系统分析混合架构中高效注意力模块(如滑动窗口注意力和循环序列混合器)的影响,发现其主要影响长上下文能力的涌现速度,并揭示“大窗口惰性”现象,提出仅对全注意力层去除位置编码可提升长上下文性能。

Comments 23 pages, 13 figures

详情
AI中文摘要

现代语言模型越来越多地采用混合架构,将全注意力与高效注意力模块(如滑动窗口注意力(SWA)和循环序列混合器)相结合。然而,这些高效模块如何塑造模型能力仍知之甚少。为填补这一空白,我们从三个角度对混合架构进行了系统分析:缩放行为、机制分析和架构设计。首先,从缩放角度来看,我们发现高效注意力设计主要影响长上下文能力涌现的速度,而不同的混合模型在充分训练下最终会收敛到可比较的长上下文性能。其次,从机制上,我们表明长距离检索主要由全注意力承担,而高效注意力则塑造其优化轨迹。这解释了我们称之为“大窗口惰性”的反直觉现象:更大的SWA窗口可能延迟全注意力层中检索头的形成。第三,受此机制指导,我们表明仅对小型窗口SWA混合架构的全注意力层应用NoPE(无位置编码)可以显著提升长上下文性能,而对短上下文性能影响甚微。

英文摘要

Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory. This explains a counter-intuitive phenomenon we call Large-Window Laziness: larger SWA windows can delay the formation of retrieval heads in full-attention layers. Third, guided by this mechanism, we show that applying NoPE to only the full-attention layers of a small-window SWA hybrid substantially improves long-context performance with negligible impact on short-context performance.

2606.15377 2026-06-16 cs.LG cs.AI physics.geo-ph 新提交

Learning Earthquake Wave Arrival Time Picking from Labels with Inaccuracies

从不准确标签中学习地震波到时拾取

Sen Li, Xu Yang, S. Mostafa Mousavi, Anye Cao, Keting Fan, Yaoqi Liu, Changbin Wang, Qiang Niu

发表机构 * Department of Earth and Planetary Sciences, Harvard University(哈佛大学地球与行星科学系) School of Computer Science and Technology, China University of Mining and Technology(中国矿业大学(北京)计算机科学与技术学院) School of Mines, China University of Mining and Technology(中国矿业大学(北京)矿院) State Key Laboratory of Coal Exploration and Intelligent Mining, China University of Mining and Technology(中国矿业大学(北京)煤炭勘探与智能开采国家重点实验室)

AI总结 提出标签噪声对比鲁棒学习(LaNCoR)方法,通过对齐波形特征与标签表示分布来纠正错误标签,在微地震P波到时拾取任务中性能提升高达28.8%。

Comments 28 pages, 10 figures

详情
AI中文摘要

不准确标记的训练数据,或称“标签噪声”,对监督机器学习模型的完整性构成重大威胁。这种污染通过教导模型特征与标签之间的错误映射直接降低性能,导致泛化能力差,并在正确标记的验证和测试数据上准确性降低。当前地震学应用主要依赖大规模训练集或数据增强来减少标签噪声影响,这可能是劳动密集且成本高昂的。在这里,我们介绍一种标签噪声对比鲁棒学习(LaNCoR)方法,该方法可以有效处理地震信号处理任务中的噪声标签,而无需大规模训练数据集。在该方法中,输入波形特征和标签表示分布在特征空间中对齐,以纠正错误标记并减少其对训练过程的影响。我们使用两个基线模型和训练方法展示了LaNCoR在真实微地震数据P波到时拾取任务上的性能。我们的结果表明,LaNCoR在性能指标上可提升高达28.8%。该方法在地震学和地球科学中的模型训练方面具有巨大潜力。

英文摘要

Inaccurately labeled training data, or "label noise", poses a significant threat to the integrity of supervised machine learning models. This corruption directly degrades performance by teaching the model erroneous mappings between features and labels, which leads to poor generalization and reduced accuracy on properly labeled validation and test data. Current seismological applications mainly rely on large-scale training sets or data augmentation to reduce the label-noise impact, which can be labor-intensive and costly. Here, we introduce a Label Noise-Contrastive Robust Learning (LaNCoR) approach that can effectively handle noisy labels in seismic signal processing tasks, without requiring large-scale training datasets. In this approach, the input waveform feature and label representation distributions are aligned in the feature space to correct mislabeling and reduce its impact on the training process. We present LaNCoR's performance on the task of P-phase arrival-time picking of real microseismic data using two baseline models and training approaches. Our results indicate that LaNCoR can improve performance by up to 28.8% across performance metrics. This approach holds great promise for model training in seismology and geosciences.

2606.15373 2026-06-16 cs.RO 新提交

A Hybrid Model-Based and Model-Free Framework for Active Multi-View Viewpoint Optimization in Sonar Target Recognition

一种基于模型与无模型混合框架的主动多视角声纳目标识别视点优化

Yongkyoon Park, Jane Shin

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出混合模型与无模型框架,结合CNN观测似然与Radon方向估计,通过信息增益奖励训练PPO智能体离线学习视点选择策略,部署时仅用CNN信念更新实现实时视点选择,在声纳数据集上以更少感知步数和运动成本达到竞争性识别精度。

详情
AI中文摘要

本文提出了一种基于模型与无模型混合框架,用于使用前视声纳进行主动多视角目标识别。卷积神经网络(CNN)提供数据驱动的观测似然,而基于Radon的方向估计无需角度标注即可实现视点感知。在训练过程中,基于信息增益的奖励引导近端策略优化(PPO)智能体离线学习信念感知的视点选择策略。在部署时,学习到的策略仅使用基于CNN的信念更新进行实时视点选择,无需计算昂贵的在线POMDP树搜索。在海洋垃圾前视声纳数据集上的实验表明,与基于模型的基线方法相比,所提方法在减少感知步数和运动成本的同时实现了具有竞争力的识别精度。

英文摘要

This paper presents a hybrid model-based and model-free framework for active multi-view target recognition using forward-looking sonar. A convolutional neural network (CNN) provides data-driven observation likelihoods, while Radon-based orientation estimation enables viewpoint-aware sensing without requiring angle annotations. During training, an information-gain-based reward guides a Proximal Policy Optimization (PPO) agent to learn a belief-aware viewpoint selection policy offline. At deployment, the learned policy performs real-time viewpoint selection using only CNN-based belief updates, eliminating the need for computationally expensive online POMDP tree search. Experiments on a marine-debris forward-looking sonar dataset demonstrate that the proposed approach achieves competitive recognition accuracy while reducing sensing steps and motion cost compared to model-based baselines.

2606.15370 2026-06-16 cs.CV cs.LG 新提交

MNet++: Extended 2D/3D Networks for Anisotropic Medical Image Segmentation

MNet++: 用于各向异性医学图像分割的扩展2D/3D网络

Kirsten Odendaal, Rade Bajic

发表机构 * School of Computing, Georgia Institute of Technology(佐治亚理工学院计算学院)

AI总结 本文复现并扩展了混合2D/3D卷积网络MNet,引入自适应融合门控和VMamba状态空间模块,在保持各向异性鲁棒性的同时提升分割性能。

详情
AI中文摘要

本工作展示了MNet的完整复现与扩展,MNet是一种专为各向异性医学图像分割设计的混合2D/3D卷积网络。在nnU-Net框架内重新实现了原始架构,以验证其报告的性能和对可变体素间距(即各向异性)的鲁棒性。在匹配的预处理和计算约束下,在PROMISE前列腺MRI和LiTS肝脏CT的受控子集上进行了实验。复现的MNet在PROMISE上达到了89.0 +/- 0.9%的Dice相似系数(DSC),与已发表结果相差0.8%,在LiTS上肝脏和肿瘤分割分别达到94.3 +/- 1.9%和54.6 +/- 3.1%。进一步引入了两种轻量级扩展:(1) 一种学习的融合门控机制,实现自适应2D-3D特征融合;(2) 一个VMamba状态空间模块,用于高效的长程深度建模。空间门控变体以不到3%的推理开销将DSC提高了+0.8%,而VMamba提高了性能一致性,将PROMISE Dice变异降低至+/- 0.7%,并在LiTS肝脏上达到最强性能,Dice为95.8%。两种扩展均保持了MNet对各向异性的鲁棒性,在1-4 mm体素间距下Dice变化为1.5%。总体而言,该研究证实了MNet的可复现性,并表明自适应融合和状态空间建模有潜力进一步增强各向异性条件下的分割可靠性。然而,需要进一步测试才能得出明确结论。

英文摘要

This work demonstrates a full reproduction and extension of MNet, a hybrid 2D/3D convolutional network designed for anisotropic medical image segmentation. The original architecture was re-implemented within the nnU-Net framework to verify its reported performance and robustness to variable voxel spacing, known as anisotropy. Experiments were conducted on PROMISE prostate MRI and a controlled subset of LiTS liver CT under matched preprocessing and compute constraints. The reproduced MNet achieved a Dice similarity coefficient (DSC) of 89.0 +/- 0.9% on PROMISE, within 0.8% of the published result, and 94.3 +/- 1.9% / 54.6 +/- 3.1% for liver and tumor segmentation on LiTS, respectively. Two lightweight extensions were further introduced: (1) a learned Fusion Gating mechanism enabling adaptive 2D-3D feature blending, and (2) a VMamba state-space module for efficient long-range depth modelling. The Spatial Gating variant improved DSC by +0.8% with less than 3% inference overhead, while VMamba improved performance consistency, reducing PROMISE Dice variation to +/- 0.7% and achieving the strongest LiTS liver performance at 95.8% Dice. Both extensions preserved MNet robustness to anisotropy, with delta Dice = 1.5% across 1-4 mm voxel spacing. Overall, the study confirms MNet reproducibility and demonstrates that adaptive fusion and state-space modelling have the potential to further strengthen segmentation reliability under anisotropic conditions. However, further tests are required to provide definitive conclusions.

2606.15369 2026-06-16 cs.LG 新提交

Repeated Bilateral Trade: The Quest for Fairness

重复双边贸易:追求公平

François Bachoc, Roberto Colomboni, Emilie Kaufmann

发表机构 * University of Lille(里尔大学) Institut Universitaire de France (IUF)(法国大学研究院) School of Mathematics, University of Bristol(布里斯托大学数学学院) Univ. Lille, CNRS, Inria, Centrale Lille, UMR 9189-CRIStAL(里尔大学、法国国家科学研究中心、法国国家信息与自动化研究所、中央理工-里尔高等电力学院,UMR 9189-CRIStAL)

AI总结 研究重复双边贸易中的公平性,提出Rawls-to-Nash公平增益目标族,并刻画其最优学习率。

详情
AI中文摘要

我们从公平的角度研究重复双边贸易。每轮,一对新的卖方-买方到达,平台在观察交易者估值之前发布价格。只有当双方都接受价格时,交易才会发生。我们考虑的不是最大化贸易收益,而是寻求平衡分配所产生的盈余的平台。我们表明,自然的公平性要求导致了一个单参数的Rawls-to-Nash公平增益目标族,该目标族通过非正Hölder均值聚合卖方和买方的净收益而得到。与标准的贸易收益目标和先前工作中研究的Rawlsian公平增益目标不同,我们提出的目标引入了一种新的统计结构,其中期望奖励通过阈值反馈从二维奇异核积分恒等式中恢复。这导致了一个非标准的纯探索问题,其自然估计量是具有行列依赖和奇异权重的矩形双重和。假设卖方和买方估值序列独立同分布且具有任意未知边际分布,我们刻画了整个Rawls-to-Nash公平增益目标族的最优学习率,给出了匹配的固定置信度样本复杂度和遗憾界(最多相差多对数因子)。

英文摘要

We study repeated bilateral trade from a fairness perspective. At each round, a fresh seller-buyer pair arrives, and the platform posts a price before observing the traders' valuations. Trade occurs only if both agents accept the price. Rather than maximizing only the gain from trade, we consider platforms that seek balanced divisions of the generated surplus. We show that natural fairness desiderata lead to a one-parameter Rawls-to-Nash family of fair-gain objectives, obtained by aggregating the seller's and buyer's net gains through nonpositive Hölder means. Unlike the standard gain-from-trade objective and the Rawlsian fair-gain objective studied in prior work, our proposed objectives induce a new statistical structure in which expected rewards are recovered from threshold feedback through a two-dimensional singular-kernel integral identity. This leads to a nonstandard pure-exploration problem whose natural estimators are rectangular double sums with row-column dependence and singular weights. Assuming independent i.i.d. seller and buyer valuation sequences with arbitrary unknown marginals, we characterize the optimal learning rates for the whole Rawls-to-Nash family of fair-gain objectives, giving matching fixed-confidence sample-complexity and regret bounds up to polylogarithmic factors.

2606.15367 2026-06-16 cs.AI cs.CL cs.IR cs.LG 新提交

S1-DeepResearch: Beyond Search, Toward Real-World Long-Horizon Research Agents

S1-DeepResearch:超越搜索,迈向真实世界的长周期研究智能体

Yao Dong, Xinglin Xiao, Liwei Dong, Xinlong Jin, Zhengbo Li, Heng Zhang, Duyun Wang, Nan Xu

发表机构 * XScience Lab(XScience实验室) Wenge AI(问格人工智能)

AI总结 提出统一轨迹构建范式,结合封闭式问答与开放式探索,通过图基任务构建、智能体轨迹生成和多维验证,合成高质量长链推理轨迹,训练出在20个基准上达到开源最优的32B模型。

详情
AI中文摘要

深度研究智能体旨在通过长周期规划、证据收集、推理和报告生成来解决复杂的知识密集型任务。尽管搜索智能体近期在信息检索和答案验证方面展现出强大能力,但现有训练数据集大多以搜索为中心,主要关注封闭式问答和信息定位。因此,它们主要训练信息寻求行为,而对关键深度研究能力(包括证据整合、知识综合、规划、文件理解和结构化报告生成)的覆盖有限。在这项工作中,我们提出了一种用于深度研究智能体的统一轨迹构建范式,该范式结合了封闭式问答和开放式探索。所提出的框架包括图基任务构建、智能体轨迹展开和多维轨迹验证,能够可扩展地合成涵盖长链复杂推理、深度研究指令遵循、报告撰写、文件理解与生成以及技能使用的高质量智能体轨迹。与现有的面向搜索的数据集相比,我们合成的轨迹更强调知识综合、复杂推理和规划。S1-DeepResearch-32B在跨越五个能力维度(包括复杂推理、指令遵循、报告生成、文件理解和技能使用)的20个基准测试中,达到了同等规模开源模型的最先进性能。在几个具有挑战性的深度研究基准上,它接近领先的专有前沿模型的性能。这些结果强调了联合建模信息获取、知识综合和面向规划的智能体行为对于构建有效深度研究智能体的重要性。

英文摘要

Deep research agents aim to solve complex knowledge-intensive tasks through long-horizon planning, evidence gathering, reasoning, and report generation. While recent progress in search agents has demonstrated strong capabilities in information retrieval and answer verification, most existing training datasets remain search-centric, focusing primarily on closed-ended question answering and information localization. As a result, they mainly train information-seeking behavior while providing limited coverage of key deep research capabilities, including evidence integration, knowledge synthesis, planning, file understanding, and structured report generation. In this work, we propose a unified trajectory construction paradigm for deep research agents that combines closed-ended QA and open-ended exploration. The proposed framework consists of graph-grounded task formulation, agentic trajectory rollout, and multi-dimensional trajectory verification, enabling scalable synthesis of high-quality agentic trajectories spanning long-chain complex reasoning, deep research instruction following, report writing, file understanding and generation, and skills usage. Compared with existing search-oriented datasets, our synthesized trajectories place greater emphasis on knowledge synthesis, complex reasoning, and planning. S1-DeepResearch-32B achieves state-of-the-art performance among open-source models of comparable scale across 20 benchmarks spanning five capability dimensions, including complex reasoning, instruction following, report generation, file understanding, and skills usage. On several challenging deep research benchmarks, it approaches the performance of leading proprietary frontier models. These results highlight the importance of jointly modeling information acquisition, knowledge synthesis, and planning-oriented agent behaviors for building effective deep research agents.

2606.15363 2026-06-16 cs.AI 新提交

APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents

APEX: 自适应原则提取——面向生产AI智能体的三层自进化框架

Ya-Chuan Chen, Tien-Jen Lai, Hsiang-Wei Hu

发表机构 * Grace AI Technology

AI总结 提出APEX框架,通过三层协同进化(提示修复、原则蒸馏、工作流拓扑选择)提升AI智能体性能,在15节点计算集群上实现健康评分+90%。

Comments 8 pages, 1 figure, 4 tables. Evaluated on a production 15-node compute fleet with 114 real task traces. Code available at https://aispark.airlive.com/joe-hackathon/

详情
AI中文摘要

AI智能体的自我改进已成为一个关键研究前沿:系统根据累积的操作经验修改自身的提示、工作流和决策规则。最先进的Self-Harness框架[1]通过挖掘失败簇并修补智能体提示,在Terminal-Bench-2.0上实现了14–21%的提升。然而,Self-Harness仅优化一个维度——提示提示——而行为原则和工作流拓扑保持不变。我们提出APEX(自适应原则提取),一个三层协同进化框架,同时进化:(L1) 通过失败模式修补的提示,(L2) 通过成功轨迹蒸馏[2]的行为原则,以及(L3) 通过基于结构适应度选择[6]的智能体工作流拓扑。我们在Joe[13]上实现了APEX,Joe是一个基于NVIDIA Nemotron构建的生产级超级AI智能体,专为NVIDIA Agent Challenge 2026设计为边缘AI智能体工厂,管理一个15节点计算集群,使用18天内收集的114个真实任务轨迹。APEX在单次进化运行中达到0.570的APEX健康评分(相比基线0.300提升+90%),蒸馏出6个新的可复用原则,并选择了一个得分为0.900(+20%)的研究优先工作流拓扑。我们的结果表明,多维协同进化显著优于单轴提示优化,且成本仅为在本地qwen2.5-coder:32b实例上调用4次LLM(约270秒)。

英文摘要

Self-improvement in AI agents has emerged as a key research frontier: systems that modify their own prompts, workflows, and decision rules based on accumulated operational experience. The state-of-the-art Self-Harness framework [1] achieves 14--21% improvement on Terminal-Bench-2.0 by mining failure clusters and patching the agent harness. However, Self-Harness optimises only one dimension -- the prompt harness -- leaving behavioural principles and workflow topology unchanged. We propose APEX (Adaptive Principle EXtraction), a three-layer co-evolution framework that simultaneously evolves: (L1) the harness via failure-mode patching, (L2) behavioural principles via success-trace distillation [2], and (L3) the agent workflow topology via structural fitness-based selection [6]. We implement APEX on Joe [13], a production-grade super AI Agent built on NVIDIA Nemotron and designed as an Edge AI Agent Factory for the NVIDIA Agent Challenge 2026, managing a 15-node compute fleet using 114 real task traces collected over 18 days. APEX achieves an APEX Health Score of 0.570 (+90% vs. baseline 0.300) in a single evolutionary run, distilling 6 novel reusable principles and selecting a research-first workflow topology scoring 0.900 (+20%). Our results demonstrate that multi-dimensional co-evolution substantially outperforms single-axis harness optimisation, at a cost of only 4 LLM calls (~270 s) on a local qwen2.5-coder:32b instance.

2606.15359 2026-06-16 cs.LG 新提交

DiRecT: Safe Diffusion-Based Planning via Receding-Horizon Denoising

DiRecT:基于滚动去噪的安全扩散规划

Paolo Giaretta, Zeyang Li, Navid Azizan

发表机构 * MIT(麻省理工学院)

AI总结 提出DiRecT算法,通过随机最优控制仅在最终干净轨迹上施加约束,避免中间去噪步骤过度约束,实现安全扩散规划,提升安全性和任务性能。

详情
AI中文摘要

扩散模型通过学习动作和轨迹上的多模态分布,已成为规划和控制的强大工具。然而,在安全关键任务中,可靠的推理时安全强制执行仍然是其部署的主要障碍。现有方法通常将每个去噪迭代投影到可行集上,尽管约束仅定义在最终的干净轨迹上。因此,对含噪中间样本强制执行可行性可能会过度约束采样动态,显著降低样本质量。为解决这一限制,我们引入了DiRecT(通过带终端约束的滚动去噪进行基于扩散的规划),这是一种通过随机最优控制(SOC)从扩散模型中进行无训练约束采样的算法。DiRecT仅在最终干净样本上施加约束,避免了对中间去噪动态的不必要限制。受模型预测控制的启发,我们为原本难以处理的约束SOC公式推导了一个原则性的滚动时域替代方案,从而产生了一种高效的算法,该算法将随机去噪与约束满足清晰分离,逐步将样本引导至可行的最终轨迹,而不会扭曲学习到的扩散动态。此外,DiRecT高度灵活:它可以利用现成的或特定领域的优化器,整合环境动态的先验知识,并优化额外的软奖励。在安全规划基准上的大量实验表明,与现有的基于扩散的规划基线相比,DiRecT显著提高了部署安全性和任务性能。

英文摘要

Diffusion models have emerged as powerful tools for planning and control by learning multimodal distributions over actions and trajectories. Yet reliable inference-time safety enforcement remains a key barrier to their deployment in safety-critical tasks. Existing approaches typically project each denoising iterate onto the feasible set, even though constraints are defined only on the final clean trajectory. Enforcing feasibility on noisy intermediate samples can therefore overconstrain the sampling dynamics, substantially degrading sample quality. To address this limitation, we introduce DiRecT (Diffusion-based planning via Receding-horizon denoising with Terminal constraints), a training-free algorithm for constrained sampling from diffusion models via stochastic optimal control (SOC). DiRecT enforces constraints only on the final clean sample, avoiding unnecessary restrictions on the intermediate denoising dynamics. Inspired by model predictive control, we derive a principled receding-horizon surrogate for the otherwise intractable constrained SOC formulation, yielding an efficient algorithm that cleanly separates stochastic denoising from constraint satisfaction, progressively steering samples toward feasible final trajectories without distorting the learned diffusion dynamics. Furthermore, DiRecT is highly flexible: it can leverage off-the-shelf or domain-specific optimizers, incorporate priors over environment dynamics, and optimize additional soft rewards. Extensive experiments on safe planning benchmarks demonstrate that DiRecT substantially improves deployment safety and task performance over existing diffusion-based planning baselines.

2606.15355 2026-06-16 cs.CV 新提交

Sustainable Face Recognition on Low-Power Devices with VQ-VAE Embeddings

基于VQ-VAE嵌入的低功耗设备可持续人脸识别

Christos Chronis, Georgios Th. Papadopoulos, Iraklis Varlamis

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出一种基于VQ-VAE的可持续边缘人脸识别框架,通过紧凑潜在表示和知识蒸馏,在低功耗设备上实现与先进模型相当的精度,同时降低内存和计算需求。

详情
AI中文摘要

人脸识别已成为现代AI应用的基石,但传统方法通常依赖部署在云环境中的计算密集型模型,导致网络流量增加、高能耗和大量碳足迹。本文介绍了一种基于向量量化变分自编码器(VQ-VAE)的可持续、可边缘部署的人脸识别框架,该框架生成紧凑且语义丰富的人脸图像潜在表示。通过利用VQ-VAE嵌入在边缘的压缩能力和重建质量,并结合知识蒸馏中预训练人脸嵌入的力量,我们的系统在显著降低边缘内存和计算需求的同时,达到了与最先进人脸嵌入模型相当的精度,使其适用于低功耗边缘设备。VQ-VAE压缩的集成最小化了网络开销,同时通过在潜在空间中仅保留最具信息量的面部特征来保持高匹配精度。因此,重建图像保留了关键身份特征,提高了人脸嵌入的鲁棒性和整体性能。

英文摘要

Face recognition has become a cornerstone of modern AI applications, yet conventional approaches often rely on computationally intensive models deployed in cloud environments, leading to increased network traffic, high energy consumption, and a heavy carbon footprint. This work introduces a sustainable, edge-deployable face recognition framework based on Vector-Quantized Variational Autoencoders (VQ-VAE), which generates compact and semantically rich latent representations of facial images. By leveraging the compression capacity and reconstruction quality of VQ-VAE embeddings on the edge and combining them with the power of pre-trained face embeddings in a knowledge distillation setup, our system achieves comparable accuracy to state-of-the-art face embedding models while significantly reducing memory and computation requirements on the edge, making it suitable for low-power edge devices. The integration of VQ-VAE compression minimizes network overhead while keeping the matching accuracy high by retaining only the most informative facial features in the latent space. As a result, the reconstructed images preserve the key identity characteristics, improving the robustness and overall performance of the face embeddings.