arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1970
专题追踪 全部专题
2606.12689 2026-06-12 cs.CL 新提交

Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models

可观察模式并非解释:潜在推理模型的因果几何分析

Darpan Aswal, Thomas Palmeira Ferraz, Yongxin Zhou, Maxime Peyrard

发表机构 * Université Grenoble Alpes, CNRS, Grenoble INP, LIG(格勒诺布尔阿尔卑斯大学,法国国家科学研究中心,格勒诺布尔国立理工学院,信息学实验室) Université Paris-Saclay(巴黎-萨克雷大学) NAVER LABS Europe(NAVER欧洲实验室)

AI总结 本文通过对照实验和因果干预发现,潜在推理模型中的可观察模式(如BFS前沿)在控制组中也出现且不总是因果影响行为,提出潜在思维的使用是分级的,其因果效应集中在低秩方向,几何结构随行为影响增强而更有序。

详情
AI中文摘要

潜在推理模型(LRMs)用连续思维替代显式思维链。最近的研究将可观察的潜在状态模式(如BFS式前沿和可解码的算术计算)视为内部推理机制的证据。通过评估两个LRM(Coconut和CODI)与缺乏所提议的循环或课程的控制组,我们发现这些模式也出现在控制组中,并且并不总是因果性地影响行为。因果干预揭示,潜在思维的利用不是二元的,而是分级的,随着思维对模型行为的因果效应而缩放。几何分析表明,这种效应集中在低秩方向,其逐步几何结构随着行为影响的增加而变得更加结构化。因此,潜在思维应被视为隐藏计算,而非隐藏解释:仅凭可解码性、注意力或静态结构无法确立机制。因此,LRM可解释性需要匹配的控制组和因果测试。

英文摘要

Latent reasoning models (LRMs) replace explicit chain-of-thought with continuous thoughts. Recent work treats observable latent-state patterns, such as BFS-like frontiers and decodable arithmetic computation, as evidence for internal reasoning mechanisms. Evaluating two LRMs (Coconut and CODI) against controls lacking the proposed recurrence or curriculum, we find these patterns also appear in the controls and do not always causally affect behavior. Causal interventions reveal that latent-thought utilization is not binary but graded, scaling with a thought's causal effect on model behavior. Geometric analyses reveal this effect concentrates in low-rank directions whose step-to-step geometry grows more structured as their behavioral influence increases. Latent thoughts should therefore be treated as hidden computation, not hidden explanation: decodability, attention, or static structure alone cannot establish mechanism. LRM interpretability thus requires matched controls and causal tests.

2606.12687 2026-06-12 cs.LG 新提交

Forecasting Is Not Attribution: Localizing Decoder Bypass in Graph-Based Neural Marketing Mix Models

预测不等于归因:在基于图的神经营销组合模型中定位解码器旁路

Yunbo Wang, Bolbi Liu

发表机构 * University of California, Irvine(加州大学尔湾分校) AdsGency AI

AI总结 针对基于图的神经营销组合模型中预测精度高但归因失败的问题,提出DICE-MMM框架,通过限制解码器通信路径来诊断和定位归因旁路,实验表明低预测误差不能保证归因正确性。

详情
AI中文摘要

营销组合模型用于预测业务结果并将这些结果归因于营销渠道,但这些目标并不等价。我们研究了基于图的神经MMM中的一种失败模式,称为归因旁路:高容量解码器可以通过目标自回归、密集通信、共同运动、上下文或潜在记忆获得低预测误差,但未能将反事实敏感性通过用作归因对象的图进行路由。我们引入DICE-MMM作为一个有界诊断和训练框架。我们不声称观测性神经MMM能够识别因果效应。相反,DICE将基于图的MMM中经常混淆的三个问题分开:图恢复、预测准确性,以及训练后的解码器的扰动诱导影响是否与图对齐。阶段1训练一个带有受限图介导解码器的图编码器。阶段2冻结选定的编码器,并训练一个图安全的潜在解码器,其跨节点通信必须通过提供的图。解码器的使用通过CIG、AR-CIG和图交换测试进行评估。在受控的R/d/T交换和外部多图原始日志压力测试中,DICE比CausalMMM提高了稳定图恢复。实验表明,预测准确性不是归因证书:在稀疏目标基准中,无图解码器和全图解码器实现了约0.004的MSE@7,而AR-CIG nAUPRC仍接近或低于零,而oracle图在可比的MSE下达到0.807 +/- 0.129。冻结图交换定位了瓶颈:相同的DICE-hard训练解码器在学习图输入下从nAUPRC -0.044 +/- 0.006移动到oracle图下的0.894 +/- 0.027。贡献在于一个压力测试和故障定位框架,表明低MSE可能隐藏归因旁路,且未解决的瓶颈是图支撑选择,而不是预测或解码器容量。

英文摘要

Marketing mix models are used to forecast business outcomes and to attribute those outcomes to marketing channels, but these goals are not equivalent. We study a failure mode in graph-based neural MMM called attribution bypass: a high-capacity decoder can obtain low forecasting error through target autoregression, dense communication, co-movement, context, or latent memory while failing to route counterfactual sensitivity through the graph used as the attribution object. We introduce DICE-MMM as a bounded diagnostic and training framework. We do not claim that observational neural MMM identifies causal effects. Instead, DICE separates three questions often conflated in graph-based MMM: graph recovery, forecasting accuracy, and whether the trained decoder's perturbation-induced influence is graph aligned. Stage 1 trains a graph encoder with a restricted graph-mediated decoder. Stage 2 freezes the selected encoder and trains a graph-safe latent decoder whose cross-node communication must pass through the supplied graph. Decoder use is evaluated with CIG, AR-CIG, and graph-swap tests. Across controlled R/d/T swaps and an external multi-graph rawlog stress test, DICE improves stable graph recovery over CausalMMM. The experiments show that forecasting accuracy is not an attribution certificate: in a sparse-target benchmark, no-graph and full-graph decoders achieve MSE@7 around 0.004 while AR-CIG nAUPRC remains near or below zero, whereas an oracle graph reaches 0.807 +/- 0.129 at comparable MSE. Frozen graph-swap localizes the bottleneck: the same DICE-hard-trained decoder moves from nAUPRC -0.044 +/- 0.006 under learned graph inputs to 0.894 +/- 0.027 with the oracle graph. The contribution is a stress test and failure-localization framework showing that low MSE can hide attribution bypass and that the unresolved bottleneck is graph-support selection, not forecasting or decoder capacity.

2606.12683 2026-06-12 cs.AI cs.CY cs.LG 新提交

From AGI to ASI

从AGI到ASI

Tim Genewein, Matija Franklin, Alexander Lerchner, Laurent Orseau, Samuel Albanie, Adam Bales, Cole Wyeth, Stephanie Chan, Iason Gabriel, Joel Z. Leibo, Allan Dafoe, Marcus Hutter, Thore Graepel, Shane Legg

发表机构 * Google DeepMind(谷歌深度思维) University of Waterloo(滑铁卢大学) Australian National University(澳大利亚国立大学) University College London(伦敦大学学院)

AI总结 探讨从人类级通用人工智能到超级智能的转变路径,包括扩展、范式转变、递归改进和多智能体涌现,并分析摩擦与瓶颈。

详情
AI中文摘要

在过去十年中,构建人类级通用人工智能已从遥不可及的猜测转变为许多大型AI组织未来十年的具体目标。实现这一目标将对人类社会产生深远影响,并引发未来十年的诸多复杂问题。本报告研究在机器智能连续体中,AI如何在后AGI世界中继续发展。该连续体的终点——通用AI——在理论上已被充分理解,这为本报告的主要焦点提供了形式基础:从人类级AGI向人工通用超级智能的转变,直观上可理解为比大型人类组织更智能、认知能力更强的系统。在描述ASI后,报告讨论了从AGI到ASI的四条潜在路径:扩展AGI、AI范式转变、递归改进以及从大规模多智能体集体中涌现ASI。随后,报告讨论了这些路径上可能的摩擦和瓶颈。确定这些摩擦的影响是微不足道还是重大,提出了若干具体的开放研究问题。由于预测ASI进展存在巨大不确定性,不能排除AI进展在未来几年继续加速的可能性。这可能意味着由人类级AGI引入社会所导致的单一变革性步骤的形象可能不准确。更恰当的前景可能是由AI在科学和技术的多个领域引发的进步和突破所导致的一系列变革性社会变化。为这一前景做准备需要全球范围内的大规模跨学科努力。

英文摘要

Over the last decade, building human-level artificial general intelligence has moved from far-fetched speculation to being a concrete next-decade target for many of the largest AI organisations. Achieving this goal would have profound and far-reaching impacts on human society, which raises many complex questions for the decade ahead. This report investigates how AI itself might continue to develop in a post-AGI world along the continuum of machine intelligence. The endpoint of this continuum, Universal AI, is theoretically well understood, which provides some formal grounding for the main focus of this report: the transition from human-level AGI to artificial general superintelligence, which, intuitively, can be understood as a system that is more intelligent and cognitively capable than large organisations of humans. After characterizing ASI, the report discusses four potential pathways from AGI to ASI: scaling AGI, AI paradigm shifts, recursive improvement, and ASI emerging from large-scale multi-agent collectives. The report then discusses possible frictions and bottlenecks along these pathways. Determining whether the impact of these frictions will be negligible or substantial raises a number of concrete open research questions. Due to large uncertainties for predicting ASI progress, it cannot be ruled out that AI progress might continue to accelerate over the next years. This could imply that the image of a single transformative step change, caused by the introduction of human-level AGI into our society, could be inaccurate. More apt might be the prospect of a series of transformative societal changes caused by AI-enabled progress and breakthroughs across many areas of science and technology. Preparing for this prospect requires a massively interdisciplinary endeavour of global scope and interest.

2606.12680 2026-06-12 cs.LG stat.ML 新提交

How Useful is Causal Invariance for Domain Adaptation in Finite-Sample Settings?

因果不变性在有限样本设置中对领域适应有多大用处?

Julia Kostin, Kasra Jalaldoust, Elias Bareinboim, Samory Kpotufe, Fanny Yang

发表机构 * Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系) Causal Artificial Intelligence Lab, Columbia University(哥伦比亚大学因果人工智能实验室) Department of Statistics, Columbia University(哥伦比亚大学统计系)

AI总结 研究线性回归中因果不变性如何提升监督领域适应,通过候选预测器的目标风险边界和有限样本估计误差推导匹配上下界,证明当边界足够大时自适应聚合可避免负迁移。

详情
AI中文摘要

机器学习模型在部署到与训练源分布不同的目标分布时,性能往往会下降。最近基于因果的领域泛化工作表明,领域间的共享因果结构可以诱导不变预测器,例如在结构化领域偏移下具有稳定风险的某些特征子集上的模型。然而,这种总体水平的因果不变性在有限样本设置中能带来多大收益仍未充分探索。特别是,在实践中我们通常只能获得少量带标签的目标样本,这种设置称为监督领域适应(sDA)。本文探讨何时(完全或部分)因果知识能够可证明地改进监督领域适应。作为第一步,我们研究线性回归,其中完全或部分因果知识指定了一组不变或可能不变的特征子集,每个子集产生一个源训练候选预测器。我们推导了匹配的上界和下界,表明有限样本收益由候选预测器之间的目标风险边界以及有限源估计误差共同决定。当这些边界相对于$n_Q$足够大时,自适应聚合过程可以匹配最佳候选预测器,同时避免相对于仅使用目标样本学习的负迁移。另一方面,当边界过小时,没有算法能够可靠地利用候选集合获得更快的有限样本速率。我们进一步将这些边界与线性SCM中的结构偏移幅度联系起来,并在真实世界的因果基准上验证了理论。

英文摘要

Machine learning models often degrade when they are deployed on a target distribution that differs from the source distributions they were trained on. Recent work in causality-based domain generalization has shown how shared causal structure between domains can induce invariant predictors, e.g., models on a subset of features which have stable risk across structured domain shifts. However, the extent to which such population-level causal invariances can lead to gains in finite-sample settings remains underexplored. In particular, in practice we often have access to a few labeled target samples, a setting called supervised domain adaptation (sDA). In this paper, we explore when (full or partial) causal knowledge can provably improve supervised domain adaptation. As a first step, we study linear regression, where full or partial causal knowledge specifies a collection of invariant or possibly invariant feature subsets, each yielding a source-trained candidate predictor. We derive matching upper and lower bounds showing that finite-sample gains are governed by the target-risk margins separating the candidates, together with the finite-source estimation error. When these margins are sufficiently large relative to $n_Q$, an adaptive aggregation procedure can match the best candidate predictor while avoiding negative transfer relative to target-only learning. On the other hand, when the margins are too small, no algorithm can reliably exploit the candidate collection to obtain faster finite-sample rates. We further connect these margins to structural shift magnitude in linear SCMs and validate the theory on real-world causal benchmarks.

2606.12679 2026-06-12 cs.LG cs.CR eess.IV 新提交

Fed-FBD: Federated Functional Block Diversification for Isolation, Privacy, and Surgical Unlearning

Fed-FBD:用于隔离、隐私和精准遗忘的联邦功能块多样化

Weijie Chen, Alan B. McMillan

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 提出Fed-FBD模块化联邦架构,将ResNet分解为六个功能块并维护颜色变体仓库,实现块级隔离、隐私设计和亚秒级精准遗忘,在多个数据集上以微小精度代价换取安全保障。

Comments 12 pages, 3 figures, 8 tables. Code: https://github.com/wchen-ai/functional-block-diversification

详情
AI中文摘要

联邦学习(FL)能够在无需共享原始患者数据的情况下进行协作模型训练,但标准方法(如FedAvg)将每个客户端视为黑盒,无法隔离对抗性贡献者、审计每个客户端的影响或尊重已退出参与者的被遗忘权。我们提出Fed-FBD(联邦功能块多样化),一种模块化联邦架构,将ResNet骨干网络分解为六个功能块(主干、四个残差组和分类头),并维护一个包含N种颜色变体的仓库,每种变体由独立跟踪和贡献者标记的块组装而成。Fed-FBD提供了FedAvg所不具备的三种能力:(i) 架构保证的块级隔离,使对抗性或错误标注的客户端无法污染干净颜色;(ii) 隐私设计,在应用任何隐私机制之前,成员推断优势已与随机猜测无异;(iii) 在亚秒级成本下无需重新训练即可精准遗忘已退出参与者的贡献。在六个MedMNIST-2D数据集、224x224的PathMNIST和CIFAR-10上的实验表明,Fed-FBD在规模足够的数据集上以0.3%-3.1%的IID精度差距换取这些保证,在四个数据集中的三个上,Dirichlet alpha=1.0时与FedAvg的差距在0.8%-4.0%以内,并将我们研究的所有六种对抗性攻击限制在中毒客户端自己的块内,干净颜色上的AUC漂移最多为+/-0.01。

英文摘要

Federated learning (FL) enables collaborative model training without sharing raw patient data, but standard approaches such as FedAvg treat each client as a black box and provide no mechanism for isolating an adversarial contributor, auditing per-client influence, or honoring a departed participant's right to be forgotten. We present Fed-FBD (Federated Functional Block Diversification), a modular federated architecture that decomposes a ResNet backbone into six functional blocks (the stem, four residual groups, and the classification head) and maintains a warehouse of N color variants, each assembled from independently tracked and contributor-stamped blocks. Fed-FBD provides three capabilities absent in FedAvg: (i) architecturally guaranteed block-level isolation, so that an adversarial or mislabelled client cannot contaminate the clean colous; (ii) privacy-by-design, where membership inference advantage is already indistinguishable from chance before any privacy mechanism is applied; and (iii) surgical machine unlearning of a departed participant's contribution at sub-second cost and without retraining. Experiments on six MedMNIST-2D datasets, PathMNIST at 224x224, and CIFAR-10 show that Fed-FBD trades a modest 0.3%-3.1% IID accuracy gap on the adequately sized datasets for these guarantees, remains within 0.8%-4.0% of FedAvg at Dirichlet alpha=1.0 on three of four datasets, and confines all six adversarial attacks we study to the poisoned client's own blocks with at most +/-0.01 AUC drift on the clean colors.

2606.12673 2026-06-12 cs.LG cs.AI 新提交

A Zero-shot Generalized Graph Anomaly Detection Framework via Node Reconstruction

基于节点重构的零样本广义图异常检测框架

Phan Nguyen, Dat Cao, Hien Chu, Khue Hoang

发表机构 * School of Computing, KAIST(韩国科学技术院计算机学院)

AI总结 提出AlignGAD框架,通过全局统一模块对齐异构特征、聚类模块捕获组级异常模式及节点差异评分模块聚合多视图异常证据,实现零样本跨域图异常检测。

详情
AI中文摘要

跨域图异常检测旨在识别未见过的目标图中的异常节点,在异构图数据的实际应用中展现出巨大潜力。然而,现有方法通常依赖于数据集特定的特征语义和结构模式,限制了其跨域泛化能力。为解决这一挑战,我们提出AlignGAD,一个零样本广义图异常检测框架。我们的框架基于三个关键组件:全局统一模块,用于对齐异构节点特征并在谱域中归一化图信号;聚类模块,用于构建聚类感知的图视图以捕获组级异常模式;以及节点差异评分模块,用于测量重构差异并聚合来自不同图视图的异常证据。在多个真实数据集上的实验证明了AlignGAD在零样本图异常检测设置下的有效性。

英文摘要

Cross-domain graph anomaly detection (GAD) aims to identify abnormal nodes in unseen target graphs, showing strong potential in real-world applications with heterogeneous graph data. However, existing methods often depend on dataset-specific feature semantics and structural patterns, which limits their ability to generalize across different domains. To address this challenge, we propose AlignGAD, a zero-shot generalized graph anomaly detection framework. Our framework is built upon three key components: a Global Unification Module that aligns heterogeneous node features and normalizes graph signals in the spectral domain; a Clustering Module that constructs cluster-aware graph views to capture group-level abnormal patterns; and a Node Discrepancy Scoring Module that measures reconstruction discrepancy and aggregates anomaly evidence from different graph views. Experiments on multiple real-world datasets demonstrate the effectiveness of AlignGAD under the zero-shot GAD setting.

2606.12662 2026-06-12 cs.SD cs.AI cs.LG 新提交

BASENet: Band-Adapted Speech Enhancement Network with Cross-Band Attention

BASENet: 基于频带自适应的跨频带注意力语音增强网络

Damien Martins Gomes, François Capman

发表机构 * Thales SIX GTS, FRANCE(泰雷兹SIX GTS公司,法国)

AI总结 提出BASENet,通过Bark尺度划分频带并分配自适应容量编码器,结合跨频带注意力模块,以最少参数实现高PESQ和STOI,适用于资源受限设备。

详情
AI中文摘要

语音增强模型通常对所有频率采用统一容量,忽略了人类听觉的非均匀频谱分辨率。我们提出BASENet,一种频率自适应架构,将频谱划分为Bark尺度频带,并为每个频带分配基于临界频带密度的缩放容量编码器,自动为感知密集的低频分配更深的分支,为高频分配更轻的分支。跨频带注意力模块通过紧凑的频率池化表示以线性复杂度捕获跨频带的谐波依赖性。基于具有密集连接的倒残差块和卷积循环网络,BASENet在VoiceBank+DEMAND上以仅0.83M参数和7.3 G MACs达到3.55 PESQ和STOI~96%,是所有PESQ > 3.50方法中参数最少的。因果变体(3.44 PESQ)超过了几种非因果基线,证实了其在资源受限设备上实时流传输的适用性。

英文摘要

Speech enhancement models typically apply uniform capacity across all frequencies, disregarding the non-uniform spectral resolution of human hearing. We propose BASENet, a frequency-adapted architecture that partitions the spectrum into Bark-scale bands and assigns each a scaled-capacity encoder derived from critical-band density, automatically granting deeper branches to perceptually dense low frequencies and lighter ones to high frequencies. A cross-band attention module captures harmonic dependencies across bands through compact frequency-pooled representations at linear complexity. Built on inverted residual blocks with dense connectivity and a convolutional recurrent network, BASENet achieves 3.55 PESQ and STOI~96% on VoiceBank+DEMAND with only 0.83M parameters and 7.3 G~MACs, the fewest parameters among all methods with PESQ > 3.50. A causal variant (3.44 PESQ) surpasses several non-causal baselines, confirming suitability for real-time streaming on resource-constrained devices.

2606.12658 2026-06-12 cs.LG q-bio.QM stat.ML 新提交

Physics-Informed Neural Networks for Chemotherapy Pharmacokinetics: Benchmarking the Clinical Estimator and Exposing Parameter Identifiability

基于物理信息的神经网络用于化疗药代动力学:基准测试临床估计器并揭示参数可辨识性

Riya Bisht, Dhruv Agarwal

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本研究将物理信息神经网络(PINN)应用于化疗药代动力学,在双室线性模型上匹配临床标准方法,在Michaelis-Menten扩展模型中揭示参数不可辨识性,并通过稀疏组织观测部分恢复可辨识性。

详情
AI中文摘要

物理信息神经网络(PINN)是生物学中部分观测问题的一个有吸引力的工具,其中控制动力学已知但某些隔室无法测量。化疗药代动力学(PK)是一个清晰的实例:血浆中的药物浓度常规测量,但组织中的浓度——决定肿瘤杀伤和脱靶毒性——无法测量。我们在两个PK问题上将PINN与标准临床基线(非线性最小二乘解析双指数血浆解,以下简称NLS)和物理无关的神经基线(仅数据的MLP)进行基准测试。在线性双室问题上,NLS接近最优;PINN在匹配其性能(小常数因子内)的同时,在单次训练过程中产生组织曲线,而仅数据的MLP在组织上失败约10倍。在Michaelis-Menten扩展(可饱和消除)上,双指数闭式不再存在,因此NLS被错误指定并静默返回无意义的速率常数。PINN反而揭示了一个更深层的事实:Michaelis-Menten双室模型仅从血浆数据不可辨识,PINN通过收敛到k12 -> 0的盆地诚实地报告这一点。添加两个稀疏组织观测在很大程度上解决了可辨识性:在五个随机种子上,PINN恢复k21在真实值的1%以内,Vmax和Km在一个标准差范围内,而k12向正确方向移动(0.02 -> 0.82)但仍低于真实值约2个标准差——这是闭式NLS估计器根本无法尝试的恢复,因为其双指数假设仅描述血浆。我们的主张不是PINN击败NLS。而是PINN提供了一种统一的方案,该方案在教科书问题上与教科书估计器匹配,揭示了教科书估计器隐藏的结构可辨识性,并在单一损失中吸收异构测量。

英文摘要

Physics-Informed Neural Networks (PINNs) are an attractive tool for partial-observation problems in biology, where the governing dynamics are known but some compartments cannot be measured. Chemotherapy pharmacokinetics (PK) is a clean instance: drug concentration in plasma is routinely measured, but concentration in tissue -- which determines tumour kill and off-target toxicity -- is not. We benchmark a PINN against the standard clinical baseline (nonlinear least-squares on the analytical biexponential plasma solution, hereafter NLS) and a physics-agnostic neural baseline (a data-only MLP) on two PK problems. On the linear two-compartment problem, NLS is near-optimal; the PINN matches it to within a small constant factor while also producing the tissue curve in a single training pass, whereas the data-only MLP fails on tissue by roughly 10x. On a Michaelis-Menten extension (saturable elimination), the biexponential closed form no longer exists, so NLS is mis-specified and silently returns meaningless rate constants. The PINN instead exposes a deeper fact: the Michaelis-Menten two-compartment model is non-identifiable from plasma alone, and the PINN reports this honestly by converging to a basin with k12 -> 0. Adding two sparse tissue observations largely resolves identifiability: across five seeds the PINN recovers k21 to within 1% of truth and Vmax, Km to within one standard-deviation bar, while k12 moves in the correct direction (0.02 -> 0.82) but remains ~2 sigma below truth -- a recovery the closed-form NLS estimator cannot attempt at all, because its biexponential ansatz describes only plasma. Our claim is not that PINNs beat NLS. It is that PINNs offer a uniform recipe that ties the textbook estimator on the textbook problem, exposes structural identifiability that the textbook estimator hides, and absorbs heterogeneous measurements within a single loss.

2606.12657 2026-06-12 cs.AI cs.DB cs.RO 新提交

TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation

TrajGenAgent: 一种用于人类移动轨迹生成的分层LLM智能体

Siyu Li, Toan Tran, Lingyi Zhao, Khurram Shafique, Li Xiong

发表机构 * Emory University(埃默里大学) University of Florida(佛罗里达大学)

AI总结 提出TrajGenAgent,一种无需微调的分层LLM智能体框架,通过编排器-工作者两阶段设计生成真实轨迹,在时空保真度、语义一致性和个体行为真实性上优于现有方法。

Comments 14 pages, 2 figures, 8 tables. Accepted by the 27th IEEE International Conference on Mobile Data Management (MDM 2026)

详情
AI中文摘要

人类移动数据对于交通、城市规划和流行病控制至关重要,但大规模轨迹收集通常成本高昂且受隐私限制,这推动了逼真的合成轨迹生成。现有的基于LLM的生成器通常依赖于提示工程(保留了零样本推理但缺乏细粒度的时空基础)或轨迹级微调(提高了统计精度但产生了大量计算成本并可能削弱一般推理)。我们提出了TrajGenAgent,一种语义感知的分层LLM智能体框架,用于无需模型微调的人类移动轨迹生成。TrajGenAgent采用两阶段编排器-工作者设计:LLM首先通过上下文学习从历史证据中合成个体和星期条件化的活动链,然后确定性工作流通过个性化POI检索、距离感知位置选择、运动学感知旅行时间传播和基于LLM的持续时间估计将每个活动落地为完整的访问。为了评估超越聚合时空统计的真实性,我们引入了一个基于异常检测的评估框架,使用两个互补检测器来评估行为和语义合理性。在基准和大规模模拟数据集上的实验表明,与代表性的神经网络和基于LLM的基线相比,TrajGenAgent在时空保真度、语义一致性和个体特定行为真实性方面有所改进,同时避免了参数更新。

英文摘要

Human mobility data is important for transportation, urban planning, and epidemic control, but large-scale trajectory collection is often costly and privacy-constrained, motivating realistic synthetic trajectory generation. Existing LLM-based generators typically rely on either prompt engineering, which preserves zero-shot reasoning but lacks fine-grained spatiotemporal grounding, or trajectory-level fine-tuning, which improves statistical precision but incurs substantial computational cost and may weaken general reasoning. We propose TrajGenAgent, a semantic-aware hierarchical LLM-agent framework for human mobility trajectory generation without model fine-tuning. TrajGenAgent uses a two-stage orchestrator-worker design: an LLM first synthesizes an individual- and weekday-conditioned activity chain from historical evidence via in-context learning, and a deterministic workflow then grounds each activity into a complete visit using personalized POI retrieval, distance-aware location selection, kinematics-aware travel-time propagation, and LLM-based duration estimation. To evaluate realism beyond aggregate spatiotemporal statistics, we introduce an anomaly-detection-based evaluation framework using two complementary detectors to assess behavioral and semantic plausibility. Experiments on benchmark and large-scale simulation datasets show that TrajGenAgent improves spatiotemporal fidelity, semantic coherence, and individual-specific behavioral realism over representative neural and LLM-based baselines, while avoiding parameter updates.

2606.12649 2026-06-12 cs.CL 新提交

MentalMARBERT: Domain-Adaptive Pre-training and Two-Stage Fine-Tuning for Arabic Mental Health Disorders Detection

MentalMARBERT:面向阿拉伯语心理健康障碍检测的领域自适应预训练与两阶段微调

Fatimah Almalki, Areej Alhothali, Lulwah Alharigy, Abdulrahman Aladeem

发表机构 * King Abdulaziz University(阿卜杜勒阿齐兹国王大学)

AI总结 针对阿拉伯语社交媒体文本中心理健康障碍检测的方言差异、非正式语言、标注资源有限和类别不平衡问题,提出领域自适应预训练与两阶段微调框架,构建含5万条推文的数据集,MentalMARBERT在宏F1和准确率上分别达到0.861和0.877。

Comments 17 pages, 5 figures, 13 tables

详情
AI中文摘要

从阿拉伯语社交媒体文本中检测心理健康障碍仍然具有挑战性,原因包括方言差异、非正式语言、高质量标注资源有限以及严重的类别不平衡。虽然英语心理健康自然语言处理(NLP)已取得显著进展,但阿拉伯语多类别障碍分类的研究仍不充分。本研究提出一个两阶段框架用于阿拉伯语心理健康文本分类。在第一阶段,三个阿拉伯语预训练语言模型AraBERT、CAMeLBERT和MARBERT,使用大规模未标注阿拉伯语心理健康推文语料库进行领域自适应和任务自适应预训练(DAPT和TAPT)。在统一协议下评估自适应模型,以确定最有效的骨干模型。在第二阶段,选定的模型在四种配置下进行评估,这些配置结合了单阶段和分层两阶段分类架构,并采用全微调和低秩适应(LoRA)。为支持本研究,我们构建了一个新的标注阿拉伯语心理健康数据集,包含50,670条推文,涵盖六个类别,具有强标注者间一致性(Krippendorff's Alpha = 0.733,平均成对一致性 = 0.797)。实验结果表明,领域自适应的MARBERT(MentalMARBERT)在准确率和宏F1上均比基线模型有统计显著的提升。结合全微调的分层两阶段架构取得了最佳整体性能,宏F1达到0.861,准确率达到0.877。这些发现证明了领域特定自适应预训练和分层分类在阿拉伯语心理健康障碍检测中的有效性。

英文摘要

Detecting mental health disorders from Arabic social media text remains challenging due to dialectal variation, informal language, limited high-quality annotated resources, and severe class imbalance. While English mental health natural language processing (NLP) has progressed substantially, Arabic multi-class disorder classification remains insufficiently studied. This study proposes a two-phase framework for Arabic mental health text classification. In phase 1, three Arabic pre-trained language models, AraBERT, CAMeLBERT, and MARBERT, undergo Domain-Adaptive and Task-Adaptive Pretraining (DAPT and TAPT) using a large-scale corpus of unlabeled Arabic mental health tweets. The adapted models are evaluated under a unified protocol to identify the most effective backbone model. In phase 2, the selected model is assessed across four configurations combining single-stage and hierarchical two-stage classification architectures with full fine-tuning and Low-Rank Adaptation (LoRA). To support this study, we constructed a novel annotated Arabic mental health dataset comprising 50,670 tweets across six categories, with strong inter annotator agreement (Krippendorff's Alpha = 0.733, average pairwise agreement = 0.797). Experimental results show that the domain-adapted MARBERT (MentalMARBERT) achieves statistically significant improvements over baseline models in both accuracy and macro-F1. The hierarchical two-stage architecture combined with full fine-tuning achieves the best overall performance, reaching a macro-F1 of 0.861 and an accuracy of 0.877. These findings demonstrate the effectiveness of domain-specific adaptive pretraining and hierarchical classification for Arabic mental health disorder detection.

2606.12643 2026-06-12 cs.LG 新提交

TEDD: Robust Detection of Unstable Temporal Features

TEDD:不稳定时间特征的鲁棒检测

Ricardo Ribeiro Pereira, Bruno Casal Laraña, Nádia Soares, Miguel Araújo

发表机构 * Feedzai

AI总结 提出TEDD方法,利用回归模型检测导致时间分布变化的特征,无需参数调优,可扩展,能检测数值和类别特征的单变量及多变量漂移。

Comments 8 pages, 9 figures

详情
AI中文摘要

在处理真实世界的时间序列数据时,经常会遇到特征分布随时间变化的情况。在这种不稳定的数据上直接使用机器学习模型可能导致性能迅速下降,尤其是当新分布与训练时所见差异较大时。为了解决这个问题,自动识别随时间变化的特征至关重要。检测到这些特征后,数据科学家和其他从业者能够通过应用数据变换等方式缓解问题,部署更鲁棒的模型,使其在更长时间内保持高性能。本文描述了特征不应遭受的时间变化类型,并提出了TEDD技术,用于a) 识别数据集何时可能导致不稳定的机器学习模型,以及b) 自动检测哪些特征导致了这种不鲁棒性。为此,我们利用回归模型来突出哪些特征有助于良好预测实例的时间戳。我们将我们的方法与其他方法在真实和合成数据上进行比较,测试它们在所有简单变化模式上的检测能力。我们表明,我们的方法:检测所有类型的基本变化,包括数值和类别特征;能够检测多变量漂移;返回一个可比较的值来衡量每个特征的变化量;无需参数调优;并且在数据集的特征数量和实例数量上都具有可扩展性。

英文摘要

When working with real-world temporal data, it is common to encounter features whose distribution is changing over time. The naive employment of Machine Learning models on this unstable data might lead to rapidly degrading performance, especially if the new distribution is much different from what was previously seen during training. In order to cope with this problem, it is critical to automatically identify features that are changing over time. With these features detected, data scientists and other practitioners will be able to mitigate the issue (for instance, by applying data transformations), deploying more robust models that retain high performance for longer periods of time. In this paper, we describe which temporal changes a feature should not suffer from, and propose TEDD, a technique to a) identify when a dataset might lead to an unstable Machine Learning model and b) automatically detect which features cause such lack of robustness. In order to achieve it, we leverage a regression model to highlight which features contribute to a good prediction of an instance's timestamp. We compare our approach to other methods in real and synthetic data, testing their detection capability on all simple change patterns. We show that our method: detects all types of basic changes, both for numerical and categorical features; can detect multivariate drifts; returns a comparable value measuring the amount of change of each feature; requires no parameter tuning; and is scalable both on number of features and instances of the dataset.

2606.12640 2026-06-12 cs.LG cs.RO cs.SY eess.SY 新提交

Individual Control Barrier Functions-Guided Diffusion Model for Safe Offline Multi-Agent Reinforcement Learning

个体控制障碍函数引导的扩散模型用于安全离线多智能体强化学习

Qingyun Guo, Junyi Shi, Jianuo Huang, Tianyu Shi

发表机构 * Department of Electrical Engineering and Automation, Aalto University(阿尔托大学电气工程与自动化系) School of Computing and Data Science, Xiamen University Malaysia(厦门大学马来西亚分校计算与数据科学学院) Department of Computer Science, University of Toronto(多伦多大学计算机科学系)

AI总结 提出一种将神经个体控制障碍函数嵌入扩散模型的离线多智能体强化学习算法,通过逆动力学恢复控制策略,在保证奖励的同时显著提升轨迹生成的安全性。

Comments Accepted to the 23rd IFAC World Congress, 2026

详情
AI中文摘要

离线强化学习允许直接从数据中学习控制策略而无需在线交互,使其适用于安全关键任务。最近的研究将扩散模型应用于离线强化学习,以利用其建模复杂数据分布的强大能力。然而,现有方法主要关注单智能体设置,多智能体环境中的安全挑战在很大程度上未被探索。在这项工作中,我们提出了一种安全的离线多智能体强化学习算法,该算法将神经个体控制障碍函数嵌入扩散模型中,以增强轨迹生成过程中的安全性,并通过逆动力学恢复控制策略。我们在多种基准上评估了我们的算法,证明了在保持竞争性奖励的同时实现了显著的安全改进。

英文摘要

Offline reinforcement learning allows control policies to be learned directly from data without online interaction, making it suitable for safety-critical tasks. Recent studies have applied diffusion models to offline reinforcement learning to leverage their strong capacity for modeling complex data distributions. However, existing approaches primarily focus on single-agent settings, leaving the safety challenges in multi-agent environments largely unexplored. In this work, we propose a safe offline multi-agent reinforcement learning algorithm that embeds neural individual control barrier functions into the diffusion model to enhance safety during trajectory generation, with control policies recovered through inverse dynamics. We evaluate our algorithm across diverse benchmarks, demonstrating substantial safety improvements while maintaining competitive rewards.

2606.12639 2026-06-12 cs.LG q-bio.QM 新提交

The Metric Picks the Winner: Evaluation Choice Flips Model Rankings for Drug-Response Prediction in Unseen Chemistry

度量选择胜者:评估选择翻转未见化学空间中药物反应预测的模型排名

Dhruv Agarwal, Riya Bisht

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本研究通过VCPI竞赛数据,发现药物反应预测模型排名随评估指标反转:简单基线在代理指标下胜出,但真实指标下深度模型显著优于线性指纹基线,首次在真实药物化学数据上验证了度量校准效应。

详情
AI中文摘要

预测细胞转录组对其从未见过的药物的反应是计算细胞生物学中的一个核心难题:最近的基准测试表明,一旦测试化合物按化学结构留出,复杂模型往往无法击败简单基线。我们研究了一个细胞系和检测方法,即通过DRUG-seq分析的THP-1细胞,由VCPI预测竞赛的活性化合物加权MSE(wMSE)评分。我们提出了一种分阶段方法:该领域一直无法击败的简单基线(未处理对照和平均训练化合物响应);非参数检索(对留出化合物的最近训练化合物进行Tanimoto加权平均);以及一个融合阶段,将冻结的化学嵌入与检索支持特征相结合,以预测相对于均值的残差,并包含不确定性头和基因程序。在发布的VCPI THP-1 drug-seq数据(14,026个训练化合物)上,采用Bemis-Murcko骨架划分,模型排名根据度量标准反转。在逆方差每基因代理度量下,基于Morgan指纹的正则化线性回归似乎胜过了深度模型、检索和ChemBERTa——这是教科书式的“简单基线获胜”结果。但在竞赛的真实活性集度量(每(基因,化合物)的Mejia权重,经官方评分器验证;均值基线0.535 vs 组织者的0.507参考)下,情况反转:深度模型获胜,我们的融合解码器显著优于线性指纹基线(-0.012 wMSE,配对bootstrap p < 10^-4),而代理度量的胜者成为最差的化学感知预测器。选择度量即选择胜者——据我们所知,这是首次在真实留出药物化学数据上证明度量校准效应,该效应此前主要在遗传扰动中建立。我们发布了一个可复现的流水线,连接到官方评分器,可在真实的1064 x 12,995网格上生成有效提交。

英文摘要

Predicting how a cell's transcriptome responds to a drug it has never seen is a core, hard problem in computational cell biology: recent benchmarks show complex models often fail to beat trivial baselines once test compounds are held out by chemistry. We study one cell line and assay, THP-1 cells profiled by DRUG-seq, scored by the active-compound weighted MSE(wMSE) of the VCPI prediction contest. We propose a staged approach: dumb baselines (untreated control and mean training-compound response) that the field keeps failing to beat; non-parametric retrieval (a Tanimoto-weighted average of a held-out compound's nearest training compounds); and a fusion stage combining a frozen chemistry embedding with retrieval-support features to predict the residual over the mean, with an uncertainty head and gene programs. On the released VCPI THP-1 drug-seq data (14,026 training compounds), under a Bemis-Murcko scaffold split, the model ranking inverts depending on the metric. Under an inverse-variance per-gene proxy, a regularized linear regression on Morgan fingerprints appears to win over the deep models, retrieval, and ChemBERTa -- the textbook "simple baselines win" result. But under the contest's true active-set metric (per-(gene, compound) Mejia weights, validated against the official scorer; mean baseline 0.535 vs the organizers' 0.507 reference), that reverses: the deep models win, our fusion decoder significantly beats the linear fingerprint baseline (-0.012 wMSE, paired bootstrap p < 10^-4), and the proxy's winner becomes the worst chemistry-aware predictor. Picking the metric picks the winner -- to our knowledge the first demonstration on real held-out drug chemistry of the metric-calibration effect established largely on genetic perturbation. We release a reproducible pipeline wired to the official scorer that emits a valid submission over the real 1064 x 12,995 grid.

2606.12635 2026-06-12 cs.CV 新提交

CD-RCM: Generalizable Continuous-Depth Novel View Synthesis for Reflectance Confocal Microscopy

CD-RCM:面向反射共聚焦显微镜的泛化连续深度新视角合成

Tooba Imtiaz, Milind Rajadhyaksha, Kivanc Kose, Jennifer Dy

发表机构 * Northeastern University(东北大学) Memorial Sloan Kettering Cancer Center(纪念斯隆凯特琳癌症中心)

AI总结 针对反射共聚焦显微镜各向异性3D体积,提出首个RCM专用新视角合成方法CD-RCM,通过前馈模型从稀疏z-stack预测连续深度切片,实现亚秒级高保真合成。

详情
AI中文摘要

反射共聚焦显微镜(RCM)通过获取连续深度处的正面图像,形成稀疏z-stack,从而提供人体皮肤 \emph{体内} 的无创、细胞分辨率“光学活检”。由于光学限制,这些堆栈是各向异性的3D体积,横向分辨率(0.5 $\mu$m)比轴向分辨率(由光学切片定义,3 $\mu$m)高约6倍,限制了组织解释。我们的目标是通过插值中间切片并使3D体积各向同性,提供连续深度可视化。这种表示允许任意方向切片,包括类似组织病理学的横截面检查,无需针对每位患者进行优化。为此,我们引入了首个RCM特定的新视角合成(NVS)方法CD-RCM,这是一种前馈模型,可从稀疏采样的RCM堆栈预测逼真的、未见过的深度。经典神经渲染方法侧重于从表面级多视角观测进行重建。与表面级相机视图不同,RCM可以获取组织表面以下至200 $\mu$m的光学切片正面图像。然而,在可视化RCM堆栈时,较浅切片(朝向表面)的观测会遮挡较深切片。这种独特的轴向成像几何和层依赖性解剖结构促使我们开发了定制的架构和训练框架,明确考虑了RCM的深度分辨、遮挡成像物理特性。实验表明,CD-RCM实现了高保真新视角合成,推理时间低于一秒。

英文摘要

Reflectance confocal microscopy (RCM) provides noninvasive, cellular-resolution "optical biopsies" of human skin \emph{in vivo} by acquiring en-face images at successive depths, forming a sparse z-stack. Due to optical limitations, these stacks are anisotropic 3D volumes with lateral resolution (0.5 $μ$m) $\sim$6 times higher compared to axial resolution, which is defined by the optical sectioning (3 $μ$m), limiting the interpretation of tissue. Our goal is to provide continuous-depth visualization by interpolating intermediate sections and making the 3D volume isotropic. Such a representation permits arbitrary-direction sectioning, including histopathology-like cross-sectional examination, without requiring per-patient optimization. To that end, we introduce the first RCM-specific novel-view synthesis (NVS) approach, CD-RCM, a feedforward model that predicts realistic, unseen depths from sparsely sampled RCM stacks. Classical neural rendering methods focus on reconstruction from surface-level multi-view observations. In contrast to surface-level camera views, RCM can acquire optically sectioned en-face images of tissue beyond the surface up to 200 $μ$m. However, during visualization of the RCM stacks, observations of the shallower sections (towards the surface) obscure the deeper ones. This unique axial imaging geometry and layer-dependent anatomical organization motivated our development of a tailored architectural and training framework that explicitly accounts for RCM's depth-resolved, occlusive imaging physics. Experiments demonstrate that CD-RCM achieves high-fidelity novel-view synthesis with sub-second inference time.

2606.12634 2026-06-12 cs.LG cs.AI cs.CL 新提交

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

保持策略梯度主导:面向长程工具使用智能体的兄弟引导信用蒸馏

Tianyu Ding, Jianhong Xin, Juan Pablo De la Cruz Weinstein

发表机构 * Amazon Web Services(亚马逊云服务)

AI总结 针对长程工具使用强化学习中轨迹级优势信号稀疏的问题,提出兄弟引导信用蒸馏(SGCD),通过动态采样成功与失败轨迹、外部LLM对比生成逐步信用参考,实现密集信用分配,在AppWorld和τ³-airline任务上显著提升性能。

Comments 13 pages, 4 figures, 7 tables. Submitted to EMNLP 2026 Industry Track

详情
AI中文摘要

长程工具使用强化学习可以从结果验证中学习,但其轨迹级优势被广播到许多推理、API和答案令牌上。自蒸馏通过重用策略自身的轨迹或特权教师承诺提供更密集的信号。然而,我们表明直接的令牌级自蒸馏会悄然破坏工具使用:它复述教师行为而不知道验证器奖励哪些动作,因此有用技能和有害捷径被一起放大。我们引入兄弟引导信用蒸馏(SGCD),它使用蒸馏进行信用分配而非作为竞争性的演员损失。动态采样产生混合的成功和失败的兄弟轨迹;外部LLM将其对比总结为训练时逐步信用参考;密集的教师/学生散度驱动信用重新分配;有界分离的信用权重重塑GRPO令牌优势。部署的学生看不到外部LLM、兄弟证据或预言机。在AppWorld和τ³-airline上,SGCD优于匹配的GRPO比较器:AppWorld上test_normal的TGC从42.9提升到45.6,test_challenge从24.7提升到27.0;τ³-airline的pass@1从0.583提升到0.602。

英文摘要

Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by reusing a policy's own rollouts or a privileged teacher. We show, however, that direct token-level self-distillation can silently destroy tool use: it rehearses teacher behavior without knowing which actions the verifier rewards, so useful skills and harmful shortcuts are amplified together. We introduce Sibling-Guided Credit Distillation (SGCD), which uses distillation for credit assignment rather than as a competing actor loss. Dynamic sampling produces mixed successful and failed sibling rollouts; an external LLM summarizes their contrast into a training-only stepwise credit reference; dense teacher/student divergence drives credit reassignment; and bounded detached credit weights reshape GRPO token advantages. The deployed student sees no external LLM, sibling evidence, or oracle. Across AppWorld and $τ^3$-airline, SGCD improves over matched GRPO comparators: AppWorld TGC $42.9 \to 45.6$ on test_normal and $24.7 \to 27.0$ on test_challenge, and $τ^3$-airline pass@1 $0.583 \to 0.602$.

2606.12633 2026-06-12 cs.CV cs.LG 新提交

ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation

ECA:面向开放图像到文本生成的高效持续对齐

Jiangtao Kong, Peijun Zhao, Chun-Fu Chen, Youngwook Do, Shaohan Hu, Tianyi Zhou, Huajie Shao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ECA方法,通过混合查询模块、Fisher动态扩展和字典重放,实现无需旧数据的持续对齐,缓解灾难性遗忘,提升开放图像到文本生成的增量学习性能。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

开放图像到文本生成(OpenITG)的增量学习(IL)使模型能够持续为新的图像生成准确、上下文相关的文本,同时保留先前获得的知识。与先前研究不同,本文处理了一个更实际的场景,其中视觉数据的主要类别随时间推移而演变。在此背景下,我们引入了持续对齐的新概念,它逐步调整预训练VLM中的对齐模块,以保持高质量的跨模态表示。基于这一思想,我们提出了高效持续对齐(ECA),一种用于OpenITG的无样本IL方法。关键挑战是使模型能够获取新的任务特定特征,同时最小化对已建立对齐的干扰,且无需访问先前任务的原始数据。为此,ECA采用了三种核心机制:混合查询(MoQ)模块,用于适应任务特定的查询令牌;Fisher动态扩展(FeDEx),基于Fisher信息矩阵(FIM)度量动态扩展模型结构;以及带有字典重放(DR)的嵌入字典,以保留过去的知识。为了评估ECA的性能,我们构建了四个新的IL OpenITG基准,更好地反映了现实场景。实验结果表明,与基线方法相比,ECA显著缓解了灾难性遗忘并提高了IL性能。代码和基准可在该https URL获取。

英文摘要

Incremental Learning (IL) for Open-ended Image-to-Text Generation (OpenITG) enables models to continuously generate accurate, contextually relevant text for new images while preserving previously acquired knowledge. Unlike prior studies, this paper addresses a more practical scenario in which the predominant category of visual data shifts over time as environments evolve. In this context, we introduce a new notion of continual alignment, which incrementally adapts the alignment module within pre-trained VLMs to preserve high-quality cross-modal representations. Based on this idea, we propose Efficient Continual Alignment (ECA), a novel exemplar-free IL approach for OpenITG. The key challenge is enabling the model to acquire new, task-specific features while minimizing interference with the established alignment without accessing raw data from previous tasks. To address this, ECA employs three core mechanisms: a Mixture of Query (MoQ) module that adapts task-specific query tokens, a Fisher Dynamic Expansion (FeDEx) that dynamically expands model structure based on a Fisher Information Matrix (FIM)-based metric, and an embedding dictionary with Dictionary Replay (DR) to retain past knowledge. To evaluate ECA's performance, we construct four new IL OpenITG benchmarks that better reflect real-world scenarios. Experimental results demonstrate that ECA significantly mitigates catastrophic forgetting and improves IL performance compared to baseline methods. Code and benchmarks are available at https://github.com/Snowball0823/ECA.

2606.12628 2026-06-12 cs.CV 新提交

Context-Aware Feature-Fusion for Co-occurring Object Detection in Autonomous Driving

面向自动驾驶中共现对象检测的上下文感知特征融合

Binay Kumar Singh, Niels Da Vitoria Lobo

发表机构 * Department of Computer Science, University of Central Florida(中佛罗里达大学计算机科学系)

AI总结 提出上下文中心特征融合框架CCFF,通过局部上下文融合模块和全局上下文注意力模块分别处理小/遮挡对象与共现先验,提升共现对象检测性能,在Cityscapes和BDD100K上实现类别一致性策略0.973和0.969,小目标检测AP_S提升14.1%。

Comments 8 pages, 3 figures, CVPR 2026 Precognition Workshop

详情
AI中文摘要

自动驾驶中的目标检测需要精确定位以及对共现对象之间关系上下文的固有理解。在极其复杂的异构环境中,稀有类别、小尺度对象和频繁出现的对象对于标准目标检测框架来说难以处理。在本文中,我们提出了一种新颖的框架,称为上下文中心特征融合(CCFF),它利用两个基于注意力的模块:局部上下文融合模块(LCFM)使用RoI到RoI的自注意力机制来解决空间交互,主要考虑小且部分遮挡的对象;而全局上下文注意力模块(GCAM)通过将top-K RoI特征池化为全局上下文注意力标记来转换对象的共现先验,避免了像素级全局池化的计算开销。这种局部和以对象为中心的全局特征的融合产生了上下文化的嵌入,增强了分类结果和共现对象检测。我们的方法在两个数据集Cityscapes和BDD100K上进行了评估,在关系一致性上显示出显著改进,分别达到了0.973和0.969的类别级一致性策略(CCS)。此外,我们的方法在小目标检测(AP_S: 14.1%)上取得了实质性提升,并成功恢复了通常在大分布中丢失的稀有类别,如“火车”。我们的效率报告显示,该框架以0.2 FPS的开销实时处理图像。代码可在此https URL获取。

英文摘要

Object detection in autonomous driving requires precise localization and an inherent understanding of the relational context between co-occurring objects. In extremely complex heterogeneous environments rare classes, small-scale objects, and frequently appearing objects are difficult for standard object detection frameworks to handle. In this paper, we propose a novel framework called Context-Centric Feature Fusion (CCFF), which utilizes two attention-based modules, Local Context Fusion Module (LCFM) uses the RoI-to-RoI self-attention mechanism to resolve spatial interactions, mainly considering small and partially obscured objects, while Global Context Attention Module (GCAM) converts the co-occurrence of objects priors by pooling top-K RoI features into a global context attention token, avoiding the computational overhead of pixel-level global pooling. This fusion of local and object-centric global features yields contextualized embeddings that enhance classification results and co-occurring objects detection. Our method is evaluated on two datasets, Cityscapes and BDD100K which demonstrate significant improvement on relational consistency, achieving a Category-level Consistency Strategy (CCS) of 0.973 and 0.969, respectively. Furthermore, our approach produces substantial gains in small object detection (AP_S: 14.1%) and successfully recovers rare classes such as "Train" that are typically lost in large distributions. Our efficiency report shows that the framework processes images in real time with a 0.2 FPS overhead. The code is available at https://github.com/BinayKSingh/CCFF.

2606.12616 2026-06-12 cs.AI cs.CL 新提交

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

PersonaDrive: 面向闭环驾驶模拟的人类风格检索增强VLA智能体

Mahmoud Srewa, Praneetsai Iddamsetty, Mohammad Abdullah Al Faruque, Salma Elmalaki

发表机构 * University of California, Irvine(加利福尼亚大学尔湾分校)

AI总结 提出PersonaDrive流水线,通过检索风格指令下的人类驾驶演示来调节视觉-语言-动作(VLA)驾驶智能体,实现闭环模拟中多样化的非自车智能体行为,无需针对每种风格重新训练。

详情
AI中文摘要

闭环驾驶模拟器通常在其环境中填充行为大致相同的非自车交通智能体,这些智能体要么由基于规则的交通管理器生成,要么由训练为单一行为模式的学习模型生成。最近的工作通过观测数据上的事后标签或LLM推断的奖励权重引入风格变化,但这些信号充当了风格应奖励什么的代理,而不是明确要求以该风格驾驶的人类演示。我们提出了PersonaDrive,一个流水线,它根据从风格指令的人类驾驶数据集中检索到的演示来调节视觉-语言-动作(VLA)驾驶智能体,在该数据集中,参与者在驾驶员在环平台上以激进、中性和保守指令驾驶CARLA排行榜路线。该流水线包括三个阶段:(i) 使用组合的图像-文本相似度分数对每种风格的人类驾驶数据进行离线三元组挖掘;(ii) 训练一个轻量级检索头,将冻结的视觉特征与每个风格数据库上的小型控制编码器融合;(iii) 微调单个VLA主干,以在航点预测期间将检索到的上下文点视为上下文行为演示。在推理时,通过切换检索头查询的每个风格数据库,相同的主干可以适应任何风格,因此选择风格无需针对每种风格重新训练,同时为闭环模拟启用人类风格、风格多样的非自车智能体。在Bench2Drive上,PersonaDrive(无风格)的驾驶得分比SimLingo高4.6%,比HiP-AD高2.5%,在风格条件下,每种风格都获得最高驾驶得分,波动范围约2%(其最弱风格超过最强基线DMW 5.4%),而从保守指令到激进指令,平均速度和加速度分别提高18%和25%。

英文摘要

Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single behavioral mode. Recent work introduces style variation through post-hoc labels on observational data or LLM-inferred reward weights, but these signals act as proxies for what a style should reward rather than demonstrations of humans explicitly asked to drive in that style. We introduce PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, in which participants drive CARLA leaderboard routes under aggressive, neutral, and conservative instructions on a driver-in-the-loop rig. The pipeline has three stages: (i) offline triplet mining over per-style human driving data using a combined image-text similarity score; (ii) training a lightweight retrieval head that fuses frozen visual features with a small control encoder over per-style databases; and (iii) fine-tuning a single VLA backbone to treat retrieved context points as in-context behavioral demonstrations during waypoint prediction. At inference, the same backbone is conditioned on any style by swapping which per-style database the retrieval head queries, so selecting a style requires no per-style retraining while enabling human-style, style-diverse non-ego agents for closed-loop simulation. On Bench2Drive, PersonaDrive (no style) improves the driving score by 4.6% over SimLingo and 2.5% over HiP-AD, and under style conditioning attains the highest driving score in every style within a roughly 2% band (its weakest style surpassing the strongest baseline, DMW, by 5.4%), while average speed and acceleration rise by 18% and 25% from the conservative to the aggressive instruction.

2606.12615 2026-06-12 cs.LG 新提交

Towards Provably Fair Machine Learning: Bayesian Approaches For Consistent and Transparent Predictions

迈向可证明公平的机器学习:用于一致和透明预测的贝叶斯方法

Owen O'Neill, Fintan Costello

发表机构 * University College Dublin(都柏林大学学院)

AI总结 提出公平贝叶斯分类器,通过强制确定性和统计一致性,在多个数据集上实现零一致性错误,同时保持准确性和多校准,解决少数群体因正则化导致的预测不一致问题。

详情
AI中文摘要

部署在高风险领域的机器学习分类器产生的预测质量在不同子组之间存在系统性差异。对于由多个特征交叉定义的细粒度子组,预测通常与观测数据不一致:模型输出与该子组可用的证据相矛盾。正则化通过将小子组合并到较大组中来改善整体性能,从而加剧了这一问题,对人口统计少数群体产生不成比例的影响。我们定义了一致性预测的两个要求:确定性(相同的个体获得相同的预测)和统计一致性(在显著性水平alpha下,我们不能拒绝子组预测来自为该子组推断的贝叶斯最优目标分布的假设)。从这些要求出发,我们推导出公平贝叶斯分类器,该分类器同时强制每个组和子组满足这两个要求,并在无法进行一致确定性预测时弃权。在三个基准数据集(Adult、COMPAS和Bank Marketing)上,标准分类器对相当一部分子组产生统计上不一致的预测。我们的分类器通过构造实现零一致性错误,同时在每个测试数据集上超过基线准确性和多校准。统计一致性为预测质量提供了原则性基础,对算法公平性有直接影响。少数群体人口不成比例地集中在小子组中,而正是在这些子组中频率论推断最不可靠;因此,解决这一推断问题是迈向公平ML的必要步骤。通过在数据支持的最细粒度上强制贝叶斯一致性,我们的分类器证明了在实践中可以实现具有原则性弃权的详尽子组公平性。

英文摘要

ML classifiers deployed in high-stakes domains produce predictions whose quality varies systematically across subgroups. For granular subgroups defined by intersections of multiple features, predictions are often inconsistent with the observed data: the model's outputs contradict the evidence available for that subgroup. This problem is exacerbated by regularisation, which improves aggregate performance by collapsing small subgroups into larger groups, disproportionately affecting demographic minorities. We define two requirements for consistent prediction: determinism (identical individuals receive identical predictions) and statistical consistency (we cannot reject, at significance level alpha, the hypothesis that the predictions for a subgroup were drawn from the Bayesian optimal target distribution inferred for that subgroup). From these requirements we derive the Fair Bayesian classifier, which enforces both across every group and subgroup simultaneously and abstains whenever no consistent deterministic prediction is possible. On three benchmark datasets (Adult, COMPAS, and Bank Marketing), standard classifiers produce statistically inconsistent predictions for a substantial proportion of subgroups. Our classifier achieves zero consistency error by construction while exceeding baseline accuracy and multicalibration on every dataset tested. Statistical consistency provides a principled foundation for prediction quality with direct implications for algorithmic fairness. Minority demographics are disproportionately concentrated in small subgroups, precisely where frequentist inference is least reliable; addressing this inference problem is therefore a necessary step toward fair ML. By enforcing Bayesian consistency at the finest resolution the data supports, the our classifier demonstrates that exhaustive subgroup fairness with principled abstention is achievable in practice.

2606.12614 2026-06-12 cs.RO 新提交

DARRMS -- An Efficient Algorithm for Dynamic Attention Radius in Resource-Constrained Multi-Agent Systems

DARRMS——资源受限多智能体系统中动态注意力半径的高效算法

Benjamin Alcorn, Eman Hammad

发表机构 * Texas A&M University(德克萨斯A&M大学)

AI总结 提出DARRMS算法,通过优化注意力半径和决策,在资源受限下降低计算需求,提升协调性和可扩展性。

详情
AI中文摘要

多智能体系统是机器人、网络安全和自动驾驶规划等领域不可或缺的工具。这类系统通常面临计算资源约束,需要高效的轻量级算法。传统决策框架常假设理想条件(如完全可观测性和无限计算能力),这与现实挑战不符。本文提出一种新算法,在不显著牺牲其他性能指标的前提下,降低对计算资源的需求。智能体将可观测性限制在某个注意力半径内,从而有意识地忽略对行动规划可能不必要的环境部分。通过同时优化注意力半径和决策,我们的方法在不确定环境中增强了协调性和可扩展性。通过理论分析和实证验证,我们证明了自适应观测在资源受限系统中提升系统性能并维持稳健决策策略的有效性。

英文摘要

Multi-agent systems are integral tools for various domains such as robotics, cybersecurity, and autonomous vehicle planning. These types of systems often have constraints on the computational resources, leading to a need for efficient lightweight algorithms. Traditional decision making frameworks often assume ideal conditions, such as full observability and unlimited computational capacity, which do not align with real-world challenges. In this paper, we introduce a new algorithm that allows for reduced demand on computational resources without a large cost of other performance metrics. Agents will limit their observability to some attention radius, which intentionally allows them to ignore parts of the environment that might be unnecessary for action planning. By optimizing both the attention radius and decision-making, our approach enhances coordination and scalability in uncertain environments. Through both theoretical analysis and empirical validation, we demonstrate the effectiveness of adaptive observation in improving system performance and maintaining robust decision-making strategies in resource-constrained systems.

2606.12610 2026-06-12 cs.LG 新提交

The Mathematics of AI Winters: The mathematical Taxonomy of Paradigm Fragility in AI Winter

AI寒冬的数学:AI中范式脆弱性的数学分类

Miquel Noguer i Alonso, David Pacheco Aznar

发表机构 * AIFI Staq.io

AI总结 本文提出AI寒冬的数学解释,通过感知机不可能性、神经网络训练复杂度、高维非参数估计率、梯度消失和统计学习理论等数学瓶颈,分析早期AI范式失败的原因,并关联后续突破。

Comments 33 pages, 1 figure

详情
AI中文摘要

人工智能研究中两个主要的资金减少和信心下降时期,通常被称为第一次和第二次AI寒冬,通常被解释为工程失败、商业失望和预期膨胀。本文提出一个补充论点:这些时期的主导范式也遇到了真正的形式障碍,包括表示、优化、计算复杂性、统计可学习性和高维近似的限制。贡献是综合性的而非档案性的。我们并不声称特定定理机械地导致了寒冬;相反,我们表明早期AI的几个核心失望与数学上精确的瓶颈相一致。我们通过Minsky和Papert的感知机不可能结果、Blum和Rivest建立的精确神经网络训练的计算复杂性困难、Stone的高维非参数估计的极小化极大率、Hochreiter以及Bengio及其合作者的梯度消失分析,以及Vapnik和Chervonenkis、Valiant、Blumer及其合作者传统的经典统计学习理论来分析这些瓶颈。然后我们将这些障碍与后来缓解(而非消除)它们的突破联系起来。

英文摘要

Two major periods of reduced funding and confidence in artificial intelligence research, commonly called the first and second AI winters, are usually explained through engineering failure, commercial disappointment, and inflated expectations. This article develops a complementary thesis: that the dominant paradigms of those periods also met genuine formal barriers, including limitations of representation, optimisation, computational complexity, statistical learnability, and high-dimensional approximation. The contribution is synthetic rather than archival. We do not claim that particular theorems mechanically caused the winters; rather, we show that several central disappointments of early AI were aligned with mathematically precise bottlenecks. We analyse these bottlenecks through the perceptron impossibility results of Minsky and Papert, the complexity-theoretic hardness of exact neural-network training established by Blum and Rivest, minimax rates for nonparametric estimation in high dimension due to Stone, vanishing-gradient analyses by Hochreiter and by Bengio and collaborators, and classical statistical learning theory in the tradition of Vapnik and Chervonenkis, Valiant, and Blumer and collaborators. We then relate these barriers to the later breakthroughs that mitigated, rather than eliminated, them.

2606.12609 2026-06-12 cs.LG q-bio.QM 新提交

Viral Proteins Reveal Geometry of Protein Language Models

病毒蛋白质揭示蛋白质语言模型的几何结构

Arthur Bigot, Harmon Bhasin, Core Francisco Park, Eugene Shakhnovich, Dianzhuo Wang

发表机构 * University of Washington(华盛顿大学) DeepMind(深度思维)

AI总结 研究蛋白质语言模型在不平衡数据下对病毒蛋白的表示,发现嵌入空间中存在主导的“天然性”轴,该轴按模型困惑度排序序列,且缩放效果因病毒家族而异,但嵌入仍保留病毒特异性信号。

Comments Accepted at ICML 2026 GenBio Workshop and FM4LS Workshop. Code available at https://github.com/MisteFr/viral-proteins-plms

详情
AI中文摘要

蛋白质语言模型在高度不平衡的数据集上训练,引发了一个问题:它们如何表示代表性不足的生物序列?以病毒蛋白作为跨ESM模型家族的案例研究,我们在嵌入空间中识别出一个主导的天然性轴,该轴与掩码重建困惑度对齐,将序列从建模良好的细胞蛋白通过病毒蛋白排序到打乱和随机序列。缩放效果在不同病毒家族间不均匀地压缩该轴。尽管如此,蛋白质语言模型嵌入保留了病毒特异性信号:病毒蛋白在零样本困惑度和浅层序列特征之上仍然是线性可分的。这些结果共同表明,pLM表示由天然性的一般概念结构化,同时保留了特定于不同生物群体的信息。

英文摘要

Protein language models are trained on highly imbalanced datasets, raising the question of how they represent underrepresented biological sequences. Using viral proteins as a case study across ESM model families, we identify a dominant nativeness axis in embedding space, aligned with masked reconstruction perplexity, that orders sequences from well-modeled cellular proteins through viral proteins to shuffled and random sequences. Scaling contracts this axis unevenly across viral families. Despite this, protein language model embeddings retain viral-specific signal: viral proteins remain linearly separable beyond zero-shot perplexity and shallow sequence features. Together, these results suggest that pLM representations are structured by a general notion of nativeness while preserving information specific to distinct biological groups.

2606.12608 2026-06-12 cs.CL cs.LG 新提交

Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

购物推理基准:面向多轮对话购物助手的专家编写基准

Shuxian Fan, Seonwoo Min, Youna Hu, Botao Xia, Jayakrishnan Unnikrishnan, Rowan Musselmann, Yifan Gao, Qingyu Yin, Priyanka Nigam, Bing Yin

发表机构 * Amazon(亚马逊)

AI总结 提出一个由零售专家编写的525个任务的多轮对话购物推理基准,包含10863个加权评分标准,评估9个模型显示通过率仅57-77%,多轮任务性能下降4-18分。

详情
AI中文摘要

对话式购物助手现已服务数亿客户,但现有基准均未联合评估真实购物对话所需的开放式多轮推理、领域专业知识和标准级质量。购物推理在语言模型应用中独具特色。与事实性问答或可验证代码生成不同,它需要在多轮对话中平衡主观偏好、预算约束和跨产品权衡,这些能力在以往的电商和通用基准中缺失。我们引入了购物推理基准(Shopping Reasoning Bench),这是一个由零售领域专家编写的基准,包含525个任务(232个单轮,293个多轮)和10863个重要性加权的二元评分标准。这些标准组织在包含五个推理类别和十五个子类别的分类体系下,涵盖偏好细化、权衡分析和兼容性评估等多样化需求。对三个模型系列(GPT、Claude、Gemini)中九个模型的评估显示,整体通过率仅为57-77%。在多轮任务中,所有模型在可选的超越标准上的得分比必需标准低13-29分,并且随着对话进行,性能下降4-18分。这些差距表明,当前模型能处理基本购物辅助,但达不到专家级建议,使购物推理基准成为未来购物助手开发的挑战性测试平台。

英文摘要

Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and criterion-level quality that real shopping conversations demand. Shopping reasoning is unique among language model applications. Unlike factual question answering or verifiable code generation, it requires balancing subjective preferences, budget constraints, and cross-product trade-offs across multi-turn dialogue, capabilities absent from previous e-commerce and general-purpose benchmarks. We introduce the Shopping Reasoning Bench, an expert-authored benchmark of 525 missions (232 single-turn, 293 multi-turn) with 10863 importance-weighted binary rubrics authored by retail domain experts. These criteria are organized under a taxonomy of five reasoning categories and fifteen subcategories covering diverse demands such as preference refinement, trade-off analysis, and compatibility assessment. An evaluation of nine models across three families (GPT, Claude, Gemini) shows that pass rates reach only 57--77% overall. On multi-turn missions, all models score 13--29 points lower on optional above-and-beyond criteria than on required ones, and performance degrades 4--18 points as conversations progress. These gaps show that current models handle basic shopping assistance but fall short of expert-level advice, making Shopping Reasoning Bench a challenging testbed for future shopping assistant development.

2606.12604 2026-06-12 cs.RO 新提交

EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

EgoEngine:从自我中心人类视频到高保真灵巧机器人演示

Yangcen Liu, Shuo Cheng, Xinchen Yin, Woo Chul Shin, Alfred Cueva, Yiran Yang, Zhenyang Chen, Chuye Zhang, Danfei Xu

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Tsinghua University(清华大学)

AI总结 提出EgoEngine框架,通过视觉和动作桥接,将自我中心人类视频转化为高保真机器人数据,首次实现零样本灵巧策略学习。

详情
AI中文摘要

灵巧操作受限于大规模机器人演示数据的收集成本。自我中心人类视频提供了多样操作行为的可扩展来源,但直接用于机器人学习需要弥合两个差距:人类与机器人观测之间的视觉差距,以及人类运动与机器人可执行动作之间的动作差距。我们提出EgoEngine,一个可扩展的框架,用于将自我中心人类操作视频转化为高保真机器人数据。给定一个自我中心RGB视频,EgoEngine生成:(i) 高保真机器人观测视频,用机器人替换人类,同时保留场景上下文和时间对齐,以及(ii) 在可行性约束下,与任务对齐、可执行的机器人动作轨迹。在仿真和真实机器人上的实验表明,EgoEngine能够将人类视频可扩展地转化为机器人数据,并且据我们所知,首次展示了无需真实机器人演示,从自我中心人类视频进行零样本视觉运动灵巧策略学习。项目网站:此 https URL。

英文摘要

Dexterous manipulation is limited by the cost of collecting large-scale robot demonstrations. Egocentric human videos offer a scalable source of diverse manipulation behaviors, but directly using them for robot learning requires bridging two gaps: the visual gap between human and robot observations, and the action gap between human motion and robot-executable action. We propose EgoEngine, a scalable framework for transforming egocentric human manipulation videos into high-fidelity robot data. Given an egocentric RGB video, EgoEngine produces: (i) a high-fidelity robot observation video replacing human with robot while preserving scene context and temporal alignment, and (ii) a task-aligned, executable robot action trajectory under feasibility constraints. Experiments in simulation and on real robots show that EgoEngine enables scalable conversion of human videos into robot data and, to our knowledge, demonstrates the first zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations. Project website: https://egoengine.github.io.

2606.12603 2026-06-12 cs.RO cs.AI 新提交

From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation

从模仿到对齐:面向长距离人行道导航的人类偏好流策略

Honglin He, Zhizheng Liu, Yukai Ma, Bolei Zhou

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出FlowPilot,一种仅使用单目RGB相机的无地图导航策略,通过锚定流匹配进行预训练,并引入人类偏好学习实现对齐,在长距离人行道导航中提升鲁棒性和社会合规性。

详情
AI中文摘要

自主长距离人行道导航对于微出行应用(如机器人送餐和辅助电动轮椅)至关重要。与道路上的自动驾驶不同,长距离人行道导航需要在不可预测的人行道地形和行人中精确操作,且感知栈轻量,仅需单个单目RGB相机。虽然从演示中模仿学习(IL)提供了一种实用解决方案,但由此产生的自动驾驶策略常常遭受复合误差、人行道上缺乏社会合规性以及缺乏处理复杂情况的反事实推理能力。为解决这些挑战,我们提出了FlowPilot,一种仅使用单目RGB相机即可实现稳健高效长距离导航性能的无地图导航策略。我们首先提出使用锚定流匹配作为动作表示,用于在大型机器人车队数据上进行策略预训练,并捕捉人行道导航行为的多样、复杂、多模态分布。为弥合模仿与对齐之间的差距,我们进一步设计了一种人在环的偏好学习方案,通过少量人类干预数据调整策略。它增强了模型的反事实推理能力和在人行道上的社会合规性。我们通过在多样化人行道环境中的广泛仿真和真实世界实验评估了FlowPilot。在仿真中,FlowPilot实现了42%的成功率和66%的路线完成率,而FlowPilot-HP进一步提升了真实世界的鲁棒性和社会合规性,相对于基础模型,IR降低了40.0%,NIR降低了52.1%。

英文摘要

Autonomous long-horizon sidewalk navigation is essential for micro-mobility applications such as robotic food delivery and assistive electronic wheelchairs. Unlike autonomous driving on the road, long-horizon sidewalk navigation requires precise maneuvering through unpredictable sidewalk terrains and pedestrians, with a lightweight perception stack as minimal as a single monocular RGB camera. While imitation learning (IL) from demonstrations offers a practical solution, the resulting autopilot policy often suffers from compounding errors, a lack of social compliance on sidewalks, and deficiencies in counterfactual reasoning to handle complex situations. To address these challenges, we introduce FlowPilot, a mapless navigation policy that achieves robust and efficient long-horizon navigation performance using only a monocular RGB camera. We first propose to use anchored flow matching as an action representation for policy pre-training on large-scale robot fleet data and to capture the diverse, complex, multimodal distribution of sidewalk navigation behaviors. To bridge the gap between imitation and alignment, we further design a human-in-the-loop preference learning scheme to tune the policy on a small amount of human intervention data. It strengthens the model's counterfactual reasoning and social compliance on sidewalks. We evaluate FlowPilot through extensive simulation and real-world experiments in diverse sidewalk environments. FlowPilot achieves 42% success rate and 66% route completion in simulation, while FlowPilot-HP further improves real-world robustness and social compliance, reducing IR by 40.0% and NIR by 52.1% relative to the base model.

2606.12601 2026-06-12 cs.CV 新提交

Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning

双状态槽注意力:解耦外观与身份用于视频目标中心学习

Sieu Tran, Duc Nguyen, Hao Vo, Khoa Vo, Ngan Le

发表机构 * University of Arkansas(阿肯色大学)

AI总结 提出双状态槽注意力(DSSA),通过分离每个槽为局部状态(外观)和身份状态(稳定身份),并采用竞争调制聚合减少弱匹配槽的干扰,提升视频目标分割质量与时间一致性。

详情
AI中文摘要

无监督视频目标中心学习旨在无需监督地将动态场景分解为持久的目标级表示。然而,现有的基于槽的方法在快速运动和部分遮挡等挑战性场景中难以维持稳定的目标身份。首先,它们通常将目标的每帧外观和跨帧身份编码在单个槽向量中,造成目标冲突导致槽交换:重建需要对瞬态视觉变化敏感,而时间一致性需要对它们不变。其次,槽注意力中使用的令牌重归一化可能放大弱注意力槽,使其吸收其他目标的令牌,破坏槽与目标的对应关系。我们提出双状态槽注意力(DSSA),一种完全自监督框架,通过分离外观与身份并减少弱匹配槽的虚假更新来解决这些限制。DSSA将每个槽分解为用于每帧外观的局部状态和用于时间稳定目标信息的身份状态,从而用分离的表示对齐重建和时间一致性。身份状态通过学习的循环转换更新,该转换作为局部状态的时间滤波器,而竞争调制聚合(CMA)降低弱匹配槽的更新权重,防止它们吸收其他目标的令牌。在MOVi-C、MOVi-D和YouTube-VIS上的实验表明,DSSA在分割质量和时间一致性上持续优于先前方法,同时在下游目标识别和视频动态预测中表现更强。代码和模型将在接收后公开。

英文摘要

Unsupervised video object-centric learning aims to decompose dynamic scenes into persistent, object-level representations without supervision. However, existing slot-based methods struggle to maintain stable object identity in challenging settings such as rapid motion and partial occlusion. First, they typically encode both the per-frame appearance of an object and its identity across frames in a single slot vector, creating an objective conflict that leads to slot swapping: reconstruction requires sensitivity to transient visual changes, whereas temporal consistency requires invariance to them. Second, the token renormalization used in Slot Attention can amplify weakly attending slots, allowing them to absorb tokens from other objects and destabilize slot-to-object correspondence. We propose Dual-State Slot Attention (DSSA), a fully self-supervised framework that addresses these limitations by separating appearance from identity and by reducing spurious updates from weakly matching slots. DSSA decomposes each slot into a local state for per-frame appearance and an identity state for temporally stable object information, thereby aligning reconstruction and temporal consistency with separate representations. The identity state is updated through a learned recurrent transition that acts as a temporal filter on the local state, while competition-modulated aggregation (CMA) down-weights updates from weakly matching slots and prevents them from absorbing tokens from other objects. Experiments on MOVi-C, MOVi-D, and YouTube-VIS demonstrate that DSSA consistently improves segmentation quality and temporal consistency over prior methods, while also yielding stronger downstream object recognition and video dynamics prediction. Code and models will be made publicly available upon acceptance.

2606.12595 2026-06-12 cs.LG cs.AI cs.CV 新提交

Emerging Flexible Designs for Geospatial Multimodal Foundation Models

地理空间多模态基础模型的新兴灵活设计

Philipe Dias, Waqwoya Abebe, Abhishek Potnis, Aristeidis Tsaris, Dan Lu, Xiao Wang, Dalton Lunga

发表机构 * Oak Ridge National Laboratory(橡树岭国家实验室)

AI总结 本文系统比较了不同架构的地理空间基础模型,在统一设置下评估其灵活性与性能,为多模态推理提供设计指导。

详情
AI中文摘要

基础模型通过跨多样未标记地理空间模态的可扩展预训练,正在迅速改变地球观测。然而,其架构多样性——从编码器-only到编码器-解码器以及掩码自编码范式——使得以一致方式评估性能权衡变得具有挑战性。在这项工作中,我们对领先的、专为地理空间多模态推理设计的基础模型架构进行了同类比较,特别关注不同光谱波段配置下的灵活性。我们使用相同的自监督学习目标和训练数据集标准化预训练,并在GEOBench基准测试上,在一致参数化下评估所有模型的分类和分割任务。我们的结果为模型灵活性、模态对齐和下游任务性能之间的设计权衡提供了新见解。通过强调受控条件下的架构优势和局限性,本研究为构建能够进行鲁棒多模态推理的下一代地理空间基础模型提供了实用指导。

英文摘要

Foundation models are rapidly transforming Earth observation by enabling scalable pretraining across diverse unlabeled geospatial modalities. However, their architectural diversity ranging from encoder-only to encoder-decoder and masked autoencoding paradigms makes it challenging to assess performance trade offs in a consistent manner. In this work, we present an apples-to-apples comparison of leading FM architectures designed for geospatial multimodal reasoning, with a particular focus on flexibility across varied spectral band configurations. We standardize pretraining using identical self supervised learning objectives and training datasets, and evaluate all models under consistent parameterization on the GEOBench benchmark across classification and segmentation tasks. Our results offer new insights into the design trade-offs between model flexibility, modality alignment, and downstream task performance. By highlighting architectural strengths and limitations under controlled conditions, this study provides practical guidance for building next generation geospatial foundation models capable of robust multimodal reasoning.

2606.12594 2026-06-12 cs.AI 新提交

Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

Pythagoras-Prover: 通过增强型Lean形式化推进高效形式化证明

Joshua Ong Jun Leang, Zheng Zhao, Mihaela Cătălina Stoian, Qiyuan Xu, Haonan Li, Wenda Li, Shay B. Cohen, Eleonora Giunchiglia

发表机构 * Imperial College London(伦敦帝国学院) University of Edinburgh(爱丁堡大学) Nanyang Technological University(南洋理工大学) MBZUAI(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出Pythagoras-Prover系列,包括自回归和扩散模型,通过课程SFT、动态过滤和增强型Lean形式化(ALF)扩展验证数据,在MiniF2F-Test上以更少参数超越DeepSeek-Prover-V2。

Comments Pythagoras-Prover: Technical Report

详情
AI中文摘要

现代Lean定理证明器只有在大量训练和推理计算下才能取得强性能,部分原因是由于稀缺的验证证明数据和形式化证明搜索的长推理轨迹,使得监督微调(SFT)和采样成本高昂。我们介绍了Pythagoras-Prover,一个计算高效的开源Lean定理证明器系列,专为实际计算预算而构建。该系列涵盖两种生成范式:4B和32B参数的自回归模型,以及首个概念验证的基于扩散的证明器(4B),它在推理时迭代地精炼Lean证明。为了提高训练效率,我们构建了一个Lean验证的语料库,按易、中、难问题分层,用于课程SFT,使模型逐步从较短、较简单的证明过渡到较长、较难的证明。在SFT期间,动态证明推理过滤方案保留了信息丰富的证明轨迹,同时将每个实例保持在8k令牌的上下文预算内。我们还引入了增强型Lean形式化(ALF),它将稀缺的验证语料库扩展为形式化语句的变体,通过自蒸馏填充以提供额外训练信号,而无需正式验证每个变异实例。通过扰动已知问题同时保留其形式化特征,ALF减少了对任何语句表面形式的依赖。实验上,Pythagoras-Prover-4B在MiniF2F-Test上的pass@32(86.1% vs 82.4%)超过了DeepSeek-Prover-V2-671B,参数数量约为其1/167,而Pythagoras-Prover-32B在MiniF2F-Test上以93.0%的成绩创下了开源最先进水平,并在672个PutnamBench问题中解决了93个。我们发布了MiniF2F-ALF,一个经ALF变异的对污染敏感的基准,每个评估模型在该基准上的准确率均下降;在此基准上,我们的32B模型仍然最强,而4B模型匹配了先前最先进的Goedel-Prover-V2-32B。

英文摘要

Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof data and the long reasoning traces of formal proof search, making both supervised fine-tuning (SFT) and sampling expensive. We introduce Pythagoras-Prover, a compute-efficient open-source family of Lean theorem provers built for practical compute budgets. The family spans two generation paradigms: autoregressive models at 4B and 32B parameters, and a first proof-of-concept diffusion-based prover (4B) that iteratively refines Lean proofs at inference time. For training efficiency, we build a Lean-verified corpus stratified into easy, medium, and hard problems for curriculum SFT, so models acquire proof skills progressively from shorter, simpler proofs to longer, harder ones. During SFT, a dynamic proof-reasoning filtering scheme preserves informative proof traces while keeping each instance within an 8k-token context budget. We also introduce Augmented Lean Formalisation (ALF), which expands scarce verified corpora into variants of formal statements, populated via self-distillation for extra training signal without formally verifying every mutated instance. By perturbing known problems while preserving their formal character, ALF reduces reliance on any statement's surface form. Empirically, Pythagoras-Prover-4B surpasses DeepSeek-Prover-V2-671B at pass@32 on MiniF2F-Test (86.1% vs 82.4%) with ~167x fewer parameters, while Pythagoras-Prover-32B sets the open-source state of the art at 93.0% on MiniF2F-Test and solves 93 of 672 PutnamBench problems. We release MiniF2F-ALF, an ALF-mutated contamination-sensitive benchmark on which every evaluated model loses accuracy; here our 32B remains strongest and our 4B matches the prior state of the art, Goedel-Prover-V2-32B.

2606.12590 2026-06-12 cs.CV cs.AI 新提交

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

分析与改进医学LVLMs中的细粒度偏好优化

Shayan Mohammadizadehsamakosh, Pritam Sarkar, Leonid Sigal, Ali Etemad, Elham Dolatabadi

发表机构 * York University(约克大学) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) Queen’s University(女王大学)

AI总结 针对医学大视觉语言模型在事实一致性、视觉定位和临床对齐方面的不足,提出一种结合双向令牌级KL正则化和视觉对比定位目标的细粒度在线偏好优化框架,通过最小编辑模型输出构建偏好对,仅修正临床错误片段,显著提升诊断准确性。

详情
AI中文摘要

大型视觉语言模型(LVLMs)在医学影像任务中取得了强劲性能,但仍容易出现事实不一致、视觉定位差以及与临床有意义反馈对齐不足的问题。现有的后训练对齐方法,包括直接偏好优化(DPO)及其变体,在医学领域面临三个关键限制:(1)序列级奖励信号将临床关键令牌与通用填充文本等同对待;(2)依赖静态监督微调参考作为偏好响应引入了离策略分布偏移,将优化导向风格伪影而非临床正确性;(3)对齐目标缺乏明确的视觉定位约束,使模型对微妙但诊断决定性的病理特征不敏感。我们的方法利用双向令牌级KL正则化以及视觉对比定位目标,该目标将干净图像与病变破坏图像配对,以惩罚缺乏足够视觉证据生成的响应。这些组件共同构成了一个细粒度的在线对齐框架,通过最小编辑模型生成的输出来构建偏好对,仅修正临床错误片段,同时保留原始语言风格。在医学影像任务和临床文本生成基准上的大量实验验证了我们方法的有效性。

英文摘要

Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features. Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style. Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.

2606.12587 2026-06-12 cs.AI cs.HC 新提交

Strategic Decision Support for AI Agents

AI智能体的战略决策支持

Shayan Kiyani, Sima Noorani, George Pappas, Hamed Hassani

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 针对AI智能体作为主要决策者时的可靠性问题,提出通过优化问题最小化支持使用并控制反事实遗漏支持误差的战略决策支持框架,并开发在线算法自适应阈值化支持分数。

详情
AI中文摘要

传统上,决策支持研究人类如何使用机器学习模型做出更好的决策。在现代智能体系统中,这种角色分工日益反转:AI智能体代表用户行动,而人类和工具成为围绕它们的支持机制。这种角色反转将可靠性问题推至前沿,因为智能体错误可能产生严重后果,且智能体行为必须始终与人类目标和约束保持一致。脱离经典的决策支持观点,我们在AI智能体作为核心行动者的设定下,重新审视其两个基本原则:寻求支持的成本-价值权衡以及不确定性量化的作用。我们提出了一个AI智能体战略决策支持框架,通过一个优化问题来最小化支持使用,同时控制一个反事实遗漏支持误差:即智能体在那些支持本可实质改善其输出的实例上单独行动的概率。在总体层面,我们证明最优策略是关于支持价值的阈值规则。基于这一结构,我们开发了一种在线算法,该算法自适应地阈值化这样的分数,并使用随机探索来控制遗漏支持误差,无需分布假设。我们进一步引入了一种即时校准方法,在线减少不必要的支持调用。我们将该框架实例化到多种场景中,包括信息收集、人机协作和工具使用,展示了每种场景如何通过相同的战略决策支持视角建模。跨这些场景的实验表明,我们的方法可靠地控制了目标误差,同时在实际中大幅减少了支持使用。

英文摘要

Traditionally, decision support studies how humans use machine learning models to make better decisions. In modern agentic systems, this division of roles is increasingly reversed: AI agents act on behalf of users, while humans and tools becomes support mechanisms around them. This role reversal brings reliability concerns to the forefront, since agentic errors can be consequential and agent behavior must remain aligned with human goals and constraints. Departing from the classical view of decision support, we revisit its two basic principles, the cost--value tradeoff of seeking support and the role of uncertainty quantification, in a setting where AI agents are the central actors. We propose a framework for strategic decision support for AI agents through an optimization problem that minimizes support usage subject to controlling a counterfactual missed-support error: the probability that the agent acts alone on instances where support would have materially improved its output. At the population level, we show that the optimal policy is a threshold rule on the value of support. Building on this structure, we develop an online algorithm that adaptively thresholds such a score and uses randomized exploration to control missed-support error without distributional assumptions. We further introduce a calibration-on-the-fly method that reduces unnecessary support calls online. We instantiate this framework across diverse scenarios, including information gathering, human--AI collaboration, and tool use, showing how each can be modeled through the same strategic decision-support lens. Experiments across these settings show that our method reliably controls the target error while substantially reducing support usage in practice.