arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
专题追踪
2606.15275 2026-06-16 cs.CV 新提交

MamBOA: State-Space Architecture for Video Recognition

MamBOA:用于视频识别的状态空间架构

Mustafa Bora Çelik

发表机构 * Ankara Medipol University(安卡拉梅迪波尔大学)

AI总结 提出MamBOA框架,通过交错扫描结构将选择性状态空间递归(S6)作为运动合成器,从骨干网络提取的连续特征中编码运动,实现细粒度动作识别的高效时序建模。

Comments 15 pages, 7 figures. Codes available at [https://github.com/BOA-clk/MamBOA]

详情
AI中文摘要

细粒度动作识别需要时序推理,通用架构通过不同的成本-精度权衡来解决:3D密集算子将计算与输入体积耦合,而基于差分的方法通过刚性的、手工设计的无上下文特征减法来近似运动——每种方法都反映了深思熟虑的设计选择,并在表达能力或灵活性上存在相应限制。我们提出MamBOA,一个骨干无关的时序框架,基于新颖的交错扫描结构,将选择性状态空间递归(S6)重新定义为原生运动合成器。通过将从预训练骨干中提取的连续特征表示交错成单个交替序列,所提出的扫描结构驱动递归在共享隐藏状态中编码每个位置的时序观测,两者仅相隔一个衰减步骤——使得帧间过渡成为状态动力学的内在组成部分,而非外部计算的量。然后,一系列专用的对齐和解码操作将此联合编码提炼为显式运动表示,双路径池化机制通过平衡注意力驱动的选择与均匀时序覆盖来自适应地聚合该表示。该框架与CNN、Transformer和Mamba骨干家族无缝接口,每对特征仅增加约2.1 GFLOPs。在Diving48上,MamBOA使用图像预训练骨干达到85.02%的Top-1准确率,使用视频预训练骨干在单次前向传播中处理整个视频达到86.24%——表明结构诱导的状态空间动力学构成了运动建模的原则性和通用基础。

英文摘要

Fine-grained action recognition demands temporal reasoning that general-purpose architectures address through different cost-accuracy tradeoffs: 3D dense operators couple computation to the input volume, while difference-based methods approximate motion through rigid, hand-crafted subtraction of uncontextualized features - each reflecting a deliberate design choice with corresponding limitations in expressiveness or flexibility. We present MamBOA, a backbone-agnostic temporal framework built upon a novel interleaved scan structure that recasts the selective state-space recurrence (S6) as a native motion synthesizer. By interleaving consecutive feature representations extracted from a pretrained backbone into a single alternating sequence, the proposed scan structurally drives the recurrence to encode both temporal observations of each position within a shared hidden state, separated by only a single decay step - rendering the inter-frame transition an intrinsic component of the state dynamics rather than an externally computed quantity. A cascade of dedicated alignment and decoding operations then distills this joint encoding into an explicit motion representation, which a dual-path pooling mechanism adaptively aggregates by balancing attention-driven selection with uniform temporal coverage. The framework interfaces seamlessly with CNN, Transformer, and Mamba backbone families, adding only ~2.1 GFLOPs per feature pair. On Diving48, MamBOA achieves 85.02% Top-1 accuracy with an image-pretrained backbone and 86.24% with a video-pretrained backbone processing the entire video in a single forward pass - demonstrating that structurally induced state-space dynamics constitute a principled and general foundation for motion modeling.

2606.15273 2026-06-16 cs.AI 新提交

Feature Attribution in Directed Acyclic Graphs Using Edge Intervention

基于边干预的有向无环图特征归因

Qiheng Sun, Junxu Liu, Xiaokai Mao, Haocheng Xia, Jinfei Liu, Kui Ren, Haibo Hu

发表机构 * Zhejiang University(浙江大学) Zhejiang Lab(之江实验室) Hong Kong Polytechnic University(香港理工大学)

AI总结 针对现有特征归因方法无法同时捕获特征外部性和外生影响的问题,提出基于边干预的DAG-SHAP方法,将每条特征边作为归因对象,并引入近似计算方法,实验验证其有效性。

详情
AI中文摘要

基于Shapley值的特征归因方法在涉及复杂特征交互和因果关系的场景中面临挑战,即使提供了因果结构。现有方法通常采用节点中心视角,仅将重要性归因于单个特征。因此,它们往往无法同时捕获特征的外部性和外生影响,导致不合理的解释。为克服这些限制,我们提出一种新的基于边干预的特征归因方法DAG-SHAP。DAG-SHAP将每条特征边作为单独的归因对象,确保特征的外部性和外生贡献都被适当捕获。此外,我们引入了一种近似方法以高效计算DAG-SHAP。在真实和合成数据集上的大量实验验证了DAG-SHAP的有效性。我们的代码可在https://github.com/ZJU-DIVER/DAG-SHAP获取。

英文摘要

Shapley value-based feature attribution methods face challenges in scenarios involving complex feature interactions and causal relationships, even when a causal structure is provided. Existing methods typically adopt a node-centric view, attributing importance solely to individual features. Consequently, they often fail to simultaneously capture the externality and exogenous influence of features, leading to unreasonable interpretations. To overcome these limitations, we propose a novel feature attribution method called DAG-SHAP, which is based on edge intervention. DAG-SHAP treats each feature edge as an individual attribution object, ensuring that both externality and exogenous contributions of features are appropriately captured. Additionally, we introduce an approximation method for efficiently computing DAG-SHAP. Extensive experiments on both real and synthetic datasets validate the effectiveness of DAG-SHAP. Our code is available at https://github.com/ZJU-DIVER/DAG-SHAP.

2606.15268 2026-06-16 cs.LG 新提交

When to use what Schatten-$p$ norm in deep learning?

在深度学习中何时使用何种 Schatten-$p$ 范数?

Thomas Pethick

发表机构 * Pethick et al. [2026](Pethick 等人 [2026])

AI总结 本文通过理论分析解决关于 Schatten-∞ 优化器有效性的矛盾观察,发现结论取决于数据维度:在低维场景(包括 Chinchilla 缩放)下,较小的 Schatten-p 几何更优,并基于 SODA 框架为 p>2 提出新的噪声鲁棒加速结果。

详情
AI中文摘要

基于 Schatten-$\infty$ 的优化器(如 Muon)在经验上表现出色,但关于它们是否有益仍存在看似矛盾的观察。我们通过表明结论具有场景依赖性来解决这一矛盾。即使目标在 Schatten-$\infty$ 几何中是光滑的,较小的 Schatten-$p$ 几何也可能是最优的,特别是在低维场景中,我们证明这包括 Chinchilla 缩放。这一结论源于 SODA 框架在 $p>2$ 时的一个新的噪声鲁棒加速结果。同样的分析解释了为什么类似 Muon 的方法不需要预热,为什么它们自然偏好大批量,并得出了任意 $p$ 的批量大小缩放规则。

英文摘要

Schatten-$\infty$ based optimizers such as Muon have shown promising empirical performance, but there remains seemingly conflicting observations regarding whether they are beneficial. We resolve this conflict by showing that the conclusion is regime dependent. Even when the objective is smooth in the Schatten-$\infty$ geometry, smaller Schatten-$p$ geometries can be optimal, specifically in the low-dimensional regime, which we show includes Chinchilla scaling. This conclusion follows from a new noise-robust acceleration result for the SODA framework for $p>2$. The same analysis explains why Muon-like methods do not require warmup, why they naturally favor large batches, and yields a batch size scaling rule for arbitrary $p$.

2606.15266 2026-06-16 cs.CL 新提交

Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation

评估与保留英汉语音翻译中的词汇重音

Yuchen Song, Xi Chen, Mingze Li, Satoshi Nakamura

发表机构 * The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)) Shenzhen Loop Area Institute, China(深圳环域研究所)

AI总结 针对英汉语音翻译中词汇重音跨语言传递不足的问题,构建重音标注数据集和普通话重音检测器,提出跨语言重音评估指标,并微调CosyVoice3构建重音感知S2ST系统,实验表明该系统在重音翻译能力上显著优于现有系统。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

语音到语音翻译(S2ST)系统在语义准确性和语音自然度方面取得了显著进展。然而,词汇重音(强调和说话者意图的关键线索)的跨语言传递仍然严重缺乏探索,加之缺乏针对汉语等声调语言的可靠自动评估指标。我们通过构建一个重音标注的中文数据集和一个基于XLS-R的普通话重音检测器,研究了英汉S2ST重音传递。结合英语EmphAssess系统,我们提出了一种新的跨语言重音评估客观指标。此外,我们微调了CosyVoice3以构建一个重音感知的S2ST系统。实验表明,我们提出的S2ST架构在重音翻译能力上显著优于现有系统,同时保持了有竞争力的翻译质量。此外,我们的评估指标与人类主观判断具有强相关性。

英文摘要

Speech-to-speech translation (S2ST) systems have achieved impressive progress in semantic accuracy and speech naturalness. However, the cross-lingual transfer of lexical stress, a vital cue for emphasis and speaker intent, remains heavily underexplored, compounded by a lack of reliable automatic evaluation metrics for tonal languages like Chinese. We investigate English-to-Chinese S2ST stress transfer by constructing a stress-annotated Chinese dataset and an XLS-R-based Mandarin stress detector. Integrating this with the English EmphAssess system, we propose a novel objective metric for cross-lingual stress evaluation. Furthermore, we fine-tune CosyVoice3 to build a stress-aware S2ST system. Experiments demonstrate that our proposed S2ST architecture significantly outperforms existing systems in stress translation capability while maintaining competitive translation quality. Furthermore, our evaluation metric exhibits a strong correlation with human subjective judgments.

2606.15260 2026-06-16 cs.LG cs.AI 新提交

Trust-Region Diffusion Policies for Massively Parallel On-Policy RL

大规模并行在线强化学习的信任区域扩散策略

Huy Le, Onur Celik, Denis Blessing, Tai Hoang, Claas A Voelcker, Axel Brunnbauer, Felix Richter, Michael Volpp, Gerhard Neumann

发表机构 * University of Freiburg(弗赖堡大学) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所)

AI总结 提出TruDi方法,通过信任区域优化约束扩散轨迹的KL散度,实现大规模并行在线强化学习中的稳定训练,在73个任务中优于或持平基线。

详情
AI中文摘要

利用大规模并行模拟的强化学习已成为开发鲁棒、可部署策略的标准框架;然而,大多数现有方法仍依赖简单的高斯策略参数化。扩散模型提供了更具表达力的策略类,并在具有挑战性的控制问题上表现出色,但大多数基于扩散的强化学习方法是为离线或离策略训练设计的。在这项工作中,我们探究扩散策略能否在大规模并行、在线策略机制下有效训练。为此,我们引入了信任区域扩散策略(TruDi),它使得扩散策略能够用于大规模并行模拟的在线强化学习。这种设置特别具有挑战性,因为数据分布在每次更新中快速变化,使得复杂策略的稳定训练变得困难。TruDi通过整合信任区域优化规则来约束整个扩散轨迹上的KL散度,从而解决了这一问题。实验上,我们在包含73个任务的4个不同的大规模并行强化学习基准上评估了TruDi。在这些任务中,TruDi在标准任务上始终优于或与强基线持平,在更具挑战性的人形控制任务上取得了明显收益,为大规模并行在线强化学习建立了新的强基线。

英文摘要

Reinforcement learning with massively parallel simulations has become a standard framework for developing robust, deployable policies; however, most existing approaches still rely on simple Gaussian policy parameterizations. Diffusion models provide a more expressive policy class and have shown strong performance on challenging control problems, yet most diffusion-based RL methods are designed for offline or off-policy training. In this work, we ask whether diffusion policies can be trained effectively in the massively parallel, on-policy regime. To this end, we introduce Trust-region Diffusion Policies (TruDi), which enables diffusion policies for on-policy RL with massively parallel simulations. This setting is particularly challenging because the data distribution changes quickly across updates, making stable training with complex policies difficult. TruDi addresses this by integrating a trust-region optimization rule to enforce a KL-divergence constraint over the entire diffusion trajectory. Empirically, we evaluate TruDi on a diverse set of 4 massively parallel RL benchmarks comprising a total of 73 tasks. Across these tasks, TruDi consistently outperforms or is on-par with strong baselines on standard tasks and achieves clear gains on more challenging humanoid control tasks, establishing a strong new baseline for massively parallel on-policy RL.

2606.15258 2026-06-16 cs.AI 新提交

Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs

Mask-Proof: 一种基于LLM的数学证明自动数据整理流水线

Jierui Zhang, Siyuan Tan, Xinhang Li, Longzhuangzhi Lin, Dailin Li, Chengfeng Gu, Xinping Li, Yaxian Hao, Shengjia Liang, Yuxiang Ren, Wenhao Liu

发表机构 * School of Computer Science, Beijing University of Posts and Telecommunications(北京邮电大学计算机学院) Graduate College for Engineers, Beijing University of Posts and Telecommunications(北京邮电大学研究生院工程师学院) School of Mathematical Sciences, Fudan University(复旦大学数学科学学院) School of Cyberspace Security, Beijing University of Posts and Telecommunications(北京邮电大学网络空间安全学院) School of Computer Science and Technology, Dalian University of Technology(大连理工大学计算机科学与技术学院) Chu Kochen Honors College, Zhejiang University(浙江大学竺可桢学院) Department of Psychological and Cognitive Sciences, Tsinghua University(清华大学心理学与认知科学系) State Key Laboratory of Virtual Reality Technology and Systems, Beihang University(北京航空航天大学虚拟现实技术与系统国家重点实验室) School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院)

AI总结 提出Mask-Proof流水线,将真实证明转化为可自动检查的掩码步骤任务,通过LLM等价性判断器评估模型推理,构建包含292个问题的基准,推理增强模型性能提升12%-27%。

详情
AI中文摘要

大型语言模型(LLM)在数学问题求解方面能力日益增强,甚至能辅助研究级证明,但我们仍缺乏一种可扩展且可重复的方式来衡量跨不同来源的长证明中的逐步推理。这种评估差距限制了在经证明认证的科学进步中可信赖的AI辅助。现有评估通常强调最终答案或依赖昂贵的专家评分,而端到端的证明生成仍然是开放式的且难以自动验证。我们引入Mask-Proof,一个将真实证明转化为可自动检查的掩码步骤任务的流水线。它掩盖关键公式步骤,提供必要的上下文,并使用基于LLM的等价性判断器(通过重复投票保持稳定性)评估模型重建。由此产生的Mask-ProofBench包含来自不同研究领域的292个精心策划的问题。对17个模型的实验表明,推理增强模型比标准模型性能提升12%至27%。我们的评估器与专家注释者的一致性达到96.8%,实现了对逐步数学推理的忠实、可重复和可比较的测量。基准、注释和代码可在https://github.com/weating/Mask-Proof获取。

英文摘要

Large language models (LLMs) are increasingly capable of mathematical problem solving and can even assist with research-level proofs, yet we still lack a scalable and reproducible way to measure step-level reasoning in long proofs across diverse sources. This evaluation gap limits trustworthy AI assistance in proof-certified scientific progress. Existing evaluations often emphasize final answers or rely on costly expert grading, while end-to-end proof generation remains open-ended and hard to verify automatically. We introduce Mask-Proof, a pipeline that turns real proofs into automatically checkable masked-step tasks. It masks key formula steps, provides the necessary surrounding context, and evaluates model reconstructions with an LLM-based equivalence judge using repeated votes for stability. The resulting Mask-ProofBench contains 292 curated problems across diverse research areas. Experiments with 17 models show that reasoning-enhanced models outperform standard models by 12% to 27%. Our evaluator achieves 96.8% agreement with expert annotators, enabling faithful, reproducible, and comparable measurement of step-level mathematical reasoning. Benchmark, annotations, and code are available at https://github.com/weating/Mask-Proof.

2606.15257 2026-06-16 cs.LG 新提交

AI for Social Good: An Investigation of the Causal Relationship Between Environmental Regulations and Their Effects on Air Pollution in London, UK

AI 促进社会公益:英国伦敦环境法规与其对空气污染影响的因果关系研究

Yang Han, Jacqueline CK Lam, Victor OK Li, Yiu-Wai Man

发表机构 * Department of Electrical and Electronic Engineering, The University of Hong Kong(香港大学电子与电气工程系)

AI总结 提出不确定性感知的贝叶斯深度学习框架,估计2010-2020年伦敦空气污染法规对PM2.5的因果效应,发现法规平均降低PM2.5 1.88 μg/m³(12.35%)。

详情
AI中文摘要

空气污染法规是城市公共卫生治理的核心,但估计其效果具有挑战性,因为政策实施非随机,且污染轨迹受气象、社会经济变化、时间趋势和重叠干预措施的影响。本研究开发了一个不确定性感知的贝叶斯深度学习框架,用于估计2010年至2020年伦敦空气污染法规对PM$_{2.5}$浓度的总体影响。该框架整合了来自内伦敦监测站的每日PM$_{2.5}$观测数据、气象协变量、年度社会经济指标、月份和星期指示变量,以及32项政策措施的每日法规状态数据。贝叶斯LSTM捕获环境和社会经济协变量的时间依赖性,贝叶斯嵌入层表示时间和法规状态输入,法规状态预测分支支持基于倾向性得分的非随机政策实施调整。通过将观测到的PM$_{2.5}$浓度与假设无法规情景下的反事实预测进行比较,估计法规效果,并在重复贝叶斯训练和bootstrap重采样中总结不确定性。结果显示,伦敦的法规与平均PM$_{2.5}$减少1.88 μg/m³(相对减少12.35%)相关,95%置信区间为1.64-2.12 μg/m³。2013年之前效果有限,2013年至2017年效果逐渐明显,2018年和2019年效果最强。研究结果表明,持续累积的监管干预措施对伦敦空气质量改善产生了可衡量的影响。本研究展示了不确定性感知的因果AI如何支持环境问责、公共卫生保护和基于证据的环境决策治理。

英文摘要

Air pollution regulation is central to urban public health governance, but estimating its effects is difficult because policies are implemented non-randomly and pollution trajectories are shaped by meteorology, socioeconomic change, temporal trends, and overlapping interventions. This study develops an uncertainty-aware Bayesian deep learning framework to estimate the aggregate effect of air pollution regulations on PM$_{2.5}$ concentrations in London from 2010 to 2020. The framework integrates daily PM$_{2.5}$ observations from Inner London monitoring stations, meteorological covariates, annual socioeconomic indicators, month-of-year and day-of-week indicators, and daily regulation status data for 32 policy measures. A Bayesian LSTM captures temporal dependencies in environmental and socioeconomic covariates, Bayesian embedding layers represent temporal and regulation status inputs, and a regulation status prediction branch supports propensity score-based adjustment for non-random policy implementation. Regulatory effects are estimated by comparing observed PM$_{2.5}$ concentrations with counterfactual predictions under a hypothetical no-regulation scenario, with uncertainty summarized across repeated Bayesian training runs and bootstrap resampling. Results show that London's regulations were associated with an average PM$_{2.5}$ reduction of 1.88 $μ$g/m$^3$, a relative reduction of 12.35%, with a 95% confidence interval of 1.64-2.12 $μ$g/m$^3$. Estimated effects were limited before 2013, became clearer from 2013 to 2017, and were strongest in 2018 and 2019. The findings suggest that sustained and cumulative regulatory interventions contributed to measurable improvements in London's air quality. This study demonstrates how uncertainty-aware causal AI can support environmental accountability, public health protection, and evidence-based governance for environmental decision-making.

2606.15255 2026-06-16 cs.RO 新提交

OSDAG: Online Scheduling for Efficient Multi-Robot Collaboration

OSDAG: 面向高效多机器人协作的在线调度

Thanh Nguyen Canh, Thang Tran Viet, Phuc Van Dinh, Xiem HoangVan, Nak Young Chong

发表机构 * Japan Advanced Institute of Science and Technology(日本北陆先端科学技术大学院大学) University of Engineering and Technology, Vietnam National University(越南国立大学工程技术大学) Hanyang University(汉阳大学)

AI总结 提出OSDAG框架,结合LLM任务推理与DAG在线调度,通过一次性分解指令为依赖图并实时分配任务,相比对话式方法推理速度提升5-15倍,调度时间缩短38%。

详情
AI中文摘要

协调异构多机器人系统(MRS)完成复杂、长周期任务需要灵活的高层推理和高效的低层调度。现有的基于LLM的方法解决了推理方面,但引入了两个关键瓶颈:(1)执行过程中重复的LLM推理,随着智能体数量增加而增加延迟;(2)离线、预提交的调度,即使存在独立工作,也会迫使机器人等待顺序排列的前驱任务而闲置。本文提出了OSDAG,一种新颖的框架,将基于LLM的任务推理与有向无环图(DAG)表示和约束感知的在线调度相结合。LLM被调用一次,将自然语言指令分解为带有依赖注释的任务图,然后轻量级在线调度器实时将就绪任务分配给空闲智能体。DAG表示编码了前驱和资源约束,确保正确性同时暴露所有可用的并行性。在五个基准场景上的实验表明,与基于对话的方法相比,OSDAG的推理时间快5-15倍,与顺序基线相比,完成时间最多减少38%,并保持有竞争力的成功率。在双臂操作任务上的仿真和真实世界实验验证了所提方法在高效多机器人协调中的有效性和实用性。网站和资源可在 http://thanhnguyencanh.github.io/LLM_DAG4MultiRobot 获取。

英文摘要

Coordinating heterogeneous multi-robot systems (MRS) for complex, long-horizon tasks requires both flexible high-level reasoning and efficient low-level scheduling. Existing LLM-based approaches address the reasoning side but introduce two critical bottlenecks: (1) repeated LLM inference during execution, which inflates latency with agent count, and (2) offline, pre-committed scheduling, which forces robots to idle while waiting for sequentially ordered predecessors even when independent work is available. This paper presents OSDAG, a novel framework that integrates LLM-based task reasoning with Directed Acyclic Graph (DAG) representation and constraint-aware online scheduling. The LLM is invoked once to decompose a natural-language instruction into a dependency-annotated task graph, and a lightweight online scheduler then allocates ready tasks to idle agents in real time. The DAG representation encodes both precedence and resource constraints, ensuring correctness while exposing all available parallelism. Experiments across five benchmark scenarios demonstrate that OSDAG achieves 5-15x faster reasoning time compared to dialogue-based methods, reduces makespan by up to 38% over sequential baselines, and maintains competitive success rates. Both simulation and real-world experiments on dual-arm manipulation tasks validate the effectiveness and practicality of the proposed approach for efficient multi-robot coordination. The website and resources are available at http://thanhnguyencanh.github.io/LLM_DAG4MultiRobot

2606.15253 2026-06-16 cs.CV 新提交

Focus, Align, and Sustain: Counteracting Gradient Dilution in Incremental Object Detection

聚焦、对齐与维持:对抗增量目标检测中的梯度稀释

Aoting Zhang, Dongbao Yang, Chang Liu, Xiaopeng Hong, Yu Zhou

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出FAS框架,通过注入先验的查询聚焦判别信号、确定性锚点蒸馏对齐分配、流形支持回放维持旧类分布,解决增量目标检测中梯度稀释导致的性能下降问题。

Comments Accepted by ICML2026

详情
AI中文摘要

将检测Transformer适应到增量目标检测(IOD)面临系统性挑战,因为基于集合的优化本质上被顺序学习所不稳定。在这项工作中,我们识别出梯度稀释是性能下降的根本原因,其中保留旧知识所需的优化信号逐渐减弱。这种现象表现为保留梯度在幅度、方向和支撑覆盖上的级联侵蚀,由三个紧密耦合的因素驱动:信号分散,其中前景梯度被背景噪声淹没;分配漂移,其中随机查询-目标匹配导致不一致的梯度轨迹;以及支撑衰减,其中保留样本的梯度不足以覆盖旧类特征空间,在新类干扰下削弱决策边界。为对抗此,我们提出FAS,一个统一的框架,在增量学习中聚焦、对齐和维持梯度流。具体地,我们引入注入先验的查询,通过从源头过滤背景干扰来聚焦判别信号。我们进一步提出确定性锚点蒸馏,以对齐查询-目标分配并在不稳定匹配下跨阶段强制执行语义一致性。最后,我们设计流形支撑回放,以维持旧类的分布支撑,对抗持续更新引起的表示侵蚀。大量实验表明,FAS恢复了鲁棒的优化动态,并优于最先进的方法,在具有挑战性的40+10x4增量设置中实现了超过5.0 AP的提升。

英文摘要

Adapting Detection Transformers to Incremental Object Detection (IOD) poses a systemic challenge, as set-based optimization is inherently destabilized by sequential learning. In this work, we identify Gradient Dilution as the root cause of performance degradation, wherein optimization signals required to preserve old knowledge are progressively weakened. This phenomenon manifests as a cascading erosion of preservation gradients in magnitude, direction, and support coverage, driven by three tightly coupled factors: Signal Dispersion, where foreground gradients are overwhelmed by background noise; Assignment Drift, where stochastic query-target matching induces inconsistent gradient trajectories; and Support Attrition, where gradients from retained samples insufficiently cover the old-class feature space, weakening decision boundaries under interference from new classes. To counteract this, we propose FAS, a unified framework that Focuses, Aligns, and Sustains gradient flow throughout incremental learning. Specifically, we introduce prior-injected queries to focus discriminative signals by filtering background interference at the source. We further propose deterministic anchor distillation to align query-target assignments and enforce semantic consistency across stages under unstable matching. Finally, we devise manifold-support replay to sustain distributional support of old classes, counteracting representational erosion induced by continual updates. Extensive experiments show that FAS restores robust optimization dynamics and outperforms state-of-the-art methods, achieving over 5.0 AP improvement in the challenging 40+10x4 incremental setting.

2606.15251 2026-06-16 cs.RO cs.AI cs.LG 新提交

Driving, Fast or Slow? Neuro-Symbolic Guidance for Motion Prediction in Multi-Modal Ground Mobility

驾驶,快或慢?多模态地面移动中运动预测的神经符号引导

Simon Kohaut, Felix Divo, Julius Hahnewald, Benedict Flade, Julian Eggert, Kristian Kersting, Devendra Singh Dhami

发表机构 * Artificial Intelligence and Machine Learning Lab, TU Darmstadt(达姆施塔特工业大学人工智能与机器学习实验室) Honda Research Institute(本田研究所) Hessian Center for AI (hessian.AI)(黑森州人工智能中心) Centre for Cognitive Science(认知科学中心) German Center for AI (DFKI)(德国人工智能研究中心) Uncertainty in Artificial Intelligence Lab, TU Eindhoven(埃因霍温理工大学人工智能不确定性实验室)

AI总结 提出TraCS框架,通过神经符号方法将交通规则编码为概率一阶逻辑,增强黑盒运动预测模型的可解释性和合规性,在Argoverse 2上持续提升SOTA性能。

详情
AI中文摘要

准确且可解释的异构交通空间(包括行人、自行车、汽车和卡车)运动预测对于安全的自主导航至关重要。然而,最先进的方法仍然是黑盒,缺乏对现实世界移动的监管和行为约束的显式编码。我们提出Trajectory Compliance-Shaping (TraCS),一种神经符号框架,通过可解释的概率一阶逻辑增强现有的黑盒运动预测骨干网络。为此,TraCS采用智能体代码生成流水线,弥合交通规则的自然语言描述与概率运动预测之间的差距。此外,TraCS采用反应式数据流推理引擎,随着场景演变维护并高效更新合规性景观。为防止TraCS过度自信地将骨干网络的预测引导到错误方向,我们提出一种神经置信度评分,作为上下文感知的合规性信号衰减。我们在Argoverse 2基准上展示了TraCS如何持续改进最先进的预测骨干网络,表明概率和符号合规性推理是纯神经运动预测的广泛适用且计算高效的补充。

英文摘要

Accurate and interpretable motion prediction for heterogeneous traffic spaces, including pedestrians, bicycles, cars, and trucks, is essential for safe autonomous navigation. Nevertheless, state-of-the-art approaches remain predominantly black-box, lacking explicit encoding of the regulatory and behavioral constraints of real-world mobility. We propose Trajectory Compliance-Shaping (TraCS), a neuro-symbolic framework that augments existing black-box motion prediction backbones with interpretable and probabilistic first-order logic. To do so, TraCS employs an agentic code-generation pipeline to bridge the gap between natural-language descriptions of traffic regulations and probabilistic motion prediction. Furthermore, TraCS employs a reactive data-streaming inference engine that maintains and efficiently updates compliance landscapes as scenes evolve. To prevent TraCS from overconfidently steering the backbone's predictions in the wrong direction, we propose a neural confidence rating learned as a context-aware attenuation of the compliance signal. We demonstrate on the Argoverse 2 benchmark how TraCS consistently improves state-of-the-art prediction backbones, showing that probabilistic and symbolic compliance reasoning is a broadly applicable and computationally efficient complement to purely neural motion predictors.

2606.15250 2026-06-16 cs.CV cs.AI 新提交

Landmark-free Assessment of Lower-limb Alignment with Implicit Neural Shape Functions from Knee Radiographs

基于膝关节X光片的隐式神经形状函数的无地标下肢对齐评估

Zhisen Hu, Antti Kemppainen, David Johnson, Egor Panfilov, Huy Hoang Nguyen, Timothy Cootes, Claudia Lindner, Aleksei Tiulpin

发表机构 * Division of Informatics, Imaging and Data Sciences, The University of Manchester(曼彻斯特大学信息学、影像与数据科学部) Research Unit of Health Sciences and Technology, University of Oulu(奥卢大学健康科学与技术研究部) Medical Research Center Oulu, University of Oulu and Oulu University Hospital(奥卢大学与奥卢大学医院医学研究中心) Department of Trauma and Orthopaedics, Stockport NHS Foundation Trust, Stepping Hill Hospital(斯泰平希尔医院斯托克波特NHS基金会创伤与骨科) School of Health and Society, University of Salford(索尔福德大学健康与社会学院) School of Biological Sciences, The University of Manchester(曼彻斯特大学生物科学学院) Weill Cornell Medicine, Cornell University(康奈尔大学威尔康奈尔医学院)

AI总结 提出隐式神经形状函数(INSF)方法,无需显式地标,通过编码解剖形状到潜在空间并直接回归临床对齐测量,实现自动化下肢对齐评估,性能与现有方法相当且易于扩展。

Comments Accepted to MICCAI 2026

详情
AI中文摘要

下肢对齐(LLA)的放射学评估对于预测全膝关节置换术中的关节健康和手术结果至关重要。传统测量方法手动且耗时,而最近的机器学习方法通常依赖于定位一组固定的解剖标志。这种依赖性限制了灵活性,并且当临床定义发生变化时可能需要重新标注。为了解决这个问题,我们提出了一种使用隐式神经形状函数(INSF)的自动化工作流程。我们不依赖显式地标坐标,而是将解剖结构编码到紧凑的潜在空间中,并直接从这些潜在代码回归临床对齐测量。这种架构允许快速扩展到新任务,而无需改变骨干表示。我们在一个包含566张膝关节X光片的内部数据集上训练了我们的方法,每张图像都标注了股骨和胫骨的轮廓。我们在一个包含50名患者的内部测试数据集和一个来自MRKR数据集的402个术前病例的外部独立数据集上进行了评估。这些数据提供了手动临床测量,并且MRKR测量将公开可用。性能与最先进的基于地标的方法和手动一致性相当,同时提供了一种可扩展到其他测量任务的灵活形状表示。

英文摘要

Radiographic assessment of lower-limb alignment (LLA) is important for predicting joint health and surgical outcomes in total knee arthroplasty. Traditional measurement methods are manual and time-consuming, while recent machine learning approaches typically rely on locating a fixed set of anatomical landmarks. This dependence limits flexibility and may require re-annotation when clinical definitions change. To address this, we propose an automated workflow using Implicit Neural Shape Functions (INSF). Rather than relying on explicit landmark coordinates, we encode the anatomy into a compact latent space and regress clinical alignment measurements directly from these latent codes. This architecture allows for rapid extendability to new tasks without altering the backbone representation. We trained our method on an internal dataset of 566 knee radiographs, each annotated with the outline of the femur and tibia. We evaluated it on both an internal test dataset of 50 patients and a separate external set of 402 preoperative cases from the MRKR dataset. Manual clinical measurements are available for these data, and the MRKR measurements will be made publicly accessible. Performance was comparable to state-of-the-art landmark-based methods and manual agreement, while offering a flexible shape representation that can be extended to additional measurement tasks.

2606.15247 2026-06-16 cs.LG cs.AI 新提交

Exploring Starts Are Not Enough: Counterexamples and a Fix for Monte Carlo Exploring Starts

探索性初始状态并不足够:蒙特卡洛探索性初始状态的反例与修正

Octave Oliviers, Glenn Vinnicombe

发表机构 * Department of Engineering, University of Cambridge(剑桥大学工程系)

AI总结 本文通过构造反例证明,在表格设置下,蒙特卡洛探索性初始状态(MCES)算法可能收敛到次优解,并提出基于状态级学习率缩放的修正方法以恢复最优性收敛。

详情
AI中文摘要

蒙特卡洛探索性初始状态(MCES)的渐近行为是强化学习中一个长期存在的开放问题,即使在表格设置中也是如此。我们通过构造算法收敛到次优解的例子,研究了表格MCES的收敛性质。本文为初始访问和首次访问MCES提供了新的反例,并给出了初始访问情况下的收敛恢复修正。我们表明,即使贪婪动作平均更新频率高于非贪婪动作,初始访问MCES在样本平均更新下也可能存在稳定的次优解。然而,通过按状态将学习率与更新频率成反比缩放,可以保证收敛到最优性。与之前的均匀化方法不同,此修正适用于需要近似估计值函数的大规模问题。然后,我们扩展该例子以表明样本平均首次访问MCES也可能收敛到次优解。这基本上解决了一个基本的开放问题,并表明仅靠探索性初始状态并不能保证收敛到最优性。更广泛地说,这些结果突显了收敛性关键取决于应用于不同动作的更新的相对大小和频率,使得学习率的选择以及探索与利用的平衡成为MCES分析和可扩展蒙特卡洛控制方法实现的核心。

英文摘要

The asymptotic behaviour of Monte Carlo Exploring Starts (MCES) is a long-standing open question in reinforcement learning, even in the tabular setting. We investigated the convergence properties of tabular MCES by constructing examples in which the algorithm converges to suboptimal solutions. This paper presents new counterexamples for both initial-visit and first-visit MCES and gives a convergence-restoring modification for the initial-visit case. We show that stable suboptimal solutions may exist for initial-visit MCES with sample-average updates even when greedy actions are updated more often than non-greedy actions on average. However, by scaling learning rates inversely to update frequencies on a state-by-state basis, convergence to optimality is guaranteed. Unlike previous uniformisation methods, this modification is applicable to large-scale problems that require approximating the estimated value function. We then extend the example to show that sample-average first-visit MCES may also converge to suboptimal solutions. This largely settles a fundamental open problem and shows that exploring starts alone do not guarantee convergence to optimality. More broadly, these results highlight that convergence depends critically on the relative size and frequency of updates applied to different actions, making the choice of learning rates and the balance between exploration and exploitation central to the analysis of MCES and the implementation of scalable Monte Carlo control methods.

2606.15244 2026-06-16 cs.LG 新提交

M-CTX: Exact and Scalable Spatial Context Retrieval for Trajectory Analytics

M-CTX:用于轨迹分析的精确保可扩展空间上下文检索

Kun Ma, Qilong Han, Chengjing Song, Jingzheng Yao, Xiao Han, Yuee Zhou, Changmao Wu

发表机构 * Harbin Engineering University(哈尔滨工程大学) Wuhan University of Technology(武汉理工大学) University of Chinese Academy of Sciences(中国科学院大学) Alibaba Group(阿里巴巴集团)

AI总结 提出M-CTX框架,将空间上下文构建转化为空间数据库查询,通过索引加速实现226倍加速,解决轨迹预测中上下文构建的系统瓶颈。

Comments 14 pages, 10 figures, 12 tables. Submitted to ICDE 2027

详情
AI中文摘要

现代轨迹预测器越来越多地依赖于外部空间上下文,例如地图几何、符号距离场(SDF)和附近的移动代理。虽然这种上下文提高了预测质量,但为每个训练锚点构建它已成为一个隐藏的系统瓶颈。在一个代表性的海事AIS流程中,空间上下文构建需要大约17个CPU天来处理一个5.48M锚点的语料库,这主导了下游预测器的成本。我们提出了M-CTX,一个用于轨迹分析的精确保可扩展空间上下文检索框架。M-CTX将上下文构建重新构想为一次摄取、多次查询的空间数据库工作负载,并用可组合的、基于索引的操作符替换了三个暴力阶段——OSM范围检索、SDF计算和移动船舶邻居查找。其学习的范围索引后端BR-LZ提供了召回完全的MBR重叠范围检索,并将候选放大相对于全局扩展单曲线基线降低了1.1倍至2.7倍。在四个海事区域、八个基线系统、多达4000万个空间特征的合成工作负载以及10^7条记录的AIS流上,M-CTX精确地重现了参考上下文。在5.48M锚点语料库上,它将上下文构建从大约17个CPU天减少到1.8小时,实现了226倍的端到端加速。一个可选的存储模式进一步将SDF上下文压缩了64倍,仅改变了0.04米的ADE。这些结果确立了精确空间上下文检索作为现代轨迹分析中一类数据库问题的地位。代码和数据集公开在https://github.com/mark000071/M-CTX-Traj。

英文摘要

Modern trajectory predictors increasingly condition on external spatial context, such as map geometry, signed distance fields (SDFs), and nearby moving agents. While this context improves prediction quality, constructing it for every training anchor has become a hidden systems bottleneck. In a representative maritime AIS pipeline, spatial context construction requires roughly 17 CPU-days for a 5.48M-anchor corpus, dominating the cost of the downstream predictor. We present M-CTX, an exact and scalable spatial context-retrieval framework for trajectory analytics. M-CTX recasts context construction as an ingest-once, query-many spatial database workload and replaces three brute-force stages -- OSM range retrieval, SDF computation, and moving-vessel neighbour lookup -- with composable, index-backed operators. Its learned range-index backend, BR-LZ, provides recall-complete MBR-overlap range retrieval and reduces candidate amplification by 1.1x--2.7x relative to global-expansion one-curve baselines. Across four maritime regions, eight baseline systems, synthetic workloads with up to 40M spatial features, and 10^7-record AIS streams, M-CTX reproduces the reference context exactly. On the 5.48M-anchor corpus, it reduces context construction from about 17 CPU-days to 1.8 hours, a measured 226x end-to-end speed-up. An optional storage mode further compresses SDF context by 64x with only a 0.04 m ADE change. These results establish exact spatial context retrieval as a first-class database problem in modern trajectory analytics. Code and datasets are publicly available at https://github.com/mark000071/M-CTX-Traj.

2606.15243 2026-06-16 cs.CV 新提交

SPARK: Spatial Policy-driven Adaptive Reinforcement learning for Knowledge distillation

SPARK: 空间策略驱动的自适应强化学习知识蒸馏

Mohamed Jismy Aashik Rasool, Shabir Ahmad, Gisong Oh, Teag Kuen Whangbo

发表机构 * Gachon University(高丽大学)

AI总结 提出SPARK框架,利用轻量强化学习策略网络自适应分配蒸馏努力,通过空间权重图调制量化感知训练中的知识蒸馏损失,提升低比特量化图像恢复网络的性能。

Comments 13 pages, 3 figures,5 tables ,BMVC submission

详情
AI中文摘要

低比特量化使得图像恢复(IR)网络能够在资源受限设备上部署,但引入了舍入噪声,不成比例地降低了边缘和精细纹理等高频率区域的质量。现有的知识蒸馏(KD)方法在所有空间位置上均匀应用蒸馏信号,忽略了不同图像区域的重建难度差异。为了解决这一问题,我们提出了SPARK(空间策略驱动的自适应强化学习知识蒸馏),一个使用轻量级强化学习(RL)策略网络自适应分配蒸馏努力的框架。在每个训练步骤中,一个难度特征提取器计算四个信号,即拉普拉斯方差、像素方差、学生重建误差和师生知识差距,这些信号被输入到一个紧凑的策略CNN中,该网络生成一个随机空间权重图,以在量化感知训练(QAT)期间调制KD损失。SPARK与IR任务无关,不增加推理成本,并且无需架构更改即可集成到任何现有的QAT流程中。在基准数据集上的实验表明,SPARK在多种学生架构上始终优于PTQ、QAT和最先进的(SOTA)KD方法,在显著的计算约束下实现了最接近全精度教师的重建质量。

英文摘要

Low-bit quantization enables deployment of image restoration (IR) networks on resource-constrained devices, but introduces rounding noise that disproportionately degrades high-frequency regions such as edges and fine textures. Existing knowledge distillation (KD) methods apply distillation signals uniformly across all spatial locations, overlooking the varying reconstruction difficulty across image regions. To address this, we propose SPARK (Spatial Policy-driven Adaptive Reinforcement Learning for Knowledge Distillation), a framework that adaptively allocates distillation effort using a lightweight reinforcement learning (RL) policy network. At each training step, a difficulty feature extractor computes four signals, namely Laplacian variance, pixel variance, student reconstruction error, and teacher-student knowledge gap, which are fed into a compact policy CNN that produces a stochastic spatial weight map to modulate the KD loss during quantization-aware training (QAT). SPARK is IR task-agnostic, adds no inference cost, and integrates into any existing QAT pipeline without architectural changes. Experiments on benchmark datasets demonstrate that SPARK consistently outperforms PTQ, QAT, and state-of-the-art (SOTA) KD approaches across multiple student architectures, achieving reconstruction quality closest to the full-precision teacher under significant computational constraints.

2606.15240 2026-06-16 cs.LG 新提交

EnvShip-Bench: An Environment-Enhanced Benchmark for Short-Term Vessel Trajectory Prediction

EnvShip-Bench:一种环境增强的短期船舶轨迹预测基准

Kun Ma, Qilong Han, Chengjing Song, Jingzheng Yao, Hao Wang, Changmao Wu

发表机构 * Harbin Engineering University(哈尔滨工程大学) Politecnico di Torino(都灵理工大学) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所)

AI总结 针对现有船舶轨迹预测基准缺乏统一协议和环境上下文的问题,提出EnvShip-Bench,基于丹麦海事局和NOAA的原始AIS数据构建,采用标准化预测协议,提供环境与邻近船舶上下文扩展,支持轨迹、环境感知和交互感知预测的统一评估。

Comments Submitted to ACM MM 2026

详情
AI中文摘要

船舶轨迹预测对于智能航运、海上监视和航行安全至关重要。然而,现有的公共海事AIS资源通常受到预测协议不一致、数据质量不均匀以及缺乏基准就绪的上下文注释的限制,这阻碍了公平比较和上下文感知建模。为解决这一问题,我们提出了EnvShip-Bench,这是一个用于短期船舶轨迹预测的统一基准,通过通用处理流程从丹麦海事局(DMA)和美国国家海洋和大气管理局(NOAA)的大规模原始AIS数据构建而成。EnvShip-Bench采用标准化的预测协议,包括10分钟观测、10分钟预测和20秒采样,使用以船舶为中心的局部公制坐标。除了大规模核心基准外,它还提供了一个质量优先的紧凑子集,用于高效且可重复的实验,以及同步的环境和邻近船舶上下文扩展。因此,EnvShip-Bench在统一评估框架下支持仅轨迹、环境感知和交互感知预测。广泛的基准统计和分析表明,EnvShip-Bench为海上轨迹预测研究提供了标准化、可扩展且上下文感知的基础。

英文摘要

Vessel trajectory prediction is important for intelligent shipping, maritime surveillance, and navigation safety. However, existing public maritime AIS resources are often limited by inconsistent forecasting protocols, uneven data quality, and the lack of benchmark-ready contextual annotations, which hinder fair comparison and context-aware modeling. To address this gap, we present EnvShip-Bench, a unified benchmark for short-term vessel trajectory prediction built from large-scale raw AIS data from the Danish Maritime Authority (DMA) and NOAA through a common processing pipeline. EnvShip-Bench adopts a standardized forecasting protocol with 10 minutes of observation, 10 minutes of prediction, and 20-second sampling in vessel-centric local metric coordinates. Beyond the large-scale core benchmark, it provides a quality-first compact subset for efficient and reproducible experimentation, together with synchronized environmental and nearby-vessel context extensions. As a result, EnvShip-Bench supports trajectory-only, environment-aware, and interaction-aware forecasting under a unified evaluation framework. Extensive benchmark statistics and analysis demonstrate that EnvShip-Bench offers a standardized, extensible, and context-aware foundation for maritime trajectory forecasting research.

2606.15239 2026-06-16 cs.RO cs.CY cs.HC 新提交

Co-Creating Buildable and Open Social Robot Study Companions with University Students

与大学生共同创造可构建且开放的社会机器人学习伙伴

Farnaz Baksh, Matevž B. Zorec, Feiazie Baksh, Karl Kruusamäe

发表机构 * University of Tartu(塔尔图大学) University of Guyana Robotics Club(圭亚那大学机器人俱乐部)

AI总结 针对开源机器人构建门槛高的问题,采用双钻石框架与大学生共同设计机器人学习伴侣v4.1,通过扭锁、卡扣等可装配/拆卸设计,将系统可用性从差提升至优(SUS 59.4→89.4),并降低感知工作负荷。

Comments Accepted for 18th International Conference on Social Robotics (ICSR + ART 2026), London, UK | 1-4 July 2026

详情
AI中文摘要

开源社会机器人提供了可访问性、可修复性和学生赋权,但构建本身往往是一个障碍。现有平台要么预组装发货,排除了动手学习的机会,要么让学生面对不熟悉的紧固件、不透明的布线和难以触及的维修点,从而削弱了参与度。针对性的机械重新设计能否在保持结构完整性的同时降低这一障碍,尚未得到验证。在这里,我们展示了面向装配的设计(DfA)和面向拆卸的设计(DfD)干预措施在缩短构建时间之前,先改变了构建的体验感受。与圭亚那和爱沙尼亚的大学生合作,我们应用双钻石框架共同创造了机器人学习伴侣(RSC)v4.1:映射痛点,然后围绕扭锁紧固件、卡扣接头和免工具维修锁扣重新设计其底盘。在两项涉及开发者和首次构建者的研究中,系统可用性从差提升至优(SUS 59.4到89.4),感知工作负荷呈下降趋势(NASA-TLX 4.29到4.00),平均组装时间呈下降趋势(21.4到13.7分钟,含初学者的学习效应),同时为首次构建者提供的方向提示和导航连续性成为下一个文档化前沿。感知工作负荷,而非完成时间,似乎是决定学生是否接受开源硬件的关键因素。

英文摘要

Open-source social robots offer accessibility, repairability, and student empowerment, yet the build itself often presents a barrier. Existing platforms either ship pre-assembled, foreclosing hands-on learning, or expose students to unfamiliar fasteners, opaque wiring, and inaccessible service points that erode engagement. Whether targeted mechanical redesign can lower this barrier whilst maintaining structural integrity remains untested. Here we show that Design for Assembly (DfA) and Design for Disassembly (DfD) interventions reshape how a build feels before they shorten how long it takes. Working with university students in Guyana and Estonia, we applied the Double Diamond framework to co-create the Robot Study Companion (RSC) v4.1: mapping pain points, then redesigning its chassis around twist-lock fasteners, snap-fit joints, and tool-free service latches. Across two studies with developers and first-time builders, system usability climbed from Poor to Excellent (SUS 59.4 to 89.4), perceived workload trended downward (NASA-TLX 4.29 to 4.00), and mean assembly time trended downward (21.4 to 13.7 minutes, with juniors' learning effect), whilst orientation cues and navigation continuity for first-time builders emerged as the next documentation frontier. Perceived workload, not completion time, appears to govern whether students take up open hardware.

2606.15232 2026-06-16 cs.RO 新提交

Rethinking Implicit Spatial Representation in Visuomotor Policy Learning

重新思考视觉运动策略学习中的隐式空间表示

Xiangyu Chen, Yuxuan Hu, Chuhao Zhou, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(南洋理工大学MARS实验室)

AI总结 本文重新评估了空间softmax池化在机器人操作中的有效性,发现其提供紧凑稳定的空间表示但受限于表示瓶颈,并提出PRISM编码器通过多尺度隐式空间信息融合提升性能。

详情
AI中文摘要

基于生成模型的模仿学习已成为机器人操作广泛采用的范式,其中策略性能关键取决于条件视觉表示。尽管空间softmax表示已被用于先前的视觉运动策略,但其有效性和潜在机制仍未被充分理解。本文重新思考空间softmax池化的使用:这种隐式空间表示是否为机器人操作提供了有效且稳定的视觉特征?通过对视觉编码器中不同池化方法的系统研究,我们发现这种池化操作产生紧凑且稳定的空间表示,尽管使用更少的维度,但优于特征值表示。互补的显著性分析进一步表明,这些空间表示引导编码器更一致地关注任务相关区域。然而,这一优势受到当前视觉编码器中表示瓶颈的限制:重复的下采样操作在动作生成模块使用之前削弱了细粒度空间信息,尤其是在低分辨率观测下。受这些发现的启发,我们提出PRISM,一种通过自上而下的交叉注意力融合保留多尺度隐式空间信息的视觉编码器。跨多个任务和策略骨干的实验显示出一致的改进。特别是在低分辨率、高精度的ToolHang任务中,PRISM显示出明显的增益,将平均成功率从5.0%提高到13.4%,同时参数仅增加15.4%。这些结果支持将多尺度隐式空间表示作为机器人操作的有效且高效的设计原则。

英文摘要

Generative model-based imitation learning has become a widely adopted paradigm for robotic manipulation, where policy performance depends critically on the conditioned visual representations. Although spatial softmax-based representations have been adopted in prior visuomotor policies, their effectiveness and underlying mechanisms remain insufficiently understood. This work rethinks the use of spatial softmax pooling: do such implicit spatial representations provide effective and stable visual features for robotic manipulation? Through systematic studies of different pooling methods in visual encoders, we find that this pooling operation produces compact and stable spatial representations, which outperform feature-value representations, despite using substantially fewer dimensions. Complementary saliency analysis further suggests that these spatial representations guide the encoder to focus more consistently on task-relevant regions. However, this advantage is limited by a representation bottleneck in current visual encoders: repeated downsampling operations weaken fine-grained spatial information before the action-generation module can use it, especially under low-resolution observations. Motivated by these findings, we propose PRISM, a visual encoder that preserves multiscale implicit spatial information through top-down cross-attention fusion. Experiments across multiple tasks and policy backbones show consistent improvements. In particular, on the low-resolution, high-precision ToolHang task, PRISM shows clear gains, improving the average success rate from 5.0% to 13.4% while increasing parameters by only 15.4%. These results support the use of multiscale implicit spatial representations as an effective and efficient design principle for robotic manipulation.

2606.15231 2026-06-16 cs.AI 新提交

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

Visual-Seeker:通过主动视觉推理实现视觉原生多模态智能搜索

Zhengbo Zhang, Changtao Miao, Jinbo Su, Zhaowen Zhou, Chunxia Zhang, Xukai Wang, Ruiqi Liu, Kaiyuan Zheng, Jiansheng Cai, Bo Zhang, Zhe Li, Shiming Xiang, Ying Yan

发表机构 * School of Artificial Intelligence UCAS(中国科学院大学人工智能学院) Institute of Automation CAS(中国科学院自动化研究所) Ant Digital Technologies Ant Group(蚂蚁数字科技蚂蚁集团) RUC(中国人民大学) BIT(北京理工大学)

AI总结 提出Visual-Seeker,一种通过主动视觉推理进行视觉原生多模态深度搜索的智能体,在五个基准上达到最先进性能,甚至超越专有模型。

详情
AI中文摘要

多模态大语言模型(MLLMs)在许多视觉任务中展示了令人印象深刻的能力,但在面对复杂、开放世界场景时,它们常常在事实性基础上挣扎。尽管最近的多模态深度搜索智能体试图通过利用外部工具来解决这个问题,但视觉原生搜索范式仍未得到充分探索。现有方法主要依赖于具有显式语义的简单图像和纯文本证据轨迹,限制了智能体执行多跳、跨模态推理和搜索的能力。为了解决这些限制,我们提出了Visual-Seeker,一种通过主动视觉推理的视觉原生多模态深度搜索智能体。我们的智能体不是将视觉视为静态输入,而是主动关注细粒度的视觉细节,在搜索过程中动态收集视觉证据。为了释放其视觉原生潜力,我们设计了一个主动视觉推理数据管道,并合成了5K高质量的多模态轨迹用于模型训练。大量实验表明,在五个具有挑战性的多模态搜索基准上,我们的方法达到了最先进的性能,甚至超越了多个专有模型,验证了在真实网络环境中鲁棒的视觉原生推理和搜索能力。代码和数据可在 https://github.com/ZhengboZhang/Visual-Seeker 获取。

英文摘要

Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. While recent multimodal deep search agents attempt to address this issue by utilizing external tools, the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent's ability to perform multi-hop, cross-modal reasoning and search. To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Rather than treating vision as a static input, our agent actively attends to fine-grained visual details, dynamically harvests visual evidence throughout the search process. To unlock its visual-native potential, we design an active visual reasoning data pipeline and synthesize 5K high-quality multimodal trajectories for model training. Extensive experiments demonstrate the state-of-the-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual-native reasoning and search in real-world web environments. The code and data can be accessed at: https://github.com/ZhengboZhang/Visual-Seeker.

2606.15225 2026-06-16 cs.LG cs.AI cs.IR 新提交

Edu-Theater: A Data-Efficient Agent Framework for Scalable Learner Behavior Simulation through Staging Roll-Call

Edu-Theater: 一种通过点名排演实现可扩展学习者行为模拟的数据高效智能体框架

Weibo Gao, Qi Liu, Linan Yue, Zheng Zhang, Yichao Du, Fangzhou Yao, Ao Yu, Zhenya Huang, Shijin Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) State Key Laboratory of Cognitive Intelligence(认知智能国家重点实验室) Southeast University(东南大学) Alibaba Group(阿里巴巴集团) iFLYTEK Co., Ltd.(科大讯飞股份有限公司)

AI总结 提出Edu-Theater框架,通过构建群体水平能力先验和少量诊断查询,利用LLM智能体模拟学习者行为,在减少数据需求的同时提高模拟精度,并增强下游自适应测试等应用。

Comments LLM Agent, Educational Data Mining, Data Synthesis, Human Simulation

详情
AI中文摘要

大规模学习者-任务交互数据对智能教育系统至关重要,但收集成本高且受隐私和学习者参与度限制。学习模拟器在无需真实学习者持续参与的情况下,对模拟可扩展的学习者行为起着关键作用。然而,现有方法主要是**以个体为中心**,为每个学习者配对模拟器,从密集的交互历史中迭代推断潜在知识状态,这既数据密集又计算密集,且在冷启动场景中脆弱。我们提出一种**群体感知的点名模拟范式**,首先构建群体水平的能力先验,然后通过少量有针对性的诊断查询细化个体学习者状态。基于该范式,我们引入**Edu-Theater**,一个由LLM驱动的智能体系统,通过教师智能体和基于学习者日志的回顾性点名探测执行群体感知的学习者模拟。Edu-Theater无需每个学习者的密集历史即可实现可扩展的未来行为模拟。在两个真实世界数据集上的实验表明,Edu-Theater以显著更少的LLM调用实现了更高的模拟精度,生成的合成数据增强了自适应测试等下游应用。

英文摘要

Large-scale learner-task interaction data are crucial for intelligent educational systems but are costly to collect and constrained by privacy and learner engagement. Learner simulators play a critical role in simulating scalable learner behavior without the need for continuous involvement of real learners. However, existing methods are predominantly \textbf{individual-centric}, pairing a simulator with each learner to iteratively infer latent knowledge states from dense interaction histories, which is both data- and computation-intensive, and fragile in cold-start scenarios. We propose a \textbf{cohort-aware roll-call simulation paradigm} that first constructs cohort-level proficiency priors and refines individual learner states through a small number of targeted diagnostic queries. Based on this paradigm, we introduce \textbf{Edu-Theater}, an LLM-powered agent system that performs cohort-aware learner simulation via a teacher agent and retrospective roll-call probing over learner logs. Edu-Theater enables scalable future behavior simulation without the need for dense per-learner histories. Experiments on two real-world datasets demonstrate that Edu-Theater achieves higher simulation accuracy with significantly fewer LLM calls, producing synthetic data that enhances downstream applications such as adaptive testing.

2606.15219 2026-06-16 cs.LG cs.DS math.ST stat.ML stat.TH 新提交

Can Neural Networks Achieve Optimal Computational-statistical Tradeoff? An Analysis on Single-Index Model

神经网络能否实现最优计算-统计权衡?基于单指标模型的分析

Siyu Chen, Beining Wu, Miao Lu, Zhuoran Yang, Tianhao Wang

发表机构 * Department of Statistics and Data Science, Yale University(耶鲁大学统计与数据科学系) Department of Statistics, University of Chicago(芝加哥大学统计系) Department of Management Science and Engineering, Stanford University(斯坦福大学管理科学与工程系) Toyota Technological Institute at Chicago(芝加哥丰田技术研究所)

AI总结 提出统一梯度算法训练两层神经网络,在多项式时间内学习高斯单指标模型,样本复杂度匹配SQ下界,并扩展到稀疏情形。

Comments 96 pages, 4 figures

详情
AI中文摘要

在这项工作中,我们解决以下问题:基于梯度的神经网络训练能否在学习高斯单指标模型时实现最优计算-统计权衡?先前研究表明,统计查询框架下的任何多项式时间算法需要$Ω(d^{s^\star/2}\lor d)$个样本,其中$s^\star$是生成指数,代表学习潜在模型的内在难度。然而,神经网络能否达到这一样本复杂度尚不清楚。受先前学习单指标模型的技术(如标签变换和景观平滑)启发,我们提出了一种统一的梯度算法,用于在多项式时间内训练两层神经网络。我们的方法适用于多种损失函数和激活函数,涵盖了广泛现有方法。我们证明,该算法学习到的特征表示与未知信号$θ^\star$高度对齐,样本复杂度为$\widetilde{O}(d^{s^\star/2} \lor d)$,对于所有生成指数$s^\star\geq 1$,与SQ下界仅差多对数因子。此外,我们通过引入一种利用稀疏结构的新型权重扰动技术,将方法扩展到$θ^\star$为$k$-稀疏($k = o(\sqrt{d})$)的情形。我们推导出相应的SQ下界为$\widetildeΩ(k^{s^\star})$,我们的方法与之匹配至多对数因子。我们的框架,特别是权重扰动技术,具有独立意义,并暗示了其他问题(如稀疏张量PCA)的潜在梯度解法。

英文摘要

In this work, we tackle the following question: Can neural networks trained with gradient-based methods achieve the optimal computational-statistical tradeoff in learning Gaussian single-index models? Prior research has shown that any polynomial-time algorithm under the statistical query (SQ) framework requires $Ω(d^{s^\star/2}\lor d)$ samples, where $s^\star$ is the generative exponent representing the intrinsic difficulty of learning the underlying model. However, it remains unknown whether neural networks can achieve this sample complexity. Inspired by prior techniques such as label transformation and landscape smoothing for learning single-index models, we propose a unified gradient-based algorithm for training a two-layer neural network in polynomial time. Our method is adaptable to a variety of loss and activation functions, covering a broad class of existing approaches. We show that our algorithm learns a feature representation that strongly aligns with the unknown signal $θ^\star$, with sample complexity $\widetilde{O} (d^{s^\star/2} \lor d)$, matching the SQ lower bound up to a polylogarithmic factor for all generative exponents $s^\star\geq 1$. Furthermore, we extend our approach to the setting where $θ^\star$ is $k$-sparse for $k = o(\sqrt{d})$ by introducing a novel weight perturbation technique that leverages the sparsity structure. We derive a corresponding SQ lower bound of order $\widetildeΩ(k^{s^\star})$, matched by our method up to a polylogarithmic factor. Our framework, especially the weight perturbation technique, is of independent interest, and suggests potential gradient-based solutions to other problems such as sparse tensor PCA.

2606.15216 2026-06-16 cs.CL cs.AI 新提交

Spokes: Optimizing for Diverse Pretraining Data Selection

Spokes: 优化多样化预训练数据选择

Clarence Lee, Yejin Choi, Luke Zettlemoyer, Pang Wei Koh, Hai Leong Chieu

发表机构 * DSO National Laboratories(DSO国家实验室) Stanford University(斯坦福大学) University of Washington(华盛顿大学)

AI总结 提出基于G-Vendi分数的概率多样化框架,通过指数梯度下降直接优化数据多样性,在FineWeb和DCLM上提升下游性能1.5和1.4个点。

Comments 9 pages, 4 figures

详情
AI中文摘要

多样性在数据选择中起着关键作用,通过减少冗余和重复,在固定数据预算下提高性能。然而,优化多样性本身具有挑战性,因为它是集合级属性,依赖于数据点之间的交互而非单个示例。因此,现有方法通常依赖代理或近似,往往无法确保足够多样化的子集。在这项工作中,我们通过引入基于G-Vendi分数的概率多样化框架,并利用指数梯度下降进行优化,直接优化多样性。我们的方法生成的子集比通过随机抽样获得的子集多样化得多,在50万样本子集上实现了G-Vendi分数增加489。我们在FineWeb和DCLM上评估了我们的方法,它持续优于现有方法。值得注意的是,SPOKES(仅多样性)在DCLM和FineWeb上分别比随机抽样提高了平均下游性能0.4和0.5个点。更重要的是,联合优化质量和多样性取得了最强结果:SPOKES在DCLM和FineWeb上分别取得了1.5和1.4个点的提升,优于所有基线,包括语义去重和质量过滤。

英文摘要

Diversity plays a critical role in data selection, improving performance under fixed data budgets by reducing redundancy and repetition. However, optimizing for diversity is inherently challenging, as it is a set-level property that depends on interactions between data points rather than individual examples. As a result, existing approaches typically rely on proxies or approximations, which often fail to ensure sufficiently diverse subsets. In this work, we directly optimize diversity by introducing a probabilistic diversification framework based on the G-Vendi score, optimized via exponentiated gradient descent. Our method produces subsets that are substantially more diverse than those obtained via random sampling, achieving a +489 increase in G-Vendi score on a 500k-sample subset. We evaluate our approach on FineWeb and DCLM, where it consistently outperforms existing methods. Notably, SPOKES (diversity-only) improves average downstream performance by +0.4 and +0.5 points over random sampling on DCLM and FineWeb, respectively. More importantly, jointly optimizing for both quality and diversity yields the strongest results: SPOKES achieves gains of +1.5 and +1.4 points on DCLM and FineWeb, outperforming all baselines, including semantic deduplication and quality filtering.

2606.15209 2026-06-16 cs.AI cs.CR 新提交

Attribute Inference from Interactive Targeted Ads

从互动定向广告中进行属性推断

Peihao Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文建模了互动定向广告中用户属性推断的噪声信道,通过合成基准评估了贝叶斯、监督、正无标签和自适应攻击,发现披露策略是最有效的控制手段。

详情
AI中文摘要

定向广告系统可以将广告主选择的受众与展示可见用户操作的广告单元配对。当互动仍然与引发它的广告活动相关联时,广告主可能会收到与用户相关的观察结果,而不仅仅是汇总报告。我们将该渠道建模为用于属性推断的噪声预言机。该模型区分了定向谓词、曝光、互动和披露。这些边界捕捉了资格与投放之间的差距,以及互动与广告主可见性之间的差距。我们使用公共数据校准的合成群体构建了一个可重复的基准,每个群体都有已知的敏感标签。生成的广告活动语义层提供了主题变体和响应先验。模拟器生成真实情况、事件轨迹、披露观察结果和指标。评估比较了在常见广告活动和披露定义下的贝叶斯、监督、正无标签和自适应攻击。最终评估使用了四个主题变体、七个模拟器种子和两种互动设置。具有身份曝光的重复广告活动产生了可测量但有界的推断信号。在160次广告活动中,贝叶斯和监督攻击在主要设置中达到约0.64 AUC,在更高互动设置中达到约0.65 AUC。披露政策是最强的控制手段。汇总报告消除了与用户相关的评估预言机输入。类型过滤和随机披露减少了释放的信号。结果是针对互动定向广告中隐私的模型、工件和防御评估方法。代码可在 https://github.com/P-HOW/Interactive-Ad-Oracle 获取。

英文摘要

Targeted advertising systems can pair audiences selected by advertisers with ad units that expose visible user actions. When an interaction remains linked to the campaign that elicited it, the advertiser may receive an observation tied to a user rather than only an aggregate report. We model that channel as a noisy oracle for attribute inference. The model separates targeting predicates, exposure, interaction, and disclosure. These boundaries capture the gap between eligibility and delivery, and the gap between interaction and advertiser visibility. We build a reproducible benchmark using synthetic populations calibrated with public data, each with known sensitive labels. A generated campaign semantics layer provides topic variants and response priors. The simulator generates the ground truth, event traces, disclosed observations, and metrics. The evaluation compares Bayesian, supervised, positive and unlabeled, and adaptive attacks under common campaign and disclosure definitions. The final evaluation uses four topic variants, seven simulator seeds, and two interaction settings. Repeated campaigns with identity exposure produce measurable but bounded inference signal. At $160$ campaigns, Bayesian and supervised attacks reach about $0.64$ AUC in the main setting and about $0.65$ AUC in the higher interaction setting. Disclosure policy is the strongest control. Aggregate reporting removes the evaluated oracle input tied to users. Type filtering and randomized disclosure reduce the released signal. The result is a model, artifact, and defense evaluation method for privacy in interactive targeted advertising. The code is available at https://github.com/P-HOW/Interactive-Ad-Oracle.

2606.15207 2026-06-16 cs.LG cs.AI cs.NE 新提交

Controlled Dynamics Attractor Transformer

受控动力学吸引子Transformer

Cheng Zhang, Minnan Luo, Zesheng Yang, Ming Li, Yong-Jin Liu, Qinghua Zheng

发表机构 * Xi'an Jiaotong University(西安交通大学) Tsinghua University(清华大学)

AI总结 提出受控动力学吸引子Transformer(CDAT),通过耦合混合von Mises-Fisher注意力能量与Hopfield精炼能量,并引入CANN启发的兴奋-抑制调制,实现拓扑约束的动力学系统,在图异常检测和图分类任务上达到最优性能。

Comments 20pages,3 figures

Journal ref Forty-Third International Conference on Machine Learning(ICML 2026)

详情
AI中文摘要

Transformer架构通过自注意力机制在深度模型的表示学习和推理方面取得了显著进展。同时,联想记忆(AM)框架将表示映射到能量景观上,提供了可解释的检索机制。然而,其连续时间推理动力学缺乏经典连续吸引子神经网络(CANN)的生物合理性。为弥合这一差距,我们提出了受控动力学吸引子Transformer(CDAT),它将混合von Mises-Fisher(Mo-vMF)注意力能量与Hopfield精炼能量耦合,同时通过CANN启发的兴奋-抑制调制增强能量下降。CDAT实例化了一个拓扑约束的动力学系统,其耦合编码了标记之间的关系结构,从而将吸引子式动力学与现代基于能量的注意力联系起来。我们进一步提供了构造性的耗散分析,以正式建立其受控推理动力学。得益于这些鲁棒且结构化的动力学,CDAT在图异常检测和图分类的多个基准测试中达到了最先进的性能。

英文摘要

Transformer architectures have dramatically advanced representation learning and inference in deep models through self-attention mechanisms. In parallel,associative memory (AM) frameworks map representations onto energy landscapes, offering interpretable retrieval mechanisms. However, their continuous-time inference dynamics lack the biological plausibility of classical Continuous Attractor Neural Networks (CANNs). To bridge this gap, we propose Controlled Dynamics Attractor Transformer (CDAT), which couples a mixture von Mises-Fisher (Mo-vMF) attention energy with a Hopfield refinement energy, while augmenting energy descent with a CANN-inspired excitation-inhibition modulation. CDAT instantiates a topology-constrained dynamical system whose couplings encode relational structure among tokens, thereby linking attractor-style dynamics to modern energy-based attention. We further provide a constructive dissipation analysis to formally establish their controlled inference dynamics. Benefiting from these robust and structured dynamics, CDAT achieves state-of-the-art performance across multiple benchmarks in graph anomaly detection and graph classification.

2606.15202 2026-06-16 cs.CV 新提交

Comparing Human Gaze and Vision-Language Model Attention in Safety-Relevant Environments

安全相关环境中的人类注视与视觉语言模型注意力的比较

Marta Vallejo, Siwen Wang

发表机构 * Heriot-Watt University(赫瑞-瓦特大学)

AI总结 本研究通过眼动追踪实验和GPT-4o等视觉语言模型,比较了人类与模型在安全相关场景中的注意力分布,发现模型无需训练数据即可近似人类注视模式。

Comments 30 pages, 33 figures. Submitted as a preprint. Code and data available upon reasonable request

详情
AI中文摘要

人类视觉注意力在人们感知和响应包含潜在风险的环境时起着重要作用。本研究探讨大型视觉语言模型是否能识别安全相关环境中吸引人类注意力的相同场景区域。使用Pupil Invisible可穿戴眼镜收集了十名参与者观看33张代表不同潜在风险水平的环境场景图像的眼动数据。将注视坐标映射到刺激图像上,生成群体平均的人类注视热图。同时,通过OpenAI视觉应用程序接口(API)提示GPT-4o生成视觉注意力的空间预测,并将其转换为显著性图,以便与人类注视模式进行比较。使用四种互补指标评估人类注视热图与模型生成的显著性图之间的空间对齐:皮尔逊相关系数(r = 0.515 ± 0.117)、归一化扫描路径显著性(NSS = 0.988 ± 0.323)、Kullback-Leibler散度(KL = 1.766 ± 0.844)以及使用Judd公式的接收者操作特征曲线下面积(AUC-Judd = 0.806 ± 0.076)。与Gemini Pro、Gemini Flash和Claude的跨模型比较显示,所有模型均超过AUC-Judd的随机基线0.5,并获得了正的NSS分数。根据四项指标中的三项,Gemini Pro表现出最强的空间定位能力,而GPT-4o在KL散度上产生了与人类注意力最接近的分布匹配。这些发现表明,大型视觉语言模型能够识别与人类在安全相关场景中视觉注意力大致对应的区域,而无需眼动训练数据。结果凸显了视觉语言模型作为近似人类注意力模式的可扩展工具的潜力。

英文摘要

Human visual attention plays an important role in how people perceive and respond to environments containing potential risks. This study investigates whether large vision-language models can identify the same regions of a scene that attract human attention in safety-relevant environments. Eye-tracking data were collected from ten participants viewing 33 scene images representing environments with varying levels of potential risk using Pupil Invisible wearable glasses. Gaze coordinates were mapped onto stimulus images to generate population-averaged human gaze heatmaps. In parallel, GPT-4o was prompted through the OpenAI Vision Application Programming Interface (API) to generate spatial predictions of visual attention, which were converted into saliency maps for comparison with human gaze patterns. Spatial alignment between human gaze heatmaps and model-generated saliency maps was evaluated using four complementary metrics: Pearson correlation (r = 0.515 +- 0.117), Normalised Scanpath Saliency (NSS = 0.988 +- 0.323), Kullback-Leibler divergence (KL = 1.766 +- 0.844), and Area Under the Receiver Operating Characteristic Curve using the Judd formulation (AUC-Judd = 0.806 +- 0.076). A cross-model comparison with Gemini Pro, Gemini Flash, and Claude showed that all models exceeded the AUC-Judd chance baseline of 0.5 and achieved positive NSS scores. Gemini Pro demonstrated the strongest spatial localisation according to three of the four metrics, whereas GPT-4o produced the closest distributional match to human attention as measured by KL divergence. These findings suggest that large vision-language models can identify regions that broadly correspond to where humans direct visual attention in safety-relevant scenes without requiring eye-tracking training data. The results highlight the potential of vision-language models as a scalable tool for approximating human attentional patterns.

2606.15200 2026-06-16 cs.CV 新提交

Keep It in Mind: User Centric Continual Spatial Intelligence Reasoning in Egocentric Video Streams

铭记于心:面向用户中心的持续空间智能推理在自我中心视频流中的应用

Yun Wang, Junbin Xiao, Han Lyu, Yifan Wang, Jing Zuo, Zhanjie Zhang, Hong Huang, Dapeng Wu, Angela Yao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出UCS-Bench数据集和DirectMe框架,通过增量构建结构化空间记忆,实现自我中心视频流中动态空间推理、长期记忆与用户实时位置对齐,显著提升多模态大模型的空间推理能力。

Comments 45 pages. https://icml.cc/virtual/2026/poster/63682

Journal ref ICML 2026

详情
AI中文摘要

我们介绍了UCS-Bench,一个涵盖170多小时自我中心视觉观察的数据集,包含8.1K+带时间戳的问题,用于诊断自我中心视频流中用户中心的持续空间智能。UCS-Bench针对一个新问题,强调动态空间推理、长期记忆及其与用户实时位置的对齐。我们提出了DirectMe,一个从流式自我中心观察中增量构建和维护结构化空间记忆的框架。DirectMe能够稳健地跟踪和回忆物体位置,这些位置始终相对于用户随时间移动。通过将视觉感知与记忆更新和空间推理紧密耦合,我们的方法支持需要回忆交互、解决视角引起的歧义以及适应动态场景的长时查询。实验表明,DirectMe显著提升了领先多模态大语言模型的空间推理能力;它还超越了许多具有空间感知和长形式流视频模型。我们希望我们的基准和解决方案能够推进自我中心AI助手的空间智能研究。数据和代码可在https://github.com/cocowy1/UCS-Bench获取。

英文摘要

We introduce UCS-Bench, a dataset spanning 170+ hours of egocentric visual observations with 8.1K+ timestamped questions for diagnosing User-Centric Continual Spatial intelligence in egocentric video streams. UCS-Bench targets a new problem that emphasizes dynamic spatial reasoning, long-term memory, and their alignment with users' real-time locations. We propose DirectMe, a framework that incrementally constructs and maintains a structured spatial memory from streaming egocentric observations. DirectMe enables robust tracking and recall of object locations, all relative to the user's movement over time. By tightly coupling visual perception with memory updates and spatial reasoning, our approach supports long-horizon queries that require recalling interactions, resolving viewpoint-induced ambiguities, and adapting to dynamic scenes. Our experiments show that DirectMe significantly improves the spatial reasoning of leading multimodal LLMs; it also surpasses many spatially aware and long-form streaming video models. We hope our benchmark and solution will advance spatial intelligence research for egocentric AI assistants. Data and code are available at https://github.com/cocowy1/UCS-Bench.

2606.15199 2026-06-16 cs.AI 新提交

CogGuard: Cognitive and Operational Profiling for Proactive Warning in Edge Intelligent Services

CogGuard:边缘智能服务中基于认知与操作画像的主动预警

Zhi Yao, Weihao Chen, Zhiqing Tang, Hanshuai Cui, Qianli Ma, Weijia Jia, Wei Zhao

发表机构 * Beijing Normal-Hong Kong Baptist University(北京师范大学-香港浸会大学) Guangdong Key Lab of AI and Multi-modal Data Processing(广东省人工智能与多模态数据处理重点实验室) Institute of Artificial Intelligence and Future Networks(人工智能与未来网络研究院) Engineering Center of AI and Future Education(人工智能与未来教育工程中心) Guangdong Provincial Department of Science and Technology(广东省科学技术厅) Zhuhai Science-Tech Innovation Bureau(珠海市科技创新局) Beijing Normal University at Zhuhai(北京师范大学珠海校区)

AI总结 提出CogGuard框架,通过解耦离线LLM画像构建与在线SLM评分预测,结合前缀对齐KV缓存重用和长度感知分布式微调,实现边缘智能服务的主动预警,在教育和操作任务上降低构建时间48%、微调时间19%。

Comments Accepted to ICWS 2026

详情
AI中文摘要

主动预警是边缘智能服务的一项重要能力,系统需在严格的延迟和隐私约束下预测主体能否成功完成即将到来的任务。这种预测依赖于从历史交互日志中提取的长期静态属性和短期动态状态。近期的大语言模型(LLM)为从这些日志构建结构化画像提供了强大的长上下文推理能力,但现有解决方案在边缘部署时面临两个挑战:(1)画像方法通常具有领域特异性,缺乏跨服务场景的可复用抽象;(2)在异构边缘集群上微调对齐模型时,由于输入序列长度的差异,同步开销较高。为应对这些挑战,我们提出了CogGuard,一个面向边缘智能服务的主动预警框架。CogGuard通过共享的静态-动态画像到评分流水线,将离线基于LLM的画像构建与在线基于小语言模型(SLM)的评分预测解耦,并在两个代表性场景中实例化:教育表现预警和操作任务结果预警。为高效构建画像,我们设计了场景特定的画像方法,并采用前缀对齐的KV缓存重用以减少重复编码开销。为进行边缘端模型对齐,我们提出了一种具有对比正则化的长度感知分布式微调策略,以缓解异构集群上的工作负载不平衡。在教育和操作数据集上的实验表明,CogGuard将画像构建时间最多减少48%,分布式微调时间减少19%,同时在100分量表预警任务上分别达到13.4和5.9的MAE。在最大的教育场景中,与最强基线相比,CogGuard将预测误差降低了15.4%。

英文摘要

Proactive warning is an important capability for edge intelligent services, where the system predicts whether a subject will successfully complete an incoming task under strict latency and privacy constraints. Such prediction depends on both long-term static attributes and short-term dynamic states derived from historical interaction logs. Recent Large Language Models (LLMs) offer strong long-context reasoning for constructing structured profiles from these logs, but existing solutions face two challenges for edge deployment: (1) profiling methods are typically domain-specific and lack a reusable abstraction across service scenarios, and (2) fine-tuning alignment models on heterogeneous edge clusters incurs high synchronization overhead due to the variance in input sequence lengths. To address these challenges, we propose CogGuard, a proactive-warning framework for edge intelligent services. CogGuard decouples offline LLM-based profile construction from online Small Language Model (SLM)-based score prediction through a shared static-dynamic profile-to-score pipeline, and instantiates it in two representative scenarios: educational performance warning and operational task outcome warning. For efficient profile construction, we design scenario-specific profiling methods with prefix-aligned KV-cache reuse to reduce repeated encoding overhead. For edge-side model alignment, we propose a length-aware distributed fine-tuning strategy with contrastive regularization to mitigate workload imbalance on heterogeneous clusters. Experiments on education and operation datasets show that CogGuard reduces profile construction time by up to 48% and distributed fine-tuning time by 19%, while achieving MAEs of 13.4 and 5.9, respectively, on 100-point-scale warning tasks. In the largest educational setting, CogGuard reduces prediction error by 15.4% compared with the strongest baseline.

2606.15198 2026-06-16 cs.CV cs.HC 新提交

City landscape in sight: A crowdsourced framework for unlocking urban-scale window view perceptions from real estate imagery

城市景观在望:一种从房地产图像解锁城市尺度窗景感知的众包框架

Chucai Peng, Sijie Yang, Ang Liu, Yang Xiang, Zhixiang Zhou, Filip Biljecki

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出一种利用房地产平台真实窗景图像(WVI)进行大规模感知映射的方法,通过混合神经网络模型预测六维感知并分析空间分布,发现楼层高度和窗景组成(如天空、树木比例)对感知有非线性影响。

详情
AI中文摘要

通过住宅窗户看到的城市景观影响生活质量,然而城市尺度上实际窗景的感知仍研究不足。本研究提出一种大规模感知映射方法,使用从中国武汉房地产平台收集的12,334张真实住宅窗景图像(WVI),这是一种罕见探索的城市景观图像形式,相比以往研究中常见的渲染或模拟窗景具有优势。通过非沉浸式虚拟现实平台,我们基于499张WVI从304名参与者收集了27,477对六维感知(如生动性)的比较。训练了一个混合神经网络模型来预测所有众包WVI的人类感知并绘制其空间分布。结果显示,整个城市存在显著的空间自相关,具有明显的热点和冷点。楼层高度强烈影响人类感知:较高楼层提供更受欢迎和更广阔的窗景,而较低楼层为居民提供安静和生动的视野。推理模型进一步表明,窗景组成至关重要:高比例的天空、树木和低层建筑增强人们的偏好和生动性感知,而高层建筑的高比例增加单调和压抑感。重要的是,这些影响是非线性的:某些元素的过度存在会改变其对人类感知的影响。这项工作推进了城市尺度上居民视觉体验的理解,并为以人为本的城市规划和房地产优化窗户视觉景观提供了基于证据的指导。

英文摘要

City landscapes viewed through home windows influence quality of life, yet perceptions of actual window views at the urban scale remain understudied. This study presents an approach for large-scale mapping of perceptions using 12,334 window view images (WVIs) collected from actual residential properties listed on real estate platforms in Wuhan, China, representing a rarely explored form of urban view imagery that offers advantages over the rendered or simulated window views commonly examined in previous studies. Through a non-immersive virtual reality platform, we collected 27,477 pairwise comparisons across six perceptual dimensions (e.g.\ Vivid) from 304 participants based on 499 WVIs. A hybrid neural network model was trained to predict human perceptions of all crowdsourced WVIs and map their spatial distribution. Results reveal significant spatial autocorrelation with distinct hot and cold spots across the whole city. Floor level strongly influences human perceptions: while higher floors offer more preferred and extensive window views, lower-floor windows provide residents with quiet and vivid views. An inference model further shows that window view composition matters considerably: high ratios of sky, trees, and low-rise buildings enhance people's preferences and perceptions of vividness, whereas high ratios of high-rise buildings increase perceptions of monotony and oppression. Importantly, these effects are non-linear: the excessive presence of certain elements can alter their impact on human perception. This work advances urban-scale understanding of residents' visual experiences and provides evidence-based guidance for human-centric urban planning and real estate to optimise visual landscapes from windows.

2606.15191 2026-06-16 cs.CL 新提交

AmchiBias: Measuring Stereotypical Bias in Goan Identity Groups with a Minimal Pair Dataset in English and Konkani

AmchiBias:基于英语和孔卡尼语的最小对数据集测量果阿身份群体的刻板偏见

Michelle Barbosa, Sebastian Padó, Franziska Weeber

发表机构 * Institute for Natural Language Processing, University of Stuttgart(斯图加特大学自然语言处理研究所)

AI总结 提出AmchiBias基准,通过313个最小对评估多语言编码器对果阿身份群体的刻板偏见,发现模型在孔卡尼语上表现接近随机,英语查询反映泛印度偏见而非本地文化知识。

Comments The 1st Workshop on Stereotypes Across Cultures in Language Technologies

详情
AI中文摘要

社会文化刻板偏见是NLP系统开发和部署中的重要考虑因素。然而,尽管存在丰富的次国家级社会文化结构,偏见通常仅在国家层面被考虑。我们提出AmchiBias,这是首个针对印度果阿邦(其独特的历史多元文化背景)测量社会文化刻板偏见的基准。它涵盖各种果阿身份群体,包括英语和天城文孔卡尼语中八个社会人口维度的313个最小对。然后,我们在此基准上评估五个多语言编码器模型中的刻板偏见。我们发现模型在孔卡尼语上的得分接近随机,反映了通用多语言模型的语言能力不足以及印度语言模型缺乏果阿文化能力。当用英语查询时,具有更强印度语言覆盖的模型对泛印度群体表现出比超本地果阿群体更高的偏见。这表明英语信号反映了泛印度预训练关联,而非真正的果阿文化知识。我们的发现突显了低资源多语言NLP评估中超本地社区身份的关键空白。

英文摘要

Socio-cultural stereotypical bias is an important consideration in the development and deployment of NLP systems. It is however often considered only at the national level, despite rich subnational socio-cultural structures. We present AmchiBias, the first benchmark for measuring socio-cultural stereotypical bias for the Indian state of Goa with its unique historically multicultural setting. It covers various Goan identity groups and comprises 313 minimal pairs across eight sociodemographic dimensions in both English and Devanagari Konkani. We then evaluate stereotypical bias in five multilingual encoder models on this benchmark. We find near-chance scores in Konkani, reflecting language incompetence for general multilingual models and a lack of Goan cultural competence for Indian language models. Queried in English, models with a stronger Indian language coverage show higher bias for pan-Indian groups than hyperlocal Goan groups. This suggests the English signal reflects pan-Indian pretraining associations rather than genuine Goan cultural knowledge. Our findings highlight a critical gap in low-resource multilingual NLP evaluation for hyperlocal community identities.

2606.15188 2026-06-16 cs.CV 新提交

Adaptive Inference-Time Scaling via Early-Step Latent Verification for Image Editing

自适应推理时间缩放:基于早期步骤潜在验证的图像编辑

Yue Yu, Yang Jiao, Jiayu Wang, Qi Dai, Jingjing Chen

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院) Microsoft Research Asia(微软亚洲研究院) Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究所)

AI总结 提出VeriLatent框架,通过早期步骤潜在空间编辑激活图验证初始噪声,实现自适应推理时间缩放,提升图像编辑质量和效率。

详情
AI中文摘要

基于指令的图像编辑随着生成模型的最新进展取得了显著进步。然而,编辑结果的质量仍受随机采样的初始噪声影响,特别是在复杂编辑场景中。不合适的初始噪声可能导致不满意的编辑结果。最近的推理时间缩放方法通过采样多个初始噪声并选择更好的候选者来解决这一问题。然而,大多数方法遵循解码-验证方案,引入了效率与准确性的权衡。当在有限的推理步骤后进行解码时,解码后的图像通常噪声过大,无法进行可靠评估,而充分去噪的图像则需要更高的计算成本。为了解决这个问题,我们提出了VeriLatent,一种即插即用的自适应推理时间缩放框架,用于图像编辑的早期步骤潜在验证。具体来说,我们提出了一种新颖的验证器,通过在早期阶段通过潜在空间编辑激活图对每个初始噪声进行评分。它通过评估候选者是否能在正确区域引发有效编辑来识别有希望的候选者。这使得无需将潜在变量解码为图像即可进行高效的早期剪枝。在此基础上,我们进一步开发了一种用于推理时间缩放的自适应搜索策略。它根据编辑难度分配推理预算,从而减少函数评估次数(NFE)。在多个基准测试和不同基础模型上的大量实验表明,VeriLatent持续提高了编辑性能和推理时间缩放效率。

英文摘要

Instruction-based image editing has made notable progress with recent advances in generative models. However, the quality of the edited result is still influenced by the randomly sampled initial noise, particularly in complex editing scenarios. An unsuitable initial noise may lead to unsatisfactory editing results. Recent inference-time scaling methods address this issue by sampling multiple initial noises and selecting better candidates. Nevertheless, most of them follow a decode-then-verify scheme which introduces an efficiency-accuracy trade-off. When decoding is performed after limited inference steps, the decoded images often remain too noisy for reliable assessment, whereas sufficiently denoised images require much higher computational cost. To address this issue, we propose VeriLatent, a plug-and-play adaptive inference-time scaling framework with early-step latent verification for image editing. Specifically, we propose a novel verifier that scores each initial noise through a latent-space editing activation map at an early stage. It identifies promising candidates by assessing whether they can induce an effective edit in the correct region. This enables efficient early pruning without decoding latents into images. Building on this, we further develop an adaptive search strategy for inference-time scaling. It allocates inference budgets according to editing difficulty, thereby reducing the number of function evaluations (NFE). Extensive experiments on multiple benchmarks and different base models demonstrate that VeriLatent consistently improves both editing performance and inference-time scaling efficiency.

2606.15186 2026-06-16 cs.SD cs.AI eess.AS 新提交

FreeSonic: Training-Free Temporal-Aware Decoupled Attention for Precise Audio Editing

FreeSonic: 无需训练的时序感知解耦注意力用于精确音频编辑

Yuxuan Jiang, Mingyang Han, Yusheng Dai, Andong Wang, Tianhong Zhou, Jiaxin Ye, Dongxiao Wang, Haoxiang Shi, Boyu Li, Jun Song, Cheng Yu, Bo Zheng, Weibei Dou, Zehua Chen, Jun Zhu

发表机构 * Tsinghua University(清华大学) Alibaba Group(阿里巴巴集团) Monash University(蒙纳士大学) Renmin University of China(中国人民大学) Fudan University(复旦大学)

AI总结 提出FreeSonic,一种无需训练的框架,利用基于Rectified Flow的TangoFlux模型,通过优化反转-逆过程、联合文本-音频注意力图以及调度注意力解耦,实现精确且一致的音频编辑,同时保持背景保真度。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

文本到音频(TTA)生成取得了显著进展,但实现精确且一致的音频编辑仍然是一个主要挑战。然而,现有方法难以平衡时间一致性与背景保留。在本文中,我们提出FreeSonic,一个无需训练的框架,利用最先进的基于Rectified Flow的TangoFlux模型。FreeSonic利用优化的反转-逆过程和联合文本-音频注意力图进行精确的目标片段提取。对于内容编辑,一种新颖的调度注意力解耦将修改限制在目标区域,同时保留原始声学上下文。此外,面向任务的噪声注入增强了音频移除和非刚性替换等任务的通用性。大量实验结果表明,FreeSonic通过提供高保真且高效的解决方案,在精确且一致的音频编辑中实现了优越的平衡。项目和演示:https://free-sonic.github.io/

英文摘要

Text-to-audio (TTA) generation has made significant strides, yet achieving precise and consistent audio editing remains a major challenge. However, existing methods struggle to balance temporal consistency with background preservation. In this paper, we propose FreeSonic, a training-free framework leveraging the state-of-the-art Rectified Flow-based TangoFlux model. FreeSonic utilizes an optimized inversion-reverse process and joint text-audio attention maps for precise target segment extraction. For content editing, a novel scheduled attention decoupling confines modifications to target regions while preserving original acoustic context. Furthermore, task-oriented noise injection enhances versatility for tasks such as audio removal and non-rigid replacement. Extensive experimental results demonstrate that FreeSonic achieves a superior balance by providing a high-fidelity and efficient solution for precise and consistent audio editing. Project and demos: https://free-sonic.github.io/