arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.15275 2026-06-16 cs.CV 新提交

MamBOA: State-Space Architecture for Video Recognition

MamBOA：用于视频识别的状态空间架构

Mustafa Bora Çelik

发表机构 * Ankara Medipol University（安卡拉梅迪波尔大学）

AI总结提出MamBOA框架，通过交错扫描结构将选择性状态空间递归(S6)作为运动合成器，从骨干网络提取的连续特征中编码运动，实现细粒度动作识别的高效时序建模。

Comments 15 pages, 7 figures. Codes available at [https://github.com/BOA-clk/MamBOA]

详情

AI中文摘要

细粒度动作识别需要时序推理，通用架构通过不同的成本-精度权衡来解决：3D密集算子将计算与输入体积耦合，而基于差分的方法通过刚性的、手工设计的无上下文特征减法来近似运动——每种方法都反映了深思熟虑的设计选择，并在表达能力或灵活性上存在相应限制。我们提出MamBOA，一个骨干无关的时序框架，基于新颖的交错扫描结构，将选择性状态空间递归(S6)重新定义为原生运动合成器。通过将从预训练骨干中提取的连续特征表示交错成单个交替序列，所提出的扫描结构驱动递归在共享隐藏状态中编码每个位置的时序观测，两者仅相隔一个衰减步骤——使得帧间过渡成为状态动力学的内在组成部分，而非外部计算的量。然后，一系列专用的对齐和解码操作将此联合编码提炼为显式运动表示，双路径池化机制通过平衡注意力驱动的选择与均匀时序覆盖来自适应地聚合该表示。该框架与CNN、Transformer和Mamba骨干家族无缝接口，每对特征仅增加约2.1 GFLOPs。在Diving48上，MamBOA使用图像预训练骨干达到85.02%的Top-1准确率，使用视频预训练骨干在单次前向传播中处理整个视频达到86.24%——表明结构诱导的状态空间动力学构成了运动建模的原则性和通用基础。

英文摘要

Fine-grained action recognition demands temporal reasoning that general-purpose architectures address through different cost-accuracy tradeoffs: 3D dense operators couple computation to the input volume, while difference-based methods approximate motion through rigid, hand-crafted subtraction of uncontextualized features - each reflecting a deliberate design choice with corresponding limitations in expressiveness or flexibility. We present MamBOA, a backbone-agnostic temporal framework built upon a novel interleaved scan structure that recasts the selective state-space recurrence (S6) as a native motion synthesizer. By interleaving consecutive feature representations extracted from a pretrained backbone into a single alternating sequence, the proposed scan structurally drives the recurrence to encode both temporal observations of each position within a shared hidden state, separated by only a single decay step - rendering the inter-frame transition an intrinsic component of the state dynamics rather than an externally computed quantity. A cascade of dedicated alignment and decoding operations then distills this joint encoding into an explicit motion representation, which a dual-path pooling mechanism adaptively aggregates by balancing attention-driven selection with uniform temporal coverage. The framework interfaces seamlessly with CNN, Transformer, and Mamba backbone families, adding only ~2.1 GFLOPs per feature pair. On Diving48, MamBOA achieves 85.02% Top-1 accuracy with an image-pretrained backbone and 86.24% with a video-pretrained backbone processing the entire video in a single forward pass - demonstrating that structurally induced state-space dynamics constitute a principled and general foundation for motion modeling.

URL PDF HTML ☆

赞 0 踩 0

2606.15273 2026-06-16 cs.AI 新提交

Feature Attribution in Directed Acyclic Graphs Using Edge Intervention

基于边干预的有向无环图特征归因

Qiheng Sun, Junxu Liu, Xiaokai Mao, Haocheng Xia, Jinfei Liu, Kui Ren, Haibo Hu

发表机构 * Zhejiang University（浙江大学）； Zhejiang Lab（之江实验室）； Hong Kong Polytechnic University（香港理工大学）

AI总结针对现有特征归因方法无法同时捕获特征外部性和外生影响的问题，提出基于边干预的DAG-SHAP方法，将每条特征边作为归因对象，并引入近似计算方法，实验验证其有效性。

2606.15268 2026-06-16 cs.LG 新提交

When to use what Schatten-$p$ norm in deep learning?

在深度学习中何时使用何种 Schatten-$p$ 范数？

Thomas Pethick

发表机构 * Pethick et al. [2026]（Pethick 等人 [2026]）

AI总结本文通过理论分析解决关于 Schatten-∞ 优化器有效性的矛盾观察，发现结论取决于数据维度：在低维场景（包括 Chinchilla 缩放）下，较小的 Schatten-p 几何更优，并基于 SODA 框架为 p>2 提出新的噪声鲁棒加速结果。

2606.15266 2026-06-16 cs.CL 新提交

Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation

评估与保留英汉语音翻译中的词汇重音

Yuchen Song, Xi Chen, Mingze Li, Satoshi Nakamura

发表机构 * The Chinese University of Hong Kong, Shenzhen, China（香港中文大学（深圳））； Shenzhen Loop Area Institute, China（深圳环域研究所）

AI总结针对英汉语音翻译中词汇重音跨语言传递不足的问题，构建重音标注数据集和普通话重音检测器，提出跨语言重音评估指标，并微调CosyVoice3构建重音感知S2ST系统，实验表明该系统在重音翻译能力上显著优于现有系统。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

语音到语音翻译（S2ST）系统在语义准确性和语音自然度方面取得了显著进展。然而，词汇重音（强调和说话者意图的关键线索）的跨语言传递仍然严重缺乏探索，加之缺乏针对汉语等声调语言的可靠自动评估指标。我们通过构建一个重音标注的中文数据集和一个基于XLS-R的普通话重音检测器，研究了英汉S2ST重音传递。结合英语EmphAssess系统，我们提出了一种新的跨语言重音评估客观指标。此外，我们微调了CosyVoice3以构建一个重音感知的S2ST系统。实验表明，我们提出的S2ST架构在重音翻译能力上显著优于现有系统，同时保持了有竞争力的翻译质量。此外，我们的评估指标与人类主观判断具有强相关性。

英文摘要

Speech-to-speech translation (S2ST) systems have achieved impressive progress in semantic accuracy and speech naturalness. However, the cross-lingual transfer of lexical stress, a vital cue for emphasis and speaker intent, remains heavily underexplored, compounded by a lack of reliable automatic evaluation metrics for tonal languages like Chinese. We investigate English-to-Chinese S2ST stress transfer by constructing a stress-annotated Chinese dataset and an XLS-R-based Mandarin stress detector. Integrating this with the English EmphAssess system, we propose a novel objective metric for cross-lingual stress evaluation. Furthermore, we fine-tune CosyVoice3 to build a stress-aware S2ST system. Experiments demonstrate that our proposed S2ST architecture significantly outperforms existing systems in stress translation capability while maintaining competitive translation quality. Furthermore, our evaluation metric exhibits a strong correlation with human subjective judgments.

URL PDF HTML ☆

赞 0 踩 0

2606.15260 2026-06-16 cs.LG cs.AI 新提交

Trust-Region Diffusion Policies for Massively Parallel On-Policy RL

大规模并行在线强化学习的信任区域扩散策略

Huy Le, Onur Celik, Denis Blessing, Tai Hoang, Claas A Voelcker, Axel Brunnbauer, Felix Richter, Michael Volpp, Gerhard Neumann

发表机构 * University of Freiburg（弗赖堡大学）； Max Planck Institute for Intelligent Systems（智能系统马克斯·普朗克研究所）

AI总结提出TruDi方法，通过信任区域优化约束扩散轨迹的KL散度，实现大规模并行在线强化学习中的稳定训练，在73个任务中优于或持平基线。

详情

AI中文摘要

利用大规模并行模拟的强化学习已成为开发鲁棒、可部署策略的标准框架；然而，大多数现有方法仍依赖简单的高斯策略参数化。扩散模型提供了更具表达力的策略类，并在具有挑战性的控制问题上表现出色，但大多数基于扩散的强化学习方法是为离线或离策略训练设计的。在这项工作中，我们探究扩散策略能否在大规模并行、在线策略机制下有效训练。为此，我们引入了信任区域扩散策略（TruDi），它使得扩散策略能够用于大规模并行模拟的在线强化学习。这种设置特别具有挑战性，因为数据分布在每次更新中快速变化，使得复杂策略的稳定训练变得困难。TruDi通过整合信任区域优化规则来约束整个扩散轨迹上的KL散度，从而解决了这一问题。实验上，我们在包含73个任务的4个不同的大规模并行强化学习基准上评估了TruDi。在这些任务中，TruDi在标准任务上始终优于或与强基线持平，在更具挑战性的人形控制任务上取得了明显收益，为大规模并行在线强化学习建立了新的强基线。

英文摘要

Reinforcement learning with massively parallel simulations has become a standard framework for developing robust, deployable policies; however, most existing approaches still rely on simple Gaussian policy parameterizations. Diffusion models provide a more expressive policy class and have shown strong performance on challenging control problems, yet most diffusion-based RL methods are designed for offline or off-policy training. In this work, we ask whether diffusion policies can be trained effectively in the massively parallel, on-policy regime. To this end, we introduce Trust-region Diffusion Policies (TruDi), which enables diffusion policies for on-policy RL with massively parallel simulations. This setting is particularly challenging because the data distribution changes quickly across updates, making stable training with complex policies difficult. TruDi addresses this by integrating a trust-region optimization rule to enforce a KL-divergence constraint over the entire diffusion trajectory. Empirically, we evaluate TruDi on a diverse set of 4 massively parallel RL benchmarks comprising a total of 73 tasks. Across these tasks, TruDi consistently outperforms or is on-par with strong baselines on standard tasks and achieves clear gains on more challenging humanoid control tasks, establishing a strong new baseline for massively parallel on-policy RL.

URL PDF HTML ☆

赞 0 踩 0

2606.15258 2026-06-16 cs.AI 新提交

Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs

Mask-Proof: 一种基于LLM的数学证明自动数据整理流水线

Jierui Zhang, Siyuan Tan, Xinhang Li, Longzhuangzhi Lin, Dailin Li, Chengfeng Gu, Xinping Li, Yaxian Hao, Shengjia Liang, Yuxiang Ren, Wenhao Liu

发表机构 * School of Computer Science, Beijing University of Posts and Telecommunications（北京邮电大学计算机学院）； Graduate College for Engineers, Beijing University of Posts and Telecommunications（北京邮电大学研究生院工程师学院）； School of Mathematical Sciences, Fudan University（复旦大学数学科学学院）； School of Cyberspace Security, Beijing University of Posts and Telecommunications（北京邮电大学网络空间安全学院）； School of Computer Science and Technology, Dalian University of Technology（大连理工大学计算机科学与技术学院）； Chu Kochen Honors College, Zhejiang University（浙江大学竺可桢学院）； Department of Psychological and Cognitive Sciences, Tsinghua University（清华大学心理学与认知科学系）； State Key Laboratory of Virtual Reality Technology and Systems, Beihang University（北京航空航天大学虚拟现实技术与系统国家重点实验室）； School of Intelligence Science and Technology, Nanjing University（南京大学智能科学与技术学院）

AI总结提出Mask-Proof流水线，将真实证明转化为可自动检查的掩码步骤任务，通过LLM等价性判断器评估模型推理，构建包含292个问题的基准，推理增强模型性能提升12%-27%。

详情

AI中文摘要

大型语言模型（LLM）在数学问题求解方面能力日益增强，甚至能辅助研究级证明，但我们仍缺乏一种可扩展且可重复的方式来衡量跨不同来源的长证明中的逐步推理。这种评估差距限制了在经证明认证的科学进步中可信赖的AI辅助。现有评估通常强调最终答案或依赖昂贵的专家评分，而端到端的证明生成仍然是开放式的且难以自动验证。我们引入Mask-Proof，一个将真实证明转化为可自动检查的掩码步骤任务的流水线。它掩盖关键公式步骤，提供必要的上下文，并使用基于LLM的等价性判断器（通过重复投票保持稳定性）评估模型重建。由此产生的Mask-ProofBench包含来自不同研究领域的292个精心策划的问题。对17个模型的实验表明，推理增强模型比标准模型性能提升12%至27%。我们的评估器与专家注释者的一致性达到96.8%，实现了对逐步数学推理的忠实、可重复和可比较的测量。基准、注释和代码可在https://github.com/weating/Mask-Proof获取。

英文摘要

Large language models (LLMs) are increasingly capable of mathematical problem solving and can even assist with research-level proofs, yet we still lack a scalable and reproducible way to measure step-level reasoning in long proofs across diverse sources. This evaluation gap limits trustworthy AI assistance in proof-certified scientific progress. Existing evaluations often emphasize final answers or rely on costly expert grading, while end-to-end proof generation remains open-ended and hard to verify automatically. We introduce Mask-Proof, a pipeline that turns real proofs into automatically checkable masked-step tasks. It masks key formula steps, provides the necessary surrounding context, and evaluates model reconstructions with an LLM-based equivalence judge using repeated votes for stability. The resulting Mask-ProofBench contains 292 curated problems across diverse research areas. Experiments with 17 models show that reasoning-enhanced models outperform standard models by 12% to 27%. Our evaluator achieves 96.8% agreement with expert annotators, enabling faithful, reproducible, and comparable measurement of step-level mathematical reasoning. Benchmark, annotations, and code are available at https://github.com/weating/Mask-Proof.

URL PDF HTML ☆

赞 0 踩 0

2606.15257 2026-06-16 cs.LG 新提交

AI for Social Good: An Investigation of the Causal Relationship Between Environmental Regulations and Their Effects on Air Pollution in London, UK

AI 促进社会公益：英国伦敦环境法规与其对空气污染影响的因果关系研究

Yang Han, Jacqueline CK Lam, Victor OK Li, Yiu-Wai Man

发表机构 * Department of Electrical and Electronic Engineering, The University of Hong Kong（香港大学电子与电气工程系）

AI总结提出不确定性感知的贝叶斯深度学习框架，估计2010-2020年伦敦空气污染法规对PM2.5的因果效应，发现法规平均降低PM2.5 1.88 μg/m³（12.35%）。

详情

AI中文摘要

空气污染法规是城市公共卫生治理的核心，但估计其效果具有挑战性，因为政策实施非随机，且污染轨迹受气象、社会经济变化、时间趋势和重叠干预措施的影响。本研究开发了一个不确定性感知的贝叶斯深度学习框架，用于估计2010年至2020年伦敦空气污染法规对PM$_{2.5}$浓度的总体影响。该框架整合了来自内伦敦监测站的每日PM$_{2.5}$观测数据、气象协变量、年度社会经济指标、月份和星期指示变量，以及32项政策措施的每日法规状态数据。贝叶斯LSTM捕获环境和社会经济协变量的时间依赖性，贝叶斯嵌入层表示时间和法规状态输入，法规状态预测分支支持基于倾向性得分的非随机政策实施调整。通过将观测到的PM$_{2.5}$浓度与假设无法规情景下的反事实预测进行比较，估计法规效果，并在重复贝叶斯训练和bootstrap重采样中总结不确定性。结果显示，伦敦的法规与平均PM$_{2.5}$减少1.88 μg/m³（相对减少12.35%）相关，95%置信区间为1.64-2.12 μg/m³。2013年之前效果有限，2013年至2017年效果逐渐明显，2018年和2019年效果最强。研究结果表明，持续累积的监管干预措施对伦敦空气质量改善产生了可衡量的影响。本研究展示了不确定性感知的因果AI如何支持环境问责、公共卫生保护和基于证据的环境决策治理。

英文摘要

Air pollution regulation is central to urban public health governance, but estimating its effects is difficult because policies are implemented non-randomly and pollution trajectories are shaped by meteorology, socioeconomic change, temporal trends, and overlapping interventions. This study develops an uncertainty-aware Bayesian deep learning framework to estimate the aggregate effect of air pollution regulations on PM$_{2.5}$ concentrations in London from 2010 to 2020. The framework integrates daily PM$_{2.5}$ observations from Inner London monitoring stations, meteorological covariates, annual socioeconomic indicators, month-of-year and day-of-week indicators, and daily regulation status data for 32 policy measures. A Bayesian LSTM captures temporal dependencies in environmental and socioeconomic covariates, Bayesian embedding layers represent temporal and regulation status inputs, and a regulation status prediction branch supports propensity score-based adjustment for non-random policy implementation. Regulatory effects are estimated by comparing observed PM$_{2.5}$ concentrations with counterfactual predictions under a hypothetical no-regulation scenario, with uncertainty summarized across repeated Bayesian training runs and bootstrap resampling. Results show that London's regulations were associated with an average PM$_{2.5}$ reduction of 1.88 $μ$g/m$^3$, a relative reduction of 12.35%, with a 95% confidence interval of 1.64-2.12 $μ$g/m$^3$. Estimated effects were limited before 2013, became clearer from 2013 to 2017, and were strongest in 2018 and 2019. The findings suggest that sustained and cumulative regulatory interventions contributed to measurable improvements in London's air quality. This study demonstrates how uncertainty-aware causal AI can support environmental accountability, public health protection, and evidence-based governance for environmental decision-making.

URL PDF HTML ☆

赞 0 踩 0

2606.15255 2026-06-16 cs.RO 新提交

OSDAG: Online Scheduling for Efficient Multi-Robot Collaboration

OSDAG: 面向高效多机器人协作的在线调度

Thanh Nguyen Canh, Thang Tran Viet, Phuc Van Dinh, Xiem HoangVan, Nak Young Chong

发表机构 * Japan Advanced Institute of Science and Technology（日本北陆先端科学技术大学院大学）； University of Engineering and Technology, Vietnam National University（越南国立大学工程技术大学）； Hanyang University（汉阳大学）

AI总结提出OSDAG框架，结合LLM任务推理与DAG在线调度，通过一次性分解指令为依赖图并实时分配任务，相比对话式方法推理速度提升5-15倍，调度时间缩短38%。

详情

AI中文摘要

协调异构多机器人系统（MRS）完成复杂、长周期任务需要灵活的高层推理和高效的低层调度。现有的基于LLM的方法解决了推理方面，但引入了两个关键瓶颈：（1）执行过程中重复的LLM推理，随着智能体数量增加而增加延迟；（2）离线、预提交的调度，即使存在独立工作，也会迫使机器人等待顺序排列的前驱任务而闲置。本文提出了OSDAG，一种新颖的框架，将基于LLM的任务推理与有向无环图（DAG）表示和约束感知的在线调度相结合。LLM被调用一次，将自然语言指令分解为带有依赖注释的任务图，然后轻量级在线调度器实时将就绪任务分配给空闲智能体。DAG表示编码了前驱和资源约束，确保正确性同时暴露所有可用的并行性。在五个基准场景上的实验表明，与基于对话的方法相比，OSDAG的推理时间快5-15倍，与顺序基线相比，完成时间最多减少38%，并保持有竞争力的成功率。在双臂操作任务上的仿真和真实世界实验验证了所提方法在高效多机器人协调中的有效性和实用性。网站和资源可在 http://thanhnguyencanh.github.io/LLM_DAG4MultiRobot 获取。

英文摘要

Coordinating heterogeneous multi-robot systems (MRS) for complex, long-horizon tasks requires both flexible high-level reasoning and efficient low-level scheduling. Existing LLM-based approaches address the reasoning side but introduce two critical bottlenecks: (1) repeated LLM inference during execution, which inflates latency with agent count, and (2) offline, pre-committed scheduling, which forces robots to idle while waiting for sequentially ordered predecessors even when independent work is available. This paper presents OSDAG, a novel framework that integrates LLM-based task reasoning with Directed Acyclic Graph (DAG) representation and constraint-aware online scheduling. The LLM is invoked once to decompose a natural-language instruction into a dependency-annotated task graph, and a lightweight online scheduler then allocates ready tasks to idle agents in real time. The DAG representation encodes both precedence and resource constraints, ensuring correctness while exposing all available parallelism. Experiments across five benchmark scenarios demonstrate that OSDAG achieves 5-15x faster reasoning time compared to dialogue-based methods, reduces makespan by up to 38% over sequential baselines, and maintains competitive success rates. Both simulation and real-world experiments on dual-arm manipulation tasks validate the effectiveness and practicality of the proposed approach for efficient multi-robot coordination. The website and resources are available at http://thanhnguyencanh.github.io/LLM_DAG4MultiRobot

URL PDF HTML ☆

赞 0 踩 0

2606.15253 2026-06-16 cs.CV 新提交

Focus, Align, and Sustain: Counteracting Gradient Dilution in Incremental Object Detection

聚焦、对齐与维持：对抗增量目标检测中的梯度稀释

Aoting Zhang, Dongbao Yang, Chang Liu, Xiaopeng Hong, Yu Zhou

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出FAS框架，通过注入先验的查询聚焦判别信号、确定性锚点蒸馏对齐分配、流形支持回放维持旧类分布，解决增量目标检测中梯度稀释导致的性能下降问题。

Comments Accepted by ICML2026

详情

AI中文摘要

将检测Transformer适应到增量目标检测（IOD）面临系统性挑战，因为基于集合的优化本质上被顺序学习所不稳定。在这项工作中，我们识别出梯度稀释是性能下降的根本原因，其中保留旧知识所需的优化信号逐渐减弱。这种现象表现为保留梯度在幅度、方向和支撑覆盖上的级联侵蚀，由三个紧密耦合的因素驱动：信号分散，其中前景梯度被背景噪声淹没；分配漂移，其中随机查询-目标匹配导致不一致的梯度轨迹；以及支撑衰减，其中保留样本的梯度不足以覆盖旧类特征空间，在新类干扰下削弱决策边界。为对抗此，我们提出FAS，一个统一的框架，在增量学习中聚焦、对齐和维持梯度流。具体地，我们引入注入先验的查询，通过从源头过滤背景干扰来聚焦判别信号。我们进一步提出确定性锚点蒸馏，以对齐查询-目标分配并在不稳定匹配下跨阶段强制执行语义一致性。最后，我们设计流形支撑回放，以维持旧类的分布支撑，对抗持续更新引起的表示侵蚀。大量实验表明，FAS恢复了鲁棒的优化动态，并优于最先进的方法，在具有挑战性的40+10x4增量设置中实现了超过5.0 AP的提升。

英文摘要

Adapting Detection Transformers to Incremental Object Detection (IOD) poses a systemic challenge, as set-based optimization is inherently destabilized by sequential learning. In this work, we identify Gradient Dilution as the root cause of performance degradation, wherein optimization signals required to preserve old knowledge are progressively weakened. This phenomenon manifests as a cascading erosion of preservation gradients in magnitude, direction, and support coverage, driven by three tightly coupled factors: Signal Dispersion, where foreground gradients are overwhelmed by background noise; Assignment Drift, where stochastic query-target matching induces inconsistent gradient trajectories; and Support Attrition, where gradients from retained samples insufficiently cover the old-class feature space, weakening decision boundaries under interference from new classes. To counteract this, we propose FAS, a unified framework that Focuses, Aligns, and Sustains gradient flow throughout incremental learning. Specifically, we introduce prior-injected queries to focus discriminative signals by filtering background interference at the source. We further propose deterministic anchor distillation to align query-target assignments and enforce semantic consistency across stages under unstable matching. Finally, we devise manifold-support replay to sustain distributional support of old classes, counteracting representational erosion induced by continual updates. Extensive experiments show that FAS restores robust optimization dynamics and outperforms state-of-the-art methods, achieving over 5.0 AP improvement in the challenging 40+10x4 incremental setting.

URL PDF HTML ☆

赞 0 踩 0

2606.15251 2026-06-16 cs.RO cs.AI cs.LG 新提交

Driving, Fast or Slow? Neuro-Symbolic Guidance for Motion Prediction in Multi-Modal Ground Mobility

驾驶，快或慢？多模态地面移动中运动预测的神经符号引导

Simon Kohaut, Felix Divo, Julius Hahnewald, Benedict Flade, Julian Eggert, Kristian Kersting, Devendra Singh Dhami

发表机构 * Artificial Intelligence and Machine Learning Lab, TU Darmstadt（达姆施塔特工业大学人工智能与机器学习实验室）； Honda Research Institute（本田研究所）； Hessian Center for AI (hessian.AI)（黑森州人工智能中心）； Centre for Cognitive Science（认知科学中心）； German Center for AI (DFKI)（德国人工智能研究中心）； Uncertainty in Artificial Intelligence Lab, TU Eindhoven（埃因霍温理工大学人工智能不确定性实验室）

AI总结提出TraCS框架，通过神经符号方法将交通规则编码为概率一阶逻辑，增强黑盒运动预测模型的可解释性和合规性，在Argoverse 2上持续提升SOTA性能。

详情

AI中文摘要

准确且可解释的异构交通空间（包括行人、自行车、汽车和卡车）运动预测对于安全的自主导航至关重要。然而，最先进的方法仍然是黑盒，缺乏对现实世界移动的监管和行为约束的显式编码。我们提出Trajectory Compliance-Shaping (TraCS)，一种神经符号框架，通过可解释的概率一阶逻辑增强现有的黑盒运动预测骨干网络。为此，TraCS采用智能体代码生成流水线，弥合交通规则的自然语言描述与概率运动预测之间的差距。此外，TraCS采用反应式数据流推理引擎，随着场景演变维护并高效更新合规性景观。为防止TraCS过度自信地将骨干网络的预测引导到错误方向，我们提出一种神经置信度评分，作为上下文感知的合规性信号衰减。我们在Argoverse 2基准上展示了TraCS如何持续改进最先进的预测骨干网络，表明概率和符号合规性推理是纯神经运动预测的广泛适用且计算高效的补充。

英文摘要

Accurate and interpretable motion prediction for heterogeneous traffic spaces, including pedestrians, bicycles, cars, and trucks, is essential for safe autonomous navigation. Nevertheless, state-of-the-art approaches remain predominantly black-box, lacking explicit encoding of the regulatory and behavioral constraints of real-world mobility. We propose Trajectory Compliance-Shaping (TraCS), a neuro-symbolic framework that augments existing black-box motion prediction backbones with interpretable and probabilistic first-order logic. To do so, TraCS employs an agentic code-generation pipeline to bridge the gap between natural-language descriptions of traffic regulations and probabilistic motion prediction. Furthermore, TraCS employs a reactive data-streaming inference engine that maintains and efficiently updates compliance landscapes as scenes evolve. To prevent TraCS from overconfidently steering the backbone's predictions in the wrong direction, we propose a neural confidence rating learned as a context-aware attenuation of the compliance signal. We demonstrate on the Argoverse 2 benchmark how TraCS consistently improves state-of-the-art prediction backbones, showing that probabilistic and symbolic compliance reasoning is a broadly applicable and computationally efficient complement to purely neural motion predictors.

URL PDF HTML ☆

赞 0 踩 0

2606.15250 2026-06-16 cs.CV cs.AI 新提交

Landmark-free Assessment of Lower-limb Alignment with Implicit Neural Shape Functions from Knee Radiographs

基于膝关节X光片的隐式神经形状函数的无地标下肢对齐评估

Zhisen Hu, Antti Kemppainen, David Johnson, Egor Panfilov, Huy Hoang Nguyen, Timothy Cootes, Claudia Lindner, Aleksei Tiulpin

发表机构 * Division of Informatics, Imaging and Data Sciences, The University of Manchester（曼彻斯特大学信息学、影像与数据科学部）； Research Unit of Health Sciences and Technology, University of Oulu（奥卢大学健康科学与技术研究部）； Medical Research Center Oulu, University of Oulu and Oulu University Hospital（奥卢大学与奥卢大学医院医学研究中心）； Department of Trauma and Orthopaedics, Stockport NHS Foundation Trust, Stepping Hill Hospital（斯泰平希尔医院斯托克波特NHS基金会创伤与骨科）； School of Health and Society, University of Salford（索尔福德大学健康与社会学院）； School of Biological Sciences, The University of Manchester（曼彻斯特大学生物科学学院）； Weill Cornell Medicine, Cornell University（康奈尔大学威尔康奈尔医学院）

AI总结提出隐式神经形状函数（INSF）方法，无需显式地标，通过编码解剖形状到潜在空间并直接回归临床对齐测量，实现自动化下肢对齐评估，性能与现有方法相当且易于扩展。

Comments Accepted to MICCAI 2026

详情

AI中文摘要

下肢对齐（LLA）的放射学评估对于预测全膝关节置换术中的关节健康和手术结果至关重要。传统测量方法手动且耗时，而最近的机器学习方法通常依赖于定位一组固定的解剖标志。这种依赖性限制了灵活性，并且当临床定义发生变化时可能需要重新标注。为了解决这个问题，我们提出了一种使用隐式神经形状函数（INSF）的自动化工作流程。我们不依赖显式地标坐标，而是将解剖结构编码到紧凑的潜在空间中，并直接从这些潜在代码回归临床对齐测量。这种架构允许快速扩展到新任务，而无需改变骨干表示。我们在一个包含566张膝关节X光片的内部数据集上训练了我们的方法，每张图像都标注了股骨和胫骨的轮廓。我们在一个包含50名患者的内部测试数据集和一个来自MRKR数据集的402个术前病例的外部独立数据集上进行了评估。这些数据提供了手动临床测量，并且MRKR测量将公开可用。性能与最先进的基于地标的方法和手动一致性相当，同时提供了一种可扩展到其他测量任务的灵活形状表示。

英文摘要

Radiographic assessment of lower-limb alignment (LLA) is important for predicting joint health and surgical outcomes in total knee arthroplasty. Traditional measurement methods are manual and time-consuming, while recent machine learning approaches typically rely on locating a fixed set of anatomical landmarks. This dependence limits flexibility and may require re-annotation when clinical definitions change. To address this, we propose an automated workflow using Implicit Neural Shape Functions (INSF). Rather than relying on explicit landmark coordinates, we encode the anatomy into a compact latent space and regress clinical alignment measurements directly from these latent codes. This architecture allows for rapid extendability to new tasks without altering the backbone representation. We trained our method on an internal dataset of 566 knee radiographs, each annotated with the outline of the femur and tibia. We evaluated it on both an internal test dataset of 50 patients and a separate external set of 402 preoperative cases from the MRKR dataset. Manual clinical measurements are available for these data, and the MRKR measurements will be made publicly accessible. Performance was comparable to state-of-the-art landmark-based methods and manual agreement, while offering a flexible shape representation that can be extended to additional measurement tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.15247 2026-06-16 cs.LG cs.AI 新提交

重新思考视觉运动策略学习中的隐式空间表示

Xiangyu Chen, Yuxuan Hu, Chuhao Zhou, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University（南洋理工大学MARS实验室）

AI总结本文重新评估了空间softmax池化在机器人操作中的有效性，发现其提供紧凑稳定的空间表示但受限于表示瓶颈，并提出PRISM编码器通过多尺度隐式空间信息融合提升性能。

详情

AI中文摘要

基于生成模型的模仿学习已成为机器人操作广泛采用的范式，其中策略性能关键取决于条件视觉表示。尽管空间softmax表示已被用于先前的视觉运动策略，但其有效性和潜在机制仍未被充分理解。本文重新思考空间softmax池化的使用：这种隐式空间表示是否为机器人操作提供了有效且稳定的视觉特征？通过对视觉编码器中不同池化方法的系统研究，我们发现这种池化操作产生紧凑且稳定的空间表示，尽管使用更少的维度，但优于特征值表示。互补的显著性分析进一步表明，这些空间表示引导编码器更一致地关注任务相关区域。然而，这一优势受到当前视觉编码器中表示瓶颈的限制：重复的下采样操作在动作生成模块使用之前削弱了细粒度空间信息，尤其是在低分辨率观测下。受这些发现的启发，我们提出PRISM，一种通过自上而下的交叉注意力融合保留多尺度隐式空间信息的视觉编码器。跨多个任务和策略骨干的实验显示出一致的改进。特别是在低分辨率、高精度的ToolHang任务中，PRISM显示出明显的增益，将平均成功率从5.0%提高到13.4%，同时参数仅增加15.4%。这些结果支持将多尺度隐式空间表示作为机器人操作的有效且高效的设计原则。

英文摘要

Generative model-based imitation learning has become a widely adopted paradigm for robotic manipulation, where policy performance depends critically on the conditioned visual representations. Although spatial softmax-based representations have been adopted in prior visuomotor policies, their effectiveness and underlying mechanisms remain insufficiently understood. This work rethinks the use of spatial softmax pooling: do such implicit spatial representations provide effective and stable visual features for robotic manipulation? Through systematic studies of different pooling methods in visual encoders, we find that this pooling operation produces compact and stable spatial representations, which outperform feature-value representations, despite using substantially fewer dimensions. Complementary saliency analysis further suggests that these spatial representations guide the encoder to focus more consistently on task-relevant regions. However, this advantage is limited by a representation bottleneck in current visual encoders: repeated downsampling operations weaken fine-grained spatial information before the action-generation module can use it, especially under low-resolution observations. Motivated by these findings, we propose PRISM, a visual encoder that preserves multiscale implicit spatial information through top-down cross-attention fusion. Experiments across multiple tasks and policy backbones show consistent improvements. In particular, on the low-resolution, high-precision ToolHang task, PRISM shows clear gains, improving the average success rate from 5.0% to 13.4% while increasing parameters by only 15.4%. These results support the use of multiscale implicit spatial representations as an effective and efficient design principle for robotic manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.15231 2026-06-16 cs.AI 新提交

从互动定向广告中进行属性推断

Peihao Li

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文建模了互动定向广告中用户属性推断的噪声信道，通过合成基准评估了贝叶斯、监督、正无标签和自适应攻击，发现披露策略是最有效的控制手段。

详情

AI中文摘要

定向广告系统可以将广告主选择的受众与展示可见用户操作的广告单元配对。当互动仍然与引发它的广告活动相关联时，广告主可能会收到与用户相关的观察结果，而不仅仅是汇总报告。我们将该渠道建模为用于属性推断的噪声预言机。该模型区分了定向谓词、曝光、互动和披露。这些边界捕捉了资格与投放之间的差距，以及互动与广告主可见性之间的差距。我们使用公共数据校准的合成群体构建了一个可重复的基准，每个群体都有已知的敏感标签。生成的广告活动语义层提供了主题变体和响应先验。模拟器生成真实情况、事件轨迹、披露观察结果和指标。评估比较了在常见广告活动和披露定义下的贝叶斯、监督、正无标签和自适应攻击。最终评估使用了四个主题变体、七个模拟器种子和两种互动设置。具有身份曝光的重复广告活动产生了可测量但有界的推断信号。在160次广告活动中，贝叶斯和监督攻击在主要设置中达到约0.64 AUC，在更高互动设置中达到约0.65 AUC。披露政策是最强的控制手段。汇总报告消除了与用户相关的评估预言机输入。类型过滤和随机披露减少了释放的信号。结果是针对互动定向广告中隐私的模型、工件和防御评估方法。代码可在 https://github.com/P-HOW/Interactive-Ad-Oracle 获取。

英文摘要

Targeted advertising systems can pair audiences selected by advertisers with ad units that expose visible user actions. When an interaction remains linked to the campaign that elicited it, the advertiser may receive an observation tied to a user rather than only an aggregate report. We model that channel as a noisy oracle for attribute inference. The model separates targeting predicates, exposure, interaction, and disclosure. These boundaries capture the gap between eligibility and delivery, and the gap between interaction and advertiser visibility. We build a reproducible benchmark using synthetic populations calibrated with public data, each with known sensitive labels. A generated campaign semantics layer provides topic variants and response priors. The simulator generates the ground truth, event traces, disclosed observations, and metrics. The evaluation compares Bayesian, supervised, positive and unlabeled, and adaptive attacks under common campaign and disclosure definitions. The final evaluation uses four topic variants, seven simulator seeds, and two interaction settings. Repeated campaigns with identity exposure produce measurable but bounded inference signal. At $160$ campaigns, Bayesian and supervised attacks reach about $0.64$ AUC in the main setting and about $0.65$ AUC in the higher interaction setting. Disclosure policy is the strongest control. Aggregate reporting removes the evaluated oracle input tied to users. Type filtering and randomized disclosure reduce the released signal. The result is a model, artifact, and defense evaluation method for privacy in interactive targeted advertising. The code is available at https://github.com/P-HOW/Interactive-Ad-Oracle.

URL PDF HTML ☆

赞 0 踩 0

2606.15207 2026-06-16 cs.LG cs.AI cs.NE 新提交

Controlled Dynamics Attractor Transformer

受控动力学吸引子Transformer

Cheng Zhang, Minnan Luo, Zesheng Yang, Ming Li, Yong-Jin Liu, Qinghua Zheng

发表机构 * Xi'an Jiaotong University（西安交通大学）； Tsinghua University（清华大学）

AI总结提出受控动力学吸引子Transformer（CDAT），通过耦合混合von Mises-Fisher注意力能量与Hopfield精炼能量，并引入CANN启发的兴奋-抑制调制，实现拓扑约束的动力学系统，在图异常检测和图分类任务上达到最优性能。

Comments 20pages,3 figures

Journal ref Forty-Third International Conference on Machine Learning(ICML 2026)

详情

AI中文摘要

Transformer架构通过自注意力机制在深度模型的表示学习和推理方面取得了显著进展。同时，联想记忆（AM）框架将表示映射到能量景观上，提供了可解释的检索机制。然而，其连续时间推理动力学缺乏经典连续吸引子神经网络（CANN）的生物合理性。为弥合这一差距，我们提出了受控动力学吸引子Transformer（CDAT），它将混合von Mises-Fisher（Mo-vMF）注意力能量与Hopfield精炼能量耦合，同时通过CANN启发的兴奋-抑制调制增强能量下降。CDAT实例化了一个拓扑约束的动力学系统，其耦合编码了标记之间的关系结构，从而将吸引子式动力学与现代基于能量的注意力联系起来。我们进一步提供了构造性的耗散分析，以正式建立其受控推理动力学。得益于这些鲁棒且结构化的动力学，CDAT在图异常检测和图分类的多个基准测试中达到了最先进的性能。

英文摘要

Transformer architectures have dramatically advanced representation learning and inference in deep models through self-attention mechanisms. In parallel,associative memory (AM) frameworks map representations onto energy landscapes, offering interpretable retrieval mechanisms. However, their continuous-time inference dynamics lack the biological plausibility of classical Continuous Attractor Neural Networks (CANNs). To bridge this gap, we propose Controlled Dynamics Attractor Transformer (CDAT), which couples a mixture von Mises-Fisher (Mo-vMF) attention energy with a Hopfield refinement energy, while augmenting energy descent with a CANN-inspired excitation-inhibition modulation. CDAT instantiates a topology-constrained dynamical system whose couplings encode relational structure among tokens, thereby linking attractor-style dynamics to modern energy-based attention. We further provide a constructive dissipation analysis to formally establish their controlled inference dynamics. Benefiting from these robust and structured dynamics, CDAT achieves state-of-the-art performance across multiple benchmarks in graph anomaly detection and graph classification.

URL PDF HTML ☆

赞 0 踩 0

2606.15202 2026-06-16 cs.CV 新提交

Comparing Human Gaze and Vision-Language Model Attention in Safety-Relevant Environments

安全相关环境中的人类注视与视觉语言模型注意力的比较

Marta Vallejo, Siwen Wang

发表机构 * Heriot-Watt University（赫瑞-瓦特大学）

AI总结本研究通过眼动追踪实验和GPT-4o等视觉语言模型，比较了人类与模型在安全相关场景中的注意力分布，发现模型无需训练数据即可近似人类注视模式。

Comments 30 pages, 33 figures. Submitted as a preprint. Code and data available upon reasonable request

详情

AI中文摘要

人类视觉注意力在人们感知和响应包含潜在风险的环境时起着重要作用。本研究探讨大型视觉语言模型是否能识别安全相关环境中吸引人类注意力的相同场景区域。使用Pupil Invisible可穿戴眼镜收集了十名参与者观看33张代表不同潜在风险水平的环境场景图像的眼动数据。将注视坐标映射到刺激图像上，生成群体平均的人类注视热图。同时，通过OpenAI视觉应用程序接口（API）提示GPT-4o生成视觉注意力的空间预测，并将其转换为显著性图，以便与人类注视模式进行比较。使用四种互补指标评估人类注视热图与模型生成的显著性图之间的空间对齐：皮尔逊相关系数（r = 0.515 ± 0.117）、归一化扫描路径显著性（NSS = 0.988 ± 0.323）、Kullback-Leibler散度（KL = 1.766 ± 0.844）以及使用Judd公式的接收者操作特征曲线下面积（AUC-Judd = 0.806 ± 0.076）。与Gemini Pro、Gemini Flash和Claude的跨模型比较显示，所有模型均超过AUC-Judd的随机基线0.5，并获得了正的NSS分数。根据四项指标中的三项，Gemini Pro表现出最强的空间定位能力，而GPT-4o在KL散度上产生了与人类注意力最接近的分布匹配。这些发现表明，大型视觉语言模型能够识别与人类在安全相关场景中视觉注意力大致对应的区域，而无需眼动训练数据。结果凸显了视觉语言模型作为近似人类注意力模式的可扩展工具的潜力。

英文摘要

Human visual attention plays an important role in how people perceive and respond to environments containing potential risks. This study investigates whether large vision-language models can identify the same regions of a scene that attract human attention in safety-relevant environments. Eye-tracking data were collected from ten participants viewing 33 scene images representing environments with varying levels of potential risk using Pupil Invisible wearable glasses. Gaze coordinates were mapped onto stimulus images to generate population-averaged human gaze heatmaps. In parallel, GPT-4o was prompted through the OpenAI Vision Application Programming Interface (API) to generate spatial predictions of visual attention, which were converted into saliency maps for comparison with human gaze patterns. Spatial alignment between human gaze heatmaps and model-generated saliency maps was evaluated using four complementary metrics: Pearson correlation (r = 0.515 +- 0.117), Normalised Scanpath Saliency (NSS = 0.988 +- 0.323), Kullback-Leibler divergence (KL = 1.766 +- 0.844), and Area Under the Receiver Operating Characteristic Curve using the Judd formulation (AUC-Judd = 0.806 +- 0.076). A cross-model comparison with Gemini Pro, Gemini Flash, and Claude showed that all models exceeded the AUC-Judd chance baseline of 0.5 and achieved positive NSS scores. Gemini Pro demonstrated the strongest spatial localisation according to three of the four metrics, whereas GPT-4o produced the closest distributional match to human attention as measured by KL divergence. These findings suggest that large vision-language models can identify regions that broadly correspond to where humans direct visual attention in safety-relevant scenes without requiring eye-tracking training data. The results highlight the potential of vision-language models as a scalable tool for approximating human attentional patterns.

URL PDF HTML ☆

赞 0 踩 0

2606.15200 2026-06-16 cs.CV 新提交

Keep It in Mind: User Centric Continual Spatial Intelligence Reasoning in Egocentric Video Streams

铭记于心：面向用户中心的持续空间智能推理在自我中心视频流中的应用

Yun Wang, Junbin Xiao, Han Lyu, Yifan Wang, Jing Zuo, Zhanjie Zhang, Hong Huang, Dapeng Wu, Angela Yao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出UCS-Bench数据集和DirectMe框架，通过增量构建结构化空间记忆，实现自我中心视频流中动态空间推理、长期记忆与用户实时位置对齐，显著提升多模态大模型的空间推理能力。

Comments 45 pages. https://icml.cc/virtual/2026/poster/63682

Journal ref ICML 2026

详情

AI中文摘要

我们介绍了UCS-Bench，一个涵盖170多小时自我中心视觉观察的数据集，包含8.1K+带时间戳的问题，用于诊断自我中心视频流中用户中心的持续空间智能。UCS-Bench针对一个新问题，强调动态空间推理、长期记忆及其与用户实时位置的对齐。我们提出了DirectMe，一个从流式自我中心观察中增量构建和维护结构化空间记忆的框架。DirectMe能够稳健地跟踪和回忆物体位置，这些位置始终相对于用户随时间移动。通过将视觉感知与记忆更新和空间推理紧密耦合，我们的方法支持需要回忆交互、解决视角引起的歧义以及适应动态场景的长时查询。实验表明，DirectMe显著提升了领先多模态大语言模型的空间推理能力；它还超越了许多具有空间感知和长形式流视频模型。我们希望我们的基准和解决方案能够推进自我中心AI助手的空间智能研究。数据和代码可在https://github.com/cocowy1/UCS-Bench获取。

英文摘要

We introduce UCS-Bench, a dataset spanning 170+ hours of egocentric visual observations with 8.1K+ timestamped questions for diagnosing User-Centric Continual Spatial intelligence in egocentric video streams. UCS-Bench targets a new problem that emphasizes dynamic spatial reasoning, long-term memory, and their alignment with users' real-time locations. We propose DirectMe, a framework that incrementally constructs and maintains a structured spatial memory from streaming egocentric observations. DirectMe enables robust tracking and recall of object locations, all relative to the user's movement over time. By tightly coupling visual perception with memory updates and spatial reasoning, our approach supports long-horizon queries that require recalling interactions, resolving viewpoint-induced ambiguities, and adapting to dynamic scenes. Our experiments show that DirectMe significantly improves the spatial reasoning of leading multimodal LLMs; it also surpasses many spatially aware and long-form streaming video models. We hope our benchmark and solution will advance spatial intelligence research for egocentric AI assistants. Data and code are available at https://github.com/cocowy1/UCS-Bench.

URL PDF HTML ☆

赞 0 踩 0

2606.15199 2026-06-16 cs.AI 新提交

CogGuard: Cognitive and Operational Profiling for Proactive Warning in Edge Intelligent Services

CogGuard：边缘智能服务中基于认知与操作画像的主动预警

Zhi Yao, Weihao Chen, Zhiqing Tang, Hanshuai Cui, Qianli Ma, Weijia Jia, Wei Zhao

发表机构 * Beijing Normal-Hong Kong Baptist University（北京师范大学-香港浸会大学）； Guangdong Key Lab of AI and Multi-modal Data Processing（广东省人工智能与多模态数据处理重点实验室）； Institute of Artificial Intelligence and Future Networks（人工智能与未来网络研究院）； Engineering Center of AI and Future Education（人工智能与未来教育工程中心）； Guangdong Provincial Department of Science and Technology（广东省科学技术厅）； Zhuhai Science-Tech Innovation Bureau（珠海市科技创新局）； Beijing Normal University at Zhuhai（北京师范大学珠海校区）

AI总结提出CogGuard框架，通过解耦离线LLM画像构建与在线SLM评分预测，结合前缀对齐KV缓存重用和长度感知分布式微调，实现边缘智能服务的主动预警，在教育和操作任务上降低构建时间48%、微调时间19%。

Comments Accepted to ICWS 2026

详情

AI中文摘要

主动预警是边缘智能服务的一项重要能力，系统需在严格的延迟和隐私约束下预测主体能否成功完成即将到来的任务。这种预测依赖于从历史交互日志中提取的长期静态属性和短期动态状态。近期的大语言模型（LLM）为从这些日志构建结构化画像提供了强大的长上下文推理能力，但现有解决方案在边缘部署时面临两个挑战：（1）画像方法通常具有领域特异性，缺乏跨服务场景的可复用抽象；（2）在异构边缘集群上微调对齐模型时，由于输入序列长度的差异，同步开销较高。为应对这些挑战，我们提出了CogGuard，一个面向边缘智能服务的主动预警框架。CogGuard通过共享的静态-动态画像到评分流水线，将离线基于LLM的画像构建与在线基于小语言模型（SLM）的评分预测解耦，并在两个代表性场景中实例化：教育表现预警和操作任务结果预警。为高效构建画像，我们设计了场景特定的画像方法，并采用前缀对齐的KV缓存重用以减少重复编码开销。为进行边缘端模型对齐，我们提出了一种具有对比正则化的长度感知分布式微调策略，以缓解异构集群上的工作负载不平衡。在教育和操作数据集上的实验表明，CogGuard将画像构建时间最多减少48%，分布式微调时间减少19%，同时在100分量表预警任务上分别达到13.4和5.9的MAE。在最大的教育场景中，与最强基线相比，CogGuard将预测误差降低了15.4%。

英文摘要

Proactive warning is an important capability for edge intelligent services, where the system predicts whether a subject will successfully complete an incoming task under strict latency and privacy constraints. Such prediction depends on both long-term static attributes and short-term dynamic states derived from historical interaction logs. Recent Large Language Models (LLMs) offer strong long-context reasoning for constructing structured profiles from these logs, but existing solutions face two challenges for edge deployment: (1) profiling methods are typically domain-specific and lack a reusable abstraction across service scenarios, and (2) fine-tuning alignment models on heterogeneous edge clusters incurs high synchronization overhead due to the variance in input sequence lengths. To address these challenges, we propose CogGuard, a proactive-warning framework for edge intelligent services. CogGuard decouples offline LLM-based profile construction from online Small Language Model (SLM)-based score prediction through a shared static-dynamic profile-to-score pipeline, and instantiates it in two representative scenarios: educational performance warning and operational task outcome warning. For efficient profile construction, we design scenario-specific profiling methods with prefix-aligned KV-cache reuse to reduce repeated encoding overhead. For edge-side model alignment, we propose a length-aware distributed fine-tuning strategy with contrastive regularization to mitigate workload imbalance on heterogeneous clusters. Experiments on education and operation datasets show that CogGuard reduces profile construction time by up to 48% and distributed fine-tuning time by 19%, while achieving MAEs of 13.4 and 5.9, respectively, on 100-point-scale warning tasks. In the largest educational setting, CogGuard reduces prediction error by 15.4% compared with the strongest baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.15198 2026-06-16 cs.CV cs.HC 新提交

City landscape in sight: A crowdsourced framework for unlocking urban-scale window view perceptions from real estate imagery

城市景观在望：一种从房地产图像解锁城市尺度窗景感知的众包框架

Chucai Peng, Sijie Yang, Ang Liu, Yang Xiang, Zhixiang Zhou, Filip Biljecki

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出一种利用房地产平台真实窗景图像（WVI）进行大规模感知映射的方法，通过混合神经网络模型预测六维感知并分析空间分布，发现楼层高度和窗景组成（如天空、树木比例）对感知有非线性影响。

详情

AI中文摘要

通过住宅窗户看到的城市景观影响生活质量，然而城市尺度上实际窗景的感知仍研究不足。本研究提出一种大规模感知映射方法，使用从中国武汉房地产平台收集的12,334张真实住宅窗景图像（WVI），这是一种罕见探索的城市景观图像形式，相比以往研究中常见的渲染或模拟窗景具有优势。通过非沉浸式虚拟现实平台，我们基于499张WVI从304名参与者收集了27,477对六维感知（如生动性）的比较。训练了一个混合神经网络模型来预测所有众包WVI的人类感知并绘制其空间分布。结果显示，整个城市存在显著的空间自相关，具有明显的热点和冷点。楼层高度强烈影响人类感知：较高楼层提供更受欢迎和更广阔的窗景，而较低楼层为居民提供安静和生动的视野。推理模型进一步表明，窗景组成至关重要：高比例的天空、树木和低层建筑增强人们的偏好和生动性感知，而高层建筑的高比例增加单调和压抑感。重要的是，这些影响是非线性的：某些元素的过度存在会改变其对人类感知的影响。这项工作推进了城市尺度上居民视觉体验的理解，并为以人为本的城市规划和房地产优化窗户视觉景观提供了基于证据的指导。

英文摘要

City landscapes viewed through home windows influence quality of life, yet perceptions of actual window views at the urban scale remain understudied. This study presents an approach for large-scale mapping of perceptions using 12,334 window view images (WVIs) collected from actual residential properties listed on real estate platforms in Wuhan, China, representing a rarely explored form of urban view imagery that offers advantages over the rendered or simulated window views commonly examined in previous studies. Through a non-immersive virtual reality platform, we collected 27,477 pairwise comparisons across six perceptual dimensions (e.g.\ Vivid) from 304 participants based on 499 WVIs. A hybrid neural network model was trained to predict human perceptions of all crowdsourced WVIs and map their spatial distribution. Results reveal significant spatial autocorrelation with distinct hot and cold spots across the whole city. Floor level strongly influences human perceptions: while higher floors offer more preferred and extensive window views, lower-floor windows provide residents with quiet and vivid views. An inference model further shows that window view composition matters considerably: high ratios of sky, trees, and low-rise buildings enhance people's preferences and perceptions of vividness, whereas high ratios of high-rise buildings increase perceptions of monotony and oppression. Importantly, these effects are non-linear: the excessive presence of certain elements can alter their impact on human perception. This work advances urban-scale understanding of residents' visual experiences and provides evidence-based guidance for human-centric urban planning and real estate to optimise visual landscapes from windows.

URL PDF HTML ☆

赞 0 踩 0

2606.15191 2026-06-16 cs.CL 新提交

AmchiBias: Measuring Stereotypical Bias in Goan Identity Groups with a Minimal Pair Dataset in English and Konkani

AmchiBias：基于英语和孔卡尼语的最小对数据集测量果阿身份群体的刻板偏见

Michelle Barbosa, Sebastian Padó, Franziska Weeber

发表机构 * Institute for Natural Language Processing, University of Stuttgart（斯图加特大学自然语言处理研究所）

AI总结提出AmchiBias基准，通过313个最小对评估多语言编码器对果阿身份群体的刻板偏见，发现模型在孔卡尼语上表现接近随机，英语查询反映泛印度偏见而非本地文化知识。

Comments The 1st Workshop on Stereotypes Across Cultures in Language Technologies

详情

AI中文摘要

社会文化刻板偏见是NLP系统开发和部署中的重要考虑因素。然而，尽管存在丰富的次国家级社会文化结构，偏见通常仅在国家层面被考虑。我们提出AmchiBias，这是首个针对印度果阿邦（其独特的历史多元文化背景）测量社会文化刻板偏见的基准。它涵盖各种果阿身份群体，包括英语和天城文孔卡尼语中八个社会人口维度的313个最小对。然后，我们在此基准上评估五个多语言编码器模型中的刻板偏见。我们发现模型在孔卡尼语上的得分接近随机，反映了通用多语言模型的语言能力不足以及印度语言模型缺乏果阿文化能力。当用英语查询时，具有更强印度语言覆盖的模型对泛印度群体表现出比超本地果阿群体更高的偏见。这表明英语信号反映了泛印度预训练关联，而非真正的果阿文化知识。我们的发现突显了低资源多语言NLP评估中超本地社区身份的关键空白。

英文摘要

Socio-cultural stereotypical bias is an important consideration in the development and deployment of NLP systems. It is however often considered only at the national level, despite rich subnational socio-cultural structures. We present AmchiBias, the first benchmark for measuring socio-cultural stereotypical bias for the Indian state of Goa with its unique historically multicultural setting. It covers various Goan identity groups and comprises 313 minimal pairs across eight sociodemographic dimensions in both English and Devanagari Konkani. We then evaluate stereotypical bias in five multilingual encoder models on this benchmark. We find near-chance scores in Konkani, reflecting language incompetence for general multilingual models and a lack of Goan cultural competence for Indian language models. Queried in English, models with a stronger Indian language coverage show higher bias for pan-Indian groups than hyperlocal Goan groups. This suggests the English signal reflects pan-Indian pretraining associations rather than genuine Goan cultural knowledge. Our findings highlight a critical gap in low-resource multilingual NLP evaluation for hyperlocal community identities.

URL PDF HTML ☆

赞 0 踩 0

2606.15188 2026-06-16 cs.CV 新提交

Adaptive Inference-Time Scaling via Early-Step Latent Verification for Image Editing

自适应推理时间缩放：基于早期步骤潜在验证的图像编辑

Yue Yu, Yang Jiao, Jiayu Wang, Qi Dai, Jingjing Chen

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University（复旦大学计算机科学与人工智能学院）； Microsoft Research Asia（微软亚洲研究院）； Institute of Trustworthy Embodied AI, Fudan University（复旦大学可信具身人工智能研究所）

AI总结提出VeriLatent框架，通过早期步骤潜在空间编辑激活图验证初始噪声，实现自适应推理时间缩放，提升图像编辑质量和效率。

详情

AI中文摘要

基于指令的图像编辑随着生成模型的最新进展取得了显著进步。然而，编辑结果的质量仍受随机采样的初始噪声影响，特别是在复杂编辑场景中。不合适的初始噪声可能导致不满意的编辑结果。最近的推理时间缩放方法通过采样多个初始噪声并选择更好的候选者来解决这一问题。然而，大多数方法遵循解码-验证方案，引入了效率与准确性的权衡。当在有限的推理步骤后进行解码时，解码后的图像通常噪声过大，无法进行可靠评估，而充分去噪的图像则需要更高的计算成本。为了解决这个问题，我们提出了VeriLatent，一种即插即用的自适应推理时间缩放框架，用于图像编辑的早期步骤潜在验证。具体来说，我们提出了一种新颖的验证器，通过在早期阶段通过潜在空间编辑激活图对每个初始噪声进行评分。它通过评估候选者是否能在正确区域引发有效编辑来识别有希望的候选者。这使得无需将潜在变量解码为图像即可进行高效的早期剪枝。在此基础上，我们进一步开发了一种用于推理时间缩放的自适应搜索策略。它根据编辑难度分配推理预算，从而减少函数评估次数（NFE）。在多个基准测试和不同基础模型上的大量实验表明，VeriLatent持续提高了编辑性能和推理时间缩放效率。

英文摘要

Instruction-based image editing has made notable progress with recent advances in generative models. However, the quality of the edited result is still influenced by the randomly sampled initial noise, particularly in complex editing scenarios. An unsuitable initial noise may lead to unsatisfactory editing results. Recent inference-time scaling methods address this issue by sampling multiple initial noises and selecting better candidates. Nevertheless, most of them follow a decode-then-verify scheme which introduces an efficiency-accuracy trade-off. When decoding is performed after limited inference steps, the decoded images often remain too noisy for reliable assessment, whereas sufficiently denoised images require much higher computational cost. To address this issue, we propose VeriLatent, a plug-and-play adaptive inference-time scaling framework with early-step latent verification for image editing. Specifically, we propose a novel verifier that scores each initial noise through a latent-space editing activation map at an early stage. It identifies promising candidates by assessing whether they can induce an effective edit in the correct region. This enables efficient early pruning without decoding latents into images. Building on this, we further develop an adaptive search strategy for inference-time scaling. It allocates inference budgets according to editing difficulty, thereby reducing the number of function evaluations (NFE). Extensive experiments on multiple benchmarks and different base models demonstrate that VeriLatent consistently improves both editing performance and inference-time scaling efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.15186 2026-06-16 cs.SD cs.AI eess.AS 新提交

FreeSonic: Training-Free Temporal-Aware Decoupled Attention for Precise Audio Editing

FreeSonic: 无需训练的时序感知解耦注意力用于精确音频编辑

Yuxuan Jiang, Mingyang Han, Yusheng Dai, Andong Wang, Tianhong Zhou, Jiaxin Ye, Dongxiao Wang, Haoxiang Shi, Boyu Li, Jun Song, Cheng Yu, Bo Zheng, Weibei Dou, Zehua Chen, Jun Zhu

发表机构 * Tsinghua University（清华大学）； Alibaba Group（阿里巴巴集团）； Monash University（蒙纳士大学）； Renmin University of China（中国人民大学）； Fudan University（复旦大学）

AI总结提出FreeSonic，一种无需训练的框架，利用基于Rectified Flow的TangoFlux模型，通过优化反转-逆过程、联合文本-音频注意力图以及调度注意力解耦，实现精确且一致的音频编辑，同时保持背景保真度。

Comments Accepted at Interspeech 2026

详情

AI中文摘要

文本到音频（TTA）生成取得了显著进展，但实现精确且一致的音频编辑仍然是一个主要挑战。然而，现有方法难以平衡时间一致性与背景保留。在本文中，我们提出FreeSonic，一个无需训练的框架，利用最先进的基于Rectified Flow的TangoFlux模型。FreeSonic利用优化的反转-逆过程和联合文本-音频注意力图进行精确的目标片段提取。对于内容编辑，一种新颖的调度注意力解耦将修改限制在目标区域，同时保留原始声学上下文。此外，面向任务的噪声注入增强了音频移除和非刚性替换等任务的通用性。大量实验结果表明，FreeSonic通过提供高保真且高效的解决方案，在精确且一致的音频编辑中实现了优越的平衡。项目和演示：https://free-sonic.github.io/

英文摘要

Text-to-audio (TTA) generation has made significant strides, yet achieving precise and consistent audio editing remains a major challenge. However, existing methods struggle to balance temporal consistency with background preservation. In this paper, we propose FreeSonic, a training-free framework leveraging the state-of-the-art Rectified Flow-based TangoFlux model. FreeSonic utilizes an optimized inversion-reverse process and joint text-audio attention maps for precise target segment extraction. For content editing, a novel scheduled attention decoupling confines modifications to target regions while preserving original acoustic context. Furthermore, task-oriented noise injection enhances versatility for tasks such as audio removal and non-rigid replacement. Extensive experimental results demonstrate that FreeSonic achieves a superior balance by providing a high-fidelity and efficient solution for precise and consistent audio editing. Project and demos: https://free-sonic.github.io/

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

MamBOA: State-Space Architecture for Video Recognition

Feature Attribution in Directed Acyclic Graphs Using Edge Intervention

When to use what Schatten-$p$ norm in deep learning?

Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation

Trust-Region Diffusion Policies for Massively Parallel On-Policy RL

Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs

AI for Social Good: An Investigation of the Causal Relationship Between Environmental Regulations and Their Effects on Air Pollution in London, UK

OSDAG: Online Scheduling for Efficient Multi-Robot Collaboration

Focus, Align, and Sustain: Counteracting Gradient Dilution in Incremental Object Detection

Driving, Fast or Slow? Neuro-Symbolic Guidance for Motion Prediction in Multi-Modal Ground Mobility

Landmark-free Assessment of Lower-limb Alignment with Implicit Neural Shape Functions from Knee Radiographs

Exploring Starts Are Not Enough: Counterexamples and a Fix for Monte Carlo Exploring Starts

M-CTX: Exact and Scalable Spatial Context Retrieval for Trajectory Analytics

SPARK: Spatial Policy-driven Adaptive Reinforcement learning for Knowledge distillation

EnvShip-Bench: An Environment-Enhanced Benchmark for Short-Term Vessel Trajectory Prediction

Co-Creating Buildable and Open Social Robot Study Companions with University Students

Rethinking Implicit Spatial Representation in Visuomotor Policy Learning

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

Edu-Theater: A Data-Efficient Agent Framework for Scalable Learner Behavior Simulation through Staging Roll-Call

Can Neural Networks Achieve Optimal Computational-statistical Tradeoff? An Analysis on Single-Index Model

Spokes: Optimizing for Diverse Pretraining Data Selection

Attribute Inference from Interactive Targeted Ads

Controlled Dynamics Attractor Transformer

Comparing Human Gaze and Vision-Language Model Attention in Safety-Relevant Environments

Keep It in Mind: User Centric Continual Spatial Intelligence Reasoning in Egocentric Video Streams

CogGuard: Cognitive and Operational Profiling for Proactive Warning in Edge Intelligent Services

City landscape in sight: A crowdsourced framework for unlocking urban-scale window view perceptions from real estate imagery

AmchiBias: Measuring Stereotypical Bias in Goan Identity Groups with a Minimal Pair Dataset in English and Konkani

Adaptive Inference-Time Scaling via Early-Step Latent Verification for Image Editing

FreeSonic: Training-Free Temporal-Aware Decoupled Attention for Precise Audio Editing