arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2083
2605.13370 2026-05-14 cs.LG cs.CL

Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory

Sungwoo Goo, Hwi-yeol Yun, Sangkeun Jung

AI总结 本文提出了一种名为“Phasor Memory Network(PMNet)”的新架构,旨在解决显式记忆模型在语言建模中因反向传播时梯度不稳定而导致的训练困难问题。该方法通过引入单位相位动力学和分层可学习锚点,结构化地稳定了记忆模块的更新过程,从而在无需特殊初始化的情况下保持梯度稳定性。实验表明,PMNet在合成复制粘贴任务中能够实现几乎100%的精确记忆检索,并在参数规模仅为Mamba模型三分之一的情况下,展现出相当的长上下文处理能力,为可扩展序列建模提供了理论支撑。

详情
英文摘要

For over a decade, explicit memory architectures like the Neural Turing Machine have remained theoretically appealing yet practically intractable for language modeling due to catastrophic gradient instability during Backpropagation Through Time. In this work, we break this stalemate with \textit{Phasor Memory Network} (PMNet), a novel architecture that structurally resolves memory volatility through \textit{Unitary Phasor Dynamics} and \textit{Hierarchical Learnable Anchors}. Rather than relying on brute-force scaling, we present a mechanistic proof-of-concept in a controlled byte-level setting. By constraining recurrent state updates to phase rotations on a complex unit circle, PMNet preserves gradient norms and inherently prevents divergence without the need for specialized initialization. We empirically demonstrate the active actuation of the memory module through a synthetic Copy-Paste task, where PMNet utilizes an expansive \textit{85-slot hierarchical memory tree} ($=\sum^{4}_{h=1}4^{h-1}$) to achieve near 100\% exact retrieval across temporal distances that completely exceed the local sliding window attention's receptive field. Furthermore, despite being a compact 119M parameter model trained on 18.8B tokens, PMNet matches the zero-shot long-context robustness of a Mamba model that is three times larger. Our ablation studies and gradient analyses confirm that the historical failure of explicit memory was a structural alignment problem, which PMNet effectively overcomes, providing a theoretically grounded foundation for scalable sequence modeling.

2605.13368 2026-05-14 cs.CL

What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation

Shaomu Tan, Dawei Zhu, Ke Tran, Michael Denkowski, Sony Trenous, Bill Byrne, Leonardo Ribeiro, Felix Hieber

AI总结 本文系统研究了迭代自修正策略在文学翻译中的实际效果,探讨了不同粒度和策略对翻译质量的影响。研究发现,先进行文档级机器翻译,再进行片段级修正能带来稳定且显著的提升,而文档级修正效果较弱且不可靠。实验还表明,通用的修正提示优于特定错误修正和评估后修正方法,且修正主要提升了流畅性、风格和术语,对内容准确性提升有限。这些发现揭示了当前修正方法的机制及其局限性。

详情
英文摘要

Iterative self-refinement is a simple inference-time strategy for machine translation: an LLM revises its own translation over multiple inference-time passes. Yet document-scale refinement remains poorly understood: 1) which pipelines work best, 2) what quality dimensions improve, and 3) how refiners behave. In this paper, we present a systematic study of document-level literary translation, covering nine LLMs and seven language pairs. Across nine translation-refinement granularity combinations and five refinement strategies, we find a robust recipe: document-level MT followed by segment-level refinement yields strong and stable improvements. In contrast, document-level refinement often makes fewer edits and leads to smaller or less reliable gains. Beyond granularity, A simple general refinement prompt consistently outperforms error-specific prompting and evaluate-then-refine schemes. Our large-scale human evaluation shows that refinement gains come primarily from fluency, style, and terminology, with limited and less consistent improvements in adequacy. Experiments varying model strength reveal refinement projects outputs toward the refiner's distribution rather than performing targeted error repair. These findings clarify the mechanisms and limitations of current refinement approaches.

2605.13366 2026-05-14 cs.CV cs.LG

Neural Surrogate Forward Modelling For Electrocardiology Without Explicit Intracellular Conductivity Tensor

Shaheim Ogbomo-Harmitt, Cesare Magnetti, Jakub Grzelak, Oleg Aslanidi

AI总结 该研究针对无创心脏电生理学中的正向建模问题,提出了一种无需显式输入细胞内导电张量的深度学习方法,用于直接从左心房细胞内电位预测远场心电图。该方法通过深度学习模型学习电位与心电图之间的映射关系,避免了传统物理模型中难以测量的导电张量带来的结构误差。实验表明,该模型在仅使用74个受试者数据训练的情况下,取得了较高的预测精度(R²为0.949 ± 0.037),展示了其在改善房颤无创评估中的潜力。

Comments Accepted into the 9th International Conference on Computational and Mathematical Biomedical Engineering (CMBE2026)

详情
英文摘要

Accurate forward modelling is essential for non-invasive cardiac electrophysiology, particularly in atrial fibrillation, where electrical activation is highly disorganised. Conventional physics-based forward models require explicit specification of intracellular conductivity tensors, which are not directly measurable in clinical practice and introduce structural modelling errors. This proof-of-concept study presents a deep learning approach that learns a direct mapping from left atrial intracellular electrical potentials to far-field ECGs without requiring explicit intracellular conductivity inputs at inference time. Despite training only on 74 subjects, the model achieved an R2 of 0.949 \pm 0.037, highlighting potential to reduce structural uncertainty and improve non-invasive AF assessment.

2605.13352 2026-05-14 cs.LG

GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding

Mayank Nautiyal, Li Ju, Andreas Hellander, Ekta Vats, Prashant Singh

AI总结 GeoFlowVLM 是一种后处理方法,旨在为冻结的视觉-语言嵌入模型引入几何感知的联合不确定性估计。该方法通过黎曼流匹配在超球面乘积空间上学习配对嵌入的联合分布,从而同时捕捉跨模态的模糊性(aleatoric uncertainty)和训练分布外的不确定性(epistemic uncertainty)。该模型能够生成条件检索熵和边际典型性分数,分别用于衡量模糊性和知识不确定性,并在多个检索和零样本分类任务中表现出良好的校准性能。

详情
英文摘要

Standard dual-encoder vision-language models that map images and text to deterministic points on a shared unit hypersphere through $\ell_2$ normalization typically expose neither \emph{aleatoric} uncertainty (cross-modal ambiguity) nor \emph{epistemic} uncertainty (lack of training-distribution support). Existing post-hoc methods either recover at most one of the two uncertainty components, or ignore the hyperspherical geometry of these models' embeddings. We propose \textbf{GeoFlowVLM} as a post-hoc adapter that learns the joint distribution of paired $\ell_2$-normalised dual-encoder VLM embeddings on the product hypersphere $\mathbb{S}^{d-1} \times \mathbb{S}^{d-1}$ via Riemannian flow matching with a single masked velocity field. A consistency result shows that, in the population limit, the trained network exposes the joint flow and both cross-modal conditional flows as valid Riemannian flow-matching velocity fields on their respective domains. We derive two quantities from this single model: a conditional retrieval entropy that quantifies aleatoric ambiguity with a decision-theoretic interpretation via a Fano-type bound, and a marginal-typicality epistemic score justified by an exact chain-rule decomposition of the joint NLL. This decomposition isolates a cross-modal pointwise-mutual-information term that is structurally discriminative rather than epistemic, and is empirically the only consistently uninformative standalone component. Empirically, the entropy tracks Recall@1 with near-ideal monotonic calibration across three retrieval benchmarks in both directions, and the marginal-typicality sum yields consistently calibrated selective accuracy across four zero-shot classification benchmarks.

2605.13349 2026-05-14 cs.CV

Drag within Prior Distribution: Text-Conditioned Point-Based Image Editing within Distribution Constraints

Haoyang Hu, Masataka Seo, Yen-Wei Chen

AI总结 本文研究了在扩散模型框架下,如何在保持图像语义一致性和分布约束的前提下,实现基于文本条件的点编辑。为了解决传统点编辑方法中轨迹模糊、编辑范围过大导致的不自然伪影等问题,作者引入了基于CLIP的引导机制和先验保持损失函数,确保编辑过程在扩散先验分布范围内进行。同时,提出了一种方向加权的点追踪机制,提升了细粒度编辑的准确性和生成质量。

Comments ICASSP 2026 oral

详情
英文摘要

Diffusion-based point editing methods have gained significant traction in image editing tasks due to their ability to manipulate image semantics and fine details by applying localized perturbations on the manifold of noise latent. However, these approaches face several limitations. Traditional point-based editing relies on pairs of handle and target points to define motion trajectories, which can introduce ambiguity or unnecessary alterations. Furthermore, when the distance between the handle and target points is large, the accumulated perturbations often cause the noise latent deviation from inversion score trajectory, resulting in unnatural artifacts. To address these issues in global editing tasks, we introduce a CLIP-based model to evaluate and guide intermediate editing steps, ensuring that the generated results remain both semantically aligned. Additionally, we propose a prior-preservation loss that constrains the optimized latent code to stay within the sampling space of the diffusion prior, improving consistency with the original data distribution, to ensure the model generates images along a familiar score trajectory. For fine-grained tasks, we present a directionally-weighted point tracking mechanism that steers the editing process toward the target direction within similar feature regions. This improves both the tracking accuracy and generation quality, while also reducing the editing time.

2605.13346 2026-05-14 cs.LG

Contextual Bandits for Resource-Constrained Devices using Probabilistic Learning

Marco Angioli, Kevin Johansson, Antonello Rosato, Amy Loutfi, Denis Kleyko

AI总结 本文研究了在资源受限设备上高效部署上下文多臂老虎机算法的问题,提出了一种基于概率更新规则的高维上下文多臂老虎机方法(probabilistic HD-CB)。该方法通过随机更新部分向量分量并结合时间衰减更新概率,避免了传统高维方法中因累积操作导致的精度问题和溢出风险,同时降低了计算和存储开销。实验表明,该方法在相同精度下性能优于二值化高维方法,且在少量比特数下接近原高维方法的性能。

详情
英文摘要

Contextual bandits (CB) are online sequential decision-making problems under partial feedback that underpin many adaptive services. There is a growing demand to deploy CB agents directly on-device, under strict constraints on memory, compute, and energy. However, standard linear CB algorithms are often impractical for resource-constrained devices with their unfavorable scaling in computational and memory costs. Recently, HD-CB, a CB approach based on hyperdimensional computing principles, has been proposed to model and solve CB problems by moving into high-dimensional spaces. HD-CB offers faster convergence, favorable scalability, and improves memory efficiency compared to linear CB algorithms. However, its learning rule is accumulation-based: the values of action vectors grow over time, requiring high precision. While periodic binarization can prevent overflow in low-precision components, it may discard important information about magnitudes and degrade decision quality. This paper introduces probabilistic HD-CB, a low-precision variant that replaces deterministic accumulation with a probabilistic update rule. At each step, only a random subset of vector components is updated, with a time-decaying update probability, and component values are constrained to a predefined range [-k,+k]. This approach enables low-precision components, prevents overflow without periodic binarization, and reduces the expected update cost in proportion to the fraction of updated components. Off-policy evaluation on standardized synthetic CB benchmarks using the Open Bandit Pipeline shows that probabilistic HD-CB consistently outperforms binarized HD-CB at equal precision, while approaching the performance of HD-CB with as few as 3 bits per component.

2605.13345 2026-05-14 cs.AI cs.MA

Multi-Agent Systems in Emergency Departments: Validation Study on a ED Digital Twin

Markus Wenzel, Tobias Strapatsas, Jessika Kress, Dorothea Sauer, Nele Gessler, Horst K. Hahn

AI总结 该研究针对急诊科在患者护理和资源管理方面面临的挑战,提出了一种结合离散事件仿真(DES)和基于代理的模型(ABM)的混合仿真方法,用于构建高度可配置的急诊科数字孪生系统。通过验证模型在不同规模、患者流量和人员配置下的表现,并与实际数据对比,证明了该模型能够有效模拟真实急诊环境下的运行动态。此外,研究还引入了一个基于时间事件记录的多智能体系统,可自主探索资源分配策略,为急诊科资源优化提供了有力的仿真工具。

详情
英文摘要

Emergency departments (ED) face challenges in patient care and resource management. We propose to explore optimization strategies in a realistic and flexible model and develop a hybrid Discrete Event Simulation (DES) and Agent-Based Model (ABM) simulating highly configurable ED environments. We specifically focus on the validation of the modeling approach. We derive configurations for ED sizes, patient load, and staffing from real-world studies. We then validate the model expressivity by matching its key performance indicators and metrics with their values known from literature. We proceed by implementing scientifically established and practice-proven resource optimization strategies. Comparing the documented real-world outcomes with our model's results demonstrates that the DES-ABM based simulation can effectively replicate real-world ER dynamics under interventions. We lastly integrate a Proof-of-Concept multi-agent system (MAS) that can autonomously explore resource allocation strategies within the simulated ER environment based on a temporal ledger of ED event records. This modular DES-ABM-MAS framework offers a powerful tool to explore resource optimization strategies in emergency departments.

2605.12088 2026-05-14 cs.CV

UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

Yiyan Xu, Qiulin Wang, Wenjie Wang, Yunyao Mao, Xintao Wang, Pengfei Wan, Kun Gai, Fuli Feng

AI总结 本文研究了多参考图像生成问题,即在文本指令引导下生成图像并忠实保留多个参考图像中的主体身份和外观细节。现有方法通常将语义和外观特征分离处理,导致模型难以正确关联主体与对应参考图像的细节,从而引发属性泄露和跨参考混淆。为此,作者提出UniCustom框架,在视觉语言模型编码前融合ViT和VAE特征,使模型能够同时学习主体语义和外观信息,并通过两阶段训练策略和槽位绑定正则化进一步提升生成质量。实验表明,UniCustom在多个基准上显著优于现有方法。

详情
英文摘要

Multi-reference image generation aims to synthesize images from textual instructions while faithfully preserving subject identities from multiple reference images. Existing VLM-enhanced diffusion models commonly rely on decoupled visual conditioning: semantic ViT features are processed by the VLM for instruction understanding, whereas appearance-rich VAE features are injected later into the diffusion backbone. Despite its intuitive design, this separation makes it difficult for the model to associate each semantically grounded subject with visual details from the correct reference image. As a result, the model may recognize which subject is being referred to, but fail to preserve its identity and fine-grained appearance, leading to attribute leakage and cross-reference confusion in complex multi-reference settings. To address this issue, we propose UniCustom, a unified visual conditioning framework that fuses ViT and VAE features before VLM encoding. This early fusion exposes the VLM to both semantic cues and appearance-rich details, enabling its hidden states to jointly encode the referred subject and corresponding visual appearance with only a lightweight linear fusion layer. To learn such unified representations, we adopt a two-stage training strategy: reconstruction-oriented pretraining that preserves reference-specific appearance details in the fused hidden states, followed by supervised finetuning on single- and multi-reference generation tasks. We further introduce a slot-wise binding regularization that encourages each image slot to preserve low-level details of its corresponding reference, thereby reducing cross-reference entanglement. Experiments on two multi-reference generation benchmarks demonstrate that UniCustom consistently improves subject consistency, instruction following, and compositional fidelity over strong baselines.

2605.12072 2026-05-14 cs.CV

PairDropGS: Paired Dropout-Induced Consistency Regularization for Sparse-View Gaussian Splatting

Hantang Li, Qiang Zhu, Xiandong Meng, Xingtao Wang, Debin Zhao, Xiaopeng Fan

AI总结 PairDropGS 是一种基于配对 dropout 的一致性正则化方法,旨在提升稀疏视角下高斯溅射(Gaussian Splatting)的重建稳定性与质量。该方法通过从共享高斯场中构造配对的 dropout 子集,并引入低频一致性正则化,以保持场景布局和粗略几何结构的稳定性,同时避免对高频细节的过度约束。此外,PairDropGS 还采用渐进式一致性调度策略,增强训练过程中的鲁棒性,实验表明其在多个基准数据集上均取得了优于现有方法的重建效果。

Comments 11 pages,8 figures

详情
英文摘要

Dropout-based sparse-view 3D Gaussian Splatting (3DGS) methods alleviate overfitting by randomly suppressing Gaussian primitives during training. Existing methods mainly focus on designing increasingly sophisticated dropout strategies, while they overlook the resulting inconsistencies among different dropped Gaussian subsets. This oversight often leads to unstable reconstruction and suboptimal Gaussian representation learning.In this paper, we revisit dropout-based sparse-view 3DGS from a consistency regularization perspective and propose PairDropGS, a Paired Dropout-induced Consistency Regularization framework for sparse-view Gaussian splatting. Specifically, PairDropGS first constructs a pair of the dropped Gaussian subsets from a shared Gaussian field and designs a low-frequency consistency regularization to constrain their low-frequency rendered structures. This design encourages the shared Gaussian field to preserve stable scene layout and coarse geometry under different random dropouts, while avoiding excessive constraints on ambiguous high-frequency details. Moreover, we introduce a progressive consistency scheduling strategy to gradually strengthen the consistency regularization during training for stability and robustness of reconstruction. Extensive experiments on widely-used sparse-view benchmarks demonstrate that PairDropGS achieves superior training stability, significantly outperforms existing dropout-based 3DGS methods in reconstruction quality, while exhibiting the simplicity and plug-and-play nature for improving dropout-based optimization.

2605.11726 2026-05-14 cs.LG

Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

Yan Jiang, Ruihong Qiu, Zi Huang

AI总结 本文研究了在多领域扩散大语言模型(dLLMs)的强化学习(RL)后训练中,块大小对性能的影响,并从领域冲突的角度出发,提出了块大小冲突的概念。研究构建了一个新的数据集Block-R1-41K和基准Block-R1,用于支持单领域和跨领域的RL后训练,并提出了一种基于样本级最优块大小的跨领域训练方法。实验覆盖了13个数据集、7种最新RL算法和多种dLLM模型,验证了方法的有效性。

详情
英文摘要

Recently, reinforcement learning (RL) has been widely applied during post-training for diffusion large language models (dLLMs) to enhance reasoning with block-wise semi-autoregressive generation. Block size has therefore become a vital factor in dLLMs, since it determines the parallel decoding granularity and affects the rollout trajectories during RL optimisation, e.g., GRPO. Instead of investigating the effect of block size during inference on individual domains, this paper studies block size from a domain conflict perspective for dLLM RL post-training in multi-domain scenarios. The main contributions are: (1) a formulation of domain block size conflict in multi-domain RL for dLLMs, which will largely affect the post-training effectiveness for rollout-based RL methods; (2) a novel dataset, Block-R1-41K is constructed with a best-improved training block size for each sample, which also induces a Block Size Conflict Score to quantitatively measure the domain conflict; (3) a new benchmark, Block-R1, for flexible RL post-training for dLLMs in both single and cross domain; and (4) a simple yet powerful cross-domain post-training method with sample-level best-improved training block sizes. Extensive experiments on 13 distinct datasets, 7 latest RL algorithms and diverse dLLM backbones are comprehensively covered in Block-R1. The benchmark is open-sourced at https://github.com/YanJiangJerry/Block-R1 with the dataset released at https://huggingface.co/datasets/YanJiangJerry/Block-R1-41K.

2605.11231 2026-05-14 cs.LG cs.AI

LiBaGS: Lightweight Boundary Gap Synthesis for Targeted Synthetic Data Selection

Abhishek Moturu, Anna Goldenberg, Babak Taati

AI总结 本文提出了一种名为LiBaGS的轻量级合成数据选择方法,旨在针对特定任务选择具有代表性的合成样本以补充训练分布的不足。该方法结合了决策边界距离、预测不确定性、真实数据密度和支撑有效性等多个指标,以筛选出信息量大且贴近真实数据流形的样本。通过边界间隙分配规则和边际价值停止准则,LiBaGS能够高效地选择稀疏但真实的边界邻域样本,提升模型在下游任务中的准确率。

详情
英文摘要

Synthetic data is useful only when the added samples fill missing parts of the training distribution that matter for the downstream task. We introduce LiBaGS, a lightweight, generator-agnostic method for targeted synthetic training data selection. LiBaGS scores candidate synthetic samples by combining decision-boundary proximity, predictive uncertainty, real-data density, and support validity, so that selected samples are both informative and likely to remain on the real data manifold. We then use a boundary-gap allocation rule that targets sparse but realistic decision-boundary neighborhoods, rather than simply adding more data or selecting only the most uncertain candidates. LiBaGS also learns when enough synthetic samples have been added through a marginal-value stopping rule, assigns softer labels near ambiguous boundaries, and uses a diversity objective to avoid redundant near-duplicate selections. Experiments show that LiBaGS improves accuracy over classical oversampling, hard augmentation, uncertainty and density ablations, and targeted-generation selection criteria.

2605.10556 2026-05-14 cs.CV cs.LG

EnergyLens: Interpretable Closed-Form Energy Models for Multimodal LLM Inference Serving

Vittorio Palladino, Gianluca Palermo, Michael E. Papka, Zhiling Lan

AI总结 随着大语言模型架构日益多样化,并在异构加速器上处理多模态工作负载,优化推理能耗已成为与延迟和吞吐量同样关键的问题。现有方法要么将延迟作为能耗代理,要么依赖数据密集的黑箱模型,均难以适应不同的并行策略。本文提出EnergyLens,通过符号回归从性能剖析数据中推导出一个包含12个参数的闭式能耗模型,能够准确描述系统特性如并行度、批大小和序列长度对能耗的影响,其预测结果具有物理可解释性,并且仅需少量的剖析样本即可实现高精度的配置选择和跨硬件平台的泛化能力。

Comments 10 pages

详情
英文摘要

As large language models span dense, mixture-of-experts, and state-space architectures and are deployed on heterogeneous accelerators under increasingly diverse multimodal workloads, optimising inference energy has become as critical as optimizing latency and throughput. Existing approaches either treat latency as an energy proxy or rely on data-hungry black-box surrogates. Both fail under varying parallelism strategies: latency and energy optima diverge in over 20% of configurations we tested, and black-box surrogates require hundreds of profiling samples to generalize across model families and hardware. We present EnergyLens, which uses symbolic regression as a structure-discovery tool over profiling data to derive a single twelve-parameter closed-form energy model expressed in terms of system properties such as degree of parallelism, batch size, and sequence length. Unlike black-box surrogates, EnergyLens decouples tensor and pipeline parallelism contributions and separates prefill from decode energy, making its predictions physically interpretable and actionable. Fitted from as few as 50 profiling measurements, EnergyLens achieves 88.2% Top-1 configuration selection accuracy across many evaluation scenarios compared to 60.9% for the closest prior analytical baseline, matches the predictive accuracy of ensemble ML methods with 10x fewer profiling samples, and extrapolates reliably to unseen batch sizes and hardware platforms without structural modification, making it a practical, interpretable tool for energy-optimal LLM deployment.

2605.10415 2026-05-14 cs.CL

Aligning LLM Uncertainty with Human Disagreement in Subjectivity Analysis

Junyu Lu, Deyi Ji, Xuanyi Liu, Lanyun Zhu, Bo Xu, Liang Yang, Xian-Sheng Hua, Hongfei Lin

AI总结 本文研究了在主观性分析任务中如何使大语言模型的不确定性与人类判断的分歧相一致。传统方法使用聚合标签训练模型,忽略了低一致性样本的内在不确定性,导致模型预测过于自信。为此,作者提出了一个两阶段的DPUA框架,通过感知人类分歧并调整模型不确定性,既保持了任务性能,又提升了模型在边界样本上的可靠性与分布外泛化能力。

详情
英文摘要

Large language models for subjectivity analysis are typically trained with aggregated labels, which compress variations in human judgment into a single supervision signal. This paradigm overlooks the intrinsic uncertainty of low-agreement samples and often induces overconfident predictions, undermining reliability and generalization in complex subjective settings. In this work, we advocate uncertainty-aware subjectivity analysis, where models are expected to make predictions while expressing uncertainty that reflects human disagreement. To operationalize this perspective, we propose a two-phase Disagreement Perception and Uncertainty Alignment (DPUA) framework. Specifically, DPUA jointly models label prediction, rationale generation, and uncertainty expression under an uncertainty-aware setting. In the disagreement perception phase, adaptive decoupled learning enhances the model's sensitivity to disagreement-related cues while preserving task performance. In the uncertainty alignment phase, GRPO-based reward optimization further improves uncertainty-aware reasoning and aligns the model's confidence expression with the human disagreement distribution. Experiments on three subjectivity analysis tasks show that DPUA preserves task performance while better aligning model uncertainty with human disagreement, mitigating overconfidence on boundary samples, and improving out-of-distribution generalization.

2605.09505 2026-05-14 cs.AI

EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

Yuyang Dai, Zheng Chen, Jathurshan Pradeepkumar, Yasuko Matsubara, Jimeng Sun, Yasushi Sakurai, Yushun Dong

AI总结 该研究提出了EpiGraph,一个大规模的癫痫知识图谱和评估基准,旨在提升基于证据的临床推理能力。EpiGraph整合了48,166篇同行评审论文和七项临床资源,构建了一个包含24,324个实体和32,009个证据支持三元组的异构图谱,并基于此定义了五个临床任务用于评估模型性能。实验表明,结合EpiGraph的大型语言模型在多项任务中表现显著提升,尤其在药理基因组学推理方面提升了30%至41%,验证了结构化知识对增强临床推理能力的有效性。

详情
英文摘要

Epilepsy diagnosis and treatment require evidence-intensive reasoning across heterogeneous clinical knowledge, including biosignal patterns, genetic mechanisms, pharmacogenomics, treatment strategies, and patient outcomes. In this work, we present \textsc{EpiGraph}, a large-scale epilepsy knowledge graph and benchmark for evaluating knowledge-augmented clinical reasoning. \textsc{EpiGraph} integrates 48,166 peer-reviewed papers and seven clinical resources into a heterogeneous graph containing 24,324 entities and 32,009 evidence-grounded triplets across five clinical layers. Built upon this graph, \textsc{EpiBench} defines five clinically motivated tasks spanning clinical decision-making, EEG report generation, pharmacogenomic precision medicine, treatment recommendation, and deep research planning. We evaluate six LLMs under both standard and Graph-RAG settings. Results show that integrating \textsc{EpiGraph} consistently improves performance across all tasks, with the largest gains observed in pharmacogenomic reasoning (+30--41\%). Our findings demonstrate that structured epilepsy knowledge substantially enhances evidence-grounded clinical reasoning and provides a practical benchmark framework for evaluating knowledge-augmented LLMs in real-world neurological settings. Our code is available at: https://github.com/LabRAI/EEG-KG.

2605.09415 2026-05-14 cs.AI

Strategic commitments shape collective cybersecurity under AI inequality

Adeela Bashir, Zia Ush Shamszaman, Zhao Song, Matjaz Perc, The Anh Han

AI总结 随着人工智能在网络安全中的广泛应用,攻防双方的力量对比正在发生变化。本文研究了在AI防御工具获取不均的情况下,资源有限的防御者难以有效保护系统所带来的安全风险,并提出通过引入有承诺的防御者和针对性补贴,可以显著提升整体防御能力并增强系统韧性。研究还表明,这种策略不仅能提高防御者的安全收益,还能有效抑制攻击者的获利空间,为AI驱动环境下的网络安全政策制定提供了理论支持。

Comments 26 pages, 16 figures

详情
英文摘要

The growing integration of AI into cybersecurity is reshaping the balance between attackers and defenders. When access to advanced AI-enabled defence tools is uneven, resource-limited defenders may be unable to adopt effective protection, creating persistent system vulnerabilities. We study the impact of differential AI access using an evolutionary game-theoretic model in a finite population. We first show that when high-capability defence is costly, the population is driven toward low-cost, weak-defence behaviour, sustaining attacks and weakening long-run security. To address this problem, we introduce differential access to AI defence tools by allowing defenders to choose between low- and high-capability protection based on their resources. We then examine the role of a small group of committed defenders who always adopt strong defence and influence others through social learning. Although commitment increases the prevalence of strong defence, it alone cannot stabilise secure outcomes due to high defence costs. We therefore incorporate a targeted subsidy to remove the cost disadvantage from committed defenders. Our analysis shows that subsidised commitment significantly increases strong defence adoption, suppresses successful attacks, and improves overall system resilience. Simulations across a broad parameter space confirm that subsidies consistently outperform commitment alone. In addition, social-welfare analysis shows improved defender outcomes while keeping attacker gains low. These findings suggest that targeted support for key defenders can be an effective mechanism for stabilising cybersecurity in AI-driven environments and provide a theoretical bridge between cybersecurity policy, AI governance, and strategic allocation of defensive AI capabilities.

2605.09134 2026-05-14 cs.AI cs.SE

BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

Yuanhao Li, Hongbo Wang, Xiaotang Shang, Xunzhu Tang, Yiming Cao, Xuhong Chen

AI总结 BoostAPR 是一种基于执行引导的强化学习框架,旨在解决程序修复中反馈稀疏和奖励粒度过粗的问题。该方法通过三个阶段进行优化:首先在带有推理轨迹的执行验证演示上进行监督微调,然后从执行结果中训练双奖励模型,分别用于评估序列级和行级的修复效果,最后通过PPO算法进行优化,将行级奖励重新分配给关键的代码修改区域。实验表明,BoostAPR 在多个基准测试中取得了优异的修复效果,并展现出良好的跨语言泛化能力。

Comments 21 pages, 2 figures. Accepted at ICML 2026

详情
英文摘要

Reinforcement learning for program repair is hindered by sparse execution feedback and coarse sequence-level rewards that obscure which edits actually fix bugs. We present BoostAPR, a three-stage framework addressing these challenges: (1) supervised fine-tuning on execution-verified demonstrations with reasoning traces, (2) training dual reward models--a sequence-level assessor and a line-level credit allocator--from execution outcomes, and (3) PPO optimization where the line-level model redistributes rewards to critical edit regions. This line-level credit assignment operates at an intermediate granularity naturally suited to code changes. Trained on SWE-Gym and evaluated on four benchmarks, BoostAPR achieves 40.7% on SWE-bench Verified (+22.9pp over base model), 24.8% on Defects4J (Python-to-Java transfer), 84.5% on HumanEval-Java, and 95.0% on QuixBugs, achieving competitive results among open-source models with strong cross-language generalization.

2605.09020 2026-05-14 cs.CV

The Direct Integration Theorem: A Rigorous Framework for Consistent Discrete Solutions of the Inverse Radon Problem

Mikhail G. Mozerov

AI总结 本文提出了一种新的直接积分定理(DIT),作为经典中心切片定理(CST)的非平凡推论,为连续域到离散域的数学一致转换提供了严谨的框架,解决了计算断层成像中的根本性难题。该方法无需传统 ramp 滤波和频率域插值,避免了零频奇点和谱失真等问题,并实现了基于采样参数和网格几何的准精确重建。实验表明,该方法在图像方差保持、重建质量及重投影保真度方面优于传统滤波反投影(FBP)方法,显著提升了图像的统计特性还原能力。

Comments Submitted to IEEE TPAMI. Code and data available at https://github.com/Mozerov-iitp/radon-dit/

详情
英文摘要

This paper presents a novel Direct Integration Theorem (DIT), derived as a non-trivial corollary of the classical Central Slice Theorem (CST). The DIT provides a mathematically consistent transition from the continuous to the discrete domain - a fundamental challenge in computed tomography - thereby eliminating the need for frequency-domain interpolation without resorting to conventional ramp-filtering. The proposed approach circumvents two principal limitations inherent in traditional methods: (i) the zero-frequency singularity and spectral distortions introduced by the mandatory ramp-filtering step, and (ii) discretization inaccuracies associated with frequency-domain interpolation. Based on the DIT, we develop a rigorous framework for consistent discrete solutions of the inverse Radon problem. Mathematical modeling demonstrates that this approach achieves quasi-exact reconstruction, with errors constrained solely by sampling parameters and grid geometry. Furthermore, while Filtered Back Projection (FBP) inherently distorts the variance of the reconstructed image, the DIT-based algorithm preserves it. Comparative simulations confirm that the proposed method eliminates common artifacts, such as intensity cupping, and consistently outperforms FBP in terms of PSNR, SSIM, and reprojection fidelity, faithfully restoring the original image's statistical characteristics.

2605.07653 2026-05-14 cs.CV eess.IV

Aquatic Neuromorphic Optical Flow

Pei Zhang, Yunkai Liang, Kaiqiang Wang

AI总结 本文研究了水下环境中基于神经形态视觉的光流估计问题,提出了一种基于脉冲神经网络的自监督框架,能够从异步事件流中高效估计逐像素光流,有效克服了水下数据稀缺的瓶颈。该方法在保证视觉和定量性能的同时,显著提升了计算效率,为资源受限的水下边缘平台提供了轻量、实时且低成本的感知解决方案。

Comments This work is under review. Project page: https://github.com/pz-even/event_underwater_optical_flow

详情
英文摘要

Underwater environments impose severe constraints on conventional imaging systems and demand solutions that balance high-quality sensing with strict resource efficiency. While emerging event cameras offer a promising alternative, their potential in aquatic scenarios remains largely unexplored. Through the lens of neuromorphic vision, this work pioneers the investigation of motion fields that serve as key media for agile underwater perception. Built upon spiking neural networks, we introduce a self-supervised framework to estimate per-pixel optical flow from asynchronous event streams, elegantly bypassing the long-standing bottleneck of underwater data scarcity. Extensive evaluations demonstrate that our method achieves competitive visual and quantitative results against leading techniques while operating with superior computational efficiency. By bridging neuromorphic sensing and aquatic intelligence, this work opens new frontiers for lightweight, real-time, and low-cost perception on resource-constrained underwater edge platforms.

2605.06651 2026-05-14 cs.AI

AI co-mathematician: Accelerating mathematicians with agentic AI

Daniel Zheng, Ingrid von Glehn, Yori Zwols, Iuliya Beloshapka, Lars Buesing, Daniel M. Roy, Martin Wattenberg, Bogdan Georgiev, Tatiana Schmidt, Andrew Cowie, Fernanda Viegas, Dimitri Kanevsky, Vineet Kahlon, Hartmut Maennel, Sophia Alj, George Holland, Alex Davies, Pushmeet Kohli

AI总结 本文介绍了“AI co-mathematician”,一个辅助数学家进行开放式研究的智能工作平台。该系统通过异步、状态化的交互方式,支持数学研究中的各个环节,如想法生成、文献检索、计算探索和定理证明,并能有效管理不确定性、追踪失败假设并生成原生数学成果。实验表明,该系统不仅提升了数学研究效率,还在多个难题求解基准测试中取得了优异成绩。

Comments 23 pages; several citations added

详情
英文摘要

We introduce the AI co-mathematician, a workbench for mathematicians to interactively leverage AI agents to pursue open-ended research. The AI co-mathematician is optimized to provide holistic support for the exploratory and iterative reality of mathematical workflows, including ideation, literature search, computational exploration, theorem proving and theory building. By providing an asynchronous, stateful workspace that manages uncertainty, refines user intent, tracks failed hypotheses, and outputs native mathematical artifacts, the system mirrors human collaborative workflows. In early tests, the AI co-mathematician helped researchers solve open problems, identify new research directions, and uncover overlooked literature references. Besides demonstrating a highly interactive paradigm for AI-assisted mathematical discovery, the AI co-mathematician also achieves state of the art results on hard problem-solving benchmarks, including scoring 48% on FrontierMath Tier 4, a new high score among all AI systems evaluated.

2605.04557 2026-05-14 cs.CV cs.AI

Efficient Geometry-Controlled High-Resolution Satellite Image Synthesis

Vlad Vasilescu, Daniela Faur, Teodor Costachioiu

AI总结 本文研究了如何高效生成受几何控制的高分辨率卫星图像,以解决该类图像稀缺且成本高昂的问题,这对土地覆盖分类、变化检测和灾害监测等任务的模型开发与测试造成阻碍。作者提出了一种基于现有预训练扩散模型的方法,通过引入窗口交叉注意力模块,仅利用跳跃连接特征实现对生成过程的控制,方法简洁高效。实验表明,该方法在性能上与现有控制技术相当,且在几何控制图对齐方面表现更优,同时指出现有评估方法的局限性,强调了对齐评估一致性的重要性。

Comments 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS)

详情
英文摘要

High-resolution satellite images are often scarce and costly, especially for remote areas or infrequent events. This shortage hampers the development and testing of machine learning models for land-cover classification, change detection, and disaster monitoring. In this paper, we tackle the problem of geometry-controlled high-resolution satellite image synthesis by adding control over existing pre-trained diffusion models. We propose a simple yet efficient method for controlling the synthesis process by leveraging only skip connection features using windowed cross-attention modules. Several previously established control techniques are compared, indicating that our method achieves comparable performance while leading to a better alignment with the geometry control map. We also discuss the limitations in current evaluation approaches, amplifying the necessity of a consistent alignment assessment.

2605.02752 2026-05-14 cs.CV

Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting

Giacomo Pacini, Luca Ciampi, Nicola Messina, Nicola Tonellotto, Giuseppe Amato, Fabrizio Falchi

AI总结 本文研究了开放世界文本引导的类别无关计数(CAC)任务中语义对齐的问题,指出当前模型在理解文本提示与视觉场景之间关系时存在不足,导致计数结果不可靠。为此,作者提出了一种新的评估框架PrACo++,包含负标签测试和干扰项测试等新协议,并构建了包含多类别标注的MUCCA数据集。实验表明,尽管现有模型在标准指标上表现良好,但在语义理解与对齐方面仍存在明显缺陷,突显了构建更具语义感知能力模型的重要性。

Comments Code available at https://github.com/ciampluca/PrACo

详情
英文摘要

Open-world text-guided class-agnostic counting (CAC) has emerged as a flexible paradigm for counting arbitrary object classes by using natural language prompts. However, current evaluation protocols primarily focus on standard counting errors within single-category images, overlooking a fundamental requirement: the ability to correctly ground the textual prompt in the visual scene. In this paper, we show that several state-of-the-art CAC models often struggle to determine which object class should be counted based on the given prompt, revealing a misalignment between textual semantics and visual object representations. This limitation leads to spurious counting responses and reduced reliability in real-world scenarios. To systematically address these limitations, we propose a new evaluation framework focused on model robustness and trustworthiness. Our contribution is two-fold: (i) we introduce PrACo++ (Prompt-Aware Counting++), a novel test suite featuring two dedicated evaluation protocols -- the negative-label test and the distractor test -- paired with new specialized metrics; and (ii) we present the MUCCA (MUlti-Category Class-Agnostic counting) evaluation dataset, a new collection of real-world images featuring multiple annotated object categories per scene, unlike existing CAC benchmarks that typically include a single category per image. Our extensive experimental evaluation of 10 state-of-the-art methods shows that, despite strong performance under standard counting metrics, current models exhibit significant weaknesses in understanding and grounding object class descriptions. Finally, we provide a quantitative analysis of how semantic similarity between prompts influences these failures. Overall, our results underscore the need for more semantically grounded architectures and offer a reliable framework for future assessment in open-world text-guided CAC methods.

2605.02521 2026-05-14 cs.CV

MooD: Perception-Enhanced Efficient Affective Image Editing via Continuous Valence-Arousal Modeling

Xinyi Yin, Yiduo Wang, Tingqi Hu, Meicong Si, Yunyun Shi, Shi Chen, Hao Wang, Junxiao Xue, Xuecheng Wu

AI总结 本文提出MooD,一种基于连续愉悦-唤醒(Valence-Arousal)模型的感知增强型高效情感图像编辑框架,旨在解决现有情感图像编辑方法在推理效率和连续情感建模方面的不足。MooD通过引入VA感知检索策略和融合视觉迁移与感知增强语义引导,实现了细粒度且高效的可控情感编辑。同时,为弥补现有数据集对自然场景覆盖不足的问题,研究者构建了涵盖多场景的AffectSet数据集,进一步提升了模型的性能与泛化能力。

详情
英文摘要

Affective Image Editing (AIE) aims to modify visual content to evoke targeted emotions. Although current approaches achieve impressive editing quality, they often overlook inference efficiency, which limits their applicability in computational social scenarios. Moreover, most methods depend on discrete emotion representations, which hinder the continuous modeling of complex human emotions and constrain expressive capabilities in interactive scenarios. To tackle these gaps, we propose MooD, the first framework that directly leverages continuous Valence-Arousal (VA) values as editing instruction for fine-grained and efficient AIE in computational social systems. Specifically, we first introduce a VA-Aware retrieval strategy to bridge vague affective values and detailed visual semantics. Building upon this, MooD integrates visual transfer and perception-enhanced semantic guidance to achieve controllable AIE. Furthermore, considering that existing VA-annotated datasets mainly focus on social scenarios and largely overlook natural scenes, we therefore construct AffectSet, a comprehensive VA-annotated dataset covering diverse scenarios, to support model optimization and evaluation. Extensive qualitative and quantitative experimental results demonstrate that our MooD achieves superior performance in both affective controllability and visual fidelity while maintaining high efficiency. A series of ablation studies further reveal the crucial factors of our design.

2605.02350 2026-05-14 cs.LG

A Near-optimal SQ Lower Bound for Smoothed Agnostic Learning of Boolean Halfspaces

Tim Sinen

AI总结 本文研究了在均匀边际分布下,对布尔半空间进行平滑无误学习的复杂度问题。作者在输入坐标独立翻转的概率为 $σ$ 的模型下,证明了 $L^1$ 多项式回归的运行时间和样本复杂度为 $\tilde{O}(n^{O(\log(1/\varepsilon)/σ)})$,并给出了几乎匹配的统计查询复杂度下界 $n^{Ω(\log(1+σ/\varepsilon^2)/σ)}$。该结果补充了近期在高斯边际分布下连续情况的相关研究。

Comments Fixed several typos and minor proof issues

详情
英文摘要

We study the complexity of smoothed agnostic learning of halfspaces on $\{\pm 1\}^n$ under uniform marginals in the model of~\cite{KM25}, where each input coordinate is independently flipped with probability $σ\in (0, {1}/{2})$. We show that $L^1$ polynomial regression achieves runtime and sample complexity $\tilde{O}(n^{O(\log(1/\varepsilon)/σ)})$, and prove a nearly matching Statistical Query complexity lower bound of $n^{Ω(\log(1+σ/\varepsilon^2)/σ)}$. This complements the recent work of~\cite{DK26}, which established analogous bounds in the continuous setting under Gaussian marginals.

2604.28045 2026-05-14 cs.CV

TAFA-GSGC: Group-wise Scalable Point Cloud Geometry Compression with Progressive Residual Refinement

Xiumei Li, Alexander Kopte, André Kaup

AI总结 本文提出了一种名为TAFA-GSGC的可扩展点云几何压缩方法,能够在单一比特流和单一训练模型下实现多质量解码。该方法结合了分层残差细化与通道组熵编码,并引入了目标对齐特征聚合模块以减少增强残差中的跨层冗余。实验表明,TAFA-GSGC在保持良好压缩效率的同时,支持多达9个解码质量等级,并在D1-PSNR和D2-PSNR指标上分别实现了4.99%和5.92%的比特率降低。

Comments Accepted at IEEE International Conference on Image Processing (ICIP) 2026

详情
英文摘要

Scalable compression is essential for bandwidth-adaptive transmission, yet most learned codecs are optimized for a fixed rate-distortion point, making rate adaptation costly due to re-encoding or maintaining multiple bitstreams. In this work, we propose TAFA-GSGC, a scalable learned point cloud geometry codec that enables multi-quality decoding from a single bitstream and a single trained model. TAFA-GSGC combines layered residual refinement with channel-group entropy coding, and introduces a Target-Aligned Feature Aggregation module to reduce cross-layer redundancy in enhancement residuals. Our framework supports up to 9 decodable quality levels with monotonic quality improvement as more subbitstreams are received, while maintaining strong compression efficiency. Compared with the PCGCv2 baseline, TAFA-GSGC demonstrates improved RD performance, achieving average BD-rate reductions of 4.99% and 5.92% in terms of D1-PSNR and D2-PSNR, respectively.

2604.26070 2026-05-14 cs.LG math.OC math.ST q-bio.QM stat.TH

Observable Neural ODEs for Identifiable Causal Forecasting in Continuous Time

Jennifer Wendland, Nicolas Freitag, Maik Kschischo

AI总结 该论文研究了连续时间因果推理中的可识别性问题,针对存在隐藏混杂因素的动态决策场景,提出了可观测神经ODE(ObsNODE)模型。通过将控制理论中的可观测性概念与因果可识别性联系起来,论文推导出一种连续时间调整公式,并设计了能够从观测数据中重构潜在状态的神经ODE模型,从而实现对不同干预路径下结果的预测。实验表明,该方法在合成癌症数据、基于MIMIC-IV的半合成数据和真实脓毒症数据上均表现出优越的性能。

Comments 20 pages, 5 figures

详情
英文摘要

Causal inference in continuous-time sequential decision problems is challenged by hidden confounders. We show that, in latent state-space models with time-varying interventions, observability of the latent dynamics from observed data is necessary for identifying dynamic treatment effects, linking control-theoretic observability to causal identifiability, even when hidden confounders affect both treatments and outcomes. We derive a continuous-time adjustment formula expressing potential outcome distributions under treatment trajectories via the measurement model, latent dynamics, and the filtering distribution over latent states given observed histories. We propose Observable Neural ODEs (ObsNODEs), Neural ODE models in observable normal form for causal forecasting. ObsNODEs learn continuous-time dynamics with states reconstructible from observations, enabling outcome prediction under alternative treatment paths. Experiments on synthetic cancer data, semi-synthetic data based on MIMIC-IV, and real-world sepsis data show strong performance over recent sequence models.

2604.25774 2026-05-14 cs.CL cs.AI

CGU-ILALab at FoodBench-QA 2026: Comparing Traditional and LLM-based Approaches for Recipe Nutrient Estimation

Wei-Chun Chen, Yu-Xuan Chen, I-Fang Chung, Ying-Jia Lin

AI总结 本文研究了如何从非结构化的菜谱文本中准确估计营养成分这一挑战性问题,比较了基于传统方法和大语言模型(LLM)的多种技术。研究发现,传统方法如TF-IDF在推理速度上有优势,但效果有限;而基于LLM的少样本推理和混合方法在营养估计准确性上表现最佳,主要得益于其对模糊术语和非标准单位的处理能力。然而,这类方法也带来了更高的计算延迟,突显了实时性与精度之间的实际部署权衡。

Comments Accepted by the Third Workshop on Patient-oriented Language Processing (CL4Health) at LREC 2026

详情
Journal ref
http://lrec-conf.org/proceedings/lrec2026/workshops/cl4health/2026.cl4health-1.0.pdf
英文摘要

Accurate nutrient estimation from unstructured recipe text is an important yet challenging problem in dietary monitoring, due to ambiguous ingredient terminology and highly variable quantity expressions. We systematically evaluate models spanning a wide range of representational capacity, from lexical matching methods (TF-IDF with Ridge Regression), to deep semantic encoders (DeBERTa-v3), to generative reasoning with large language models (LLMs). Under the strict tolerance criteria defined by EU Regulation 1169/2011, our empirical results reveal a clear trade-off between predictive accuracy and computational efficiency. The TF-IDF baseline achieves moderate nutrient estimation performance with near-instantaneous inference, whereas the DeBERTa-v3 encoder performs poorly under task-specific data scarcity. In contrast, few-shot LLM inference (e.g., Gemini 2.5 Flash) and a hybrid LLM refinement pipeline (TF-IDF combined with Gemini 2.5 Flash) deliver the highest validation accuracy across all nutrient categories. These improvements likely arise from the ability of LLMs to leverage pre-trained world knowledge to resolve ambiguous terminology and normalize non-standard units, which remain difficult for purely lexical approaches. However, these gains come at the cost of substantially higher inference latency, highlighting a practical deployment trade-off between real-time efficiency and nutritional precision in dietary monitoring systems.

2604.23018 2026-05-14 cs.CV cs.AI cs.LG

AmaraSpatial-10K: A Spatially and Semantically Aligned 3D Dataset for Spatial Computing and Embodied AI

Mohammad Sadegh Salehi, Alex Perkins, Igor Maurell, Ashkan Dabbagh, Raymond Wong

AI总结 该研究提出了一个名为 AmaraSpatial-10K 的三维数据集,旨在解决现有大规模三维资产在空间计算和具身人工智能应用中的部署难题。该数据集包含超过 10,000 个经过优化的合成三维资产,每个资产都具备精确的度量尺度、确定的锚点、分离的物理材质贴图以及多句文本元数据,便于直接使用。研究还引入了一套可复用的评估体系,显著提升了三维资产在图像检索、物理模拟和跨模态对齐等方面的性能。

详情
英文摘要

Web-scale 3D asset collections are abundant but rarely deployment-ready, suffering from arbitrary metric scaling, incorrect pivots, brittle geometry, and incomplete textures, defects that limit their use in embodied AI, robotics, and spatial computing. We present AmaraSpatial-10K, a dataset of over 10,000 synthetic 3D assets optimised for zero-shot deployment. Each asset ships as a metric-scaled, deterministically anchored .glb with separated PBR maps, a convex collision hull, a paired reference image, and multi-sentence text metadata. Alongside the dataset we introduce a reusable evaluation suite for 3D asset banks, a continuous Scale Plausibility Score (SPS), an LLM Concept Density metric, anchor-error auditing, and a cross-modal CLIP coherence protocol, and apply it to AmaraSpatial-10K alongside matched subsets of Objaverse, HSSD, ABO, and GSO. AmaraSpatial-10K improves CLIP Recall@5 by $3.4\times$ over Objaverse ($0.612$ vs. $0.181$, median rank $267 \rightarrow 3$), achieves a $99.1\%$ physics-stability rate under Habitat-Sim with $\sim 20\times$ wall-time speed-up, and produces zero-overlap scenes when used as a drop-in asset bank for Holodeck. Controlled ablations on the same asset bank attribute the retrieval gain to description richness.

2604.22686 2026-05-14 cs.CV

SS3D: End2End Self-Supervised 3D from Web Videos

Marwane Hariat, Gianni Franchi, David Filliat, Antoine Manzanera

AI总结 本文提出 SS3D,一种基于 SfM 的大规模自监督预训练方法,用于从单目视频中进行端到端的三维估计。该方法在一个前向传播过程中联合预测深度、相机运动和内参,并通过统一的单检查点评估协议进行训练和评估。为了解决网络视频中多视角可观测性弱和数据异构性强的问题,作者引入了多视角信号代理(MVS)用于过滤和课程采样,并通过专家训练蒸馏到单一学生模型中,显著提升了模型性能。

详情
英文摘要

We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.

2604.21496 2026-05-14 cs.AI cs.CL cs.CY

How English Print Media Frames Human-Elephant Conflicts in India

Bonala Sai Punith, Salveru Jayati, Garima Shakya, Shubham Kumar Nigam

AI总结 本文研究了印度英语印刷媒体如何报道人象冲突(HEC),通过分析2022年1月至2025年9月期间1968篇新闻文章中的28986个句子,揭示了媒体在报道中普遍使用恐惧和攻击性语言,可能加剧公众对大象的敌意,影响人与野生动物的共存努力。研究采用结合长上下文变换器、大语言模型和领域特定词典的多模型情感分析框架,量化情感倾向、提取关键语句并识别语言模式,为负责任的野生动物报道提供了可扩展的方法支持。

详情
英文摘要

Human-elephant conflict (HEC) is rising across India as habitat loss and expanding human settlements force elephants into closer contact with people. While the ecological drivers of conflict are well-studied, how the news media portrays them remains largely unexplored. This work presents the first large-scale computational analysis of media framing of HEC in India, examining 1,968 full-length news articles consisting of 28,986 sentences, from a major English-language outlet published between January 2022 and September 2025. Using a multi-model sentiment framework that combines long-context transformers, large language models, and a domain-specific Negative Elephant Portrayal Lexicon, we quantify sentiment, extract rationale sentences, and identify linguistic patterns that contribute to negative portrayals of elephants. Our findings reveal a dominance of fear-inducing and aggression-related language. Since the media framing can shape public attitudes toward wildlife and conservation policy, such narratives risk reinforcing public hostility and undermining coexistence efforts. By providing a transparent, scalable methodology and releasing all resources through an anonymized repository, this study highlights how Web-scale text analysis can support responsible wildlife reporting and promote socially beneficial media practices.

2604.21360 2026-05-14 cs.CV

Prototype-Based Test-Time Adaptation of Vision-Language Models

Zhaohong Huang, Yuxin Zhang, Wenjing Liu, Fei Chao, Rongrong Ji

AI总结 本文提出了一种基于原型的测试时适配(PTA)方法,用于提升视觉-语言模型在测试阶段的性能。该方法通过构建类特定的知识原型来累积测试样本的信息,并根据每个样本的零样本分类置信度对原型进行自适应加权,从而提升模型对新数据的适应能力。与基于缓存的适配方法相比,PTA无需维护和检索缓存,显著提高了推理效率,同时在多个图像识别和点云分析基准测试中取得了优于现有方法的性能。

详情
英文摘要

Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample's visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP's accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP's inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP's inference speed.