arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2092
2605.07630 2026-05-11 cs.CL cs.AI cs.LG

Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

Zhengyang Tang, Yi Zhang, Chenxin Li, Xin Lai, Pengyuan Lyu, Yiduo Guo, Weinong Wang, Junyi Li, Yang Ding, Huawen Shen, Zhengyao Fang, Xingran Zhou, Liang Wu, Fei Tang, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang, Chengquan Zhang, Han Hu

AI总结 本文探讨了手机使用代理在避免危害时,究竟是表现出安全性还是仅仅缺乏行动能力的问题。为了解决现有评估方法无法区分这两类情况的缺陷,研究者构建了PhoneSafety基准,包含700个来自130多款应用的真实安全关键时刻。通过分析八个代表性代理的表现,研究发现更强的通用能力并不一定意味着更高的安全性,且无法采取有效行动的情况更多反映的是能力不足而非安全问题,这对手机使用代理的安全评估提出了新的思考方向。

Comments work in progress

详情
英文摘要

When a phone-use agent avoids harm, does that show safety, or simply inability to act? Existing evaluations often cannot tell. A harmful outcome may be avoided because the agent recognized the risk and chose the safe action, or because it failed to understand the screen or execute any relevant action at all. These cases have different causes and call for different fixes, yet current benchmarks often merge them under task success, refusal, or final harmful outcome. We address this problem with PhoneSafety, a benchmark of 700 safety-critical moments drawn from real phone interactions across more than 130 apps. Each instance isolates the next decision at a risky moment and asks a simple question: does the model take the safe action, take the unsafe action, or fail to do anything useful? We evaluate eight representative phone-use agents under this framework. Our results reveal two main patterns. First, stronger general phone-use ability does not reliably imply safer choices at risky moments. Models that perform better on ordinary app tasks are not always the ones that behave more safely when the next action matters. Second, failures to do anything useful behave like a capability signal rather than a safety signal: they are concentrated in more visually and operationally demanding settings and remain stable when the evaluation protocol changes. Across models, failures split into two recurring patterns: unsafe choices in settings where the model can act but chooses wrongly, and inability to act in more visually and operationally demanding screens. Overall, a harmless outcome is not enough to count as evidence of safety. Evaluating phone-use agents requires separating unsafe judgment from inability to act.

2605.07622 2026-05-11 cs.CL

Is She Even Relevant? When BERT Ignores Explicit Gender Cues

Jonas Klein, Chiara Manna, Eva Vanmassenhove

AI总结 本研究探讨了在荷兰语中,BERT模型如何以及在什么情况下会捕捉到性别信息,特别是针对具有显性形态性别标记和通用形式的语言。通过分析训练过程中的上下文嵌入,研究构建了动态的性别子空间,发现尽管性别信息在训练约20轮后变得线性可分,但模型在面对明确性别线索的短句模板时,仍难以更新其内部性别表征,表现出对男性默认的持续倾向。这一结果挑战了现有假设,表明模型在性别方向上的表征动态性不足,难以有效反映反刻板印象的性别线索。

详情
英文摘要

Gender bias in large language models has primarily been investigated for English, while languages with grammatical or morphological gender remain comparatively understudied. This paper investigates how and when gender information emerges in a Dutch BERT model trained from scratch, offering one of the first checkpoint-level analyses of bias formation in a Transformer architecture for a language combining overt morphological gender marking and generic forms. By extracting contextual embeddings throughout training, we construct dynamic gender subspaces using linear SVMs to trace when gender becomes linearly encoded and how this encoding evolves over time. Contextual embeddings are often assumed to integrate contextual cues robustly, allowing models to adjust the representation of a word depending on its more local usage. We therefore test whether explicit gender cues in controlled sentence templates (e.g., Zij is een loodgieter ('She is a plumber')) can override learned statistical associations (plumber -> male). Our findings challenge this assumption: although gender becomes clearly linearly separable around epoch 20 and is distributed across multiple embedding dimensions, the model struggles to update its internal gender representation in light of explicit contextual cues in short sentence templates. Stereotypical gender-profession pairings are predicted far more accurately than anti-stereotypical ones, and generic forms in Dutch systematically default to a male interpretation, even when the context explicitly denotes a female referent. Together, our results seem to indicate that contextualization in the representations learned by our Dutch BERT model is not sufficiently dynamic along the probed gender direction: explicit gender cues in anti-stereotypical contexts are not reliably reflected in the resulting representations, resulting in persistent male-default behaviour.

2605.07613 2026-05-11 cs.CL

Intent-Driven Semantic ID Generation for Grounded Conversational News Recommendation

Hongyang Su, Beibei Kong, Lei Cheng, Chengxiang Zhuo, Zang Li, Chenyun Yu

AI总结 该研究针对对话式新闻推荐中如何结合用户隐含意图与实时新闻内容的问题,提出了一种基于意图驱动的语义ID生成方法(SID)。通过生成-匹配的范式,模型将用户意图映射为分层语义ID前缀,并在新闻库中进行模糊匹配,从而实现精准且有依据的推荐。该方法在主流中文新闻平台上的实验表明,其在避免幻觉和推荐匹配度方面表现优异,尤其在冷启动用户场景下显著优于现有方法。

Comments Accepted at ACL 2026 Industry Track (Oral)

详情
英文摘要

Conversational news recommendation requires grounding each suggestion in a rapidly evolving article corpus while addressing implicit user intents that lack explicit retrievable keywords. To characterize this scenario, we identify 6 intent types from production dialogues: five are implicit and pose fundamental challenges to standard RAG pipelines, forming a critical retrieve-first bottleneck. To address these issues, we introduce intent-driven Semantic ID (SID) generation under a Generate-then-Match paradigm. With two-stage training that consists of multi-task SID alignment and GPT-4 Chain-of-Thought distillation, an LLM maps diverse intents to hierarchical SID prefixes, which are then fuzzy-matched to the current news pool to guarantee fully grounded recommendations. Profile-Aware Dual-Signal Reasoning (PADR) further enables cold-start users to obtain valid recommendations using only profiles. On a mainstream Chinese news platform, our 7B model achieves 0% hallucination and 12.4% L1 match in the 152K open-generation SID space (4x random baseline). It matches GPT-4+Hybrid RAG on L1 while surpassing it on finer-grained metrics (L2 2x, Category +1.2pp) at ~100x lower cost. Cold-start users, where existing baselines score 0%, achieve 18.0% L1 (6x random), the highest among all user groups.

2605.07606 2026-05-11 cs.CL cs.AI

Nürnberg NLP at PsyDefDetect: Multi-Axis Voter Ensembles for Psychological Defence Mechanism Classification

Philipp Steigerwald, Eric Rudolph, Jens Albrecht

AI总结 该研究针对心理防御机制分类任务,解决其因表面语言相似而带来的分类模糊问题。研究提出了一种多轴投票集成方法,涵盖分类粒度、训练方式和基础模型三个正交维度,以提高分类鲁棒性。该方法在隐藏测试集上取得了0.420的F1分数,位列21支参赛队伍之首。

Comments Accepted at the BioNLP 2026 PsyDefDetect Shared Task @ ACL 2026 (1st place, 21 registered teams)

详情
英文摘要

Detecting levels of psychological defence mechanisms in supportive conversations is inherently ambiguous. In the PsyDefDetect shared task at BioNLP 2026 the eight positive defence categories share surface language and differ only in pragmatic function and trained raters reach only moderate inter-annotator agreement. On such a task the decisive lever is not a stronger single model but error independence, since any single representation will waver on the overlapping defence boundaries. We translate this insight into a 9-voter ensemble spanning three orthogonal axes: class granularity (all nine classes for the gatekeeper, only the eight defence classes for the specialists), training method (generative and discriminative) and base model. The system reaches $F1_{test}{=}.420$ on the hidden test set, placing first among 21 registered teams.

2605.07605 2026-05-11 cs.RO

BrickCraft: Visuomotor Skill Composition with Situated Manual Guidance for Long-Horizon Interlocking Brick Assembly

Jichuan Yu, Bowei Li, Zhenran Tang, Guanxing Lu, Chuxiong Hu, Ruixuan Liu, Changliu Liu

AI总结 本文提出了一种名为BrickCraft的框架,用于实现长期视野下互锁积木的自主组装。该方法通过相对坐标系将复杂任务分解为可复用的基本操作技能,并结合实时视觉引导,将高层装配计划与物理执行有效衔接。实验表明,BrickCraft能够从少量示例中学习高效的装配技能,并具备对新结构的强泛化能力。

详情
英文摘要

Autonomous robotic assembly of interlocking bricks demands seamless integration of long-horizon task reasoning, spatial grounding, and fine-grained manipulation. This paper presents BrickCraft, a compositional framework designed for long-horizon and generalizable interlocking brick assembly. BrickCraft models the assembly process using a relative formulation, where each step is anchored to a reference brick within the partial structure, thereby decomposing complex tasks into a finite set of reusable primitive skills. BrickCraft bridges the gap between high-level assembly plans and physical execution through situated manuals, which provide explicit spatial guidance for learned visuomotor skills by projecting the assembly intent onto real-time robot observations. Finally, BrickCraft employs a compositional execution pipeline that chains these spatially grounded skills to accomplish long-horizon assembly tasks. Extensive experimental validations demonstrate that BrickCraft acquires proficient assembly skills from a limited set of demonstrations and exhibits strong compositional generalization to unseen structures. The project website is available at https://intelligent-control-lab.github.io/BrickCraft.

2605.07604 2026-05-11 cs.CV cs.AI

SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild

Xuyi Hu, Jin Lyu, Jiuming Liu, Yebin Liu, Silvia Zuffi, Liang An, Stefan Goetz

AI总结 本文提出SAM 3D Animal,首个支持多动物三维重建的可提示框架,能够从单张野外图像中重建多个动物的三维模型。该方法基于SMAL+参数化动物模型,结合关键点和掩码提示,有效应对遮挡和密集场景下的重建挑战。此外,研究还引入了包含5000多张图像的Herd3D数据集,用于提升模型的泛化能力。实验表明,该方法在多个数据集上优于现有基于模型和无模型的方法,展示了其在野外动物三维重建中的高效性和可扩展性。

详情
英文摘要

3D animal reconstruction in the wild remains challenging due to large species variation, frequent occlusions, and the prevalence of multi-animal scenes, while existing methods predominantly focus on single-animal settings. We present SAM 3D Animal, the first promptable framework for multi-animal 3D reconstruction from a single image. Built on the SMAL+ parametric animal model, our method jointly reconstructs multiple instances and supports flexible prompts in the form of keypoints and masks which enable more reliable disambiguation in crowded and occluded scenes. To train such a model, we further introduce Herd3D, a multi-animal 3D dataset containing over 5K images, designed to increase diversity in species, interactions, and occlusion patterns. Experiments on the Animal3D, APTv2, and Animal Kingdom datasets show that our framework achieves state-of-the-art results over both existing model-based and model-free methods, demonstrating a scalable and effective solution for prompt-driven animal 3D reconstruction in the wild.

2605.07600 2026-05-11 cs.LG cs.AI cs.CL

Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators

Tsuyoshi Okita

AI总结 该研究提出了一种基于因果干预的框架CIKA,通过将大语言模型(LLM)作为概念掌握的模拟器,识别哪些数学概念对解题具有因果贡献。该方法通过设定概念状态为“掌握”并测量正确性变化,定义了干预能力探测(ICP)指标,从而区分模型是否能有效使用某概念而不仅仅是拥有相关知识。实验表明,CIKA在多个数学基准测试中表现出色,验证了其对问题解决能力的预测性及对知识激活的有效性。

Comments 17 pages, 0 figures

详情
英文摘要

Recent methods for improving LLM mathematical reasoning, whether through MCTS-based test-time search or causal graph-guided knowledge injection, cannot identify which concepts causally contribute to a correct answer, as the observed association may be spurious, driven by confounders such as problem difficulty. We propose CIKA (Causal Intervention for Knowledge Activation), a framework that uses the LLM itself as an interventional simulator: a prompt sets the concept state to ``mastered'' and the correctness change estimates the causal effect. We formalize this quantity as an Interventional Capability Probe (ICP), which diagnoses whether the LLM can use a given concept -- distinct from merely possessing knowledge. Because the intervention exogenously sets the concept state independently of problem difficulty, ICP separates confounding that observational methods cannot. On 67 screened problems, the ICP of the top-ranked concept (+0.219) is significantly larger than that of the negative control (+0.039; paired $t$-test, $p < 10^{-6}$, Cohen's $d = 0.86$), confirming that the probe discriminates causally relevant concepts from irrelevant ones. Analysis of 601 Omni-MATH problems further shows that solved problems have 6.1$\times$ higher ATE than unsolved ones (0.338 vs. 0.055), confirming that ICP is predictive of problem-solving success. With a 7B-parameter LLM whose weights are entirely frozen, CIKA achieves 69.7\% on the contamination-free Omni-MATH-Rule benchmark and 64.0\% overall, compared to 60.5\% for o1-mini, and 97.2\% on GSM8K, 46--50\% on AIME 2024--2026, and 46.2\% on MathArena. The Causal Knowledge Activation component contributes 33.8\% of correct answers on problems where the base model alone fails, demonstrating that the LLM already possessed but had not activated the requisite knowledge.

2605.07593 2026-05-11 cs.CV

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

Hengyi Feng, Hao Liang, Mingrui Chen, Bohan Zeng, Meiyi Qiang, Zhengyang Zhao, Zimo Meng, Zeang Sheng, Wentao Zhang

AI总结 TraceAV-Bench 是首个用于评估长音频-视频轨迹上多跳推理和多模态幻觉鲁棒性的基准,旨在解决现有数据集在处理跨模态、长时间跨度的稀疏证据链时的不足。该基准包含578个长视频和2200个经过严格验证的多选问题,每个问题基于平均跨越15分钟、包含3.68个推理跳步的明确推理链。实验表明,当前主流的OmniLLMs在该基准上表现有限,揭示了长音频-视频内容理解中的重大挑战。

详情
英文摘要

Real-world audio-visual understanding requires chaining evidence that is sparse, temporally dispersed, and split across the visual and auditory streams, whereas existing benchmarks largely fail to evaluate this capability. They restrict videos to short clips, isolate modalities, or reduce questions to one-hop perception. We introduce TraceAV-Bench, the first benchmark to jointly evaluate multi-hop reasoning over long audio-visual trajectories and multimodal hallucination robustness. TraceAV-Bench comprises 2,200 rigorously validated multiple-choice questions over 578 long videos, totaling 339.5 hours, spanning 4 evaluation dimensions and 15 sub-tasks. Each question is grounded in an explicit reasoning chain that averages 3.68 hops across a 15.1-minute temporal span. The dataset is built by a three-step semi-automated pipeline followed by a strict quality assurance process. Evaluation of multiple representative OmniLLMs on TraceAV-Bench reveals that the benchmark poses a persistent challenge across all models, with the strongest closed-source model (Gemini 3.1 Pro) reaching only 68.29% on general tasks, and the best open-source model (Ming-Flash-Omni-2.0) reaching 51.70%, leaving substantial headroom. Moreover, we find that robustness to multimodal hallucination is largely decoupled from general multimodal reasoning performance. We anticipate that TraceAV-Bench will stimulate further research toward OmniLLMs that can reason coherently and faithfully over long-form audio-visual content.

2605.07588 2026-05-11 cs.LG cs.AI stat.ML

Revisiting Transformer Layer Parameterization Through Causal Energy Minimization

Jin Xu, Camille Couturier, Victor Rühle, Saravan Rajmohan, James Hensman

AI总结 本文提出了一种基于因果能量最小化(CEM)的框架,用于重新审视Transformer层的参数化设计。通过将Transformer层视为条件能量函数的优化步骤,CEM揭示了多头注意力和门控MLP等模块在能量视角下的参数化原理,并指出了包括权重共享、低秩交互和递归更新等在内的设计空间。实验表明,基于CEM设计的Transformer层在参数受限的情况下仍能稳定训练并达到与传统Transformer相当的性能,为理解与改进Transformer结构提供了新的视角。

详情
英文摘要

Transformer blocks typically combine multi-head attention (MHA) for token mixing with gated MLPs for token-wise feature transformation, yet many choices in their parameterization remain largely empirical. We introduce Causal Energy Minimization (CEM), a framework that recasts Transformer layers as optimization steps on conditional energy functions while explicitly accounting for layer parameterization. Extending prior energy-based interpretations of attention, CEM shows that weight-tied MHA can be derived as a gradient update on an interaction energy, and that a gated MLP with shared up/down projections can be viewed through an element-wise energy. This perspective identifies a design space for Transformer layers that includes within-layer weight sharing, diagonal-plus-low-rank interactions, lightweight preconditioners, and recursive updates. We evaluate CEM-derived layers in language-modeling experiments at the moderate hundred-million-parameter scale. Despite their constrained parameterizations, these layers train stably and can match corresponding Transformer baselines. Overall, our results suggest that CEM provides a useful lens for understanding Transformer layer parameterization, connecting Transformer architectures to energy-based models and motivating further exploration of energy-guided layer designs.

2605.07584 2026-05-11 cs.AI

Parallel Lifted Planning via Semi-Naive Datalog Evaluation

Dominik Drexler, Oliver Joergensen, Jendrik Seipp

AI总结 该论文研究了如何通过半天真Datalog评估提升提升式经典规划的效率,提出了一个具有规则级和求地级双重并行性的执行模型。研究设计了一种基于团枚举的求地器,并扩展支持半天真Datalog评估,实验表明该方法在单核上已优于现有基线,且随着核心数增加性能优势更加明显,尤其在难以求地的任务中展现出高达92.4%的并行比例和6倍的加速效果。

详情
英文摘要

Lifted classical planners operate directly on first-order planning tasks to avoid the computationally demanding grounding step. However, lifted planning is typically slower, as planners must repeatedly instantiate ground structures during search. Many core components of lifted classical planning, such as successor generation, axiom evaluation, task grounding, and delete-relaxed heuristics, have previously been studied through the lens of Datalog evaluation. We build upon this line of work and extend it by developing and analyzing an execution model with two levels of parallelism: rule-level parallelism and grounding parallelism. We further specialize this solver for planning-specific workloads with a grounder based on clique enumeration, which we extend to support semi-naive Datalog evaluation. Our experimental evaluation using greedy best-first search with the FF heuristic shows that our implementation already solves more tasks than the baselines on a single core, and the gap widens as additional cores are used. Moreover, on hard-to-ground tasks where on average 97.6% of the total runtime is spent in Datalog execution, the proposed execution model exhibits an average parallel fraction of 92.4%, while achieving up to a 6-fold speedup on 8 cores in practice.

2605.07577 2026-05-11 cs.LG

Bilevel Graph Structure Learning, Revisited: Inner-Channel Origins of the Reported Gain

Minkyoung Kim, Beakcheol Jang

AI总结 本文重新审视了双层图结构学习的性能提升来源,发现其增益主要源于内层训练过程中的动态效应,而非图结构重 wiring 本身。为此,作者提出 frozen-ϕ 方法,将双层增益分解为隐含梯度正则化的内层训练通道和图结构重 wiring 通道。实验表明,在时空流量预测任务中,内层通道的性能可达到甚至超过完整双层流程的 78-101%,而在节点分类任务中也占 37-44%。研究还提出了标准化诊断方法 frozen-ϕ 和图蒸馏技术,为双层图结构学习提供了新的分析框架和评估手段。

详情
英文摘要

Bilevel graph structure learning is widely understood to improve graph neural networks by jointly optimizing model parameters and a learned graph structure, with the resulting performance gain attributed to the rewired adjacency. We find that this attribution may be overstated: training-dynamics effects in the inner loop, rather than the rewiring itself, capture a substantial share of the gain. To establish this, we introduce frozen-$ϕ$, a control that freezes the graph while retaining the inner-loop training schedule. This decomposes the bilevel gain into an inner channel of $T$-step training dynamics with implicit gradient regularization and a graph channel of the graph rewiring itself. On spatio-temporal flow forecasting the inner channel matches or exceeds the full bilevel pipeline, accounting for 78-101% of the gain; on node classification it accounts for 37-44% under a Bernoulli edge-level parameterization. We also verify that classical spectral diagnostics can dissociate from task gain. We propose frozen-$ϕ$ as a standardized diagnostic for bilevel graph structure learning, with graph distillation as a method-agnostic complement. A three-precondition framework further predicts the sign of the bilevel gain on all six benchmarks.

2605.07572 2026-05-11 cs.AI stat.ML

Open-Ended Task Discovery via Bayesian Optimization

Masaki Adachi, Yuta Suzuki, Juliusz Ziomek

AI总结 本文提出了一种名为Generate-Select-Refine(GSR)的开放任务发现框架,通过交替生成任务和优化任务,解决科学工作流中任务本身不确定的问题。该方法从用户提供的初始任务出发,逐步生成并优化新任务,最终将评估集中于最优任务,仅产生对单任务贝叶斯优化的对数遗憾开销。实验表明,GSR在新产品开发、化学合成放大、算法分析和专利再利用等任务中优于现有的基于大语言模型的优化器。

Comments 60 pages, 11 figures

详情
英文摘要

When applying Bayesian optimization (BO) to scientific workflow, a major yet often overlooked source of uncertainty is the task itself -- namely, what to optimize and how to evaluate it -- which can evolve as evidence accumulates. We introduce Generate-Select-Refine (GSR), a open-ended BO framework that alternates between task generation and task optimization. Starting from a user-provided seed task, GSR generates new tasks in a coarse-to-fine manner while a task-acquisition function schedules optimization. Asymptotically, it concentrates evaluations on the best task, incurring only logarithmic regret overhead relative to single-task BO. We apply GSR to new product development, chemical synthesis scaling, algorithm analysis, and patent repurposing, where it outperforms existing LLM-based optimizers.

2605.07568 2026-05-11 cs.CV cs.CL

Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

Peitao Han, Fei Cheng, Lis K. Pereira, Qianying Liu, Shigeru Kitazawa

AI总结 本文研究了视频大语言模型(Video-LLMs)在时间信息流方面的缺陷,通过追踪视觉编码器、投影器和语言模型之间的信息传递,发现基于帧的编码器难以捕捉时间特征,而基于视频的编码器虽能编码强时间信号,但在经过标准投影器后性能下降明显。研究进一步表明,投影器设计对时间信息传递至关重要,采用时间感知的投影方法可显著提升模型性能,并通过引入时间感知编码器和AoT监督,最终构建出超越人类水平的视频大语言模型。

详情
英文摘要

The Arrow-of-Time (AoT) task, determining whether a video plays forward or backward by recognizing temporal irreversibility, is one humans solve with near-perfect accuracy, yet frontier Video Large Language Models (Video-LLMs) perform only modestly above chance. This gap raises a key question: do visual backbones fail to encode temporal information, or does information bottleneck lie elsewhere in the Video-LLM architecture? We address this question by isolating the vision encoder from the Video-LLM and tracing temporal information across the encoder, projector, and LLM. We find that video-centric encoders with explicit temporal modeling encode strong temporal signals, whereas frame-centric encoders do not. However, when video-centric representations are passed through a standard Video-LLM architecture, performance often collapses, revealing a bottleneck of temporal information flow. We identify projector design as a key factor: Q-Former disrupts temporal information, while a time-preserved MLP projection substantially improves the LLM's access to such information. Our layer-wise analysis further shows temporal representation dynamics across encoder layers. Guided by these findings, we build a Video-LLM with temporal-aware video-centric encoder, time-preserved projector, and AoT supervision, surpassing human performance on AoT$_{PPB}$ with 98.1\% accuracy, and improving broader temporal reasoning tasks by up to 6.0 points on VITATECS-Direction and 1.3 points on TVBench. Our results show that temporal reasoning in Video-LLMs requires both effective temporal encoding and reliable transfer of this information to the LLM.

2605.07565 2026-05-11 cs.LG cs.AI stat.ML

Ensemble Distributionally Robust Bayesian Optimisation

Tigran Ramazyan, Denis Derkach

AI总结 本文研究了在上下文分布不确定条件下的零阶优化问题,提出了一个基于集成的分布鲁棒贝叶斯优化算法。该方法通过使用集成模型作为替代模型,增强了对复杂和噪声数据的鲁棒性,并在保持计算可行性的同时处理连续上下文。理论分析表明该算法具有次线性遗憾界,优于现有先进方法,实验结果也验证了其理论保证的有效性。

详情
英文摘要

We study zeroth-order optimisation under context distributional uncertainty, a setting commonly tackled using Bayesian optimisation (BO). A prevailing strategy to make BO more robust to the complex and noisy nature of data is to employ an ensemble as the surrogate model, thereby mitigating the weaknesses of any single model. In this study, we propose a novel algorithm for Ensemble Distributionally Robust Bayesian Optimisation that remains computationally tractable while managing continuous context. We obtain theoretical sublinear regret bounds, improving current state-of-the-art results. We show that our method's empirical behaviour aligns with its theoretical guarantees.

2605.07562 2026-05-11 cs.CV

Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs

Song Zhang, Yanlong Chen, Yilin Li, Yining Chen, Zili Yi, Xiaowei Zhang, Yawei Li

AI总结 遥感视觉-语言模型(RS-VLMs)在面对不同地面采样距离(GSD)带来的视觉差异时,存在与自然图像模型的根本性不匹配问题。本文提出ScaleEarth,一种基于Qwen3-VL的参数高效微调框架,将GSD作为连续条件变量,通过CS-HLoRA动态调整模型计算路径,从而适应不同尺度的遥感图像。此外,该方法结合SSE-U模块从视觉特征中预测GSD及其不确定性,并构建了GeoScale-VQA数据集,实现了方法与数据的闭环训练,显著提升了模型在多任务遥感基准上的性能。

Comments Under review. 30 pages, 16 figures, 7 tables

详情
英文摘要

Remote sensing vision-language models (RS-VLMs) face a fundamental mismatch with natural-image counterparts: the same geographic object exhibits radically different visual evidence across ground sampling distances (GSDs) spanning multiple orders of magnitude. Yet existing RS-VLMs often discard GSD or inject it as a discrete text token, forcing a single static parameter set to absorb the entire scale spectrum. We introduce ScaleEarth, a parameter-efficient fine-tuning framework built on Qwen3-VL that treats GSD as a continuous conditioning variable governing the model's computation path. At its core, CS-HLoRA (Continuous Scale-Conditioned Hyper-LoRA) modulates the LoRA low-rank subspace through a GSD-driven gate, enabling the model to dynamically route computation by physical scale. To remove reliance on sensor metadata at deployment, we pair CS-HLoRA with SSE-U, a lightweight heteroscedastic sub-head that predicts GSD and its uncertainty from visual features. To provide matching supervision, we construct GeoScale-VQA, a 1.5M-sample scale-layered RS-VQA corpus whose question-answer generation is conditioned on the same physical scalar that drives CS-HLoRA, forming a closed method-data loop. Trained with QLoRA on an 8B backbone, ScaleEarth achieves state-of-the-art results on remote-sensing benchmarks covering diverse Earth-system tasks, including XLRS-Bench and OmniEarth-Bench.

2605.07561 2026-05-11 cs.CV

Multimodal Stepwise Clinically-Guided Attention Learning for Pathological Complete Response Prediction in Breast Cancer

Alice Natalina Caragliano, Valerio Guarrasi, Michela Gravina, Carlo Sansone, Paolo Soda

AI总结 该研究提出了一种基于多模态逐步临床引导注意力学习的框架,用于乳腺癌新辅助治疗后病理完全缓解(pCR)的预测。该方法通过医学指导的空间注意力机制和多模态信息融合,解决了数据类别不平衡和跨临床环境泛化能力差的问题。模型采用分步训练策略,先学习全局影像特征,再聚焦肿瘤区域,最后结合临床变量优化决策,显著提升了预测灵敏度并保持较高特异性,同时生成具有解剖一致性的注意力图,有助于模型结果的临床解释。

详情
英文摘要

Pathological complete response (pCR) is a key prognostic factor in breast cancer patients undergoing neoadjuvant therapy, strongly associated with long-term survival and treatment personalization. However, accurate pre-treatment pCR prediction remains challenging due to severe class imbalance and limited generalizability across diverse clinical settings. In this work, we propose a multimodal stepwise clinically-guided attention learning framework for pCR prediction from breast magnetic resonance imaging (MRI), designed to address these limitations through medically grounded spatial guidance and multimodal integration. The approach follows a stepwise training strategy inspired by physician reasoning: the model first learns global discriminative imaging patterns, then attention mechanisms are introduced to constrain the network toward tumor regions, and finally clinical variables are integrated to refine decision-making. This guidance strategy encourages prioritization of task-relevant features, improving identification of responders despite their limited representation in the dataset. Moreover, grounding attention in anatomically consistent tumor regions reduces reliance on dataset-specific patterns, thereby enhancing cross-institutional generalization. The framework is evaluated through external validation across heterogeneous MRI cohorts. Compared to non-guided single-stage baselines, the proposed approach improves sensitivity while maintaining competitive specificity, and produces anatomically coherent attention maps that support interpretation of the model's predictions. These findings highlight the potential of clinically-guided multimodal attention learning for robust and generalizable pCR prediction in breast cancer.

2605.07556 2026-05-11 cs.CV

Dynamic Mode Decomposition along Depth in Vision Transformers

Nishant Suresh Aswani, Saif Eddin Jabari

AI总结 本文研究了视觉Transformer(ViT)中深度方向上的动态模式分解(DMD)特性,探讨ViT块是否能近似为自主线性动力系统。通过DMD方法从连续隐藏状态中拟合线性算子$K$,并在四个预训练的DINO ViT模型上验证其稳定性与准确性。研究发现,对于较短的深度跨度,$K^p$能够较好地预测后续状态并恢复中间激活,但这种局部线性特性在下游任务中并未表现出优势。

详情
英文摘要

Recent work has shown that contiguous vision transformer (ViT) blocks (a) can be replaced by a linear map and (b) organize into recurrent phases of computation. We ask whether these observations coincide: does ViT depth implement approximately \textit{autonomous linear} dynamics, admitting a single operator $K$ applied recurrently across a contiguous span? We test this using Dynamic Mode Decomposition (DMD), which fits $K$ from selected, consecutive hidden-state pairs and predicts $p$ steps ahead via $K^p$. On four pretrained DINO ViTs, we study the regularization, rank, and calibration budget required for stable fitting. For short spans ($p \leq 4$), $K^p$ tracks an unconstrained endpoint map to within $0.02$ cosine similarity on DINOv3-H/16+, while also recovering intermediate activations at each skipped block. At early cut starts, the fitted operators compress to rank $\ll d$ with minimal calibration data, and across tokens, \texttt{cls} is most amenable to linearization; both properties decay monotonically with depth. Yet this local fidelity does not transfer downstream. At the final hidden state, after propagating through the remaining blocks, an identity baseline becomes competitive.

2605.07554 2026-05-11 cs.LG cs.AI q-bio.BM stat.ML

ProteinJEPA: Latent prediction complements protein language models

Dan Ofer, Dafna Shahaf, Michal Linial

AI总结 本文研究了在蛋白质语言模型中引入潜在空间预测(JEPA)是否能提升模型性能,并在相同训练时间预算下与传统的掩码语言建模(MLM)进行对比。研究发现,在预训练和从头训练的蛋白质序列编码器中,仅在掩码位置进行潜在预测并保留MLM交叉熵损失的方法(称为masked-position MLM+JEPA)表现最佳,显著优于仅使用MLM或仅使用JEPA的方法。该方法在多个下游任务中取得了更好的性能,包括蛋白质稳定性预测、酶分类和结构检索等。

详情
英文摘要

Protein language models are trained primarily with masked language modeling (MLM), which predicts amino-acid identities at masked positions. We ask whether latent-space prediction can complement these token-level objectives under matched wall-clock budget. Across pretrained and random-init protein sequence encoders at 35--150M parameters, we find that the best protein-JEPA design is not all-position latent prediction but a variant: predicting latent targets only at masked positions, and retaining the MLM cross-entropy. We call this recipe masked-position MLM+JEPA. On a 16-task downstream suite (15 frozen linear probes plus SCOPe-40 zero-shot fold retrieval), under matched wall-clock budgets, this recipe wins more tasks than it loses against MLM-only continuation: 10 wins / 3 losses / 3 ties (hereafter W/L/T) on pretrained ESM2-35M, 11/2/3 on ESM2-150M while results in pretraining from scratch are mixed (6/8/2). Gains are seen for multiple models on 11 of 16 tasks, including stability, \b{eta}β\b{eta}-lactamase fitness, variant effect, intrinsic disorder, remote homology, enzyme classification, and SCOPe-40 fold retrieval. Tasks with more losses than wins are Fluorescence (TAPE) and Peptide-HLA Binding. All-position MLM+JEPA matches MLM-only overall but does not reproduce the masked-position gains. JEPA-only (no MLM) collapses in nearly every experiment. We conclude that JEPA, when combined with MLM, is competitive and can outperform pure MLM in pretraining and continued training, even under matched wall-clock budgets.

2605.07551 2026-05-11 cs.LG

Disagreement-Regularized Importance Sampling for Adversarial Label Corruption

Csongor Horváth, Ida-Maria Sintorn, Prashant Singh

AI总结 本文研究了在对抗性标签污染环境下重要性采样(IS)方法的失效问题,提出了一种基于损失排名分歧的正则化重要性采样方法(DR-IS)。该方法通过引入独立代理集成模型,利用样本间损失排名的不一致性来筛选数据,有效抑制了高范数对抗样本的影响。理论分析表明,DR-IS 在有限样本下具有严格的浓度界,能够保证污染样本与干净样本的分离,并在多个基准数据集上表现出对高范数攻击的鲁棒性。

详情
英文摘要

Standard Importance Sampling (IS) collapses under label corruption because high-norm examples, prioritized for variance reduction, are often adversarial outliers. We formalize this misalignment using an $\varepsilon$-contamination model and propose Disagreement-Regularized Importance Sampling (DR-IS), a sub-sampling method based on loss rank-disagreement across independent proxy ensemble. We prove finite-sample concentration bounds showing that the empirical rank disagreement of bulk corrupted examples is bounded above, and that of boundary-clean examples bounded below, both at rate $O(\sqrt{\log(N/δ)/K})$ with probability $1-δ$; when the structural expectation gap $Δ'$ between the two groups is positive and the boundary-clean set is at least as large as the selected subset, these bounds certify strict separation and control the contamination rate of the selected subset. Empirically, DR-IS remains robust under targeted high-norm attacks that break magnitude-based methods such as the Error $L_2$-norm (EL2N) on benchmark datasets. DR-IS complements training-dynamics approaches like Area Under the Margin ranking (AUM), offering improved robustness in the loss-aligned regime alongside explicit finite-sample concentration certificates and a contamination bound limiting noise leakage from the statistical tail of corrupted points.

2605.07550 2026-05-11 cs.CV

Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views

Grzegorz Wilczynski, Mikołaj Zielinski, Bartosz Świrta, Dominik Belter, Przemysław Spurek

AI总结 本文提出了一种从非重叠视角进行几何精确生成重建的新范式,旨在解决传统3D视觉系统对视角重叠的依赖问题。针对分布式机器人或众包数据采集等实际场景中难以获取重叠视角的挑战,作者引入了GLADOS框架,通过生成中间视角、鲁棒粗重建和迭代优化三个阶段,实现了在无重叠情况下的高质量几何重建。该方法为未来生成、重建和补全技术的结合提供了通用且模块化的解决方案。

详情
英文摘要

3D vision systems are fundamentally constrained by their reliance on visual overlap: reconstruction methods require it for geometric alignment, while generative models use it to enforce multi-view consistency. This limitation is particularly acute in real-world scenarios such as distributed swarm robotics or crowd-sourced data collection, where capturing overlapping perspectives, both in terms of spatial and appearance overlap, is often impossible. We introduce Generative Reconstruction from Disjoint Views as a new paradigm, establish a comprehensive dataset, and propose specialized evaluation metrics for zero-overlap scenarios. Our benchmarking demonstrates that existing state-of-the-art methods fail catastrophically on this task, producing disconnected geometries or semantically incoherent reconstructions. To address these limitations, we propose GLADOS, a general, modular framework that operates through three stages: (1) Generative Bridging, where foundation models synthesize intermediate perspectives to connect disjoint inputs; (2) Robust Coarse 3D Reconstruction, that establish coarse geometric scaffold via global alignment which absorbs local contradictions from generative process; and (3) Iterative Context Expansion and Consistency Optimization to fill missing regions and unify the reconstruction. As an architectureagnostic framework, GLADOS enables seamless integration of future advances in generation, reconstruction, and inpainting. The source code is available at: https://github.com/gwilczynski95/GLADOS.

2605.07549 2026-05-11 cs.CV cs.LG

Probabilistic Object Detection with Conformal Prediction

Christopher Ries, Moussa Kassem Sbeyti, Nicolas Bianco, Nadja Klein

AI总结 该论文研究了如何在目标检测任务中利用概率方法进行可靠的不确定性量化,提出了一种基于符合预测(Conformal Prediction, CP)的改进方法。针对目标检测的结构化多输出特性,作者将CP应用于边界框的坐标,并结合Bonferroni校正以保证整体置信度。通过引入基于概率目标检测模型的不确定性估计对预测区间进行缩放,并结合分类预测结果进行条件化处理,显著提升了预测区间的精确度和实用性。实验表明,该方法在多个自动驾驶数据集上有效提高了检测的置信区间质量,同时保持了覆盖率。

Comments Code is available at https://github.com/mos-ks/OD-CP

详情
英文摘要

Conformal Prediction (CP) is a distribution-free method for constructing prediction sets with marginal finite-sample coverage guarantees, making it a suitable framework for reliable uncertainty quantification in safety-critical object detection. However, object detection introduces structured multi-output predictions, complicating the application of classical CP theory developed for single outputs. In addition, standard, unscaled CP produces fixed-width prediction intervals across inputs, leading to unnecessary width for low-uncertainty predictions. While scaled CP addresses this by adapting the interval width to an input-dependent uncertainty estimate, prior work has neither systematically compared unscaled and scaled CP for multi-class object detection, nor integrated CP with a complementary uncertainty quantification method in this setting. We fill this gap by: (i) applying CP coordinate-wise to bounding box corners with a Bonferroni correction for box-level guarantees; (ii) scaling the resulting intervals using per-prediction aleatoric uncertainty estimates derived from a probabilistic object detector trained with loss attenuation, evaluated in uncalibrated and two calibrated variants; (iii) extending to a two-step pipeline that constructs prediction sets for the class using RAPS and conditions the conformalized bounding boxes on the predicted class set. Across three autonomous driving datasets (KITTI, BDD, CODA), including a cross-domain setting under distribution shift, scaled CP consistently improves interval sharpness over unscaled CP, achieving up to 19% higher IoU and 39% lower interval scores, without sacrificing coverage. Class-wise calibration further improves coverage for both variants with a negligible effect on sharpness. Together, these improvements yield more actionable uncertainty estimates for real-time, real-world object detection.

2605.07546 2026-05-11 cs.LG

On the Invariance and Generality of Neural Scaling Laws

Xing Han, Ziyin Liu, Suchi Saria, Paul Pu Liang

AI总结 本文研究了神经网络规模定律的不变性与通用性,探讨如何在不同领域之间迁移这些定律以减少计算资源消耗。作者提出通过识别数据变换对规模定律的影响,发现其在信息保持变换下具有不变性,在信息分辨率降低的变换下则有可预测的变化规律。研究验证了该理论在语言、视觉和语音任务中的适用性,并展示了其在跨领域预测和噪声影响分析中的实际应用价值。

Comments 23 pages, 6 figures, 11 tables

详情
英文摘要

Neural scaling laws establish a predictable relationship between model performance and data or compute, offering crucial guidance for resource allocation in new domains and tasks. Yet such laws are most needed precisely where they are hardest to obtain: fitting one for a new model task pair demands expensive sweeps that typically exhaust the very compute budget the law is meant to economize. This paper poses the research question of how to develop generalizable scaling laws: laws fit once on a well-resourced source domain and reliably transported to new domains where running a full sweep is infeasible, which requires a fundamental understanding of when and why scaling properties change. We address this by identifying the right invariants: scaling laws are preserved under bijective (information-preserving) transformations of the data and modified in predictable, information-theoretically grounded ways under non-bijective transformations that lower its information resolution $ρ$: a single axis along which a law fit in one domain can be transported to another. We validate this across language, vision, and speech, and demonstrate two cross-domain applications: predicting scaling for language models trained on electronic health records from laws fit on general text, and predicting time-series classification scaling under varying levels of noise injection, recovering the data-scaling exponents to within $3\%$ error.

2605.07545 2026-05-11 cs.CV cs.AI

Implicit Preference Alignment for Human Image Animation

Yuanzhi Wang, Xuhua Ren, Jiaxiang Cheng, Bing Ma, Kai Yu, Tianxiang Zheng, Qinglin Lu, Zhen Cui

AI总结 本文研究如何通过隐式偏好对齐(IPA)提升人类图像动画中手部动作的生成质量。该方法无需构建严格的偏好对数据,通过最大化自生成高质量样本的概率并惩罚与预训练先验的偏离,实现模型对齐。同时引入手部感知的局部优化机制,专注于提升手部区域的生成效果,有效降低了偏好数据构建的难度,并在实验中验证了其优越性。

Comments Accepted by ICML 2026

详情
英文摘要

Human image animation has witnessed significant advancements, yet generating high-fidelity hand motions remains a persistent challenge due to their high degrees of freedom and motion complexity. While reinforcement learning from human feedback, particularly direct preference optimization, offers a potential solution, it necessitates the construction of strict preference pairs. However, curating such pairs for dynamic hand regions is prohibitively expensive and often impractical due to frame-wise inconsistencies. In this paper, we propose Implicit Preference Alignment (IPA), a data-efficient post-training framework that eliminates the need for paired preference data. Theoretically grounded in implicit reward maximization, IPA aligns the model by maximizing the likelihood of self-generated high-quality samples while penalizing deviations from the pretrained prior. Furthermore, we introduce a Hand-Aware Local Optimization mechanism to explicitly steer the alignment process toward hand regions. Experiments demonstrate that our method achieves effective preference optimization to enhance hand generation quality, while significantly lowering the barrier for constructing preference data. Codes are released at https://github.com/mdswyz/IPA

2605.07537 2026-05-11 cs.AI

Multi-Environment POMDPs with Finite-Horizon Objectives

Léonard Brice, Filip Cano, Krishnendu Chatterjee, Thomas A. Henzinger, Stefanie Muroya

AI总结 本文研究了在多环境部分可观察马尔可夫决策过程(MEPOMDP)中,具有有限时间目标的最优值和策略计算问题。该问题在传统POMDP中已被证明是PSPACE完全的,作者进一步证明其在更一般的MEPOMDP设置下同样具有PSPACE完全性。为此,作者提出了一种实用的算法,并在经典基准测试中验证了其有效性,显著优于之前已知的唯一算法。

详情
英文摘要

Partially Observable Markov Decision Processes (POMDPs) are systems in which one agent interacts with a stochastic environment, and receives only partial information about the current state. In a multi-environment POMDP (MEPOMDP), the initial state is unknown, and assumed to be adversarially chosen. In this work we focus on computing the optimal value and policy in MEPOMDPs with finite-horizon objectives. That problem is known to be PSPACE-complete in POMDPs. Our main results are as follows: (1) we establish that it is also PSPACE-complete in the more general setting of MEPOMDPs; (2) we present a practical algorithm and evaluate it on classical benchmarks, significantly outperforming the only previously known algorithm.

2605.07533 2026-05-11 cs.CL

Why do Large Language Models Fail in Low-resource Translation? Unraveling the Token Dynamics of Large Language Models for Machine Translation

Shenbin Qian, Yves Scherrer

AI总结 本文系统分析了大语言模型(LLMs)在低资源语言翻译任务中的失败模式,发现非英语中心的语言对翻译质量显著低于英语中心对。研究引入了“词元激活率”(TAR)指标,用于衡量模型在生成过程中对语言特有词元的利用效率,并验证了TAR与翻译性能之间的强相关性。此外,研究还发现推理型大语言模型在翻译低TAR语言时倾向于生成更多词元,但其对翻译质量的影响因模型而异。

Comments Accepted to the 26th Annual Conference of the European Association for Machine Translation (EAMT2026)

详情
英文摘要

Large Language Models (LLMs) have recently demonstrated strong performance in machine translation (MT). However, most prior work focuses on improving or benchmarking translation quality, offering limited insight into when and why LLM-based translation fails. In this work, we systematically analyze failure modes of LLMs in MT by evaluating 15 models, including four reasoning LLMs, across 22 language pairs (LPs) with varying resource levels. We find that non-English-centric LPs consistently yield lower COMET scores than English-centric pairs. To investigate the underlying causes, we introduce Token Activation Rate (TAR), a metric that captures how effectively a model utilizes language-specific tokens in its vocabulary during generation. We validate TAR as a proxy for language representation using models with known language distributions in the training data, and show that lower TAR is strongly associated with poorer translation performance. Furthermore, reasoning LLMs tend to generate more tokens when translating into low-TAR languages, suggesting a compensatory mechanism, although its impact on translation quality varies across models. Overall, our findings emphasize the importance of token-level dynamics in understanding MT performance of LLMs.

2605.07531 2026-05-11 cs.LG math.OC

SGD for Variational Inference: Tackling Unbounded Variance via Preconditioning and Dynamic Batching

Hippolyte Labarrière, Cesare Molinari, Silvia Villa, Lorenzo Rosasco

AI总结 本文研究了黑盒变分推断(BBVI)中使用随机梯度下降(SGD)时面临的大方差问题,并提出通过预处理和动态批处理来解决这一问题。针对椭圆位置-尺度族参数化分布,作者首次证明了ELBO解的存在性,并建立了在Blum-Gladyshev条件下的 minibatch 投影SGD的收敛性保证。理论分析表明,结合动态批处理与预处理能够有效提升算法在复杂场景下的稳定性与收敛性。

详情
英文摘要

Black-Box Variational Inference (BBVI) typically relies on Stochastic Gradient Descent (SGD) to optimize the Evidence Lower Bound (ELBO). However, the stochastic gradients in BBVI inherently exhibit unbounded variance, violating standard assumptions and instead satisfying the weaker Blum-Gladyshev (BG) condition, where variance grows quadratically with distance from the optimum. In this paper, we bridge the gap between stochastic optimization theory and the practical instances of BBVI. Focusing on the broad elliptic location-scale family of parameterized distributions, we offer two main contributions. First, we prove the existence of an ELBO solution, a foundational property usually assumed a priori in the literature. Second, we establish comprehensive convergence guarantees spanning finite-time and asymptotic regimes for Minibatch Projected SGD (PSGD) equipped with dynamic batching and preconditioning under the BG condition. Our theoretical framework demonstrates that dynamic batching combined with preconditioning systematically enables rigorous guarantees even in complex settings. We illustrate our theoretical findings with numerical results, highlighting the efficacy of our approach for modern inference tasks.

2605.07530 2026-05-11 cs.RO cs.SE

Search-based Robustness Testing of Laptop Refurbishing Robotic Software

Erblin Isaku, Hassan Sartaj, Shaukat Ali, Malaika Din Hashmi, Francois Picard

AI总结 本文研究了用于笔记本电脑翻新机器人的软件鲁棒性测试问题,重点在于检测用于物体识别的模型在面对微小输入变化时的失效情况。提出了一种基于搜索的鲁棒性测试方法 PROBE,利用多目标优化算法 NSGA-II 系统探索扰动空间,以发现能够引发模型失效的最小局部扰动。实验表明,PROBE 在生成失效扰动方面比随机搜索更高效,且扰动幅度更小,同时具有跨模型迁移能力,为提升翻新机器人软件的可靠性提供了有效手段。

Comments 15 pages, 4 figures, 5 tables

详情
英文摘要

The Danish Technological Institute (DTI) focuses on transferring advanced technologies (including robots) to the industry and the public sector. One key application is laptop refurbishment using specialized robots, aimed at promoting reuse, reducing electronic waste, and supporting the European Circular Economy Action Plan. The software of such robots often includes features that use object detection models to detect objects for various purposes, such as identifying screws for laptop disassembly or detecting stickers to remove them. Ensuring the robustness of such models to small input variations remains a critical challenge, and addressing it is important to avoid potential damage to laptops during refurbishment. In this paper, we propose PROBE, a search-based robustness testing approach that leverages multi-objective optimization to identify minimal, localized perturbations that expose failures in object detection models used in the software of laptop refurbishing robots. PROBE employs NSGA-II to systematically explore the perturbation space, optimizing for failure induction considering both localization and confidence, and perturbation magnitude, while enabling the discovery of diverse failure cases. Results show that PROBE is 3$\times$ to 7$\times$ more effective than random search in generating failure-inducing perturbations, while requiring smaller perturbation magnitudes, and that the generated perturbations transfer across models. We further show that metamorphic relations provide additional insights into model robustness, enabling the assessment of stability even in non-failing cases.

2605.07522 2026-05-11 cs.CL

WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation

Zinan Zheng, Yang Liu, Nuo Chen, Juepeng Zheng, Hong Cheng, Jia Li

AI总结 本文提出了一种名为WeatherSyn的指令微调多模态大语言模型,专门用于生成天气预报报告。研究构建了首个用于该任务的指令微调数据集,涵盖美国31个城市和8个天气方面,并基于该数据集开发了首个专门生成天气预报报告的模型。实验表明,该模型在多个指标上优于领先的闭源模型,尤其在结构复杂的天气内容上表现突出,并展现出良好的跨地区泛化能力。

Comments ICML 2026

详情
英文摘要

Accurate weather forecast reporting enables individuals and communities to better plan daily activities and agricultural operations. However, the current reporting process primarily relies on manual analysis of multi-source data, which leads to information overload and reduced efficiency. With the development of multimodal large language models (MLLMs), leveraging data-driven models to analyze and generate reports in the weather forecasting domain remains largely underexplored. In this work, we propose the Weather Forecasting Report (WFR) task and construct the first instruction-tuning dataset for this task, named~\DatasetNameL, which covers 31 cities in America and 8 weather aspects. Based on this corpus, we develop the first model, \ModelNameL, specialized in generating weather forecast reports. Evaluation across multiple metrics on our dataset shows that \ModelNameL~ consistently outperforms leading closed-source MLLMs, particularly on structurally complex weather aspects. We further analyze its performance across diverse geographic regions and weather aspects. \ModelNameL~ demonstrates strong transferability across different regions, highlighting its zero-shot generalization capability. \ModelNameL~offers valuable insight for developing MLLMs specialized in weather report generation. .

2605.07520 2026-05-11 cs.AI

Model-Driven Policy Optimization in Differentiable Simulators via Stochastic Exploration

Yuval Aroosh, Ayal Taitler

AI总结 本文提出了一种名为Model-Driven Policy Optimization (MDPO)的框架,用于在可微分仿真器中进行策略优化,通过在动作空间中注入噪声引入随机探索,以应对高度非线性及离散-连续混合系统中优化景观不佳的问题。该方法利用系统模型动态调整噪声幅度,根据梯度推导的轨迹目标灵敏度生成时间依赖的探索策略,从而更有效地探索目标空间并逃离局部最优解。实验表明,MDPO在多个基准任务中优于确定性方法及模型无关的强化学习基线,显著提升了复杂非线性环境下的解的质量。

详情
英文摘要

Differentiable planning enables gradient-based optimization of decision-making problems by leveraging differentiable models of system dynamics. However, in highly nonlinear and hybrid discrete-continuous domains, the resulting optimization landscapes are often ill-conditioned, with flat regions and sharp transitions that hinder effective optimization. We propose Model-Driven Policy Optimization (MDPO), a framework that introduces stochastic exploration into differentiable planning by injecting noise into the action space during optimization. Leveraging access to the model, MDPO further adapts the noise magnitude based on gradient-derived sensitivity of the trajectory objective, yielding a time-dependent exploration profile. This enables improved exploration of the objective landscape and helps escape poor local optima via dynamic allocation of exploration across timesteps and iterations. Experiments on benchmark domains demonstrate that MDPO consistently outperforms deterministic differentiable planning, including both the noise-free variant of our method and available state-of-the-art implementations, as well as model-free baselines such as PPO, significantly improving solution quality across challenging nonlinear and hybrid settings. We further analyze the evolution of the adaptive noise magnitude across both time steps and optimization iterations, providing insight into how exploration is allocated during learning.

2605.07378 2026-05-11 cs.LG

Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns

Yameng Peng, Andy Song, HaythamM. Fayek, Vic Ciesielski, Xiaojun Chang

AI总结 该论文提出了一种名为SWAP-Score的零样本神经网络评估指标,用于在无需训练的情况下高效评估神经网络的性能。该方法通过分析样本级别的激活模式,衡量网络的表达能力,并在多种任务和架构(包括CNN和Transformer)上展现出强相关性和广泛适用性。实验表明,SWAP-Score在多个基准任务中优于现有零样本指标,且适用于预训练阶段的语言模型性能估计,显著提升了神经网络架构搜索的效率。

Comments Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence. This article is a journal extension of arXiv:2403.04161

详情
英文摘要

Zero-shot proxies, also known as training-free metrics, are widely adopted to reduce the computational overhead in neural network evaluation for scenarios such as Neural Architecture Search (NAS), as they do not require any training. Existing zero-shot metrics have several limitations, including weak correlation with the true performance and poor generalisation across different networks or downstream tasks. For example, most of these metrics apply only to either convolutional neural networks (CNNs) or Transformers, but not both. To address these limitations, we propose Sample-Wise Activation Patterns (SWAP), and its derivative, SWAP-Score, a novel and highly effective zero-shot metric. SWAP-Score is broadly applicable across both architecture families and task domains, demonstrating strong predictive performance in the majority of tasks. This metric measures the expressivity of neural networks over a mini-batch of samples, showing a high correlation with the neural networks' ground-truth performance. For both CNNs and Transformers, the SWAP-Score outperforms existing zero-shot metrics across computer vision and natural language processing tasks. For instance, Spearman's correlation coefficient between the SWAP-Score and CIFAR-10 validation accuracy for DARTS CNNs is 0.93, and 0.71 for FlexiBERT Transformers on GLUE tasks. Moreover, SWAP-Score is label-independent, hence can be applied at the pre-training stage of language models to estimate their performance for downstream tasks. When applied to NAS, SWAP-empowered NAS, SWAP-NAS can achieve competitive performance using only approximately 6 and 9 minutes of GPU time, on CIFAR-10 and ImageNet respectively. Our code is available at: https://github.com/pym1024/SWAP_Universal