arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1981
2605.14403 2026-05-15 cs.CV

DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making

Yize Liu, Siyuan Yan, Ming Hu, Lie Ju, Xieji Li, Feilong Tang, Wei Feng, Zongyuan Ge

AI总结 DermAgent 是一个用于皮肤科图像分析的自反思智能代理系统,旨在解决现有多模态大语言模型在皮肤病诊断中领域知识不足和幻觉问题。该系统通过集成七个专业视觉与语言模块,在计划-执行-反思框架下实现可追溯的诊断推理,结合多工具协同推理与外部证据检索,有效提升了诊断准确性和可靠性。实验表明,DermAgent 在多个皮肤病基准测试中表现优异,显著优于现有先进模型。

Comments MICCAI2026 early acceptance

详情
英文摘要

Dermatological diagnosis requires integrating fine-grained visual perception with expert clinical knowledge. Although Multimodal Large Language Models (MLLMs) facilitate interactive medical image analysis, their application in dermatology is hindered by insufficient domain-specific grounding and hallucinations. To address these issues, we propose DermAgent, a collaborative multi-tool agent that orchestrates seven specialized vision and language modules within a Plan-Execute-Reflect framework. DermAgent delivers stepwise, traceable diagnostic reasoning through three core components. First, it employs complementary visual perception tools for comprehensive morphological description, dermoscopic concept annotation, and disease diagnosis. Second, to overcome the lack of domain prior, a dual-modality retrieval module anchors every prediction in external evidence by cross-referencing 413,210 diagnosed image cases and 3,199 clinical guideline chunks. To further mitigate hallucinations, a deterministic critic module conducts strict post-hoc auditing via confidence, coverage, and conflict gates, automatically detecting inter-source disagreements to trigger targeted self-correction. Extensive experiments on five dermatology benchmarks demonstrate that DermAgent consistently outperforms state-of-the-art MLLMs and medical agent baselines across zero-shot fine-grained disease diagnosis, concept annotation, and clinical captioning tasks, exceeding GPT-4o by 17.6% in skin disease diagnostic accuracy and 3.15% in captioning ROUGE-L. Our code is available at https://github.com/YizeezLiu/DermAgent.

2605.14399 2026-05-15 cs.CV cs.GR

SceneForge: Structured World Supervision from 3D Interventions

Jizhizi Li, Jiayang Ao, Danny Wicks, Petru-Daniel Tudosiu

AI总结 SceneForge 是一个基于可编辑3D世界状态的干预驱动框架,旨在生成在场景编辑、视角变化和场景级干预下保持一致的结构化监督信号。该方法通过显式干预(如物体移除或相机变化)并传播其对场景结构和物理属性的影响,生成包括反事实观测、多视角观测及阴影、反射等效应感知信号在内的对齐输出。实验表明,SceneForge 能有效提升多任务学习中物体移除和场景移除的性能,为干预一致的多模态学习提供了可扩展的监督基础。

详情
英文摘要

Many multimodal learning tasks require supervision that remains consistent across edits, viewpoints, and scene-level interventions. However, such supervision is difficult to obtain from observation-level datasets, which do not expose the underlying scene state or how changes propagate through it. We present SceneForge, an intervention-driven framework that generates structured supervision from editable 3D world states. SceneForge represents each scene as a persistent world with semantic, geometric, and physical dependencies. By applying explicit interventions (e.g., object removal or camera variation) and propagating their effects through scene dependencies, SceneForge renders supervision that remains consistent with object structure and scene-level effects. This produces aligned outputs including counterfactual observations, multi-view observations, and effect-aware signals such as shadows and reflections, all derived from a shared world state rather than post hoc image-space processing. We instantiate SceneForge using Infinigen and Blender to construct a licensing-clean indoor supervision resource with a large number of counterfactual pairs and aligned annotations from over 2K scenes, covering both diverse single-view and registered multi-view settings. Under matched training budgets, incorporating SceneForge supervision improves both object removal and scene removal performance across multiple benchmarks in both quantitative and qualitative evaluation. These results indicate that modeling supervision as structured state transitions in editable worlds provides a practical and scalable foundation for intervention-consistent multimodal learning.

2605.14396 2026-05-15 cs.CV cs.CR cs.LG cs.RO

Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion

Chenyi Wang, Ruoyu Song, Raymond Muller, Jean-Philippe Monteuuis, Jonathan Petit, Z. Berkay Celik, Ryan Gerdes, Ming F. Li

AI总结 自动驾驶车辆依赖在线高精度地图构建来感知车道边界、分隔线和人行横道等关键道路元素,这些元素直接影响运动规划的安全性。本文提出MIRAGE框架,通过条件扩散模型系统性地发现能够绕过对抗防御、导致地图预测退化的语义攻击,例如制造阴影或湿滑路面等合理环境变化。实验表明,MIRAGE生成的攻击在多个防御机制下仍具有强效,并且生成场景的现实感达到80-84%,远高于传统像素级攻击方法。

详情
英文摘要

Autonomous vehicles depend on online HD map construction to perceive lane boundaries, dividers, and pedestrian crossings -- safety-critical road elements that directly govern motion planning. While existing pixel perturbation attacks can disrupt the mapping, they can be neutralized by standard adversarial defenses. We present MIRAGE, a framework for systematic discovery of semantic attacks that bypass adversarial defenses and degrade mapping predictions by finding plausible environmental variation (e.g. shadows, wet roads). MIRAGE exploits the latent manifold of real-world data learned by diffusion models, and searches for semantically mutated scenes neighboring the ground truth with the same road topology yet mislead the mapping predictions. We evaluate MIRAGE on nuScenes and demonstrate two attacks: (1) boundary removal, suppressing 57.7% of detections and corrupting 96% of planned trajectories; and (2) boundary injection, the only method that successfully injects fictitious boundaries, while pixel PGD and AdvPatch fail entirely. Both attacks remain potent under various adversarial defenses. We use two independent VLM judges to quantify realism, where MIRAGE passes as realistic 80--84% of the time (vs. 97--99% for clean nuScenes), while AdvPatch only 0--9%. Our findings expose a categorical gap in current adversarial defenses: semantic-level perturbations that manifest as legitimate environmental variation are substantially harder to mitigate than pixel-level perturbations.

2605.14393 2026-05-15 cs.CV

Analogical Trajectory Transfer

Junho Kim, Eun Sun Lee, Gwangtak Bae, Seunggu Kang, Young Min Kim

AI总结 本文研究类比轨迹迁移问题,旨在将一个三维环境中的运动轨迹转换到另一个语义上相似但空间布局不同的环境中,从而实现机器的类比空间推理能力。为了解决场景间物体位置、尺度和布局差异带来的碰撞和几何失真问题,作者提出了一种基于场景聚类和分层映射预测的方法,通过分解问题并组合子问题的解,生成语义一致且空间连贯的轨迹转移结果。该方法无需训练,运行速度快,且在多个应用场景中优于基于大语言模型和场景图匹配的基线方法。

详情
英文摘要

We study analogical trajectory transfer, where the goal is to translate motion trajectories in one 3D environment to a semantically analogous location in another. Such a capacity would enable machines to perform analogical spatial reasoning, with applications in AR/VR co-presence, content creation, and robotics. However, even semantically similar scenes can still differ substantially in object placement, scale, and layout, so naively matching semantics leads to collisions or geometric distortions. Furthermore, finding where each trajectory point should transfer to has a large search space, as the mapping must preserve semantics and functionality without tearing the trajectory apart or causing collisions. Our key insight is to decompose the problem into spatially segregated subproblems and merge their solutions to produce semantically consistent and spatially coherent transfers. Specifically, we partition scenes into object-centric clusters and estimate cross-scene mappings via hierarchical smooth map prediction, using 3D foundation model features that encode contextual information from object and open-space arrangements. We then combinatorially assemble the per-cluster maps into an initial transfer and refine the result to remove collisions and distortions, yielding a spatially coherent trajectory. Our method does not require training, attains a fast runtime around 0.6 seconds, and outperforms baselines based on LLMs, VLMs, and scene graph matching. We further showcase applications in virtual co-presence, multi-trajectory transfer, camera transfer, and human-to-robot motion transfer, which indicates the broad applicability of our work to AR/VR and robotics.

2605.14392 2026-05-15 cs.AI

Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

Yucheng Shi, Zhenwen Liang, Kishan Panaganti, Dian Yu, Wenhao Yu, Haitao Mi

AI总结 该研究提出了一种通过可验证环境合成实现自我进化的强化学习方法,使语言模型不仅能生成问题,还能构建用于训练自身的环境。核心方法是通过生成可执行的环境对象,实现问题采样、参考解计算与响应评分,并确保环境具有稳定的“解决-验证”不对称性,从而保证奖励信号的有效性。研究通过EvoEnv框架验证了该方法的有效性,在基准测试中实现了性能提升,表明模型的自我改进依赖于构建难度始终超越自身能力的环境,而非单纯增加合成数据量。

Comments Tech report, work in progress

详情
英文摘要

We pursue a vision for self-improving language models in which the model does not merely generate problems or traces to imitate, but constructs the environments that train it. In zero-data reasoning RL, this reframes self-improvement from a data-generation loop into an environment-construction loop, where each artifact is a reusable executable object that samples instances, computes references, and scores responses. Whether this vision sustains improvement hinges on a single property: the environments must exhibit stable solve--verify asymmetry, the model must be able to write an oracle once that it cannot reliably execute in natural language on fresh instances. This asymmetry takes two complementary forms. Some tasks are algorithmically hard to reason through but trivial as code: a dynamic program or graph traversal, compiled once, yields unboundedly many calibrated instances. Others are intrinsically hard to solve but easy to verify, like planted subset-sum or constraint satisfaction. Both create a durable gap between proposing and solving that the policy cannot close by gaming the verifier, and it is this gap that keeps reward informative as the learner improves. We instantiate this view in EvoEnv, a single-policy generator, solver method that synthesizes Python environments from ten seeds and admits them only after staged validation, semantic self-review, solver-relative difficulty calibration, and novelty checks. The strongest evidence comes from the already-strong regime: on Qwen3-4B-Thinking, fixed public-data RLVR and fixed hand-crafted environment RLVR reduce the average, while EvoEnv improves it from 72.4 to 74.8, a relative gain of 3.3%. Stable self-improvement, we suggest, depends not on producing more synthetic data, but on models learning to construct worlds whose difficulty stays structurally beyond their own reach.

2605.14391 2026-05-15 cs.CV

Dual-Latent Collaborative Decoding for Fidelity-Perception Balanced Image Compression

Qi Mao, Zijian Wang, Zhengxue Cheng, Lingyu Zhu, Siwei Ma

AI总结 本文研究了如何在图像压缩中平衡重建图像的保真度与感知质量。现有方法通常依赖单一的潜在表示同时处理结构细节、语义信息和感知先验,导致不同任务之间的冲突。为此,作者提出了一种双潜在协作解码框架MoDE,通过将标量量化和向量量化两种潜在表示分别作为保真度专家和感知专家,并引入专家特定增强和跨专家调制模块,实现两者的协同解码。实验表明,该方法在广泛比特率范围内实现了更优的保真-感知平衡。

详情
英文摘要

Learned image compression (LIC) increasingly requires reconstructions that balance distortion fidelity and perceptual realism across a wide range of bitrates. However, most existing methods still rely on a single compressed latent representation to simultaneously carry structural details, semantic cues, and perceptual priors, requiring the same latent representation to serve multiple, potentially conflicting roles. This tension becomes evident across different latent paradigms: scalar-quantized (SQ) continuous latents provide rate-scalable fidelity but tend to lose perceptual details at low rates, while vector-quantized (VQ) discrete tokens preserve compact semantic cues but suffer from limited structural fidelity and bitrate scalability. To address this issue, we propose Mixture of Decoder Experts (MoDE), a dual-latent collaborative decoding framework that decomposes reconstruction responsibilities across complementary latent paradigms. Specifically, MoDE treats the SQ branch as a fidelity-oriented expert and the VQ branch as a perception-oriented expert, and coordinates them through two decoder-side modules: Expert-Specific Enhancement (ESE), which preserves branch-specific expert references, and Cross-Expert Modulation (CEM), which enables selective complementary transfer during reconstruction. The resulting framework supports selective cross-latent collaboration under a shared dual-stream bitstream and enables both fidelity-anchored and perception-anchored decoding. Extensive experiments demonstrate that MoDE achieves a more favorable fidelity-perception balance than representative distortion-oriented, perception-oriented, generative, and dual-latent baselines across a wide bitrate range, highlighting decoder-side expert collaboration as an effective design for wide-range fidelity-perception balanced LIC.

2605.14389 2026-05-15 cs.AI cs.CL cs.LG

Nexus : An Agentic Framework for Time Series Forecasting

Sarkar Snigdha Sarathi Das, Palash Goyal, Mihir Parmar, Nanyun Peng, Vishy Tirumalashetty, Chun-Liang Li, Rui Zhang, Jinsung Yoon, Tomas Pfister

AI总结 时间序列预测不仅涉及数值推断,还需结合新闻、事件等非结构化文本信息进行推理。为弥补现有时间序列基础模型(TSFMs)对文本信号不敏感以及大语言模型(LLMs)在不同领域表现不一的问题,本文提出Nexus,一种多智能体预测框架,通过分解预测过程为宏观与微观时间波动识别、上下文信息整合等阶段,实现更灵活的预测。实验表明,Nexus在多个领域数据上优于现有先进模型,同时生成高质量的推理轨迹,揭示了预测背后的驱动因素,证明了现实中的时间序列预测是超越单纯序列建模的智能体推理问题。

Comments 30 Pages, 3 figures, 5 Tables

详情
英文摘要

Time series forecasting is not just numerical extrapolation, but often requires reasoning with unstructured contextual data such as news or events. While specialized Time Series Foundation Models (TSFMs) excel at forecasting based on numerical patterns, they remain unaware to real-world textual signals. Conversely, while LLMs are emerging as zero-shot forecasters, their performance remains uneven across domains and contextual grounding. To bridge this gap, we introduce Nexus, a multi-agent forecasting framework that decomposes prediction into specialized stages: isolating macro-level and micro-level temporal fluctuations, and integrating contextual information when available before synthesizing a final forecast. This decomposition enables Nexus to adapt from seasonal signals to volatile, event-driven information without relying on external statistical anchors or monolithic prompting. We show that current-generation LLMs possess substantially stronger intrinsic forecasting ability than previously recognized, depending critically on how numerical and contextual reasoning are organized. Evaluated on data strictly succeeding LLM knowledge cutoffs spanning Zillow real estate metrics and volatile stock market equities, Nexus consistently matches or outperforms state-of-the-art TSFMs and strong LLM baselines. Beyond numerical accuracy, Nexus produces high-quality reasoning traces that explicitly show the fundamental drivers behind each forecast. Our results establish that real-world forecasting is an agentic reasoning problem extending well beyond only sequence modeling.

2605.14380 2026-05-15 cs.CL

Mitigating Data Scarcity in Psychological Defense Classification with Context-Aware Synthetic Augmentation

Hoang-Thuy-Duong Vu, Quoc-Cuong Pham, Huy-Hieu Pham

AI总结 该研究针对心理防御机制(PDMs)分类任务中因数据稀缺和类别不平衡带来的挑战,提出了一种结合上下文感知合成增强与混合分类模型的方法。通过整合语言上下文表示、基础临床特征以及150个标注防御条目,该方法在PsyDefDetect共享任务中显著提升了分类性能,准确率和宏F1值分别达到58.26%和24.62%,优于现有方法,为低资源场景下的心理防御分类建立了有力的基准。

详情
英文摘要

Psychological defense mechanisms (PDMs) are unconscious cognitive processes that modulate how individuals perceive and respond to emotional distress. Automatically classifying PDMs from text is clinically valuable but severely hindered by data scarcity and class imbalance, challenges which generative augmentation alone cannot resolve without psychological grounding. In this work, we address these challenges in the PsyDefDetect shared task (BioNLP@ACL 2026) by proposing a context-aware synthetic augmentation framework combined with a hybrid classification model. Our hybrid model integrates contextual language representations with basic clinical features, along with 150 annotated defense items. Experiments demonstrate that definition quality in prompting directly governs generation fidelity and downstream performance. Our method surpasses DMRS Co-Pilot, reaching an accuracy of 58.26% (+40.25%) and a macro-F1 of 24.62% (+15.99%), thereby establishing a strong baseline for psychologically grounded defense mechanism classification in low-resource settings. Source code is available at: https://github.com/htdgv/CASA-PDC.

2605.14379 2026-05-15 cs.LG cs.AI cs.GT cs.MA

Data-Augmented Game Starts for Accelerating Self-Play Exploration in Imperfect Information Games

JB Lanier, Nathan Monette, Pierre Baldi, Roy Fox

AI总结 在不完美信息博弈中,由于稀疏奖励和长期探索的困难,寻找大规模竞争性游戏(如《星际争霸》《Dota》等)的近似均衡计算上极具挑战。本文提出了一种多智能体初始状态采样策略——数据增强博弈起始(DAGS),通过从离线人类专家演示中采样中间状态作为强化学习的起始点,以加速策略梯度方法在零和两人博弈中的探索效率。实验表明,DAGS在固定计算预算下能显著降低博弈的可利用性,并揭示了初始状态分布增强可能导致均衡偏差的问题,同时提出了一种简单有效的缓解方法。

Comments 17 pages, 4 figures. JB Lanier and Nathan Monette contributed equally

详情
英文摘要

Finding approximate equilibria for large-scale imperfect-information competitive games such as StarCraft, Dota, and CounterStrike remains computationally infeasible due to sparse rewards and challenging exploration over long horizons. In this paper, we propose a multi-agent starting-state sampling strategy designed to substantially accelerate online exploration in regularized policy-gradient game methods for two-player zero-sum (2p0s) games. Motivated by an assumption that offline demonstrations from skilled humans can provide good coverage of high-level strategies relevant to equilibrium play, we propose the initialization of reinforcement learning data collection at intermediate states sampled from offline data to facilitate exploration of strategically relevant subgames. Referring to this method as Data-Augmented Game Starts (DAGS), we perform experiments using synthetic datasets and analytically tractable, long-horizon control variants of two-player Kuhn Poker, Goofspiel, and a counterexample game designed to penalize biased beliefs over hidden information. Under fixed computational budgets, DAGS enables regularized policy gradient methods to achieve lower exploitability in games with significantly more challenging exploration. We show that augmenting starting state distributions when solving imperfect information games can lead to biased equilibria, and we provide a straightforward mitigation to this in the form of multi-task observation flags. Finally, we release a new set of benchmark environments that drastically increase exploration challenges and state counts in existing OpenSpiel games while keeping exploitability measurements analytically tractable.

2605.14374 2026-05-15 cs.LG cs.AI math.OC

Optimal Pattern Detection Tree for Symbolic Rule-Based Classification

Young-Chae Hong, Yangho Chen

AI总结 本文提出了一种基于混合整数规划的符号规则分类模型——最优模式检测树(OPDT),用于在二分类任务中发现数据中的单一最优模式。为融入先验知识和合规要求,作者进一步引入了分支结构约束(BSC)框架,使决策者能够将领域知识直接嵌入模型。该方法通过优化覆盖范围并最小化误分类的假阳性率,能够在合理时间内于中等规模数据集上发现具有最优性保证的隐藏模式。

Comments Published in Transactions on Machine Learning Research (TMLR). 26 pages, 4 figures. OpenReview URL: https://openreview.net/forum?id=RJ6eMDcDCv

详情
Journal ref
Transactions on Machine Learning Research (2026)
英文摘要

Pattern discovery in data plays a crucial role across diverse domains, including healthcare, risk assessment, and machinery maintenance. In contrast to black-box deep learning models, symbolic rule discovery emerges as a key data mining task, generating human-interpretable rules that offer both transparency and intuitive explainability. This paper introduces the Optimal Pattern Detection Tree (OPDT), a rule-based machine learning model based on novel mixed-integer programming to discover a single optimal pattern in data through binary classification. To incorporate prior knowledge and compliance requirements, we further introduce the Branching Structure Constraints (BSC) framework, which enables decision makers to encode domain knowledge and constraints directly into the model. This optimization-based approach discovers a hidden underlying pattern in datasets, when it exists, by identifying an optimal rule that maximizes coverage while minimizing the false positive rate due to misclassification. Our computational experiments show that OPDT discovers a pattern with optimality guarantees on moderately sized datasets within reasonable runtime.

2605.14368 2026-05-15 cs.CL cs.AI

Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement

Injin Kong, Hyoungjoon Lee, Yohan Jo

AI总结 本文研究了如何在预训练语言模型中有效引入扩散模型,提出了一种基于几何引导的扩散-变压器混合模型DiHAL。该方法通过几何特征评估各层的适合性,选择合适的隐藏状态接口,并用扩散桥替换下层变压器结构,保留上层结构和语言模型头部。实验表明,基于几何评分的隐藏状态恢复方法在保持相同训练预算的情况下,优于传统的连续扩散方法,展示了在语言模型中进行扩散替换的可行性。

详情
英文摘要

Continuous diffusion language models lag behind autoregressive transformers, partly because diffusion is applied in spaces poorly suited to language denoising and token recovery. We propose DiHAL, a geometry-guided diffusion-transformer hybrid that asks where diffusion should enter a pretrained transformer. DiHAL scores layers with geometry-based proxies, selects a diffusion-friendly hidden-state interface, and replaces the lower transformer prefix with a diffusion bridge while retaining the upper layers and original LM head. By reconstructing the selected-layer hidden state rather than tokens, DiHAL avoids direct continuous-to-discrete recovery. Experiments on 8B-scale backbones show that the geometry score predicts effective shallow insertion layers under a fixed bridge-training protocol and that hidden-state recovery improves over continuous diffusion baselines in a diagnostic comparison matching the diffusion/recovery training budget. These results suggest that hidden-state geometry helps identify where diffusion-based replacement is feasible inside pretrained language models.

2605.14366 2026-05-15 cs.CL cs.LG

Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

Zeli Su, Ziyin Zhang, Zhou Liu, Xuexian Song, Zhankai Xu, Longfei Zheng, Xiaolu Zhang, Rong Fu, Guixian Xu, Wentao Zhang

AI总结 该研究探讨了在低资源语言扩展中,如何避免因微调大语言模型而导致的“对齐税”问题。作者提出了一种基于语义奖励的强化学习方法,通过组相对策略优化(GRPO)在嵌入层进行语义对齐,而非传统的似然最大化,从而在保持模型通用能力的同时提升低资源语言的表现。实验表明,该方法在藏汉机器翻译和藏语新闻生成任务中有效缓解了对齐税,生成质量更高且更具可迁移性。

Comments ACL 2026 Findings

详情
英文摘要

Extending large language models (LLMs) to low-resource languages often incurs an "alignment tax": improvements in the target language come at the cost of catastrophic forgetting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enforces token-level surface imitation on narrow and biased data distributions. To address this limitation, we propose a semantic-space alignment paradigm powered by Group Relative Policy Optimization (GRPO), where the model is optimized using embedding-level semantic rewards rather than likelihood maximization. This objective encourages meaning preservation through flexible realizations, enabling controlled updates that reduce destructive interference with pretrained knowledge. We evaluate our approach on Tibetan-Chinese machine translation and Tibetan headline generation. Experiments show that our method acquires low-resource capabilities while markedly mitigating alignment tax, preserving general competence more effectively than SFT. Despite producing less rigid surface overlap, semantic RL yields higher semantic quality and preference in open-ended generation, and few-shot transfer results indicate that it learns more transferable and robust representations under limited supervision. Overall, our study demonstrates that reinforcement learning with semantic rewards provides a safer and more reliable pathway for inclusive low-resource language expansion.

2605.14365 2026-05-15 cs.LG cs.AI

LoMETab: Beyond Rank-1 Ensembles for Tabular Deep Learning

Changryeol Choi, Hyewon Park, Yujin Kwon, Gowun Jeong

AI总结 在表格深度学习中,主流方法的性能趋于接近,难以形成明显优劣之分。为此,本文提出 LoMETab,一种基于秩-$r$ 的隐式集成模型,通过引入可调节的秩和初始化尺度,增强模型的多样性与表达能力。实验表明,LoMETab 能有效提升模型间的预测差异性,并在分类和回归任务中展现出良好的控制能力与性能表现。

详情
英文摘要

Recent tabular learning benchmarks increasingly show a tight performance cluster rather than a clear hierarchy among leading methods, spanning gradient boosted decision trees, attention-based architectures, and implicit ensembles such as TabM. As benchmark gains plateau, a complementary goal is to understand and control the mechanisms that make simple neural tabular models competitive. We propose LoMETab, a rank-$r$ generalization of multiplicative implicit ensembles. LoMETab lifts the rank-1 BatchEnsemble/TabM modulation to a rank-$r$ identity-residual Hadamard family by parameterizing each member weight as $W_k = W \odot (1 + A_kB_k^\top)$, where $W$ is shared and $(A_k, B_k)$ are member-specific low-rank factors. This exposes two practical diversity-control axes: the adapter rank $r$ and the initialization scale $σ_{\mathrm{init}}$, and we prove that for $r \ge 2$ this generalization strictly enlarges BatchEnsemble's hypothesis class. Empirically, we show that this added capacity manifests as measurable predictive diversity after training: on representative classification datasets, LoMETab sustains higher pairwise KL than an additive low-rank ablation, and $(r, σ_{\mathrm{init}})$ provides broad control over pairwise KL, varying by up to several orders of magnitude across configurations. The induced diversity is reflected in task-appropriate output-level measures: argmax disagreement for classification and ambiguity for regression, indicating that the control extends beyond pairwise KL to decision- and output-level member variation. Finally, experiments sweeping over adapter rank $r$ and initialization scale $σ_{\mathrm{init}}$ reveal that predictive performance is dataset-dependent over the $(r, σ_{\mathrm{init}})$ grid, supporting LoMETab as a controllable family of implicit ensembles rather than a fixed rank-1 construction.

2605.14359 2026-05-15 cs.LG cs.AI

RQ-MoE: Residual Quantization via Mixture of Experts for Efficient Input-Dependent Vector Compression

Zhengjia Zhong, Shuyan Ke, Zaizhou Lin, Jiaqi Song, Hongyi Lan, Hui Li

AI总结 该论文提出了一种名为RQ-MoE的残差量化框架,通过结合专家混合模型与双流量化机制,实现了针对输入数据动态调整的高效向量压缩。该方法解决了现有动态量化方法在解码过程中存在的瓶颈问题,支持并行解码并提升了表达能力。实验表明,RQ-MoE在重建与检索任务中达到了当前最优或接近最优的性能,同时解码速度比以往方法快6到14倍。

Comments To appear at ICML 2026

详情
英文摘要

Vector quantization is a fundamental tool for compressing high-dimensional embeddings, yet existing multi-codebook methods rely on static codebooks that limit expressiveness under heterogeneous data geometry. While recent dynamic quantizers like QINCo adapt codebooks to individual inputs and improve expressiveness, their strict sequential dependencies create decoding bottlenecks. We propose Residual Quantization via Mixture of Experts (RQ-MoE), a framework combining a two-level MoE with dual-stream quantization to enable input-dependent codebook adaptation for efficient vector quantization. RQ-MoE enables dynamic codebook construction and decouples instruction from quantization, facilitating parallel decoding. Theoretically, we show that standard Residual Quantization and QINCo can be recovered as constrained special cases of RQ-MoE, and derive a guideline for setting expert dimensionality in RQ-MoE. Extensive experiments show that RQ-MoE achieves state-of-the-art or on-par performance in reconstruction and retrieval, while providing 6x-14x faster decoding than prior vector quantization methods. The implementation is available at https://github.com/KDEGroup/RQ-MoE.

2605.14358 2026-05-15 cs.AI cs.LG

Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces

Sanjoy Chowdhury, Dinesh Manocha

AI总结 该研究探讨了语言模型在生成长链推理过程时,其中有多少步骤对于最终预测是必要的。通过定义“最小核心”——即能保持最终答案或预测分布的最小步骤子集,并引入压缩比、冗余度、步骤必要性等指标,研究发现推理轨迹普遍存在冗余,平均有46%的步骤可以移除而不影响答案,且必要性高度集中于少数几步。研究还表明,最小核心能更清晰地揭示推理的几何结构,并在不同模型间具有较好的迁移能力,为理解语言模型推理的本质提供了新视角。

详情
英文摘要

Language models often generate long chain-of-thought traces, but it remains unclear how much of this reasoning is necessary for preserving the final prediction. We study this through the lens of overcomplete reasoning traces: generated traces that contain more intermediate steps than are needed to support the model's answer. We define the minimal core as the smallest subset of steps that preserves either the final answer or predictive distribution, and introduce metrics for compression ratio, redundancy mass, step necessity, and necessity concentration. Across six deliberative reasoning benchmarks spanning arithmetic, competition mathematics, expert scientific reasoning, and commonsense multi-hop QA, we find substantial overcompleteness: on average, 46% of steps are removable under greedy minimal-core extraction while preserving the original answer in 86% of cases. We also find that predictive support is concentrated: the top three steps account for 65% of measured necessity mass on average. Beyond compression, minimal cores expose a cleaner geometry of reasoning: compared with full traces, they improve correct-incorrect trace separation by 11 points, reduce estimated intrinsic dimensionality by 34%, and transfer across model families with 85% off-diagonal answer retention. Theoretically, we establish existence of minimal sufficient subsets, local irreducibility guarantees for greedy elimination, and certificates of overcompleteness and sparse necessity. Together, these results suggest that full reasoning traces are often verbose and overcomplete, while minimal cores isolate the effective support underlying language-model predictions.

2605.14352 2026-05-15 cs.CL

Ideology Prediction of German Political Texts

Sinclair Schneider, Florian Steuber, Joao A. G. Schneider, Gabi Dreo Rodosek

AI总结 本文研究如何利用基于Transformer的模型对德语政治文本进行意识形态预测,将文本的政治立场映射到从-1到1的连续光谱上。研究构建了四个不同来源的语料库,包括德国联邦议院的会议记录、在线决策工具Wahl-O-Mat、33家不同政治倾向的报纸以及议员的推文,并通过对比多个预训练模型,发现DeBERTa-large和Gemma2-2B在不同数据集上表现出色。研究结果表明,模型结构和领域特定数据的可用性对政治偏见估计具有重要影响。

Comments This paper has been accepted for the upcoming 20th International AAAI Conference on Web and Social Media (ICWSM 2026)

详情
英文摘要

Elections represent a crucial milestone in a nation's ongoing development. To better understand the political rhetoric from various movements, ranging from left to right, we propose a transformer-based model capable of projecting the political orientation of a text on a continuous left-to-right spectrum, represented by a normalized scalar d between -1 and 1. This approach enables analysts to focus on specific segments of the political landscape, such as conservatives, while excluding liberal and far-right movements. Such a task can only be achieved with multiclass classifiers, provided that the desired orientation is incorporated within one of their predefined classes. To determine the most suitable foundation model among 13 candidate transformers for this task, we constructed four distinct corpora. One corpus comprised annotated plenary notes from the German Bundestag, while another was based on an official online decision-making tool, Wahl-O-Mat. The third corpus consisted of articles from 33 newspapers, each identified by its political orientation, and the fourth included 535,200 tweets from 597 members of the 20th and 21st German Bundestag. To mitigate overfitting, we used two distinct corpora for training and two for testing, respectively. For in-domain performance, DeBERTa-large achieved the highest F1 score F1=0.844 as well as for the X (Twitter) out-of-domain test ACC=0.864. Regarding the newspaper out-of-domain test, Gemma2-2B excelled (MAE = 0.172). This study demonstrates that transformer models can recognize political framing in German news at the level of public opinion polls. Our findings suggest that both the model architecture and the availability of domain-specific training data can be as influential as model size for estimating political bias. We discuss methodological limitations and outline directions for improving the robustness of bias measurement.

2605.14350 2026-05-15 cs.LG

Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

Nicholas E. Corrado, Wenyuan Huang, Josiah P. Hanna

AI总结 多任务强化学习旨在训练一个智能体同时高效优化多个任务的性能,但传统方法在联合优化所有任务时常导致学习不平衡,即对简单任务学习迅速而对困难任务进展缓慢。本文提出了一种新的自适应任务采样方法DRATS,通过动态优先采样最难完成的任务,以解决数据分配不均的问题。该方法将多任务学习建模为一个可行性问题,并通过最小化最差任务回报差距的最小最大目标进行优化,在多个基准测试中表现出更高的数据效率和最差任务性能。

详情
英文摘要

Multi-task reinforcement learning (MTRL) aims to train a single agent to efficiently optimize performance across multiple tasks simultaneously. However, jointly optimizing all tasks often yields imbalanced learning: agents quickly solve easy tasks but learn slowly on harder ones. While prior work primarily attributes this imbalance to conflicting task gradients and proposes gradient manipulation or specialized architectures to address it, we instead focus on a distinct and under-explored challenge: imbalanced data allocation. Standard MTRL allocates an equal number of environment interactions to each task, which over-allocates data to easy tasks that require relatively few interactions to solve and under-allocates data to hard tasks that require substantially more experience to solve. To address this challenge, we introduce Distributionally Robust Adaptive Task Sampling (DRATS), an algorithm that adaptively prioritizes sampling tasks furthest from being solved. We derive DRATS by formalizing MTRL as a feasibility problem from which we derive a minimax objective for minimizing the worst-case return gap, the difference between a desired target return and the agent's return on a task. In benchmarks like MetaWorld-MT10 and MT50, DRATS improves data efficiency and increases worst-task performance compared to existing task sampling algorithms.

2605.14346 2026-05-15 cs.CV

Learning with Semantic Priors: Stabilizing Point-Supervised Infrared Small Target Detection via Hierarchical Knowledge Distillation

Yuanhang Yao, Ping Qian, Zhu Liu, Long Ma, Weimin Wang

AI总结 本文研究了如何在点监督下稳定红外小目标检测任务,针对轻量级检测器语义信息不足导致的伪标签噪声和训练不稳定问题,提出了一种基于分层视觉基础模型(VFM)的知识蒸馏框架。该方法通过双层优化过程,结合语义条件仿射调制(SCAM)和动态协作学习策略,有效提升了检测精度和训练稳定性。实验表明,该方法在多种红外小目标检测模型上均取得了显著改进。

详情
英文摘要

Single-frame Infrared Small Target Detection (ISTD) aims to localize weak targets under heavy background clutter, yet dense pixel-wise annotations are expensive. Point supervision with online label evolution reduces annotation cost; however, lightweight CNN detectors often lack sufficient semantics, leading to noisy pseudo-masks and unstable optimization. To address this, we propose a hierarchical VFM-driven knowledge distillation framework that uses a frozen Vision Foundation Model (VFM) during training. We formulate point-supervised learning as a bilevel optimization process: the inner loop adapts a VFM-embedded teacher on reweighted training samples, while the outer loop transfers validation-guided knowledge to a lightweight student to mitigate pseudo-label noise and training-set bias. We further introduce Semantic-Conditioned Affine Modulation (SCAM) to inject VFM semantics into CNN features at multiple layers. In addition, a dynamic collaborative learning strategy with cluster-level sample reweighting enhances robustness to imperfect pseudo-masks. Experiments on diverse challenging cases across multiple ISTD backbones demonstrate consistent improvements in detection accuracy and training stability. Our code is available at https://github.com/yuanhang-yao/semantic-prior.

2605.14343 2026-05-15 cs.LG math.ST stat.ML stat.TH

Nearest-Neighbor Radii under Dependent Sampling

Yuanyuan Gao, Yilong Hou, Zhexiao Lin

AI总结 本文研究了在依赖采样条件下最近邻方法的邻域半径性质,突破了传统独立采样假设。通过分析强混合依赖观测,论文建立了多项式混合条件下的几乎处处收敛结果,并在几何混合条件下给出了精确的非渐近矩界,这些界依赖于局部内在维度而非环境维度,从而适用于高维流形数据。实验验证了理论结果,表明即使在依赖采样下,最近邻几何结构仍具有信息性。

Comments 33 pages

详情
英文摘要

Nearest-neighbor methods are fundamental to classical and modern machine learning, yet their geometric properties are typically analyzed under independent sampling. In this paper, we study the nearest-neighbor radii under dependent sampling. We consider strong mixing dependent observations and ask whether dependence changes the scale of nearest-neighbor neighborhoods. We establish distribution-free almost sure convergence under polynomial mixing and sharp non-asymptotic moment bounds under geometric mixing. The moment bounds depend on the local intrinsic dimension rather than the ambient dimension, making the results applicable to high-dimensional data concentrated near lower-dimensional manifolds. Synthetic experiments and real-world time-series benchmarks support the theory, showing that nearest-neighbor geometry remains informative under dependence sampling.

2605.14341 2026-05-15 cs.CV

AnyBand-Diff: A Unified Remote Sensing Image Generation and Band Repair Framework with Spectral Priors

Zuopeng Zhao, Ying Liu, Xiaoyu Li, Su Luo, Lu Li, Wenwen Liu

AI总结 本文提出了一种名为 AnyBand-Diff 的统一遥感图像生成与波段修复框架,旨在解决现有扩散模型在生成遥感图像时忽略物理规律导致的光谱失真和辐射不一致问题。该方法引入了基于光谱先验的扩散模型架构,结合双随机掩码策略和物理引导采样机制,能够从任意波段子集恢复完整的光谱信息,并保证生成图像的辐射一致性。实验表明,AnyBand-Diff 在生成可靠遥感图像和实现高精度光谱重建方面表现出色,为物理感知的生成模型在地球观测领域的应用提供了新思路。

详情
英文摘要

Existing diffusion models have made significant progress in generating realistic images. However, their direct adaptation to remote sensing imagery often disregards intrinsic physical laws. This oversight frequently leads to spectral distortion and radiometric inconsistency, severely limiting the scientific utility of generated data. To address this issue, this paper introduces AnyBand-Diff, a novel spectral-prior-guided diffusion framework tailored for robust spectral reconstruction. Specifically, we design a Masked Conditional Diffusion backbone integrated with a dual stochastic masking strategy, empowering the model to recover complete spectral information from arbitrary band subsets. Subsequently, to ensure radiometric fidelity, a Physics-Guided Sampling mechanism is proposed, leveraging gradients from a differentiable physical model to explicitly steer the denoising trajectory toward the manifold of physically plausible solutions. Furthermore, a Multi-Scale Physical Loss is formulated to enforce rigorous constraints across pixel, region, and global levels in a joint manner. Extensive experiments confirm the effectiveness of AnyBand-Diff in generating reliable imagery and achieving accurate spectral reconstruction, contributing to the advancement of physics-aware generative methods for Earth observation.

2605.14340 2026-05-15 cs.SD

Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR

Ryo Magoshi, Takashi Maekaku, Yusuke Shinohara

AI总结 基于大语言模型(LLM)的自动语音识别系统通过连接音频编码器和LLM取得了良好性能,但在面对新领域时,由于缺乏配对的语音和文本数据,其适应能力受到限制。本文提出一种新的框架,通过显式建模语音与文本的对齐关系,生成更具表现力的伪音频提示,从而有效弥合模态间的差距,提升目标领域的适应效果。实验表明,该方法在整体错误率和未登录词覆盖率方面均优于现有纯文本适应方法。

Comments Submitted to Interspeech 2026

详情
英文摘要

LLM-based automatic speech recognition models demonstrate strong performance by connecting audio encoders and LLMs. However, data scarcity of paired speech and transcription often hinders their adaptation to new domains, making text-only domain adaptation crucial. Existing methods typically rely on either fine-tuning the LLM alone or employing pseudo-audio prompts. The former neglects essential acoustic context, while the latter either suffers from limited scalability in data-scarce conditions, or yields inexpressive prompts by leveraging only textual features, ignoring audio modality. To address this, we propose an enhanced framework that explicitly models speech-text alignment. Our method efficiently generates highly expressive pseudo-audio prompts that bridges the modality gap, enabling effective target-domain adaptation. Experiments demonstrate that our approach outperforms existing text-only methods, improving both overall error rates and out-of-vocabulary coverage.

2605.14337 2026-05-15 cs.CV

IG-Diff: Complex Night Scene Restoration with Illumination-Guided Diffusion Model

Yifan Chen, Fei Yin, Chunle Guo, Chongyi Li, Yujiu Yang

AI总结 在夜间复杂场景中,由于光照不足和多种退化因素共存,图像恢复面临较大挑战。本文提出一种基于光照引导的扩散模型(IG-Diff),通过引入光照引导模块,有效提升了低光环境下多退化因素共存场景的图像恢复效果。同时,作者构建了包含多种退化因素的复杂夜间场景数据集,为相关研究提供了重要资源。

Comments Accepted by CGI-2025

详情
英文摘要

In nighttime circumstances, it is challenging for individuals and machines to perceive their surroundings. While prevailing image restoration methods adeptly handle singular forms of degradation, they falter when confronted with intricate nocturnal scenes, such as the concurrent presence of weather and low-light conditions. Compounding this challenge, the lack of paired data that encapsulates the coexistence of low-light situations and other forms of degradation hinders the development of a comprehensive end-to-end solution. In this work, we contribute complex nighttime scene datasets that simulate both illumination degradation and other forms of deterioration. To address the complexity of night degradation, we propose an integration of an illumination-guided module embedded in the diffusion model to guide the illumination restoration process. Our model can preserve texture fidelity while contending with the adversities posed by various degradation in low-light scenarios.

2605.14333 2026-05-15 cs.CV

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

Yang Yue, Fangyun Wei, Tianyu He, Jinjing Zhao, Zanlin Ni, Zeyu Liu, Jiayi Guo, Lei Shi, Yue Dong, Li Chen, Ji Li, Gao Huang, Dong Chen

AI总结 本文研究了在基于离散分词的自回归图像生成中如何提升文本和人脸的生成质量。作者指出,传统分词器因过度下采样和量化导致细粒度结构丢失,难以保留可读的文本和清晰的人脸特征。为此,他们提出了InsightTok,通过引入局部、内容感知的感知损失,有效提升了文本和人脸的保真度,并在不牺牲整体重建质量的前提下显著优于现有分词器。该方法在自回归图像生成模型InsightAR中表现出色,生成的图像具有更清晰的文本和更真实的人脸细节。

Comments Code and checkpoints are available at https://github.com/LeapLabTHU/InsightTok

详情
英文摘要

Text and faces are among the most perceptually salient and practically important patterns in visual generation, yet they remain challenging for autoregressive generators built on discrete tokenization. A central bottleneck is the tokenizer: aggressive downsampling and quantization often discard the fine-grained structures needed to preserve readable glyphs and distinctive facial features. We attribute this gap to standard discrete-tokenizer objectives being weakly aligned with text legibility and facial fidelity, as these objectives typically optimize generic reconstruction while compressing diverse content uniformly. To address this, we propose InsightTok, a simple yet effective discrete visual tokenization framework that enhances text and face fidelity through localized, content-aware perceptual losses. With a compact 16k codebook and a 16x downsampling rate, InsightTok significantly outperforms prior tokenizers in text and face reconstruction without compromising general reconstruction quality. These gains consistently transfer to autoregressive image generation in InsightAR, producing images with clearer text and more faithful facial details. Overall, our results highlight the potential of specialized supervision in tokenizer training for advancing discrete image generation.

2605.14327 2026-05-15 cs.LG cs.AI

AIM-DDI: A Model-Agnostic Multimodal Integration Module for Drug-Drug Interaction Prediction

Yerin Park, Sangseon Lee

AI总结 药物-药物相互作用(DDI)预测在计算生物医学中具有重要意义,但如何对训练过程中未见的药物进行准确预测仍是一个关键挑战。本文提出了一种与模型无关的多模态集成模块AIM-DDI,它将结构、化学和语义等异构药物信息映射到共享的潜在空间中,并通过统一的融合模块建模模态间依赖关系,从而实现跨不同DDI预测架构的通用集成。实验表明,AIM-DDI在多种DDI模型和DrugBank数据集上均能有效提升预测性能,尤其在两个药物均未在训练中出现的最困难场景下表现突出。

详情
英文摘要

Drug-drug interaction (DDI) prediction is a critical task in computational biomedicine, as adverse interactions between co-administered drugs can cause severe side effects and clinical risks. A key challenge is unseen-drug generalization, where interactions must be predicted for drugs not observed during training. Although multimodal DDI models exploit diverse drug-related information, their fusion mechanisms are often tied to specific prediction architectures, limiting their reuse across models. To address this, we propose AIM-DDI, an architecture-independent multimodal integration module that represents heterogeneous modality information as tokens in a shared latent space. By modeling dependencies across modality tokens through a unified fusion module, AIM-DDI enables model-agnostic integration of structural, chemical, and semantic drug signals across different DDI prediction architectures. Extensive evaluations across diverse DDI models and DrugBank-based settings show that AIM-DDI consistently improves prediction performance, with the strongest gains under the most challenging both-unseen setting where neither drug in a test pair is observed during training. These results suggest that treating multimodal integration as a reusable module, rather than a model-specific fusion component, is an effective strategy for robust unseen-drug DDI prediction.

2605.14326 2026-05-15 cs.CV

D2-CDIG: Controlled Diffusion Remote Sensing Image Generation with Dual Priors of DEM and Cloud-Fog

Zuopeng Zhao, Ying Liu, Kanyaphakphachsorn Pharksuwan, Su Luo, Xiaoyu Li, Maocai Ning

AI总结 本文提出了一种名为D2-CDIG的可控扩散遥感图像生成框架,旨在解决现有方法在复杂地形和大气条件下生成图像准确性与自然度不足的问题。该方法通过融合数字高程模型(DEM)和云雾信息作为双重先验知识,实现了对地表特征和大气现象的精确控制,并引入了可调节的云雾滑块以灵活控制云层厚度和分布。实验表明,D2-CDIG在图像质量、细节丰富度和真实感方面相比传统方法有显著提升,为遥感大模型训练和下游任务提供了高质量的数据支持。

详情
英文摘要

Remote sensing image generation provides a reliable data foundation for remote sensing large models and downstream tasks. However, existing controllable remote sensing image generation methods typically rely on traditional techniques such as segmentation and edge detection, which do not fully leverage terrain or atmospheric conditions. As a result, the generated images often lack accuracy and naturalness when dealing with complex terrains and atmospheric phenomena. In this paper, we propose a novel remote sensing image generation framework, D2-CDIG, which integrates diffusion models with a dual-prior control mechanism. By incorporating both Digital Elevation Model (DEM) and cloud-fog information as dual prior knowledge, D2-CDIG precisely controls ground features and atmospheric phenomena within the generated images. Specifically, D2-CDIG decouples the terrain and atmospheric generation processes through independent control of ground and atmospheric branches. Additionally, a refined cloud-fog slider is introduced to flexibly adjust cloud thickness and distribution. During training, ground and atmospheric control signals are injected in layers to ensure a seamless transition within the images. Compared to traditional methods based on segmentation or edge detection, D2-CDIG shows significant improvements in image quality, detail richness, and realism. D2-CDIG offers a flexible and precise solution for remote sensing image generation, providing high-quality data for training large remote sensing models and downstream tasks.

2605.14323 2026-05-15 cs.LG cs.AI cs.CL

Dynamic Latent Routing

Fangyuan Yu, Xin Su, Amir Abdullah

AI总结 本文研究了在时间变化奖励函数的马尔可夫决策过程(MDP)中,子策略的时间拼接问题。作者提出了通用迪杰斯特拉搜索(GDS),并证明通过时间组合中间最优子策略可以恢复全局最优目标达成策略。基于GDS的“搜索、选择、更新”原则,作者进一步提出了动态潜在路由(DLR)方法,该方法在单次训练阶段联合学习离散潜在编码、路由策略和模型参数。实验表明,在低数据微调场景下,DLR在多个数据集和模型上表现优异,优于传统的监督微调方法。

详情
英文摘要

We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.

2605.14318 2026-05-15 cs.AI cs.LG

Semantic Feature Segmentation for Interpretable Predictive Maintenance in Complex Systems

Emilio Mastriani, Alessandro Costa, Federico Incardona, Kevin Munari, Sebastiano Spinello

AI总结 本文研究了复杂系统中可解释的预测性维护问题,针对监测变量异构性和冗余性导致的故障信息模糊和模型可解释性下降的问题,提出了一种语义特征分割框架。该方法将监测特征空间分解为保留主要预测信息的规范分量和包含结构边缘信号的残差分量,并基于领域知识定义功能分组以反映系统运行机制。实验表明,规范分量在预测风险和结构稳定性方面均优于残差分量和传统方法,实现了预测性能与语义可解释性的兼顾。

Comments 18 pages, 7 figures. Under review at Neural Computing and Applications. Keywords: semantic segmentation, change point detection, fault anticipation

详情
英文摘要

Predictive maintenance in complex systems is often complicated by the heterogeneity and redundancy of monitored variables,which can obscure fault-relevant information and reduce model interpretability. This work proposes a semantic feature segmentation framework that decomposes the monitored feature space into a canonical component,expected to retain the dominant predictive information, and a residual component containing structurally peripheral signals. The segmentation is defined through domain informed criteria and sets up monitoring variables into functional groups reflecting operational mechanisms such as throughput,latency,pressure,network activity,and structural state. To evaluate the effectiveness of this decomposition, we adopt a predictive perspective in which expected predictive risk is used as an operational proxy for task-relevant information. Experimental results obtained through time-aware cross-validation show that the canonical space consistently achieves lower predictive risk than the residual space across multiple temporal configurations, indicating that the semantic segmentation concentrates the most relevant information for fault anticipation. In addition, the canonical segments exhibit significantly stronger intra-segment coherence than inter-segment dependence, and this structural organization remains stable after redundancy reduction. When compared with the full feature space and with a Principal Component Analysis (PCA) representation, the canonical space carries out comparable predictive performance and furthermore preserves the semantic meaning of the original variables. These findings suggest that semantic feature segmentation provides an interpretable and information-preserving decomposition of monitoring signals, enabling competitive predictive performance without sacrificing the operational interpretability required in predictive maintenance applications.

2605.14317 2026-05-15 cs.LG physics.ao-ph

Guided Diffusion Sampling for Precipitation Forecast Interventions

Ayumu Ueyama, Kazuhiko Kawamoto, Hiroshi Kera

AI总结 本文研究如何通过数据驱动的天气预报模型实现对极端降水的干预,以减少其带来的负面影响。作者提出了一种基于梯度引导的扩散采样方法,在扩散天气预报模型中引导采样轨迹,从而在保持大气状态分布一致性的同时实现降水减少。该方法从垂直结构、潜空间轨迹偏差和跨模型可迁移性三个角度评估干预的物理合理性,实验表明其在减少极端降水方面优于对抗性扰动方法。

Comments 12+7 pages, 7+2 figures

详情
英文摘要

Extreme precipitation causes severe societal and economic damage, and weather control has long been discussed as a potential mitigation strategy. However, to the best of our knowledge, perturbation-based interventions for weather control using data-driven weather forecasting models have not yet been explored. While adversarial attacks also generate perturbations that alter forecasts, they aim to exploit model artifacts and do not account for physical plausibility. In this paper, we propose a gradient-based guidance framework for precipitation-reduction interventions through diffusion sampling in diffusion-based weather forecasting models. Instead of directly perturbing atmospheric states, our method steers the diffusion sampling trajectory, enabling precipitation reduction while maintaining consistency with the atmospheric distribution. To assess physical plausibility, we evaluate from three perspectives: (i) vertical and variable-wise perturbation profiles, (ii) latent-space trajectory deviation, and (iii) cross-model transferability. Experiments on extreme precipitation events from WeatherBench2 demonstrate that our method achieves effective precipitation reduction while yielding more physically plausible interventions than adversarial perturbations.

2605.14315 2026-05-15 cs.CV

TurboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention

David Huang, Guile Wu, Chengjie Huang, Bingbing Liu, Dongfeng Bai

AI总结 本文提出了一种名为 TurboVGGT 的新型方法,用于实现快速的多视角三维重建。该方法采用自适应交替注意力机制的视觉几何变换器,在保证重建质量的同时显著提升了计算效率。通过自适应稀疏全局注意力和帧内注意力的结合,TurboVGGT 能够有效捕捉跨帧的全局关系和单帧内的局部细节,实验表明其在多个三维重建基准上表现优异,兼具速度与精度。

Comments Technical Report

详情
英文摘要

Recent feed-forward 3D reconstruction methods, such as visual geometry transformers, have substantially advanced the traditional per-scene optimization paradigm by enabling effective multi-view reconstruction in a single forward pass. However, most existing methods struggle to achieve a balance between reconstruction quality and computational efficiency, which limits their scalability and efficiency. Although some efficient visual geometry transformers have recently emerged, they typically use the same sparsity ratio across layers and frames and lack mechanisms to adaptively learn representative tokens to capture global relationships, leading to suboptimal performance. In this work, we propose TurboVGGT, a novel approach that employs an efficient visual geometry transformer with adaptive alternating attention for fast multi-view 3D reconstruction. Specifically, TurboVGGT employs an end-to-end trainable framework with adaptive sparse global attention guided by adaptive sparsity selection to capture global relationships across frames and frame attention to aggregate local details within each frame. In the adaptive sparse global attention, TurboVGGT adaptively learns representative tokens with varying sparsity levels for global geometry modeling, considering that token importance varies across frames, attention layers operate tokens at different levels of abstraction, and global dependencies rely on structurally informative regions. Extensive experiments on multiple 3D reconstruction benchmarks demonstrate that TurboVGGT achieves fast multi-view reconstruction while maintaining competitive reconstruction quality compared with state-of-the-art methods. Project page: https://turbovggt.github.io/.

2605.14310 2026-05-15 cs.CV

CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

Ailar Mahdizadeh, Puria Azadi, Muchen Li, Xiangteng He, Leonid Sigal

AI总结 在流式视频理解中,如何高效压缩视觉-语言模型的键值缓存以支持长期推理是一个重要问题。本文将KV缓存压缩视为一个核心集选择问题,提出了一种基于几何覆盖和多样性优化的方法,通过联合优化键和值空间的表示,同时保留检索结构和输出相关信息。该方法引入正交性驱动的多样性准则,提升缓存子集的多样性,实验表明在多个开源模型和视频基准上优于传统启发式压缩方法。

详情
英文摘要

Streaming video understanding with large vision-language models (VLMs) requires a compact memory that can support future reasoning over an ever-growing visual history. A common solution is to compress the key-value (KV) cache, but existing streaming methods typically rely on local token-wise heuristics, such as recency, temporal redundancy, or saliency, which do not explicitly optimize whether the retained cache is representative of the accumulated history. We propose to view KV-cache compression as a coreset selection problem: rather than scoring tokens independently for retention, we select a small subset that covers the geometry of the accumulated visual cache. Our method operates in a joint KV representation and introduces a bicriteria objective that balances coverage in key and value spaces, preserving both retrieval structure and output-relevant information. To encourage a more diverse retained subset, we further introduce an orthogonality-driven diversity criterion that favors candidates contributing new directions beyond the current selection, and connect this criterion to log-determinant subset selection. Across four open-source VLMs and five long-video and streaming-video benchmarks, our method improves over heuristic streaming compression baselines under a fixed cache budget. These results highlight that representative coreset selection offers a more effective principle, than token-wise pruning, for memory-constrained streaming video understanding.