arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.14049 2026-06-15 cs.SD cs.CV 新提交

FoleyGenEx: Unified Video-to-Audio Generation with Multi-Modal Control, Temporal Alignment, and Semantic Precision

FoleyGenEx: 统一视频到音频生成，具备多模态控制、时间对齐与语义精度

Shiyao Wang, Xijuan Zeng, Hui Wang, Shiwan Zhao, Feng Deng, Chen Zhang, Yong Qin

发表机构 * Academy for Advanced Interdisciplinary Studies, Nankai University（南开大学前沿交叉学科研究院）； Kling Team, Kuaishou Technology（快手科技Kling团队）

AI总结提出FoleyGenEx统一框架，通过条件注入、多模态动态掩码和副词数据增强，实现视频到音频生成中多模态控制、帧级时间对齐与细粒度语义的同步合成。

Comments Accepted by INTERSPEECH 2026

Journal ref INTERSPEECH 2026

2606.14048 2026-06-15 cs.CV cs.RO 新提交

WAM4D: Fast 4D World Action Model via Spatial Register Tokens

WAM4D：通过空间注册令牌实现快速4D世界动作模型

Ying Li, Xiaobao Wei, Jiajun Cao, Hao Wang, Xiaowei Chi, Chengyu Bai, Qianpu Sun, Jiajun Li, Xiaojie Zhang, Jian Tang, Sirui Han, Shanghang Zhang

发表机构 * Peking University（北京大学）； The Hong Kong University of Science and Technology（香港科技大学）； Beijing Innovation Center of Humanoid Robotics（北京人形机器人创新中心）

AI总结提出WAM4D，利用轻量级空间注册令牌将预训练几何先验迁移至因果视频-动作变换器，实现高效4D世界动作建模，在RoboTwin 2.0和真实操作任务中提升空间一致性并保持快速推理。

Comments 15 pages, 7figures, 9tables

详情

AI中文摘要

世界动作模型（WAMs）最近在联合建模未来观测和可执行机器人动作方面显示出前景。然而，大多数现有的WAMs仍在2D视频或潜在空间中运行，其中视觉上合理的展开缺乏精确操作所需的3D空间约束和遮挡接触几何。虽然几何基础模型为从视觉观测恢复密集3D结构和运动提供了强大的先验，但迫使WAMs预测密集4D表示会引入昂贵的几何解码并减慢因果动作生成。为了解决这一权衡，我们提出了WAM4D，一种快速的4D世界动作模型，它使用轻量级空间注册令牌作为训练时的未来深度读出，将预训练的几何先验迁移到因果视频-动作变换器中，然后移除注册分支以实现轻量级动作推理。为了防止非因果捷径，我们进一步为混合变换器（MoT）WAM骨干设计了因果混合注意力，定义了视频、动作和几何令牌之间的模态特定可见性。在RoboTwin 2.0和具有挑战性的真实世界操作任务上的全面实验表明，WAM4D提高了空间一致性，并在保持高效推理的同时实现了具有竞争力的动作预测。

英文摘要

World action models (WAMs) have recently shown promise in jointly modeling future observations and executable robot actions. However, most existing WAMs still operate in 2D video or latent spaces, where visually plausible rollouts miss the 3D spatial constraints and occluded contact geometry required for precise manipulation. While geometric foundation models offer strong priors for recovering dense 3D structure and motion from visual observations, forcing WAMs to predict the dense 4D representation introduces costly geometric decoding and slows down causal action generation. To address the trade-off, we present WAM4D, a fast 4D world action model that uses lightweight spatial register tokens as training-time future-depth readouts to transfer pretrained geometric priors into a causal video-action transformer, then removes the register branch for lightweight action inference. To prevent non-causal shortcuts, we further design causal mixture attention for the Mixture-of-Transformers (MoT) WAM backbone, defining modality-specific visibility among video, action, and geometry tokens. Comprehensive experiments on RoboTwin 2.0 and challenging real-world manipulation tasks show that WAM4D improves spatial consistency and achieves competitive action prediction while maintaining efficient inference.

URL PDF HTML ☆

赞 0 踩 0

2606.14042 2026-06-15 cs.CV 新提交

Rethinking One-Step Image Editing through ChordEdit: Reproduction, Simplification, and New Insights

通过ChordEdit重新思考一步图像编辑：复现、简化与新见解

Minghan Li, Jeremy Moebel, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab（哈佛人工智能与机器人实验室）

AI总结本文通过复现、消融和简化ChordEdit，揭示其机制：和弦窗口作为时间步偏移，和弦传输执行低频语义编辑，近端对齐补充高频细节，从而将编辑分解为粗低频传输和细高频对齐两个阶段，为自适应编辑提供新路径。

Comments 9 pages

2606.14040 2026-06-15 cs.LG 新提交

Decompose Sparsely Where You Should, Absorb Densely Where You Should No

在应当稀疏处分解，在应当稠密处吸收

Ruixuan Deng, Zehao Jin, Zekun Wang, Zihan Dong

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结针对稀疏自编码器假设所有激活内容均可稀疏分解的缺陷，提出在标准SAE旁添加低秩线性瓶颈以吸收稠密成分，在Gemma-2-2B第12层上秩24瓶颈减少84%稠密潜变量，并揭示该成分是结构可识别、因果必要且被稀疏字典冗余编码的计算脚手架。

详情

AI中文摘要

稀疏自编码器（SAE）通常被训练为通过稀疏字典重建残差流的\textbf{全部}内容，隐含假设所有激活内容都适合稀疏、单语义的分解。我们质疑这一假设，并推测激活包含一个低秩、稠密的成分，该成分对模型计算重要但本质上不适合稀疏表示，这是训练SAE中广泛观察到的持久稠密潜变量的主要来源。为验证这一点，我们在标准SAE（BatchTopK和Matryoshka）旁添加一个小型秩$r$线性瓶颈，使得稠密结构在稀疏重建前被吸收。在Gemma-2-2B第12层上，秩24瓶颈将稠密潜变量计数减少高达84%，同时在匹配稀疏度下改善了两种架构的稀疏探测和定向探测扰动。被吸收的成分（i）在\textbf{结构上可识别}，即顶部主成分和离群维度；（ii）在\textbf{因果上必要}，移除它会使下一个token的交叉熵增加7.5倍，远超移除几何上几乎相同的顶部24个PCA方向带来的2.8倍增加；（iii）被\textbf{稀疏字典冗余编码}，消融787个最大对齐的稀疏特征仅使交叉熵增加2.9倍，消融2048个主题对齐特征几乎不改变MMLU主题分类，而移除脚手架则使其从98.7%降至随机水平。综合来看，我们的发现识别出残差流激活中一个紧凑、语义信息丰富且因果重要的成分（我们称之为\textbf{计算脚手架}），标准稀疏字典对其表示效率低下，表明基于稀疏性的可解释性方法的适用范围需要谨慎重新审视。

英文摘要

Sparse autoencoders (SAEs) are typically trained to reconstruct the \textbf{entire} residual stream through a sparse dictionary, implicitly assuming that all activation content is amenable to sparse, monosemantic decomposition. We question this assumption and hypothesize that activations contain a low-rank, dense component that is computationally important to the model yet inherently unsuitable for sparse representation, which serves as a major source of the persistent dense latents widely observed in trained SAEs. To test this, we add a small rank-$r$ linear bottleneck in parallel with standard SAEs (BatchTopK and Matryoshka), allowing dense structure to be absorbed before sparse reconstruction. On Gemma-2-2B layer 12, a rank-24 bottleneck reduces dense latent count by up to 84\% while improving sparse probing and targeted probe perturbation on both architectures at matched sparsity. The absorbed component is (i) \textbf{structurally identifiable} as the top principal components and outlier dimensions; (ii) \textbf{causally necessary}, with removing it raising next-token cross-entropy by 7.5$\times$, far exceeding the 2.8$\times$ from removing the geometrically near-identical top-24 PCA directions; and (iii) \textbf{redundantly encoded by sparse dictionaries}, with ablating 787 maximally aligned sparse features raising cross-entropy by only 2.9$\times$ and ablating 2,048 topic-aligned features leaving MMLU topic classification virtually unchanged, whereas removing the scaffold drops it from 98.7\% to chance. Together, our findings identify a compact, semantically informative and causally important component of residual stream activations (which we term a \textbf{computational scaffold}) that standard sparse dictionaries represent inefficiently, suggesting that the scope of sparsity-based interpretability methods warrants careful re-examination.

URL PDF HTML ☆

赞 0 踩 0

2606.14037 2026-06-15 cs.CL 新提交

Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment

对或错，模型都顺从：LLM 道德判断中的方向盲从

Jihye Kim, Jeffrey Flanigan

发表机构 * University of California, Santa Cruz（加州大学圣克鲁兹分校）

AI总结本文提出顺从不对称性（A = BCR/HCR）双向诊断指标，发现大语言模型在事实判断中更顺从有益提示（A=1.58），但在道德判断中几乎同等顺从有益和误导提示（A=1.04），揭示了方向盲从这一对齐失败模式。

详情

AI中文摘要

随着语言模型在许多领域扮演整合角色，LLM对用户反驳的响应成为一个关键的对齐属性。然而，许多现有评估将顺从视为单向的，测量模型是否抵抗压力，但不测量它们是否有选择地抵抗。我们引入顺从不对称性（A = BCR/HCR），一种双向诊断方法，比较在有益提示下的有益输出变化与在误导提示下的有害变化。在9个模型和972,000个提示条件响应中，我们发现这种选择性在事实判断和道德判断中有所不同：模型在事实问题上遵循有益提示多于有害提示（A = 1.58），但在道德问题上以几乎相同的速率遵循两个方向（A = 1.04）。这种现象在模型家族、能力水平和提示类型中持续存在。有趣的是，我们还发现思维链提示同时放大了有益和有害的顺从，而基于身份的提示以几乎相同的幅度抑制了这两者。这些结果将方向盲从道德顺从确定为当前LLM中一个独特的失败模式，并表明对齐应针对方向校准的更新，而不是仅降低顺从。

英文摘要

As language models take integrated roles across many domains, the response of LLMs to user pushback becomes a critical alignment property. Yet many existing evaluations treat compliance as unidirectional, measuring whether models resist pressure but not whether they resist it selectively. We introduce Compliance Asymmetry (A = BCR/HCR), a bidirectional diagnostic that compares beneficial output change under helpful nudges with harmful change under misleading nudges. Across 9 models and 972,000 nudge-condition responses, we find that this selectivity differs in factual and moral judgments: models follow helpful nudges more than harmful ones on factual questions (A = 1.58), but follow both directions at nearly identical rates on moral questions (A = 1.04). This phenomenon persists across model families, capability levels, and nudging types. Interestingly, we also find that chain-of-thought prompting amplifies helpful and harmful compliance together, while identity-based prompting suppresses both by nearly identical margins. These results identify direction-blind moral compliance as a distinct failure mode in current LLMs and suggest that alignment should target directionally calibrated updating rather than lower compliance alone.

URL PDF HTML ☆

赞 0 踩 0

2606.14035 2026-06-15 cs.CV 新提交

Toward 360-Degree Indoor Panorama Editing via Tuning-Free Diffusion Model with Refocusing Cross-Attention

面向360度室内全景编辑的基于重聚焦交叉注意力的免调优扩散模型

Dinh-Khoi Vo, Nhut-Thanh Le-Hinh, Viet-Tham Huynh, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * arXiv

AI总结提出FocusDiff框架，通过重聚焦交叉注意力实现免调优的精确区域编辑，并扩展到360度室内全景编辑，在局部编辑基准LIMB上优于现有零样本方法。

Comments ICCCI 2026. Project page: https://vdkhoi20.github.io/FocusDiff

详情

AI中文摘要

零样本文本引导扩散显著推进了图像编辑，但其实际可用性仍受三个持续挑战的制约：需要精细提示工程的提示脆弱性、无意影响非目标区域的溢出编辑、以及由于训练数据中有限细粒度监督导致的小或杂乱对象上的失败。我们提出FocusDiff（目标感知重聚焦用于免调优扩散编辑），一个基于重聚焦交叉注意力的免调优框架，用于精确且区域特定的图像操作。给定通过自动分割或手动选择获得的目标区域，FocusDiff对非编辑区域应用选择性模糊，以引导注意力朝向掩码区域，同时准确地将对象的身份、结构和外观传递到编辑输出。集成的上下文保留模块进一步确保背景保真度和全局一致性，使得从简单文本提示在一次传递中实现精确编辑成为可能。我们还将FocusDiff扩展到360度室内全景编辑，并在虚拟现实环境中展示其有效性。在我们包含30个多对象图像和100个标注示例（包括具有挑战性的小对象案例）的局部编辑基准LIMB上的广泛实验表明，FocusDiff在文本-图像对齐和背景保留方面优于现有零样本编辑器，实现了卓越的精度、逼真度和可用性。项目页面见此https URL。

英文摘要

Zero-shot text-guided diffusion has significantly advanced image editing; however, its practical usability remains constrained by three persistent challenges: prompt brittleness that requires meticulous prompt engineering, spillover edits that unintentionally affect non-target regions, and failures on small or cluttered objects caused by limited fine-grained supervision in training data. We propose FocusDiff (Target-Aware Refocusing for Tuning-Free Diffusion Editing), a tuning-free framework for precise and region-specific image manipulation based on refocusing cross-attention. Given a target region obtained through automated segmentation or manual selection, FocusDiff applies selective blurring to non-edit areas to guide attention toward the masked region while accurately transferring the object's identity, structure, and appearance to the edited output. Integrated context-preserving modules further ensure background fidelity and global coherence, enabling accurate edits from simple text prompts in a single pass. We also extend FocusDiff to 360-degree indoor panorama editing and demonstrate its effectiveness within virtual reality environments. Extensive experiments on our localized editing benchmark LIMB, comprising 30 multi-object images and 100 annotated examples including challenging small-object cases, show that FocusDiff outperforms existing zero-shot editors in text-image alignment and background preservation, achieving superior precision, photorealism, and usability. The project page is available at https://vdkhoi20.github.io/FocusDiff.

URL PDF HTML ☆

赞 0 踩 0

2606.14032 2026-06-15 cs.RO 新提交

From Attacks to Curricula: Learnability-Guided Adversarial Training for Safe Autonomous Driving

从攻击到课程：面向安全自动驾驶的可学习性引导对抗训练

Yuewen Mei, Tong Nie, Jie Sun, Haotian Shi, Wei Ma, Jian Sun

发表机构 * College of Transportation & Key Laboratory of Road and Traffic Engineering of Ministry of Education, Tongji University（同济大学交通运输工程学院 & 道路与交通工程教育部重点实验室）； Department of Civil and Environmental Engineering, The Hong Kong Polytechnic University（香港理工大学土木与环境工程学系）

AI总结提出AlignADV框架，通过偏好对齐生成可解决场景，并利用行为指纹预测策略能力，动态采样课程以提升自动驾驶对抗训练的收敛效率与安全性。

详情

AI中文摘要

闭环对抗训练通过将策略暴露于罕见的安全关键场景来提高自动驾驶安全性。标准流程首先生成对抗场景，然后采样用于策略优化。然而，大多数现有框架仍以攻击为导向：碰撞驱动的生成器常合成无法解决的极端情况，这可能导致学习退化；而启发式采样器忽略驾驶策略的演化能力，导致样本效率低下和收敛延迟。我们提出AlignADV，一个可学习性引导的闭环对抗训练框架，将对抗场景转化为可解决且与能力对齐的课程。首先，我们将对抗场景生成重新表述为偏好对齐问题，并采用直接偏好优化引导生成器朝向关键但可解决的场景。其次，我们引入行为指纹来捕捉演化策略的内在特征，并构建多模态能力预测模型，无需昂贵的闭环模拟即可估计策略性能。通过结合可解决性对齐场景与能力预测，AlignADV开发了动态课程采样机制，优先针对当前策略弱点的场景。在Waymo开放运动数据集上的实验表明，AlignADV提高了收敛效率和最终性能，与基线方法相比，训练步骤减少高达40.6%，同时在正常和对抗交通条件下降低了碰撞率并提高了路线完成率。这些结果强调了从攻击导向的场景生成向可学习性引导的策略改进的转变，为更安全、更高效的自动驾驶训练提供了原则性方向。项目页面：此 https URL。

英文摘要

Closed-loop adversarial training improves autonomous driving safety by exposing policies to rare safety-critical scenarios. Standard pipelines first generate adversarial scenarios and then sample them for policy optimization. However, most existing frameworks remain attack-oriented: collision-driven generators often synthesize unsolvable extreme situations, which can degrade learning, while heuristic samplers ignore the evolving capability of the driving policy, causing sample inefficiency and delayed convergence. We propose AlignADV, a learnability-guided closed-loop adversarial training framework that converts adversarial scenarios into resolvable and capability-aligned curricula. First, we reformulate adversarial scenario generation as a preference alignment problem and employ direct preference optimization to guide the generator toward critical yet resolvable scenarios. Second, we introduce behavioral fingerprints to capture the intrinsic characteristics of the evolving policy and construct a multi-modal capability prediction model that estimates policy performance without expensive closed-loop simulations. By combining resolvability-aligned scenarios with capability predictions, AlignADV develops a dynamic curriculum sampling mechanism that prioritizes scenarios targeting the current policy's vulnerabilities. Experiments on the Waymo Open Motion Dataset demonstrate that AlignADV improves convergence efficiency and final performance, reducing training steps by up to 40.6 percent compared with baseline methods while lowering collision rate and improving route completion under both normal and adversarial traffic conditions. These results highlight a shift from attack-oriented scenario generation to learnability-guided policy improvement, offering a principled direction for safer and more efficient autonomous driving training. Project page: https://meiyuewen.github.io/AlignADV/.

URL PDF HTML ☆

赞 0 踩 0

2606.14030 2026-06-15 cs.SD cs.CL 新提交

Efficiency-Performance Trade-offs in Neural Speaker Diarization via Structured Pruning and Low-Bit Quantization

神经说话人日志中的结构化剪枝与低位量化效率-性能权衡

Rishit Chatterjee, Tahiya Chowdhury

发表机构 * Department of Computer Science, Colby College（科尔比学院计算机科学系）

AI总结针对资源受限硬件上的流式说话人日志，通过结构化剪枝和低位量化压缩分割模型，研究不同延迟预算下的性能权衡，发现FP16可减半模型大小但DER增加40%。

Comments 6 pages, 3 figures, preprint

2606.14029 2026-06-15 cs.LG 新提交

Utility-Constrained Policy Optimization

效用约束策略优化

Mehrdad Moghimi, Bernardo Avila Pires

发表机构 * York University（约克大学）； Google DeepMind（谷歌深度思维）

AI总结提出一种简单而强大的效用约束MDP方法，支持风险敏感约束，无需预先固定约束限值，在多个安全基准任务上匹配或超越现有基线。

详情

AI中文摘要

约束MDP（CMDP）是将安全性纳入强化学习智能体的广泛采用框架；然而，该框架不支持风险敏感约束。这可能是有问题的：例如，CMDP允许最优解为了满足风险中性约束，混合了罕见的灾难性行为和频繁的过度保守行为。此外，先前的实证结果表明，即使在风险中性评估下，执行更严格的风险敏感约束也能提高性能。纳入风险敏感约束的自然框架是效用约束MDP（UCMDP），但此前没有针对该问题的实用解决方案。在这项工作中，我们为UCMDP和约束RL引入了一种简单而强大的方法。除了允许风险敏感约束外，我们的框架不需要在训练智能体之前预先固定约束限值，只要知道一个合理的范围即可。这增加了策略的灵活性，并且在实践中允许以零额外训练成本调整这些限值。除了受益于框架的通用性外，我们的智能体在实践中表现出强大的性能，在多个Safety Gymnasium基准任务中持续匹配或超越现有基线。

英文摘要

Constrained MDPs (CMDPs) are a widely adopted framework for incorporating safety into RL agents; however, the framework does not support risk-sensitive constraints. This can be problematic: For example, CMDPs allow for optimal solutions that, in order to satisfy the risk-neutral constraints, mix infrequent catastrophic behaviors and frequent, overly conservative ones. Moreover, prior empirical results suggest that enforcing stricter, risk-sensitive constraints can improve performance even under risk-neutral evaluation. The natural framework to incorporate risk-sensitive constraints is utility-constrained MDPs (UCMDPs), but no practical solutions for this problem existed. In this work, we introduce a simple yet powerful methodology for UCMDPs and constrained RL. Besides allowing for risk-sensitive constraints, our framework does not require us to fix constraint limits in advance of training the agent, provided that a sensible range is known. This increases policy flexibility and, in practice, allows for adjustments to these limits at no extra training cost. Besides benefiting from the generality of the framework, our agent shows strong performance in practice, consistently matching or outperforming existing baselines in several Safety Gymnasium benchmark tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.14025 2026-06-15 cs.CV 新提交

GarmentSketch: Large-scale Sketch-to-Fashion Benchmark

GarmentSketch：大规模草图到时尚基准

Duong-Duy-Khang Bui, Minh-Tan Pham, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * Kangbdd.github.io

AI总结为解决时尚草图到图像合成缺乏大规模配对数据的问题，构建了包含26249对草图-文本描述的GarmentSketch数据集，并基于多模态大模型与人工精炼生成描述，评估了现有生成模型的性能。

Comments ICCCI 2026. Project page: https://khangbdd.github.io/garmentsketch

详情

AI中文摘要

时尚草图是设计工作流程的基石，允许在物理原型制作之前快速可视化创意概念。然而，基于草图的时尚图像合成进展因缺乏大规模、高质量配对资源而受阻。为弥补这一差距，我们提出了GarmentSketch，一个新颖的数据集，包含21个服装类别的26,249张时尚草图，每张草图都配有详细的文本描述。描述是通过一个多阶段流水线生成的，该流水线集成了多个多模态大语言模型（MLLM）与人在回路中的精炼，确保了语义准确性和描述丰富性。我们在最先进的生成模型上对GarmentSketch进行了基准测试，为草图引导的文本到图像生成提供了基线性能。我们的实验揭示了现有方法的潜力和当前局限性。通过提供全面且注释丰富的资源，GarmentSketch为推进草图理解、细粒度时尚图像生成以及设计中的创意人机协作奠定了基础。该数据集将在以下网址提供：this https URL。

英文摘要

Fashion sketching is a cornerstone of design workflows, allowing rapid visualization of creative concepts prior to physical prototyping. Yet, progress in sketch-based fashion image synthesis has been hindered by the absence of large-scale, high-quality paired resources. To bridge this gap, we present GarmentSketch, a novel dataset comprising 26,249 fashion sketches across 21 garment categories, each paired with detailed textual descriptions. Captions were produced through a multi-stage pipeline that integrates multiple multimodal large language models (MLLMs) with human-in-the-loop refinement, ensuring both semantic accuracy and descriptive richness. We benchmark GarmentSketch on state-of-the-art generative models, providing baseline performance for sketch-guided text-to-image generation. Our experiments reveal both the promise and the current limitations of existing methods. By offering a comprehensive and richly annotated resource, GarmentSketch establishes a foundation for advancing sketch understanding, fine-grained fashion image generation, and creative human-AI collaboration in design. The dataset will be available at: https://khangbdd.github.io/garmentsketch.

URL PDF HTML ☆

赞 0 踩 0

2606.14024 2026-06-15 cs.CV 新提交

ViT-Up: Faithful Feature Upsampling for Vision Transformers

ViT-Up：面向视觉Transformer的忠实特征上采样

Krispin Wandel, Jingchuan Wang, Hesheng Wang

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结提出ViT-Up，一种隐式特征上采样框架，通过从中间ViT隐藏状态构建逐层查询，在任意连续坐标预测特征，避免图像引导带来的特征泄露和模糊，在密集预测和语义对应任务上超越现有方法。

Comments Code is available at: https://github.com/krispinwandel/vit-up

详情

AI中文摘要

视觉Transformer（ViT）已成为视觉表示学习的主导架构，提供异常强大且广泛可重用的骨干特征。然而，由于全局自注意力的二次复杂度，ViT通常在小块令牌网格上运行，这给语义分割和深度估计等密集预测任务带来了持续瓶颈。这推动了任务无关特征上采样器的发展。尽管最近的最先进方法能产生视觉锐利的密集表示，但它们依赖浅层图像编码器进行引导上采样，可能引入特征泄露、碎片化和模糊。我们提出ViT-Up，一种隐式特征上采样框架，用从中间ViT隐藏状态构建的逐层查询替代外部图像引导。这使得在任意连续图像坐标上预测特征成为可能，同时保持与骨干特征空间的对齐。实验表明，ViT-Up在密集预测和语义对应任务上持续优于最先进的图像引导上采样器。在DINOv3-S+上，ViT-Up在Cityscapes上相比先前方法提升高达+2.07 mIoU，在SPair-71k上提升+4.17 PCK@0.10。使用更大的DINOv3-B骨干时，这些增益增加到+3.36 mIoU和+8.09 PCK@0.10，表明ViT-Up随骨干容量增加而扩展良好。

英文摘要

Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.

URL PDF HTML ☆

赞 0 踩 0

2606.14022 2026-06-15 cs.LG 新提交

PostDeg: Placement Beats Parameterization in LayerNorm GNNs

PostDeg：在LayerNorm GNN中位置胜过参数化

Yash Tomar, Aryav Das

发表机构 * Purdue University（普渡大学）； Park Tudor High School（帕克图多尔高中）

AI总结发现LayerNorm会擦除拓扑信号，而后LayerNorm位置可保留信号；提出无参数的后LayerNorm逆度缩放PostDeg，在三个组合优化任务上提升显著，且四个证伪测试均未触发。

Comments Yash Tomar and Aryav Das contributed equally to this work

详情

AI中文摘要

基于LayerNorm的GNN通常会擦除节点选择策略应依赖的拓扑信号（度、中心性、$k$-核），但文献尚未定位擦除发生在残差块中的何处。我们回答了这个问题：在LayerNorm之前插入的正逐节点标量会被除以一个稳定项，而同一标量在LayerNorm之后插入会作为表示幅度到达分数头。幸存的位置是后LayerNorm位置。我们通过PostDeg实例化它，这是一种无参数的后LayerNorm逆度缩放，并预先注册了四个证伪器（图级标量、额外LayerNorm、表达能力相同的槽位、与骨干无关的来源），这些证伪器将拒绝该规则。PostDeg在影响力最大化、网络瓦解和最大独立集上比LN骨干分别提升$+3.5\%/+2.5\%/+5.6\%$，每个任务在10/10配对种子中获胜；四个证伪器均未触发。结论是，增益来自位置而非参数化——这是一个小的不变性检查，可推广到任何归一化残差堆栈中的任何正拓扑标量。

英文摘要

LayerNorm-based GNNs routinely erase the topology signals (degree, centrality, $k$-core) that node-selection policies should depend on, but the literature has not located where in the residual block the erasure happens. We answer that question: a positive per-node scalar inserted before LayerNorm is divided out up to a stabilizer term, while the same scalar inserted after LayerNorm reaches the score head as representation magnitude. The surviving slot is the post-LayerNorm position. We instantiate it with PostDeg, a parameter-free post-LayerNorm inverse-degree scale, and pre-register four falsifiers (graphwise scalars, extra LayerNorm, expressive same-slot capacity, backbone-agnostic source) that would reject the rule. PostDeg gains $+3.5\%/+2.5\%/+5.6\%$ over the LN backbone on influence maximization, network dismantling, and maximum independent set, with $10/10$ paired-seed wins per task; none of the four falsifiers fires. The takeaway is that placement, not parameterization, carries the gain -- a small invariance check that generalizes to any positive topology scalar in any normalized residual stack.

URL PDF HTML ☆

赞 0 踩 0

2606.14010 2026-06-15 cs.CV cs.LG cs.RO 新提交

RT-VLA: Real-Time Vision-Language-Action Models via Knowledge Distillation

RT-VLA：通过知识蒸馏实现实时视觉-语言-动作模型

Xiangyu Huang, Zhenlin Hua, Han Zhou, Shounak Sural, Ragunathan Rajkumar

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出RT-VLA，通过多级监督蒸馏将SimLingo模型的能力压缩至轻量学生模型，在保持竞争性能的同时将推理时间降低44.8倍（纯视觉模式）和7.9倍（视觉+语言模式），实现实时可解释的VLA自动驾驶。

详情

AI中文摘要

视觉-语言-动作（VLA）模型通过联合建模视觉感知、语言推理、可解释性和动作预测，在端到端自动驾驶中展现出强大潜力。然而，其庞大的视觉-语言骨干网络和推理模块引入了显著的推理延迟，从而阻碍了它们在道路网络严苛现实中的部署。我们提出RT-VLA，一种轻量级、蒸馏的VLA模型，通过多级监督蒸馏将最先进的SimLingo模型的驾驶和推理能力迁移到紧凑的学生模型中。RT-VLA保留了基于语言的推理，并通过离线语言分析安全关键驾驶时刻来支持事后解释，而不增加实时控制的延迟。与SimLingo教师模型相比，RT-VLA在保持竞争性的闭环驾驶和语言推理性能的同时，在纯视觉模式下将推理时间减少了44.8倍，在视觉+语言模式下减少了7.9倍。这些结果表明，监督蒸馏是构建实时、可解释的VLA风格自动驾驶模型的实用方法。

英文摘要

Vision-Language-Action (VLA) models have shown strong potential for end-to-end autonomous driving by jointly modeling visual perception, language reasoning, explainability and action prediction. However, their large vision-language backbones and reasoning modules introduce substantial inference latency and thereby prevent their deployment in the unforgiving reality of the road networks. We propose RT-VLA, a lightweight, distilled VLA model that transfers the driving and reasoning capabilities of the state-of-the-art SimLingo model into a compact student through multi-level supervised distillation. RT-VLA preserves language-based reasoning and supports post-hoc explanation through offline language analysis of safety-critical driving moments without adding latency to real-time control. Compared to the SimLingo teacher, RT-VLA maintains competitive closed-loop driving and language reasoning performance while reducing inference time by 44.8X in vision-only mode and 7.9X in vision+language mode. These results suggest that supervised distillation is a practical approach for building real-time, explainable VLA-style autonomous driving models.

URL PDF HTML ☆

赞 0 踩 0

2606.14006 2026-06-15 cs.CV cs.ET 新提交

HARBOR: Heading Analysis and Reconstruction from Behavioral Observation and Radar

HARBOR：基于行为观测与雷达的航向分析与重建

Joao P. A. Dantas, Paulo F. Silva Filho, Jelton A. Cunha, Gabriel Dietzsch

发表机构 * Institute for Advanced Studies (IEAv)（高级研究所（IEAv））

AI总结提出HARBOR管道，仅用单张SAR图像在无辅助数据时预测船只运动，通过骨架几何和局部强度估计航向，离线校准AIS参数生成概率热图。

详情

AI中文摘要

海上态势感知通常依赖自动识别系统（AIS）传输来跟踪船只运动。然而，在作战或冲突场景中，由于信号丢失、故意关闭或有意欺骗，这些数据可能不可用。在此条件下，合成孔径雷达（SAR）图像成为广域海上监测的关键传感替代方案，尽管仅提供静态场景快照。本文介绍HARBOR（基于行为观测与雷达的航向分析与重建），一个完整的管道，用于将单张SAR图像转换为预测运动信息，而无需在推理时使用任何辅助数据源。该方法首先进行SAR图像预处理以增强和分割船只候选区域，然后通过骨架几何和局部强度模式进行自动检测、基于尺寸的分类和航向估计。AIS数据仅在离线校准阶段用于推导依赖船只类型的运动参数，随后应用于生成候选未来船只位置的概率热图。使用真实COSMO-SkyMed SAR图像进行的案例研究展示了该管道在巴西南部海上场景中的应用，显示了其在数据拒绝环境中提取运动趋势并生成船只位置概率投影的能力。

英文摘要

Maritime situational awareness often relies on Automatic Identification System (AIS) transmissions to track vessel movements. However, in operational or conflict scenarios, these data may be unavailable due to signal loss, deliberate deactivation, or intentional spoofing. In such conditions, synthetic aperture radar (SAR) imagery becomes a critical sensing alternative for wide-area maritime monitoring, despite providing only static scene snapshots. This work introduces HARBOR (Heading Analysis and Reconstruction from Behavioral Observation and Radar), a complete pipeline for transforming a single SAR image into predictive motion information without requiring any auxiliary data source at inference time. The method begins with SAR image preprocessing to enhance and segment vessel candidates, followed by automatic detection, size-based classification, and heading estimation using skeleton geometry and local intensity patterns. AIS data are used exclusively during an offline calibration phase to derive vessel-type-dependent motion parameters, which are then applied to generate probabilistic heatmaps of candidate future vessel positions. A case study using real COSMO-SkyMed SAR imagery demonstrates the pipeline on a maritime scene in southern Brazil, showing its ability to extract motion tendencies and generate probabilistic projections of vessel positions in data-denied environments.

URL PDF HTML ☆

赞 0 踩 0

2606.14005 2026-06-15 cs.CV 新提交

Context-Guided Semantic Alignment for Feature Fusion Networks

上下文引导的特征融合网络语义对齐

Hyungseop Lee, Jiho Lee, Woochul Kang

发表机构 * Department of Embedded Systems Engineering, Incheon National University（仁川国立大学嵌入式系统工程系）

AI总结提出轻量级语义对齐模块FINE，通过跨层级注意力机制利用高层上下文指导低层特征融合，并引入对齐感知令牌采样降低计算复杂度，提升目标检测精度。

Comments 26 pages, 12 figures, 8 tables

详情

AI中文摘要

特征融合网络是现代目标检测器的基础组件，通过聚合多尺度特征来检测不同大小的物体。然而，直接融合来自不同金字塔层次的特征往往因其异构表示而导致语义不一致。本文提出特征交互网络（FINE），一种轻量级语义对齐模块，在融合前通过跨层级注意力机制利用高层上下文指导来细化低层特征。为弥合结构差距并确保计算效率，我们引入对齐感知令牌采样，对齐跨尺度的对应空间区域，将注意力复杂度降低一个数量级。生成的注意力权重产生一个空间-通道调制图，通过残差逐元素调制进行上采样并应用于低层特征。该机制确保网络选择性地增强语义相关像素，同时保留密集预测任务所需的亚像素定位精度。FINE普遍适用于各种检测器，并在不牺牲效率的情况下持续提升检测精度。

英文摘要

Feature fusion networks are fundamental components in modern object detectors, aggregating multi-scale features to detect objects of varying sizes. However, directly fusing features from different pyramid levels often introduces semantic inconsistency due to their heterogeneous representations. In this paper, we propose Feature Interaction NEtwork (FINE), a lightweight semantic alignment module that refines low-level features via high-level contextual guidance using cross-level attention prior to fusion. To bridge the structural gap and ensure computational efficiency, we introduce an Alignment-Aware Token Sampling that aligns corresponding spatial regions across scales, reducing the attention complexity by an order of magnitude. The resulting attention weights generate a spatial-channel modulation map that is upsampled and applied to the low-level features via residual element-wise modulation. This mechanism ensures that the network selectively enhances semantically relevant pixels while preserving the sub-pixel localization accuracy necessary for dense prediction tasks. FINE is generally applicable to various detectors and consistently improves detection accuracy without compromising efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.14000 2026-06-15 cs.AI 新提交

Formalizing Numerical Analysis: An Agent Pipeline and Quality Audit Beyond Kernel Acceptance

数值分析的形式化：超越内核接受的智能体流水线与质量审计

Theodore Meek, Siyuan Ge, Di Qiu Xiang, Simon Chess, Vasily Ilin

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种编码智能体流水线，将数值分析教材形式化为Lean 4代码，并引入三维质量评估框架（语义正确性、Mathlib复用、跨文件复用），发现编译通过掩盖了不忠实的形式化模式。

详情

AI中文摘要

近期工作表明，编码智能体可以在Lean 4中形式化整个高等数学教材，但现有努力集中在mathlib中已有充分表示的数学分支，并仅通过内核接受来衡量成功。我们通过将编码智能体应用于形式化《常微分方程数值方法》（一本数值分析教材，在mathlib中基本缺失）来解决这两个限制，从而考验智能体从头开发新理论的能力。我们进一步引入一个系统、可复现的三维框架，用于评估智能体生成的形式化质量，超越编译层面：语义正确性、Mathlib复用以及通过LLM-as-judge方法的跨文件复用。将该框架应用于我们自己的形式化以及RepoProver和M2F发布的输出，我们发现了内核接受完全掩盖的重复性不忠实形式化模式，包括不完整的多部分陈述、添加弱化假设和参数限制。我们的结果表明，基于编译的指标大大高估了形式化质量，我们提供了一种可复现的审计方法，以支持对未来自动形式化系统进行更严格的评估。

英文摘要

Recent work has demonstrated that coding agents can formalize entire advanced mathematics textbooks in Lean 4, yet existing efforts concentrate on branches of mathematics already well-represented in mathlib and measure success solely through kernel acceptance. We address both limitations by applying a coding agent to formalize Numerical Methods for Ordinary Differential Equations, a textbook in numerical analysis that is largely absent from mathlib, stressing the agent's capacity to develop new theory from scratch. We further introduce a systematic, reproducible three-dimensional framework for evaluating the quality of agent-produced formalizations beyond compilation: semantic correctness, Mathlib reuse, and cross-file reuse via LLM-as-judge methods. Applying this framework to our own formalization and to the released outputs of RepoProver and M2F, we uncover recurring unfaithful formalization patterns, including incomplete multi-part statements, added weakening hypotheses, and parameter restrictions, that kernel acceptance entirely obscures. Our results suggest that compilation-based metrics substantially overstate formalization quality, and we provide a reproducible audit methodology to support more rigorous evaluation of future autoformalization systems.

URL PDF HTML ☆

赞 0 踩 0

2606.13995 2026-06-15 cs.CL 新提交

Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

Dialogue SWE-Bench: 对话驱动的编码智能体基准

Brendan King, Jeffrey Flanigan

发表机构 * University of California, Santa Cruz（加州大学圣克鲁兹分校）

AI总结提出Dialogue SWE-Bench基准，通过用户模拟器评估编码智能体在对话中解决软件工程问题的能力，并引入模式引导智能体提升对话性能3-14%。

Comments 22 pages, 13 figures

详情

AI中文摘要

AI编码智能体已迅速改变软件工程，驱动着广泛使用的交互式编码助手。尽管它们在现实世界中是交互式使用的，但现有基准将其评估为完全自主系统。在这项工作中，我们引入了Dialogue SWE-Bench，一个自动基准数据集，用于评估编码智能体通过与用户对话解决现实世界软件工程问题的能力。我们设计了一个新颖的、基于角色设定的用户模拟器来支持我们的任务评估，并通过对话质量的自动评估来增强任务评估。我们还提出了一种新的模式引导智能体，旨在提升现成编码智能体的对话能力，相比强基线提升了3-14%。我们的结果表明，更好的编码模型并不总是对应更好的对话模型，这表明对话能力是编码智能体性能的一个独特且目前研究不足的维度。

英文摘要

AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants. Despite their interactive real-world use, existing benchmarks evaluate them as fully-autonomous systems. In this work, we introduce Dialogue SWE-Bench, an automatic benchmark dataset for evaluating the ability of coding agents to resolve real-world software engineering problems through dialogue with a user. We design a novel, persona-grounded user simulator to support our task evaluation, and augment our task evaluation with automatic evaluations of dialogue quality. We also propose a new schema-guided agent, aimed at improving the dialogue capabilities of off-the-shelf coding agents, which improves over strong baselines by 3-14%. Our results indicate that better coding models do not always correspond to better dialogue models, suggesting that dialogue capability is a distinct and currently understudied dimension of coding agent performance.

URL PDF HTML ☆

赞 0 踩 0

2606.13993 2026-06-15 cs.CL 新提交

The Holistic Storage of Verb+Up Phrases in Text-based and Audio-based Language Models

文本和音频语言模型中动词+up短语的整体存储

Zachary Nicholas Houghton, Yu Zhou, Dan Pluth, Vijay K. Gurbani

发表机构 * University of Oregon（俄勒冈大学）； Vail Systems, Inc（Vail Systems公司）

AI总结研究文本和音频语言模型对动词+up短语的整体存储，发现频率和可预测性驱动独立表征，支持基于使用的语言理论。

2606.13990 2026-06-15 cs.RO 新提交

SplatlessDF: Continuous Distance Field Mapping with Non-Splatting Gaussians

SplatlessDF: 基于非溅射高斯分布的连续距离场映射

Monisha Mushtary Uttsha, Lan Wu, Teresa Vidal-Calleja

发表机构 * UTS Robotics Institute, Faculty of Engineering and IT, University of Technology Sydney（悉尼科技大学工程与信息技术学院UTS机器人研究所）； School of Engineering, University of Western Australia（西澳大学工程学院）

AI总结提出SplatlessDF框架，利用各向异性高斯元素从空间角度构建连续距离场，支持距离和梯度查询，并可与2D高斯溅射结合实现统一建模，适用于机器人导航。

详情

AI中文摘要

最近的高斯溅射（GS）方法表明，场景可以通过可优化的高斯分布高效表示，以实现高质量的重建和渲染。本文基于这一原理，引入SplatlessDF，一个从空间而非光度角度使用各向异性高斯元素的连续距离场（DF）映射框架。SplatlessDF直接参数化高斯分布并优化以恢复可微DF，使得能够在空间域中查询距离和梯度，用于下游机器人任务如导航。此外，SplatlessDF可与2D高斯溅射（2DGS）耦合，提供一个完全基于高斯原语的统一框架，该框架可以学习连续DF和表面模型，并支持光度渲染。我们考虑两种设置：独立的仅DF公式和与2DGS耦合的联合DF-渲染公式。实验表明，独立公式提供高效准确的距离和梯度查询，而联合公式改善渲染几何并同时建模连续DF。这些结果凸显了GS风格表示不仅在表面建模和渲染方面，而且在适用于机器人导航的映射表示方面的潜力。

英文摘要

Recent Gaussian splatting (GS) methods have shown that scenes can be represented efficiently with optimisable Gaussians for high-quality reconstruction and rendering. In this paper, building on this principle, we introduce SplatlessDF, a continuous distance field (DF) mapping framework that uses anisotropic Gaussian elements from a spatial rather than photometric perspective. SplatlessDF directly parameterises the Gaussians and optimises to recover a differentiable DF, enabling distances and gradients to be queried in the spatial domain for downstream robotic tasks such as navigation. Furthermore, SplatlessDF can be coupled with 2D Gaussian splatting (2DGS), providing a unified framework based solely on Gaussian primitives that can learn continuous DF and surface models and supports photometric rendering. We consider two settings: a standalone DF-only formulation and a joint DF-rendering formulation coupled with 2DGS. Experiments show that the standalone formulation provides efficient and accurate distance and gradient queries, while the joint formulation improves rendering geometry and simultaneously models a continuous DF. These results highlight the potential of GS-style representations not only for surface modelling and rendering but also for mapping representations suited to robotic navigation.

URL PDF HTML ☆

赞 0 踩 0

2606.13989 2026-06-15 cs.SD cs.AI 新提交

Mask, Sample, Revise: A Revisable CTMC Inference Stack for Guided Discrete Flow Matching Text-to-Speech

掩码、采样、修正：面向引导离散流匹配文本转语音的可修正CTMC推理栈

Alef Iury Siqueira Ferreira, Lucas Rafael Stefanel Gris, Luiz Fernando de Araújo Vidal, Frederico Santos de Oliveira, Christopher Dane Shulby, Anderson da Silva Soares, Arlindo Rodrigues Galvão Filho

发表机构 * Federal University of Goiás（戈亚斯联邦大学）； Federal University of Uberlândia（乌贝兰迪亚联邦大学）； University of São Paulo（圣保罗大学）； University of Brasília（巴西利亚大学）； University of California, Berkeley（加利福尼亚大学伯克利分校）

AI总结提出Mask, Sample, Revise推理栈，结合无预测器引导、提示匹配条件耦合和调度约束重掩码机制，在低步数下提升离散流匹配TTS的鲁棒性和可懂度。

详情

AI中文摘要

最近的无对齐非自回归文本转语音模型将合成视为条件填充任务，绕过了显式时长预测器和外部对齐器。当语音用神经编解码令牌表示时，填充问题变为离散，使得离散流匹配（一种用于离散生成的连续时间马尔可夫链框架）成为自然选择。然而，用于稳定低步数条件填充的推理时控制仍未充分探索。我们提出Mask, Sample, Revise，一种用于无对齐DFM-TTS的推理时CTMC栈。该栈结合了无预测器引导以增强文本条件、提示匹配条件耦合以将概率路径与声学提示对齐，以及SC-ReMask（一种调度约束重掩码机制），引入令牌到掩码的转换，使得早期去掩码决策可以被修正。这些组件无需事后微调，并在单个tau-leaping采样器中运行。受控消融实验表明，该栈在低NFE提示设置下提高了可懂度和鲁棒性，优于具有更多步数的无引导和仅引导采样器。

英文摘要

Recent alignment-free non-autoregressive (NAR) text-to-speech (TTS) models formulate synthesis as a conditional infilling task, bypassing explicit duration predictors and external aligners. When speech is represented with neural codec tokens, the infilling problem becomes discrete, making Discrete Flow Matching (DFM), a Continuous-Time Markov Chain (CTMC) framework for discrete generation, a natural fit. However, inference-time control for stable low-step conditional infilling remains underexplored. We propose Mask, Sample, Revise, an inference-time CTMC stack for alignment-free DFM-TTS. The stack combines predictor-free guidance to strengthen text conditioning, prompt-matched conditional coupling to align the probability path with the acoustic prompt, and SC-ReMask, a schedule-constrained remasking mechanism that introduces token-to-mask transitions so early de-masking decisions can be revised. These components require no post-hoc fine-tuning and operate in a single tau-leaping sampler. Controlled ablations show that this stack improves intelligibility and robustness in the low-NFE prompted setting, outperforming unguided and guidance-only samplers with substantially more steps.

URL PDF HTML ☆

赞 0 踩 0

2606.13977 2026-06-15 cs.CL 新提交

Creative Integration: A Decidable Criterion of Creativity

创造性整合：一个可判定的创造力标准

Yoshinori Nomura

发表机构 * Mirage Mountain Technologies（幻山科技）

AI总结提出基于描述长度压缩的创造性整合可判定标准，通过四个二元门和伪整合分类法实现判别，并在多领域语料库上通过四项可证伪测试验证。

Comments 18 pages, 1 figure

详情

AI中文摘要

"整合性"解决方案广受赞誉但鲜有定义：我们缺乏一种可操作的方式来区分真正的整合（使世界更易于描述）与整洁的重新描述。基于将创造力和智能视为压缩的思想脉络，我们为创造性整合（CI）给出了这样一个标准：当且仅当在固定描述语言下，描述长度严格缩短（C = L_pre/L_post > 1），且缩减位于冲突本身时，A与B之间真实冲突的解决即为CI。我们通过四个二元合取门使判断可判定，并通过一个伪整合分类法（命名并拒绝相似物）固定其外延。我们用一个精心策划的多领域语料库支持该标准，并且——关键的是——不是通过人类评分者间一致性，而是通过它可能失败的四个可证伪测试来验证：独立计算检查、对硬负例的区分、样本外预测和描述语言鲁棒性；所有测试均以余量通过。贡献不在于"创造力即压缩"，而在于其可判定性、区分性和语料库：据此，使一个举动真正具有创造性——而非仅仅是新颖——的是它压缩了一个冲突，新颖性和价值是下游症状；所有创造力是否都如此构成，我们作为一个明确的猜想陈述。我们仅声称C-1的符号；我们判断，而非生成。结果是一个可引用的基元，用于更广泛的计划。

英文摘要

"Integrative" solutions are widely praised but rarely defined: we lack an operational way to tell a genuine integration -- one that makes the world cheaper to describe -- from a tidy re-description. Building on the lineage that treats creativity and intelligence as compression, we give such a criterion for creative integration (CI): the resolution of a real conflict between A and B is CI if and only if, under a fixed description language, the description length strictly shrinks (C = L_pre/L_post > 1), with the reduction located in the conflict itself. We make the judgment decidable through four binary, conjunctive gates, and we fix its extension through a taxonomy of pseudo-integration that names and rejects the look-alikes. We back the criterion with a curated, multi-domain corpus and -- crucially -- validate it not by human inter-rater agreement but by four falsifiable tests it could fail: an independent computational check, discrimination against hard negatives, out-of-sample prediction, and description-language robustness; all pass with margin. The contribution is not "creativity is compression" but its decidability, discrimination, and corpus: on this account, what makes a move genuinely creative -- rather than merely novel -- is that it compresses a conflict, with novelty and value as downstream symptoms; whether all creativity is so constituted we state as an explicit conjecture. We claim only the sign of C-1; we judge, not generate. The result is a citable primitive for a broader program.

URL PDF HTML ☆

赞 0 踩 0

2606.13971 2026-06-15 cs.CV 新提交

Prompt2Effect: Training-Free Image-to-Video Model Specialization via LoRA Generation

Prompt2Effect: 通过LoRA生成实现免训练图像到视频模型特化

Xiaomeng Yang, Yanyu Li, Gordon Guocheng Qian, Ivan Skorokhodov, Viacheslav Ivanov, Avalon Vinella, Xuan Zhang, Yanzhi Wang, Sergey Tulyakov, Anil Kag

发表机构 * Northeastern University（东北大学）； Snap Inc.（Snap公司）

AI总结提出Prompt2Effect，一种权重驱动超网络，通过单次前向传播直接合成效果特定的LoRA权重，无需训练，在保持视频质量的同时将计算成本从56 GPU小时降至3.3秒。

详情

AI中文摘要

将图像到视频（I2V）扩散模型个性化以具有特定视觉效果的需求日益增长，用于高端视频生成。当前实践需要为每个效果训练单独的LoRA模块，这带来了大量的数据整理和迭代优化成本，阻碍了交互式控制。我们提出Prompt2Effect，一种权重驱动的超网络，通过单次前向传播直接合成效果特定的LoRA权重，从而分摊每个效果的训练成本。与先前仅从语义回归适配器权重的超网络不同，Prompt2Effect显式地以冻结的基础模型权重为条件，将权重预测建立在每层的结构几何上。此外，我们不是预测原始LoRA矩阵，而是引入一种SVD规范化的参数化方法，解决了分解歧义并稳定了大规模权重合成。这些设计原则共同实现了高维I2V扩散模型的准确且可扩展的LoRA预测。大量实验表明，与传统的LoRA微调相比，Prompt2Effect实现了相当或更优的视频质量和效果对齐，同时将计算成本从56 GPU训练小时降至3.3秒的超网络推理。当用作后续微调的初始化时，我们预测的权重进一步提高了最终性能，并将优化速度提升了约10倍。

英文摘要

Personalizing Image-to-Video (I2V) diffusion models with specific visual effects is increasingly demanded for high-end video generation. Current practice requires training a separate Low-Rank Adaptation (LoRA) module for each effect, incurring substantial data curation and iterative optimization costs that hinder interactive control. We present Prompt2Effect, a weight-driven hypernetwork that amortizes per-effect training by directly synthesizing effect-specific LoRA weights in a single forward pass. Unlike prior hypernetworks that regress adapter weights purely from semantics, Prompt2Effect is explicitly conditioned on the frozen base model weights, grounding weight prediction in the structural geometry of each layer. Furthermore, instead of predicting raw LoRA matrices, we introduce an SVD-canonicalized parameterization that resolves factorization ambiguity and stabilizes large-scale weight synthesis. Together, these design principles enable accurate and scalable LoRA prediction for high-dimensional I2V diffusion models. Extensive experiments demonstrate that Prompt2Effect achieves on-par or superior video quality and effect alignment compared to conventional LoRA fine-tuning, while reducing the computational cost from 56 GPU training hours to 3.3 seconds of hypernetwork inference. When used as initialization for subsequent fine-tuning, our predicted weights further improve final performance and accelerate optimization by approximately 10x.

URL PDF HTML ☆

赞 0 踩 0

2606.13970 2026-06-15 cs.RO cs.LG 新提交

An Attention-based Model for Robust Forecasting with Missing Modality

基于注意力的缺失模态鲁棒预测模型

Zhitian Zhang, Wenjie Zi, Yunduz Rakhmangulova, Saghar Irandoust, Hossein Hajimirsadeghi, Thibaut Durand

发表机构 * Simon Fraser University（西蒙菲莎大学）； RBC Borealis

AI总结提出一种基于条件变分自编码器和Transformer的多模态模型，通过注意力机制学习统一固定维度的表示，在训练和推理中处理缺失模态，在人类轨迹预测和机器人操作预测任务上优于现有方法。

Comments Work originally done in 2023

详情

AI中文摘要

在缺失模态下的学习是多模态机器人学习中的一个基本挑战，因为现实世界的机器人系统通常运行在传感器数据不完整的环境中。基于注意力的模型在处理多模态数据时具有吸引力，因为它们可以用单一骨干网络处理多种模态。然而，大多数多模态模型假设在训练和推理过程中所有模态都可用，限制了它们在机器人感知和决策中的适用性。在本文中，我们介绍了一种多模态模型，旨在在训练和推理过程中处理缺失模态。该模型被表述为条件变分自编码器（CVAE），并采用基于Transformer的架构，利用注意力机制学习统一的固定维度表示，即使某些模态缺失。我们表明，所提出的模型可以在缺失模态的情况下进行训练，同时逼近所有模态的鲁棒表示。我们在五个多模态数据集上评估了我们的方法，涉及两个机器人学习任务：人类轨迹预测和机器人操作预测。实验结果表明，我们的模型有效地从不完整数据中学习，并且优于先前的多模态融合方法。

英文摘要

Learning with missing modalities is a fundamental challenge in multimodal robot learning, as real-world robotic systems often operate in environments with incomplete sensor data. Attention-based models are appealing for processing multimodal data because they can handle multiple modalities with a single backbone network. However, most multimodal models assume that all modalities are available during both training and inference, limiting their applicability in robotic perception and decision-making. In this paper, we introduce a multimodal model designed to handle missing modalities during both training and inference. The model is formulated as a conditional variational autoencoder (CVAE) and incorporates a transformer-based architecture that leverages attention mechanisms to learn a unified, fixed-dimensional representation, even when some modalities are missing. We show that our proposed model can be trained with missing modalities while approximating a robust representation of all modalities. We evaluate our approach on five multimodal datasets across two robot learning tasks: human trajectory prediction and robot manipulation forecasting. Experimental results demonstrate that our model effectively learns from incomplete data and is superior to prior multimodal fusion approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.13964 2026-06-15 cs.CV 新提交

CaricHarmony: Contrastive Diffusion Paths for Identity-Preserving Caricature Synthesis

CaricHarmony：身份保持的漫画合成的对比扩散路径

Dongyu Wang, Dar-Yen Chen, Yi-Zhe Song

发表机构 * SketchX, CVSSP, University of Surrey（萨里大学CVSSP实验室SketchX组）

AI总结提出CaricHarmony，一种无需训练的方法，通过并行无污染扩散路径解决身份与形状条件信号污染问题，实现平衡的漫画合成，在保持身份一致性的同时达到最优形状保真度。

详情

AI中文摘要

基于草图的漫画合成存在一个根本性失败模式：当身份和形状条件在扩散模型中结合时，它们会产生破坏性干扰，导致不可避免地向平淡肖像或无法识别的扭曲崩溃。我们将根本原因确定为\emph{条件信号污染}——去噪轨迹中竞争的概率分布使得平衡生成变得不可能。我们提出了CaricHarmony，这是第一种通过并行无污染扩散路径明确解决这种污染的无训练方法。在推理过程中，我们维护三条路径：$\mathcal{P}^{\mathrm{i}}$（纯身份）、$\mathcal{P}^{\mathrm{s}}$（纯形状）和$\mathcal{P}^{\mathrm{i+s}}$（和谐输出）。作用于交叉注意力特征的新型能量函数提供梯度引导，将$\mathcal{P}^{\mathrm{i+s}}$导向最优平衡：$\mathcal{E}_{\mathrm{shape}}$通过布局和语义对齐确保草图保真度，而$\mathcal{E}_{\mathrm{id}}$采用对极端扭曲鲁棒的令牌级对应匹配。与需要每身份70秒微调的DemoCaricature或受限于贝塞尔曲线的CaricatureBooth不同，CaricHarmony接受任何草图格式并在16秒内生成。实验展示了最先进的性能：在可比较的身份一致性分数下，形状CLIP分数为0.8615（对比0.8450），总体用户偏好分数为7.81（对比6.06）。我们的方法从根本上将身份-形状冲突重新概念化为扩散模型的条件信号污染，从而在保持识别的同时实现前所未有的创造性控制。

英文摘要

Sketch-based caricature synthesis suffers from a fundamental failure mode: when identity and shape conditions are combined in diffusion models, they create destructive interference that causes inevitable collapse toward either bland portraits or unrecognizable distortions. We identify the root cause as \emph{condition signal contamination} -- competing probability distributions in the denoising trajectory that make balanced generation impossible. We present CaricHarmony, the first training-free method that explicitly resolves this contamination through parallel uncontaminated diffusion paths. During inference, we maintain three paths: $\mathcal{P}^{\mathrm{i}}$ (pure identity), $\mathcal{P}^{\mathrm{s}}$ (pure shape), and $\mathcal{P}^{\mathrm{i+s}}$ (harmonized output). Novel energy functions operating on cross-attention features provide gradient guidance that steers $\mathcal{P}^{\mathrm{i+s}}$ toward optimal balance: $\mathcal{E}_{\mathrm{shape}}$ ensures sketch fidelity through layout and semantic alignment, while $\mathcal{E}_{\mathrm{id}}$ employs token-level correspondence matching robust to extreme distortions. Unlike DemoCaricature requiring 70 seconds per-identity fine-tuning or CaricatureBooth constrained to Bezier curves, CaricHarmony accepts any sketch format and generates in under 16 seconds. Experiments demonstrate state-of-the-art performance: 0.8615 shape CLIP score (vs. 0.8450) under comparable identity consistency score, with 7.81 overall user preference score (vs. 6.06). Our method fundamentally reconceptualizes the ID-shape conflict as conditioning signal contamination for diffusion models, enabling unprecedented creative control while preserving recognition.

URL PDF HTML ☆

赞 0 踩 0

2606.13959 2026-06-15 cs.LG 新提交

Can Machine Learning Forecast Rice Yields in Data-Constrained Settings? Satellite Climate Data, National Crop Statistics, and Lessons from Sierra Leone

机器学习能否在数据受限条件下预测水稻产量？卫星气候数据、国家作物统计及塞拉利昂的经验教训

Ibrahim Denis Fofanah

发表机构 * Seidenberg School of Computer Science & Information Systems Pace University, New York, USA（佩斯大学塞登伯格计算机科学与信息系统学院，纽约，美国）； RiseAfrica Foundation for STEM and Innovation Sierra Leone, West Africa（RiseAfrica STEM与创新基金会，塞拉利昂，西非）

AI总结利用塞拉利昂25年作物统计和免费卫星气候数据，通过严格反泄漏协议训练机器学习模型，发现仅气候数据的XGBoost将水稻产量预测误差降低三分之一，早期季节降雨是关键预测因子，并转化为政策建议。

Comments 32 pages, 7 figures. Code and data: https://github.com/Denis060/sierraleone-agri-ml

详情

AI中文摘要

塞拉利昂的农业几乎没有数据驱动的决策支持，也没有已发表的机器学习研究考察该国的作物产量。我们询问是否可以利用塞拉利昂目前拥有的数据预测水稻产量。使用25年（2000-2024年）九种主要作物的FAOSTAT生产数据，我们在严格的反泄漏协议下训练XGBoost、梯度提升和随机森林，采用扩展窗口的前向验证评估七个保留年份，并以朴素持久性为基准。仅基于作物统计训练的模型均未优于持久性。加入免费卫星气候数据（CHIRPS降雨、NASA POWER温度）逆转了这一结果：仅使用气候数据的XGBoost将预测误差降低了三分之一（RMSE 284 vs 428 kg/ha），这一优势在线性模型中依然成立，并且在排除异常的2018年季节后仍然稳健。早期季节（5-6月）降雨是主导预测因子，意味着季节性产量风险在收获前数月即可观测。没有模型预测到2018年的产量崩溃，其根源是制度性的而非气候性的。我们将研究结果转化为对塞拉利昂“Feed Salone”战略的政策建议，并提供了完全开源的流程。

英文摘要

Sierra Leone's agriculture operates with almost no data-driven decision support, and no published machine learning study has examined the country's crop yields. We ask whether rice yield can be forecast from data Sierra Leone currently has. Using 25 years of FAOSTAT production data (2000-2024) for nine major crops, we train XGBoost, Gradient Boosting, and Random Forest under a strict anti-leakage protocol with expanding-window walk-forward evaluation across seven held-out years, benchmarked against naive persistence. No model trained on crop statistics alone outperforms persistence. Augmenting with free satellite climate data (CHIRPS rainfall, NASA POWER temperature) reverses this result: a climate-only XGBoost reduces forecast error by one third (RMSE 284 vs 428 kg/ha), a gain that holds for a linear model and is robust to excluding the anomalous 2018 season. Early-season (May-June) rainfall is the dominant predictor, implying seasonal yield risk is observable months before harvest. No model anticipated the 2018 collapse, whose origins were institutional rather than climatic. We translate the findings into policy recommendations for Sierra Leone's Feed Salone Strategy, with a fully open-source pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.13955 2026-06-15 cs.LG 新提交

Smoothing Dark Areas in Molecular Latent Diffusion

分子潜在扩散中的暗区平滑

Xi Wang, Jiahan Li, Yuxuan Xia, Yingcheng Wu, Shaoyi Zheng, Shengjie Wang

发表机构 * New York University（纽约大学）； Stanford University（斯坦福大学）

AI总结针对分子潜在扩散中存在的暗区问题，提出拓扑优化VAE（TopVAE），通过训练时内化结构和化学约束，减少暗区，提升离后验鲁棒性，在QM9和GEOM-Drugs上取得显著改进。

详情

AI中文摘要

潜在扩散是可扩展3D分子生成的有前景框架，但它需要潜在空间在扩散采样之外保持平滑、有效且可导航。然而，现有的分子VAE通常通过基于重建的目标学习，这并不能保证这样的潜在空间。我们表明这会导致暗区：在扩散采样过程中可达但解码为不连通或化学无效分子的潜在空间区域。与图像生成不同，分子解码需要严格的结构和化学精度，因此即使微小的潜在扰动也可能导致灾难性失败。因此，我们提出TopVAE，一种拓扑优化的VAE，通过使解码器在训练期间内化结构和化学约束来减少暗区，消除了测试时化学校正的需要。TopVAE大大提高了离后验鲁棒性，当与标准DiT配对时，在QM9上实现了$77\%$更低的FCD-3D、最高的V&C，在GEOM-Drugs上实现了$52\%$更低的FCD-3D，以及在零样本支架修复中实现了$1.29{\ imes}$更稳定和更连通的分子。

英文摘要

Latent diffusion is a promising framework for scalable 3D molecular generation, but it requires a latent space that remains smooth, valid, and navigable beyond posterior samples. Existing molecular VAEs, however, are typically learned through reconstruction-based objectives, which do not guarantee such a latent space. We show that this leads to dark areas: regions of latent space that are reachable during diffusion sampling but decode to disconnected or chemically invalid molecules. Unlike in image generation, molecular decoding requires strict structural and chemical precision, so even small latent perturbations can produce catastrophic failures. We therefore propose TopVAE, a topology-optimized VAE that reduces dark areas by making the decoder internalize structural and chemical constraints during training, eliminating the need for test-time chemical correction. TopVAE greatly improves off-posterior robustness, and when paired with a standard DiT, achieves $77\%$ lower FCD-3D on QM9, the highest V&C, $52\%$ lower FCD-3D on GEOM-Drugs, and $1.29{\times}$ more stable and connected molecules on zero-shot scaffold inpainting.

URL PDF HTML ☆

赞 0 踩 0

2606.13949 2026-06-15 cs.AI 新提交

Minim: Privacy-Aware Minimal View for Agents via Trusted Local Sanitization

Minim: 通过可信本地净化实现代理的隐私感知最小化视图

Hexuan Yu, Chaoyu Zhang, Heng Jin, Shanghao Shi, Ning Zhang, Y. Thomas Hou, Wenjing Lou

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对LLM代理传输完整UI状态导致隐私泄露的问题，提出MINIM框架，在客户端基于上下文完整性学习双重分数（敏感性和必要性），通过三元披露策略实现隐私感知的最小化视图，在减少敏感泄露的同时保留任务关键信息。

Comments Accepted at ICML 2026 (43rd International Conference on Machine Learning, Seoul, South Korea). Code available at https://github.com/yyyyhx/MINIM

详情

AI中文摘要

现代基于LLM的自主代理越来越依赖丰富的用户界面（UI）状态观察，以在复杂数字环境中实现可靠的动作基础。然而，许多部署将完整的UI状态传输到远程推理服务器，即使大多数元素与当前任务无关，这可能会泄露敏感但不必要的上下文，如身份验证代码、私人通知和后台应用状态。我们提出MINIM，一个可信的本地代理，在任何观察离开设备之前，在客户端执行隐私感知的最小化。基于上下文完整性（CI），MINIM通过预测每个UI元素的固有敏感性分数（s）和任务条件必要性分数（n）来学习双分数表示。这些分数驱动一个三元披露策略，保留必要元素，在需要时抽象敏感属性，并移除与任务无关的内容。我们优化了一个CI感知目标，对高风险内容上的必要性错误施加更强的惩罚，从而在保留任务关键信息的同时实现积极的剪枝。在来自WebArena的真实世界UI观察上的实验表明，MINIM显著减少了与任务无关的敏感泄露，同时保留了任务关键的语义上下文和可靠代理动作所需的交互能力。

英文摘要

Modern LLM-powered autonomous agents increasingly rely on rich user interface (UI) state observations to achieve reliable action grounding in complex digital environments. However, many deployments transmit the full UI state to remote inference servers even when most elements are irrelevant to the current task, which can leak sensitive but unnecessary context such as authentication codes, private notifications, and background application states. We propose MINIM, a trusted local broker that performs privacy-aware minimization on the client side before any observation leaves the device. Grounded in Contextual Integrity (CI), MINIM learns a dual-score representation for each UI element by predicting an inherent sensitivity score (s) and a task-conditioned necessity score (n). These scores drive a ternary disclosure policy that keeps essential elements, abstracts sensitive attributes when needed, and removes task-irrelevant content. We optimize a CI-aware objective that penalizes necessity errors more strongly on high-risk content, enabling aggressive pruning while preserving task-critical information. Experiments on real-world UI observations derived from WebArena show that MINIM substantially reduces task-irrelevant sensitive leakage while preserving task-critical semantic context and the interactive affordances required for reliable agent actions.

URL PDF HTML ☆

赞 0 踩 0

2606.13945 2026-06-15 cs.CL 新提交

MedLatentDx: Latent Multi-Agent Communication for Cross-Hospital Rare-Disease Diagnosis

MedLatentDx：用于跨医院罕见病诊断的潜在多智能体通信

Ziqing Wang, Lili Zhao, Kaize Ding

发表机构 * Northwestern University（西北大学）

AI总结提出MedLatentDx框架，通过潜在多智能体通信实现跨医院罕见病诊断，利用潜在KV块传输保护隐私，支持同骨干和跨家族LLM部署，在CrossRare-Bench上提升诊断性能并减少可重建临床内容。

详情

AI中文摘要

罕见病影响超过3亿患者，涉及7000多种疾病，但没有任何一家医院能遇到足够多的病例来进行可靠诊断。跨医院合作可以通过允许诊断机构使用分布式、病例特定的诊断证据来提供帮助，但隐私法规限制可识别的临床文本跨机构传输。这种情况带来了两个挑战：现有的医疗智能体系统通常依赖文本证据交换，而原始潜在状态（如隐藏状态和KV缓存）仍可能泄露提示衍生的临床内容。我们引入了MedLatentDx，一个潜在多智能体通信框架，其中医院智能体将私有临床记录和检索到的病例保留在本地，并向主机智能体发送紧凑的潜在KV块以进行罕见病诊断。MedLatentDx支持两种部署设置：相同骨干的医院智能体使用潜在KV蒸馏，而具有不同LLM骨干的医院使用跨家族潜在对齐。在CrossRare-Bench（一个自建的大规模罕见病基准，具有医院级别分区）上，MedLatentDx提高了跨医院诊断性能，同时相对于原始潜在通信基线减少了可重建的临床内容。

英文摘要

Rare diseases affect over $300$ million patients across more than $7{,}000$ conditions, yet no single hospital encounters enough cases of any one condition for reliable diagnosis. Cross-hospital collaboration could help by allowing a diagnosing institution to use distributed, case-specific diagnostic evidence, but privacy regulations restrict the transmission of identifiable clinical text across institutional boundaries. This setting raises two challenges: existing medical agent systems often rely on textual evidence exchange, while raw latent states such as hidden states and KV caches may still reveal prompt-derived clinical content. We introduce MedLatentDx, a latent multi-agent communication framework in which hospital agents keep private clinical records and retrieved cases local, and send compact latent KV blocks to a host agent for rare-disease diagnosis. MedLatentDx supports two deployment settings: same-backbone hospital agents use latent KV distillation, while hospitals with different LLM backbones use cross-family latent alignment. On CrossRare-Bench, a self-built large-scale rare-disease benchmark with hospital-level partitions, MedLatentDx improves cross-hospital diagnostic performance while reducing reconstructable clinical content relative to raw-latent communication baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.13944 2026-06-15 cs.CL 新提交

LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values

LLM 蕴含多重性：部署上下文如何重塑模型层面的偏好与价值观

Filip Trhlik, Aoife O'Flynn, Angela Yu, Arduin Findeis, Paula Buttery

发表机构 * University of Cambridge（剑桥大学）； ALTA Institute（ALTA研究所）； Leverhulme Centre for the Future of Intelligence（勒沃霍姆未来智能中心）； Microsoft UK（微软英国）

AI总结本研究通过两个成对比较范式（国家偏好排序和效用判断）发现，部署上下文（如撰写Reddit帖子或新闻文章）对LLM偏好和价值观的影响远大于提示改写和温度控制，表明模型层面的偏好是上下文依赖的。

Comments 68 pages, 54 figures, 54 tables

详情

AI中文摘要

大型语言模型（LLM）在最近的评估工作中越来越被描述为具有稳定的、模型层面的偏好和价值观系统。然而，伴随的稳健性检查仅限于偶然的提示扰动，如句法变化和选项重新排序。这留下了当周围任务上下文发生变化时（如大多数实际部署中那样），所测量的属性是否仍然存在的问题。我们直接在两个已建立的成对范式中对此进行测试：国家偏好排序和效用判断。在两者中，我们将部署上下文——模型在做出具体价值依赖选择时执行的高层任务——作为我们的控制变量，在不同框架（如撰写Reddit帖子或新闻文章）之间变化。在五个LLM和超过120万个成对决策中，部署上下文产生的变化远大于提示释义和温度控制。在15个国家的偏好排序中，上下文引发了广泛的、统计上显著的排名变化；先前工作中报告的总体全球北方偏好本身是上下文依赖的，每个模型的偏见在不同上下文中系统性变化。在50个结果的效用引出中，跨类别的广泛排序得以保留，但领域内的细粒度排名变化很大，并且结果之间的基数交换率（例如，一个地区的多少条生命等于另一个地区的一条生命）在中位数上变化了2.47倍。因此，报告的模型层面偏好和效用最好被理解为上下文条件下的测量，而不是固定的模型属性：在一种框架下获得的安全保证在另一种框架下提供的保证有限。

英文摘要

Large language models (LLMs) are increasingly characterised in recent evaluation work as having stable, model-level preference and value systems. However, accompanying robustness checks are limited to incidental prompt perturbations such as syntax variation and option reordering. This leaves open whether the measured properties survive when the surrounding task context changes, as it does in most real deployments. We test this directly across two established pairwise paradigms: ranking country preferences and eliciting utility judgements. In both, we make the deployment context -- the high-level task the model is performing while making concrete value-dependent choices -- our controlled variable, varied across framings such as writing a Reddit post or a news article. Across five LLMs and over 1.2M pairwise decisions, deployment context produces variation far larger than prompt paraphrasing and temperature controls. In country preference rankings over 15 countries, context induces widespread, statistically significant rank shifts; the aggregate Global North favouritism reported in prior work is itself context-dependent, with each model's bias shifting systematically across contexts. In utility elicitation over 50 outcomes, broad cross-category ordering is preserved, but fine-grained rankings within domains vary substantially, and cardinal exchange rates between outcomes (e.g. how many lives in one region equal one in another) shift by a factor of 2.47 at the median. Reported model-level preferences and utilities are therefore better understood as context-conditioned measurements than fixed model-level properties: safety guarantees obtained under one framing provide limited assurance in another.

URL PDF HTML ☆

赞 0 踩 0

2606.13940 2026-06-15 cs.CL 新提交

Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding

后训练能否使LLM成为优秀的医学编码员？生成式ICD编码的实证研究

Ziqing Wang, Weihao Li, Shijie Chen, Yuan Luo, Kaize Ding

发表机构 * Northwestern University（西北大学）

AI总结通过对比提示、监督微调和强化学习（GRPO及新提出的PHI课程）在ICD编码任务上的表现，发现后训练（尤其是SFT和GRPO）能显著提升生成式LLM的编码性能，提示评估瓶颈在于模型适应而非生成范式本身。

详情

AI中文摘要

自动国际疾病分类（ICD）编码是用于计费、流行病学和临床决策支持的核心医学编码任务。生成式大语言模型（LLM）常被报道为弱医学编码员，但这一发现主要来自推理时设置（如提示、检索、重排序或工具使用），而任务特定后训练的作用尚未充分探索。我们提出了一项受控的实证研究，针对生成式ICD编码的后训练，在共同协议和度量集下比较了判别式基线与LLM编码员在提示、监督微调和强化学习中的表现。据我们所知，这是首个评估基于RL的后训练用于生成式LLM编码员在ICD编码中的研究。我们进一步引入了PHI，一种诊断性课程，扩展了GRPO以细化遗漏病例。我们的结果表明，仅提示评估大大低估了LLM在ICD编码中的潜力。SFT提供了主要的能力跃升，GRPO进一步改善了超出SFT的代码集预测，而PHI在宏观性能上提供了有针对性的提升。这些发现表明，主要瓶颈不在于生成式公式本身，而在于如何调整和优化模型以实现全分类召回。我们在以下网址发布代码、数据划分和检查点：此 https URL。

英文摘要

Automated International Classification of Diseases (ICD) coding is a core medical-coding task for billing, epidemiology, and clinical decision support. Generative large language models (LLMs) are often reported as weak medical coders, but this finding mainly comes from inference-time settings such as prompting, retrieval, reranking, or tool use, leaving the role of task-specific post-training underexplored. We present a controlled empirical study of post-training for generative ICD coding, comparing discriminative baselines with LLM coders across prompting, supervised fine-tuning, and reinforcement learning under a common protocol and metric set. To our knowledge, this is the first study to evaluate RL-based post-training for generative LLM coders in ICD coding. We further introduce PHI, a diagnostic curriculum that extends GRPO to refine missed-code cases. Our results show that prompting-only evaluation substantially underestimates the potential of LLMs for ICD coding. SFT provides the main capability jump, GRPO further improves code-set prediction beyond SFT, and PHI provides targeted gains on macro-level performance. These findings suggest that the main bottleneck is not the generative formulation alone, but how the model is adapted and optimized for full-taxonomy recall. We release our code, data splits, and checkpoints at https://github.com/AlexandreWANG915/LLM4ICD.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

FoleyGenEx: Unified Video-to-Audio Generation with Multi-Modal Control, Temporal Alignment, and Semantic Precision

WAM4D: Fast 4D World Action Model via Spatial Register Tokens

Rethinking One-Step Image Editing through ChordEdit: Reproduction, Simplification, and New Insights

Decompose Sparsely Where You Should, Absorb Densely Where You Should No

Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment

Toward 360-Degree Indoor Panorama Editing via Tuning-Free Diffusion Model with Refocusing Cross-Attention

From Attacks to Curricula: Learnability-Guided Adversarial Training for Safe Autonomous Driving

Efficiency-Performance Trade-offs in Neural Speaker Diarization via Structured Pruning and Low-Bit Quantization

Utility-Constrained Policy Optimization

GarmentSketch: Large-scale Sketch-to-Fashion Benchmark

ViT-Up: Faithful Feature Upsampling for Vision Transformers

PostDeg: Placement Beats Parameterization in LayerNorm GNNs

RT-VLA: Real-Time Vision-Language-Action Models via Knowledge Distillation

HARBOR: Heading Analysis and Reconstruction from Behavioral Observation and Radar

Context-Guided Semantic Alignment for Feature Fusion Networks

Formalizing Numerical Analysis: An Agent Pipeline and Quality Audit Beyond Kernel Acceptance

Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

The Holistic Storage of Verb+Up Phrases in Text-based and Audio-based Language Models

SplatlessDF: Continuous Distance Field Mapping with Non-Splatting Gaussians

Mask, Sample, Revise: A Revisable CTMC Inference Stack for Guided Discrete Flow Matching Text-to-Speech

Creative Integration: A Decidable Criterion of Creativity

Prompt2Effect: Training-Free Image-to-Video Model Specialization via LoRA Generation

An Attention-based Model for Robust Forecasting with Missing Modality

CaricHarmony: Contrastive Diffusion Paths for Identity-Preserving Caricature Synthesis

Can Machine Learning Forecast Rice Yields in Data-Constrained Settings? Satellite Climate Data, National Crop Statistics, and Lessons from Sierra Leone

Smoothing Dark Areas in Molecular Latent Diffusion

Minim: Privacy-Aware Minimal View for Agents via Trusted Local Sanitization

MedLatentDx: Latent Multi-Agent Communication for Cross-Hospital Rare-Disease Diagnosis

LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values

Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding