arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.28214 2026-05-28 cs.CR cs.LG cs.MA

Out of Sight, Not Out of Mind: Unveiling Latent Attack in Latent-based Multi-Agent Systems

眼不见,心不烦:揭示基于潜在的多智能体系统中的潜在攻击

Chenxi Wang, Ruiyang Huang, Jiayan Sun, Lei Wei, Yifan Wu

AI总结 研究潜在表示能否携带攻击信息,提出通过潜在干预激活攻击效果的框架,实验表明潜在攻击在清洁执行中显著降低任务性能,尤其影响智能体间KV缓存传递。

详情
Comments
27 pages, 7 figures, 3 tables. Preprint
AI中文摘要

基于潜在的多智能体系统用隐藏表示替代部分显式智能体间通信,为高效灵活的智能体协作提供了新方向。然而,将协调移至潜在空间也可能将攻击移至可见文本检查范围之外。本文研究潜在状态能否携带在清洁执行期间仍然有效的攻击相关信息。为探究此问题,我们引入了一个潜在攻击框架,通过潜在干预重新激活攻击诱导的效果,而无需重用对抗性文本。大量实验表明,由此产生的纯潜在攻击在清洁执行中能显著降低任务性能,尤其当应用于智能体间KV缓存传递而非局部隐藏状态时。进一步的控制分析表明,这种性能下降不能归因于任意扰动或无效生成。总体而言,我们的发现表明基于潜在的协作并未消除攻击风险,而是将部分风险转移至较不可见的执行状态,这要求超越可见文本检查的安全防护措施。

英文摘要

Latent-based multi-agent systems replace parts of explicit inter-agent communication with hidden representations, offering a new direction for efficient and flexible agent collaboration. However, moving coordination into latent space may also move attacks beyond the reach of visible-text inspection. In this paper, we study whether latent states can carry attack-associated information that remains effective during clean executions. To examine this question, we introduce a latent attack framework that reactivates attack-induced effects through latent interventions without reusing adversarial text. Extensive experiments show that the resulting latent-only attacks can substantially degrade task performance in clean executions, especially when applied to inter-agent KV-cache handoffs rather than local hidden states. Further control analyses indicate that this degradation cannot be reduced to arbitrary perturbations or invalid generation. Overall, our findings suggest that latent-based collaboration does not remove attack risk. It shifts part of the risk into less observable execution states, calling for safeguards beyond visible-text inspection.

2605.28213 2026-05-28 cs.AI

Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

学习何时优化:来自专家GPU内核谱系的验证优化技能

Shuoming Zhang, Qiuchu Yu, Yangyu Zhang, Ruiyuan Xu, Xiyu Shi, Guangli Li, Xiaobing Feng, Huimin Cui, Jiacheng Zhao

AI总结 提出KLineage方法,通过反向遍历专家GPU内核实现并提取可重用的优化技能,学习优化的适用条件,从而提升LLM代理生成内核的优化质量与效率。

详情
Comments
Preprint, Under Review
AI中文摘要

基于LLM的代理越来越多地被用于生成GPU内核,但它们通常知道尝试哪些优化,却不知道这些优化何时是合理的。我们引入了KLineage,它从专家内核中学习这种缺失的“何时”知识:KLineage不是依赖前向展开,而是通过验证门控简化反向遍历专家实现,并将每个接受的步骤逆转为可重用的优化技能。每个技能不仅记录了优化意图,还记录了它在代码中的适用位置、使其有效的条件、产生的效果以及其假设避免了哪些失败。下游LLM在相同的编译/正确性/性能分析门控下,将这些技能应用到新的代码表面上。在两个NVIDIA架构上的五个专家工作负载中,这些谱系衍生的技能作为有效的优化课程,在相同的固定预算下,在最终内核质量和优化效率方面均超过了近期基于内存的LLM内核基线。此外,我们使用一个单独的22实例保留检查作为对源案例记忆的合理性测试。

英文摘要

LLM-based agents are increasingly used to generate GPU kernels, but they often know what optimizations to try without knowing when those optimizations are sound. We introduce KLineage, which learns this missing "when" knowledge from expert kernels: instead of relying on forward rollouts, KLineage walks expert implementations backward through validation-gated simplifications and reverses each accepted step into a reusable optimization skill. Each skill records not only the optimization intent, but also where it applies in code, what conditions made it valid, what effect it had, and what failures its assumptions avoid. A downstream LLM materializes these skills on new code surfaces under the same compile/correctness/profile gate. On five expert workloads across two NVIDIA architectures, these lineage-derived skills serve as an effective optimization curriculum, exceeding recent memory-based LLM-kernel baselines in both final kernel quality and optimization efficiency under the same fixed budget. We additionally use a separate 22-instance held-out check as a sanity test against source-case memorization.

2605.28211 2026-05-28 cs.CL

When Helpful Context Leaks: Privacy Risks in Domain-Adapted ASR

当有用上下文泄露:领域自适应ASR中的隐私风险

Maike Züfle, Jan Niehues

AI总结 本文识别并系统研究了领域自适应ASR中因上下文提示或微调导致模型泄露隐私的风险,通过构建控制数据集测量泄露率,并评估了提示级缓解策略及精度-泄露权衡。

详情
AI中文摘要

语音大语言模型越来越多地部署在专业环境中,领域定制是标准做法:用户在提示中提供包含敏感信息的上下文,在专有录音上进行微调,或两者兼有。我们识别并系统研究了这种定制的一个被忽视的隐私风险:适应于识别领域特定术语的模型可以被诱导转录其上下文或训练数据中一个语音相似的词,即使说的是不同的词,从而泄露私人信息。为了评估这一风险,我们构建了一个控制数据集,并测量了两种定制机制(提示和微调)下的泄露率。两种机制都会导致可测量的泄露,且组合时加剧。我们评估了一种提示级缓解策略,并分析了不同定制方法下的精度-泄露权衡,发现无上下文提示的微调提供了最佳平衡。我们公开了代码和数据集。

英文摘要

SpeechLLMs are increasingly deployed in professional settings where domain customisation is standard practice: users supply context in prompts with sensitive information, fine-tune on proprietary recordings, or both. We identify and systematically investigate an overlooked privacy risk of such customisation: a model adapted to recognise domain-specific terminology can be nudged into transcribing a phonetically similar word from its context or training data, even when a different word is spoken, thereby leaking private information. To evaluate this risk, we construct a controlled dataset and measure leakage rates across two customisation mechanisms, prompting and fine-tuning. Both mechanisms cause measurable leakage, compounding when combined. We evaluate a prompt-level mitigation strategy and analyse the accuracy-leakage trade-off across customisation approaches, finding that fine-tuning without context prompts offers the best balance. We release our code and dataset publicly.

2605.28203 2026-05-28 cs.LG

Refining Multidimensional Video Reward Models via Disentangled Influence Functions

通过解耦影响函数优化多维视频奖励模型

Muyao Wang, Zeke Xie, Hideki Nakayama

AI总结 针对文本到视频生成任务中训练样本在不同评估维度上可靠性不一致的问题,提出解耦影响框架以估计维度特定监督风险,并设计维度解耦剪枝与重加权策略,显著提升多维视频奖励模型与真实标注的对齐效果。

详情
AI中文摘要

随着文本到视频(T2V)生成模型的不断发展,视频评估的复杂性要求跨多个轴进行细粒度评估。为此,近期工作致力于开发多维视频奖励模型(MVRMs),将评估过程分解以更好地适应人类视觉感知的多面性。然而,训练有效的MVRMs从根本上受到视频数据复杂性的挑战。在本工作中,我们识别出一个关键现象,称为维度异质性:训练样本的可靠性在不同评估维度上可能显著不同,这意味着一个样本可能为一个目标提供可靠的监督,同时为另一个目标引入高监督风险。因此,基于全局标量指标进行过滤的流行数据驱动方法对于T2V任务是不适定的。为解决此问题,我们提出一个解耦影响框架,能够高效估计维度特定的监督风险。利用该框架,我们引入两种维度解耦优化策略:维度解耦剪枝(移除极端高风险样本)和维度解耦重加权(对高风险监督进行软降权)。大量实验表明,我们的解耦策略显著优于全局过滤基线,得到的奖励模型与真实标注的对齐效果更优。

英文摘要

As Text-to-Video (T2V) generation models continue to evolve, the complexity of video evaluation necessitates a fine-grained assessment across various axes. To address this, recent works have focused on developing Multidimensional Video Reward Models (MVRMs), which decompose the evaluation process to better align with the multifaceted nature of human visual perception. However, training effective MVRMs is fundamentally challenged by the complex nature of video data. In this work, we identify a critical phenomenon termed Dimensional Heterogeneity: the reliability of a training sample can vary substantially across evaluation dimensions, meaning that a sample may provide reliable supervision for one objective while inducing high supervision risk for another. Consequently, prevailing data-centric methods that filter based on global scalar metrics are ill-posed for T2V tasks. To address this, we propose a disentangled influence framework that that efficiently estimates dimension-specific supervision risk. Leveraging this framework, we introduce two dimension-disentangled refinement strategies: Dimension-Disentangled Pruning, which removes extreme high-risk samples, and Dimension-Disentangled Reweighting, which softly down-weights high-risk supervision. Extensive experiments demonstrate that our disentangled strategies significantly outperform global filtering baselines, yielding reward models with superior alignment to ground truth.

2605.28202 2026-05-28 cs.RO

Natural Functional Gradients for Smooth Trajectory Optimization

平滑轨迹优化的自然函数梯度

Kisang Park, Chanwoo Kim, Kyungjae Lee, Sungjoon Choi

AI总结 提出一种基于自然函数梯度的轨迹优化框架,通过函数空间中的几何感知更新和蒙特卡洛估计,在无解析梯度时生成更平滑、更可行的运动轨迹。

详情
AI中文摘要

生成无碰撞且平滑的运动仍然是机器人操作中的一个核心挑战,尤其是在杂乱环境和狭窄通道中,可行区域高度受限且碎片化。我们提出了一种轨迹优化框架,该框架使用自然函数梯度直接在函数空间中进行几何感知更新。该方法优化了一个高斯平滑的替代目标,通过平滑轨迹扰动正则化优化景观,同时保留轨迹级结构。由于更新在函数空间内固有定义,轨迹规则性可以独立于特定时间离散化进行控制。我们推导了自然函数梯度的实用蒙特卡洛估计器,仅需黑盒轨迹评估,使得该方法在由于碰撞检测和接触丰富的仿真导致解析梯度不可用或不可靠时适用。在受限机器人操作任务上的实验表明,与代表性的规划和轨迹优化基线相比,所提出的方法在几何间隙狭窄的环境中提高了轨迹可行性并生成了更平滑的运动。更多结果、视频和实现细节可在项目页面获取:https://kisangpark.github.io/natural-functional-gradient/

英文摘要

Generating collision-free and smooth motions remains a central challenge in robotic manipulation, particularly in cluttered environments and narrow passages where feasible regions are highly constrained and fragmented. We propose a trajectory optimization framework that performs geometry-aware updates directly in function space using natural functional gradients. The method optimizes a Gaussian-smoothed surrogate objective that regularizes the optimization landscape through smooth trajectory perturbations while preserving trajectory-level structure. Because the updates are defined intrinsically in function space, trajectory regularity can be controlled independently of a particular time discretization. We derive a practical Monte-Carlo estimator of the natural functional gradient that requires only black-box trajectory evaluations, making the method applicable when analytic gradients are unavailable or unreliable due to collision checking and contact-rich simulation. Experiments on constrained robotic manipulation tasks demonstrate that the proposed method improves trajectory feasibility and produces smoother motions than representative planning and trajectory optimization baselines in environments with narrow geometric clearances. Additional results, videos, and implementation details are available at the project page: https://kisangpark.github.io/natural-functional-gradient/

2605.28201 2026-05-28 cs.AI

Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents

种植、持久化、触发:针对大语言模型智能体的潜伏攻击

Yongxiang Li, Moxin Li, Zhixin Ma, Fengbin Zhu, Dongrui Liu, Wenjie Wang, Fuli Feng

AI总结 提出潜伏攻击(Sleeper Attack),即攻击者将对抗性内容注入智能体状态并持久化,在后续交互中被良性用户查询触发,导致有害行为;构建包含1896个实例的基准测试,实验表明当前最强LLM智能体仍易受此类攻击。

详情
AI中文摘要

大语言模型(LLM)智能体仍然容易受到来自外部环境的安全威胁,攻击者将对抗性内容注入外部观察(如工具返回的数据、网页或MCP上下文),导致有害的智能体行为,例如不安全的操作或错误的输出。现有研究通常关注单次交互攻击,即智能体观察到对抗性内容后立即在单次用户请求中表现出有害行为。然而,我们表明对抗性内容也可以在同一智能体服务的多次交互中持久化,使得此类威胁更难检测和缓解。具体来说,对抗性内容可能持久化在智能体状态中,在多次交互中保持休眠,随后被良性用户查询激活。我们将此类安全威胁形式化为潜伏攻击(Sleeper Attack)。为了评估它,我们构建了一个包含1896个实例的基准测试,涵盖六种真实世界的有害结果、三种攻击策略和三种智能体状态目标:会话上下文、记忆和可复用技能。在七个强大的开源和闭源LLM上的实验表明,最先进的LLM智能体仍然容易受到潜伏攻击,即使在单次交互基线中它们实现了较低的攻击成功率。我们的代码和数据可在https://anonymous.4open.science/r/skdvnfu23ihr9wdscnksf1asdffsaef获取。

英文摘要

Large Language Model (LLM) agents remain vulnerable to safety threats from the external environment, where attackers inject adversarial content into external observations such as tool-returned data, webpages, or MCP context, causing harmful agentic behaviors such as unsafe actions or incorrect outputs. Existing studies typically focus on single-interaction attacks, where the agent observes adversarial content and immediately exhibits harmful behavior within one user request. However, we show that adversarial content can also persist across interactions served by the same agent, making such threats harder to detect and mitigate. Specifically, adversarial content may persist in the agent state, remain dormant across interactions, and later be activated by a benign user query. We formalize this type of safety threat as Sleeper Attack. To evaluate it, we construct a benchmark with 1,896 instances covering six real-world harmful outcomes, three attack strategies, and three agent state targets: session context, memory, and reusable skills. Experiments on seven strong open-source and closed-source LLMs show that state-of-the-art LLM agents remain vulnerable to Sleeper Attack, even when they achieve low attack success rates under a single-interaction baseline. Our code and data are available at https://anonymous.4open.science/r/skdvnfu23ihr9wdscnksf1asdffsaef.

2605.28200 2026-05-28 cs.LG q-bio.GN

Geometry-First Generative Spatial Single-Cell Reconstruction

几何优先的生成式空间单细胞重建

Ehtesamul Azim, Muhtasim Noor Alif, Tae Hyun Hwang, Yanjie Fu, Wei Zhang

AI总结 提出GEARS框架,通过几何优先方法结合扩散模型和置换等变生成器,从单细胞RNA测序数据重建空间几何,无需细胞类型标签或组织学图像。

详情
Comments
32nd SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)
AI中文摘要

单细胞RNA测序(scRNA-seq)可分析大量细胞但丢失空间背景,而空间转录组学(ST)以较低分辨率保留部分空间结构。现有大多数整合方法要么解卷积斑点混合物,要么将细胞映射到测量的斑点网格上,这使重建受限于固定网格和切片特定坐标系,在非配对设置中尤其成问题。我们提出GEARS,一种几何优先框架,在ST引导下重建内在的单细胞空间几何,无需依赖细胞类型标签、组织学图像或细胞-斑点分配。GEARS首先学习一个域不变的表达编码器,对齐ST斑点和解离细胞,然后训练一个置换等变生成器,配合基于扩散的细化器(采用EDM风格预处理),在来自ST坐标的姿态不变监督下生成局部空间几何。在推理时,GEARS在scRNA-seq细胞的多个重叠子集上重建几何,聚合跨子集的预测成对距离,并解决全局距离几何问题以获得规范二维坐标和密集距离矩阵。大量定量和定性实验(包括横截面泛化)表明,与强空间映射和解卷积基线相比,GEARS在全局距离保持、局部邻域保真度和空间分布对齐方面持续改进。

英文摘要

Single-cell RNA sequencing (scRNA-seq) profiles large numbers of cells but loses spatial context, whereas spatial transcriptomics (ST) preserves partial spatial structure at lower resolution. Most existing integration methods either deconvolve spot mixtures or map cells onto a measured spot lattice, which ties reconstructions to a fixed grid and slide-specific coordinate systems, a limitation that is especially problematic in unpaired settings. We propose GEARS, a geometry-first framework that reconstructs an intrinsic single-cell spatial geometry guided by ST, without relying on cell-type labels, histological images, or cell-to-spot assignment. GEARS first learns a domain-invariant expression encoder that aligns ST spots and dissociated cells, and then trains a permutation-equivariant generator with a diffusion-based refiner with EDM-style preconditioning to generate local spatial geometries under pose-invariant supervision derived from ST coordinates. At inference, GEARS reconstructs geometry on many overlapping subsets of scRNA-seq cells, aggregates predicted pairwise distances across subsets, and solves a global distance-geometry problem to obtain canonical two-dimensional coordinates and a dense distance matrix. Extensive quantitative and qualitative experiments, including cross-section generalization, show that GEARS consistently improves global distance preservation, local neighborhood fidelity, and spatial distribution alignment compared to strong spatial mapping and deconvolution baselines.

2605.28198 2026-05-28 cs.LG

Hierarchical Synthetic Tabular Data Generation: A Hybrid Top-Down and Bottom-Up Framework

层次化合成表格数据生成:一种自上而下与自下而上混合框架

Junfeng Nie, Alvin Jin, Xiaohui Chen

AI总结 提出一种层次化混合自上而下和自下而上(H-TDBU)框架,通过解耦语义结构与随机纹理,结合结构驱动的逻辑约束和轻量级表格生成器,在弱多模态金融基准上提升合成数据的语义一致性和统计保真度。

详情
Comments
Accepted as a poster at FMSD @ ICML 2026. 9 pages, 6 figures
AI中文摘要

现有的合成表格数据生成方法要么基于纯生成模型,要么基于大语言模型,两者在处理数据异质性、逻辑一致性、罕见事件覆盖以及低数据场景下的鲁棒性方面都存在困难。在本文中,我们提出了一种层次化混合自上而下和自下而上(H-TDBU)框架,该框架将语义结构与随机纹理解耦。在自上而下的路径中,构建了结构驱动的逻辑约束和跨模态对齐规则;而在自下而上的路径中,使用轻量级表格生成器从真实数据中学习局部统计模式。两条路径通过带有迭代反馈循环的统一合成引擎进行整合。我们在结合表格数据和情感文本数据的弱多模态金融基准上评估了该框架。实验结果表明,我们的H-TDBU方法在保持语义一致性的同时,相比神经基线方法提升了训练-合成-测试-真实性能。我们的结果表明,层次化规则引导的合成为在合成数据生成中结合可控性、语义连贯性和统计保真度提供了一种有效机制。

英文摘要

Existing approaches for synthetic tabular data generation are based on either purely generative models or LLMs, both of which struggle with data heterogeneity, logical consistency, rare-event coverage, and robustness in low-data regimes. In this paper, we propose a hierarchical hybrid top-down and bottom-up (H-TDBU) framework that decouples semantic structures from stochastic texture. In the top-down path, structure-driven logical constraints and cross-modal alignment rules are constructed, while in the bottom-up path, lightweight tabular generators are used to learn local statistical patterns from real data. The two paths are consolidated in a unified synthesis engine with an iterative feedback loop. We evaluate the framework on weak multimodal financial benchmarks combining tabular and sentiment-text data. Experimental results show that our H-TDBU approach improves train-synthetic-test-real performance over neural baseline methods while preserving semantic consistency. Our results suggest that hierarchical rule-guided synthesis provides an effective mechanism for combining controllability, semantic coherence, and statistical fidelity in synthetic data generation.

2605.28192 2026-05-28 cs.AI

Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning

面向多跳音视频推理的主动全模态感知代理

Ke Xu, Yuhao Wang, Ziyang Cheng, Hongcheng Liu, Yanfeng Wang, Yu Wang

AI总结 针对多跳音视频推理中证据稀疏且跨模态分布的问题,提出MOV-Bench基准和AOP-Agent代理框架,通过分层全模态记忆与观察-反思-重规划循环实现主动感知,显著提升开源全模态大模型在长视频和推理密集型问题上的性能。

详情
AI中文摘要

多跳音视频推理对全模态大语言模型(Omni-LLMs)仍然具有挑战性,因为相关证据通常稀疏、时间上分散,并且分布在音频和视频流中。现有基准对此设置的研究有限,通常仅涉及有限数量的模态、相关时间片段或推理步骤。在这项工作中,我们引入了MOV-Bench,一个包含519个精心设计问题的基准,这些问题需要对时间上分散的音视频证据进行多跳推理。在MOV-Bench上的评估表明,当前的全模态大语言模型在多跳跨模态推理方面仍然存在困难。为了解决这一挑战,我们进一步提出了AOP-Agent,一个基于开源全模态大语言模型的高效代理框架,用于主动全模态感知。通过将分层全模态记忆与协作的观察-反思-重规划循环相结合,AOP-Agent使开源全模态大语言模型能够进行主动感知,而无需额外训练或专有模型。在MOV-Bench和OmniVideoBench上的实验表明,AOP-Agent持续提升了推理性能,在长视频和推理密集型问题上尤其显著。

英文摘要

Multi-hop audio-visual reasoning remains challenging for Omni-LLMs, as relevant evidence is often sparse, temporally dispersed, and distributed across both audio and visual streams. Existing benchmarks provide limited investigation of this setting, typically involving only a limited number of modalities, relevant temporal segments, or reasoning steps. In this work, we introduce MOV-Bench, a benchmark containing 519 carefully curated questions that require multi-hop reasoning over temporally dispersed audio-visual evidence. Evaluations on MOV-Bench reveal that current Omni-LLMs still struggle with multi-hop cross-modal reasoning. To address this challenge, we further propose AOP-Agent, an efficient agentic framework built on open-source Omni-LLMs for active omni-modal perception. By combining a hierarchical omni-modal memory with a collaborative observe-reflect-replan loop, AOP-Agent enables open-source Omni-LLMs to perform active perception without additional training or proprietary models. Experiments on MOV-Bench and OmniVideoBench demonstrate that AOP-Agent consistently improves reasoning performance, with particularly notable gains on long videos and reasoning-intensive questions.

2605.28190 2026-05-28 cs.CL

The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness

更难文本嵌入基准(HTEB):超越一维静态鲁棒性

Manuel Frank, Haithem Afli

AI总结 提出HTEB动态评估框架,通过LLM随机变换输入,从词汇/风格、长度和语言三个维度挑战文本嵌入模型的鲁棒性,发现模型具有特定且部分解耦的鲁棒性轮廓,规模提升绝对分数但未缩小原始与变换评估差距,且英语数据集对变换更敏感。

详情
Comments
29 pages, 11 figures
AI中文摘要

像MTEB这样的嵌入基准为每个模型报告单一分数,隐含地将鲁棒性视为静态的标量属性。我们认为嵌入鲁棒性是多维的,因为模型对不同类型的变化有不同的响应,并且需要动态评估来暴露静态基准隐藏的失败。我们引入了更难文本嵌入基准(HTEB),这是一个动态评估框架,通过LLM在评估时随机变换输入,沿着三个实际可解释的轴(词汇/风格、长度和语言)挑战模型鲁棒性。在32个数据集(覆盖42种语言)上评估16个开源嵌入模型,变换通过英语子样本上的4800个人类评分验证,我们发现三种模式:(1)模型在各个轴上表现出特定的、部分解耦的鲁棒性轮廓。(2)在三个模型家族中,规模提升绝对分数,但未缩小原始评估与变换评估之间的差距。在这里,缩放倾向于特别改善语言轴。(3)英语数据集对HTEB变换比多语言数据集更敏感。这表明HTEB识别了模型在部署相关轴上的优缺点,挑战了当前的嵌入基准,并主张进行多维、动态的鲁棒性评估。

英文摘要

Embedding benchmarks like MTEB report a single score per model, implicitly treating robustness as a static, scalar property. We argue that embedding robustness is multidimensional, since models respond differently to different types of variation, and requires dynamic evaluation to expose failures hidden by static benchmarks. We introduce the Harder Text Embedding Benchmark (HTEB), a dynamic evaluation framework that challenges model robustness along three practically interpretable axes (Lexical/Stylistic, Length and Language) by stochastically transforming inputs at evaluation time with an LLM. Evaluating 16 open-weight embedding models on 32 datasets covering 42 languages under transformations validated by 4,800 human ratings on an English subsample, we find three patterns: (1) Models exhibit specific, partly decoupled robustness profiles across axes. (2) Across three model families, scale increases absolute scores but does not close the gap between original and transformed evaluations. Here, scaling tends to improve specifically the Language axis. (3) English datasets are more sensitive to HTEB transformations than multilingual datasets. This demonstrates that HTEB identifies strengths and weaknesses of models along deployment-relevant axes, challenging current embedding benchmarks and arguing for multidimensional, dynamic robustness evaluation.

2605.28188 2026-05-28 cs.CL

Framing Matters: Addressing Framing Sensitivity in Decision-Making through Behaviorally-Grounded Value Alignment

框架至关重要:通过基于行为的价值对齐解决决策中的框架敏感性

Seojin Hwang, Minju Kim, Junhyuk Choi, JeongHyun Park, Hwanhee Lee

AI总结 本文提出Fragile基准测试框架,系统评估大语言模型在事实等价但不同框架输入下的决策稳定性,并设计Valign方法通过表示级干预有效降低框架引起的决策翻转。

详情
Comments
29 pages, 7 figures, 31 tables
AI中文摘要

大语言模型(LLMs)越来越多地部署在高风险决策场景中,例如法律推理,其中在事实上等价的输入下保持一致性至关重要。然而,我们发现,事实保持不变但框架不同的输入会显著破坏LLM决策的稳定性。为了系统研究这一问题,我们引入了Fragile,一个大规模基准测试,它在三个受控维度上隔离了事实保持的语义框架:价值倾向叙述、时间切片和叙述生动性。我们的实验揭示了LLM对框架的高度敏感性,平均决策翻转率为28.6%。我们发现,简单的先验提示级和激活级干预不仅无法抑制框架敏感性,反而会主动放大它。因此,我们提出了Valign,一种表示级方法,通过将决策锚定到稳定的价值先验、将隐藏状态引导至模型的价值一致方向,并从模型隐藏状态中投影出时间-生动性敏感方向,显式地针对这些框架维度。Valign持续减少了框架引起的决策翻转,表明稳健的缓解需要直接针对框架操作的内部路径。

英文摘要

Large Language Models (LLMs) are increasingly deployed in high-stakes decision-making settings such as legal reasoning, where consistency under factually equivalent inputs is critical. However, we find that fact-preserved but differently framed inputs can significantly destabilize LLM decisions. To systematically investigate this problem, we introduce Fragile, a large-scale benchmark that isolates fact-preserving semantic framing across three controlled dimensions: value-tinted narration, temporal slice, and narrative vividness. Our experiments reveal a high susceptibility of LLMs to framing, with an average decision flip rate of 28.6%. We find that simple prior prompt-level and activation-level interventions not only fail to suppress framing sensitivity but actively amplify it. We therefore propose Valign, a representation-level method that explicitly targets these framing dimensions by anchoring decisions to a stable value prior, steering hidden states toward the model's value-consistent direction, and projecting out temporal-vividness-sensitive directions from the model's hidden states. Valign consistently reduces framing-induced decision flips, demonstrating that robust mitigation requires directly targeting the internal pathways in which framing operates.

2605.28187 2026-05-28 cs.IR cs.AI cs.CY cs.SI

Whose Name Comes Up? III: Persona Prompting Effects in LLM-Based Scholar Recommendation

谁的名字会出现?III:基于LLM的学者推荐中的人设提示效应

Annabella Sánchez-Guzmán, Lukas Eberhard, Denis Helic, Lisette Espín-Noboa

AI总结 本研究通过构建基准测试,分离模型选择与提示设计对LLM学者推荐的影响,发现提示设计(语言、地点、角色与任务)显著影响推荐质量(事实性、覆盖度)和社会代表性(多样性、均等性)。

详情
Comments
25 pages (10 main, 2 references, 13 appendix), 6 figures in main, 13 figures in appendix (under-review)
AI中文摘要

大型语言模型(LLM)越来越多地被用作学者推荐系统,塑造了学术界中被视为专家的人选。现有的审计仍然以英语为中心、单一学科且忽略人设,导致输出变异性的来源尚不明确。为此,我们提出了一个基准测试,以分离模型选择和提示设计对推荐的影响。我们通过改变人设提示(语言、地点、角色与任务)和上下文(领域、资历、k)审计了43个LLM。将推荐的学者与Semantic Scholar在六个科学学科上进行比较,以衡量技术质量(事实性、覆盖度)和社会代表性(多样性、均等性)。基本技术质量由模型选择驱动,事实性和均等性由上下文驱动,多样性由地点驱动。南非提示产生的事实性较低的列表,而日本提示产生的事实性高但同质化的列表,偏向高产的学者。因此,提示设计是基于LLM的学者发现中一个不可忽视的维度,应与模型选择一起系统审计。

英文摘要

Large language models (LLMs) are increasingly used as scholar recommenders, shaping who is seen as an expert in academia. Existing audits remain English-centric, single discipline, and persona-agnostic, leaving the source of output variability poorly understood. To this end, we propose a benchmark that disentangles the effects of model choice and prompt design on recommendations. We audit 43 LLMs by varying persona prompts (language, location, role-and-task) and context (field, seniority, k). Recommended scholars are compared against Semantic Scholar over six scientific disciplines to measure technical quality (factuality, coverage) and social representativeness (diversity, parity). Basic technical quality is driven by model choice, factuality and parity by context, and diversity by location. South Africa prompts yield less factual lists, while Japan prompts yield highly factual but homogeneous lists skewed toward highly productive scholars. Prompt design is thus a non-trivial axis of LLM-based scholar discovery and should be systematically audited alongside model choice.

2605.28186 2026-05-28 cs.RO cs.AI

Visualizing Latent Phase Structures in Locomotion Policies: A Multi-Environment Study with Temporal Feature Extension

可视化运动策略中的潜在相位结构:基于时间特征扩展的多环境研究

Daisuke Yasui, Toshitaka Matuki, Hiroshi Sato

AI总结 提出一种框架,通过扩展聚类特征(包括动作、下一状态和下一动作)并引入抑制自转移的聚类数确定方法,从深度强化学习运动策略中揭示更清晰、更规则的潜在运动相位结构。

详情
AI中文摘要

深度强化学习(DRL)已被证明在MuJoCo基准测试(如HalfCheetah、Ant和Walker2D)的运动控制任务中表现出高性能。然而,可视化由深度神经网络实现的训练策略函数内部获得的运动结构仍然具有挑战性。从生物力学及相关领域可知,运动控制是通过重复运动相位(如站立相和摆动相)实现的。在本研究中,我们提出一个框架,用于从运动控制策略通过与环境交互生成的轨迹中揭示潜在的相位结构。所提出的方法将聚类特征从仅状态观测扩展到包括动作、下一状态和下一动作的增强特征,并引入一种抑制自转移的聚类数确定方法。将所提出的方法应用于三个环境——Ant-v5、HalfCheetah-v5和Walker2D-v5,我们成功识别出比现有方法具有更清晰和更规则转换规则的相位结构。

英文摘要

Deep reinforcement learning (DRL) has been shown to achieve high performance on locomotion control tasks in MuJoCo benchmarks such as HalfCheetah, Ant, and Walker2D. However, visualizing the motion structures internally obtained by a trained policy function implemented as a deep neural network remains challenging. It is known from biomechanics and related fields that locomotion control is realized through the repetition of motion phases such as the stance phase and swing phase. In this study, we propose a framework for uncovering latent motion phase structures from trajectories generated by locomotion control policies through interaction with the environment. The proposed method extends the clustering features from state observations alone to augmented features including actions, next states, and next actions, and introduces a method for determining the number of clusters that suppresses self-transitions. Applying the proposed method to three environments -- Ant-v5, HalfCheetah-v5, and Walker2D-v5 -- we successfully identified phase structures with clearer and more regular transition rules than those obtained by the existing method.

2605.28184 2026-05-28 cs.LG

Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

通过最优系数校准在强化学习中联合训练多令牌预测

Zili Wang, Jiajun Chai, Lin Chen, Xiaohan Wang, Shiming Xiang, Guojun Yin

AI总结 本文从优化角度分析多令牌预测与强化学习联合训练失败的原因,提出最优系数校准方法,通过在线跟踪最优系数实现性能提升。

详情
AI中文摘要

基于可验证奖励的强化学习已成为提升大语言模型推理能力的标准范式,而多令牌预测是预训练中广泛采用的模块。将两者结合是自然的方法,但当前的强化学习实践会分离多令牌梯度,因为联合训练会降低性能。我们从优化角度重新审视这一失败。我们表明,多令牌对强化学习目标的每步影响可分解为两项:一阶相关性和二阶扰动惩罚。这种分解统一了三种多令牌训练模式:分离、交叉熵损失和策略损失,并解释了每种模式成功或失败的原因。对策略损失的进一步分析揭示,尽管它符合直觉,但性能仍然下降:相关性项衰减而二次惩罚持续存在。在分析指导下,我们提出最优系数校准,一种自适应方案,通过对数概率代理在线跟踪最优系数,且成本可忽略。在六个竞赛级数学推理基准上,最优系数校准一致达到或超过分离基线,实现了改进的联合多令牌-强化学习训练性能。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraining. Combining them is a natural approach, yet current RL practices detach MTP gradients because joint training degrades the performance. We revisit this failure from an optimization perspective. We show that the per-step effect of MTP on the RL objective can be decomposed into two terms: a first-order correlation and a second-order perturbation penalty. This decomposition unifies three MTP training regimes: Detach, Cross-Entropy loss, and Policy loss, and explains why each succeeds or fails. Further analysis of policy loss reveals that, although it aligns with intuition, performance still degrades: the correlation term decays while the quadratic penalty persists. Guided by the analysis, we propose Optimal Coefficient Calibration (OCC), an adaptive scheme that tracks the optimal coefficient online via a log-probability proxy at negligible cost. Across six competition-level mathematical reasoning benchmarks, OCC consistently matches or exceeds the detach baseline, delivering improved joint MTP-RL training performance.

2605.28181 2026-05-28 cs.CL

When Confidence Misleads: Suffix Anchoring and Anchor-Proximity Confidence Modulation for Diffusion Language Models

当置信度误导时:扩散语言模型的后缀锚定与锚邻近置信度调制

Jungwon Park, Jimyeong Kim, Jungmin Ko, Nojun Kwak, Wonjong Rhee

AI总结 针对扩散语言模型中置信度误导导致生成不完整或过早解码的问题,提出后缀锚定与锚邻近置信度调制方法,无需训练即可提升完全非自回归解码性能。

详情
Comments
Preprint
AI中文摘要

扩散语言模型通过对掩码标记序列进行迭代去噪来解码文本,使得选择解码位置成为推理时的核心决策。大多数无训练解码策略使用模型置信度进行位置选择,假设高置信度位置已准备好解码。本文通过研究置信度何时误导完全非自回归解码来重新审视这一假设。EOT标记可能获得高置信度并导致生成不完整;插入后缀锚定可以缓解此问题,但会在锚附近引入局部过度置信,导致锚邻近标记过早解码。为解决这些问题,我们提出后缀锚定置信度调制,一种简单的无训练方法,它插入短后缀锚定以鼓励回复完成,并根据解码进度调制锚附近的置信度。这保留了后缀锚定的回复完成优势,同时减少了锚邻近标记的过早解码。在纯文本推理、视觉-语言推理和代码生成基准测试中,我们的方法持续改进基于置信度的完全非自回归解码,优于显式EOT抑制,并保持了完全非自回归生成的并行解码优势。

英文摘要

Diffusion language models decode text by iteratively denoising masked token sequences, making the choice of which positions to decode a central inference-time decision. Most training-free decoding strategies use model confidence for position selection, assuming that high-confidence positions are ready to be decoded. In this work, we revisit this assumption by studying when confidence misleads fully non-autoregressive (fully non-AR) decoding. EOT tokens can receive high confidence and cause incomplete generation; inserting a suffix anchor can mitigate this issue but introduces local overconfidence near the anchor, causing anchor-adjacent tokens to be decoded too early. To address these issues, we propose Suffix-Anchored Confidence Modulation, a simple training-free method that inserts a short suffix anchor to encourage response completion and modulates confidence near the anchor according to decoding progress. This preserves the response-completion benefit of suffix anchoring while reducing premature decoding of anchor-adjacent tokens. Across text-only reasoning, vision-language reasoning, and code-generation benchmarks, our method consistently improves confidence-based fully non-AR decoding, outperforms explicit EOT suppression, and preserves the parallel decoding advantage of fully non-AR generation.

2605.28179 2026-05-28 cs.CL

SuperValid: Capability-Aligned OOD Validation for Generalizable Downstream Scaling

SuperValid: 面向泛化下游扩展的能力对齐OOD验证

Quanen Sun, Changxin Tian, Ke Shi, Cai Chen, Cunyin Peng, Jia Liu, Kunlong Chen, Zhiqiang Zhang

AI总结 提出SuperValid框架,通过从基准测试中提炼核心概念并扩展为多样化的知识丰富文本,合成能力对齐的分布外验证数据,以在能力层面预测下游性能,实现有效的模型选择、早停和扩展决策。

详情
AI中文摘要

扩展定律通过将计算量与交叉熵损失相关联来指导大型语言模型的训练,最近的工作进一步将其扩展到预测下游基准性能。然而,先前的方法在两个方面面临泛化限制:关注基准级性能会引入特定场景的伪影,而依赖IID验证损失则无法在训练分布变化时跟踪能力提升。在这项工作中,我们认为下游扩展应在能力层面进行研究,这能够捕捉跨相关任务的共享技能因素,同时抽象掉基准特定的噪声。我们提出了SuperValid,一个通过从能力领域内的基准测试中提炼核心概念并将其扩展为多样化的知识丰富文本来合成OOD(分布外)、能力对齐验证数据的框架。涵盖6个能力领域内17个基准测试的大量实验表明,SuperValid损失与不同架构、规模和训练数据分布的模型的下游性能表现出强且稳定的相关性。作为一种无需训练、可在训练期间计算且无需基准评估的度量,SuperValid实现了有效的模型选择、早停和扩展决策。

英文摘要

Scaling laws guide large language model training by relating compute to cross-entropy loss, and recent work further extends them to predict downstream benchmark performance. However, prior approaches face generalization limitations from two aspects: focusing on benchmark-level performance introduces scenario-specific artifacts, while relying on IID validation loss fails to track capability improvements when training distributions vary. In this work, we argue that downstream scaling should be studied at the capability level, which captures shared skill factors across related tasks while abstracting away benchmark-specific noise. We propose SuperValid, a framework that synthesizes OOD (out-of-distribution), capability-aligned validation data by distilling core concepts from benchmarks within a capability domain and expanding them into diverse, knowledge-rich texts. Extensive experiments spanning 17 benchmarks grouped into 6 capability domains show that SuperValid loss exhibits strong and stable correlation with downstream performance across models of different architectures, scales, and training data distributions. As a training-free metric computable during training without benchmark evaluation, SuperValid enables effective model selection, early stopping, and scaling decisions.

2605.28176 2026-05-28 cs.CV

From Kellgren-Lawrence to Calcium Pyrophosphate Crystal Deposition: A Soft-Labelling Framework for Knee Osteoarthritis Assessmen

从Kellgren-Lawrence到焦磷酸钙晶体沉积:一种用于膝骨关节炎评估的软标签框架

Francisco Bérchez-Moreno, Riccardo Rosati, Maria Chiara Fiorentino, Víctor M. Vargas, Edoardo Cipolletta, Emilio Filippucci, Luca Romeo, Pedro A. Gutiérrez, César Hervás-Martínez

AI总结 提出基于软标签的序贯深度学习框架,通过单峰概率分布替代独热编码,同时处理KL和CPPD分级中的序数不确定性和不对称关系,在膝X光图像上显著提升分级性能。

详情
AI中文摘要

背景与目标。传统的膝骨关节炎(KOA)分级深度学习方法依赖于独热标签,未能捕捉Kellgren-Lawrence(KL)和焦磷酸钙沉积病(CPPD)严重程度评分的序数不确定性,以及临床实践中观察到的两个量表之间的不对称关系。方法。我们回顾性收集了2172张膝关节X光图像,包括968张同时标注了KL和CPPD严重程度的X光片。开发了一个基于软标签的序贯深度学习框架用于两项任务,用以标注等级为中心的单峰概率分布替代独热目标。研究了四种分布形式:二项分布、贝塔分布、三角分布和指数分布。结果。所有软标签策略均持续优于名义基线。对于CPPD分级,三角分布实现了最高的二次加权卡帕(QWK)和最低的平均绝对误差(MAE)(QWK = 0.796;MAE = 0.438),而贝塔分布在考虑各类别的平均MAE(AMAE)和最大MAE(MMAE)时产生了最平衡的类别性能(AMAE = 0.458;MMAE = 0.573)。对于KL分级,基于贝塔的方法提供了最佳整体性能,实现了最高的QWK以及最低的MAE和类别误差(QWK = 0.777;MAE = 0.529;AMAE = 0.523;MMAE = 0.775)。统计分析表明,与传统的独热监督相比有显著改进(p < 0.001)。

英文摘要

Background and objective. Conventional Deep Learning (DL) approaches for Knee Osteoarthritis (KOA) grading rely on one-hot labels, which fail to capture both the ordinal uncertainty of Kellgren--Lawrence (KL) and Calcium Pyrophosphate Deposition Disease (CPPD) severity scores and the asymmetric relationship between the two scales observed in clinical practice. Methods. We retrospectively collected 2172 knee X-ray images, including 968 radiographs jointly annotated for KL and CPPD severity. An ordinal DL framework based on soft-labelling was developed for both tasks, replacing one-hot targets with unimodal probability distributions centred on the annotated grade. Four formulations were investigated: binomial, beta, triangular, and exponential. Results. All soft-labelling strategies consistently outperformed the nominal baseline. For CPPD grading, the triangular formulation achieved the highest Quadratic Weighted Kappa (QWK) and the lowest Mean Absolute Error (MAE) (QWK = 0.796; MAE = 0.438), while the beta formulation yielded the most balanced class-wise performance considering Average MAE (AMAE) and Maximum MAE (MMAE) across classes (AMAE = 0.458; MMAE = 0.573). For KL grading, the beta-based approach provided the best overall performance, achieving the highest QWK together with the lowest MAE and class-wise errors (QWK = 0.777; MAE = 0.529; AMAE = 0.523; MMAE = 0.775). Statistical analysis demonstrated significant improvements over conventional one-hot supervision (p < 0.001).

2605.28174 2026-05-28 cs.CV cs.AI

FLORO: A Multimodal Geospatial Foundation Model for Ecological Remote Sensing Across Sensors and Scales

FLORO:面向跨传感器与尺度的生态遥感多模态地理空间基础模型

Jorge L. Rodriguez, Victor Angulo Morales, Areej Alwahas, Mariana Elias Lara, Fida Mohammad Thoker, Kasper Johansen, Bernard Ghanem, Fernando T. Maestre, Matthew F. McCabe

AI总结 提出FLORO多模态地理空间基础模型,通过掩码自编码在异构遥感数据上预训练,利用可用性感知输入统一异构传感器配置,在PANGAEA基准上实现强迁移性能。

详情
Comments
29 pages, 9 figures
AI中文摘要

基础模型为可迁移的遥感表示提供了有前景的途径,但许多当前方法依赖于非常大的预训练数据集和固定的传感器配置,限制了它们在生态和环境应用中的适用性,这些应用中的观测通常跨平台、空间和光谱分辨率以及可用模态而变化。我们提出了FLORO,一个多模态地理空间基础模型,旨在从一个小型但高度多样化的遥感语料库中学习可迁移表示。FLORO使用掩码自编码在Sentinel-1、Sentinel-2、SkySAT影像、高程和无人机数据的异构组合上进行预训练。为了适应传感器变异性,FLORO结合了可用性感知输入,指示每个样本中存在哪些光谱波段和辅助模态,从而在异构传感器配置上实现统一的输入空间。我们在PANGAEA基准上,在冻结编码器协议下,评估了FLORO的场景分类、分割和回归任务。尽管在比竞争基础模型更小的语料库上预训练,FLORO在跨光学、光学-SAR和光学-高程基准(涵盖中分辨率卫星、航空和超高分辨率无人机影像)上实现了强大且稳定的迁移。FLORO在六个PANGAEA基准上取得了第二好的平均分割性能,仅次于最近引入的预训练图像数量超过两个数量级的基础模型,在场景分类上保持竞争力,在回归任务中表现稳健,而定性结果显示在洪水、城市、生物量和冠层高度预测设置中空间结构的保存有所改善。在EuroSAT-MS上的单独对照实验中,相对于绝对位置编码,地理位置编码进一步提高了分类性能。

英文摘要

Foundation models offer a promising route to transferable remote sensing representations, but many current approaches depend on very large pretraining datasets and fixed sensor configurations, limiting their suitability for ecological and environmental applications, where observations often vary across platforms, spatial and spectral resolutions, and available modalities. We introduce FLORO, a multimodal geospatial foundation model designed to learn transferable representations from a small but highly diverse remote sensing corpus. FLORO is pretrained using masked autoencoding on a heterogeneous combination of Sentinel-1, Sentinel-2, SkySAT imagery, elevation, and UAV-derived data. To accommodate sensor variability, FLORO incorporates availability-aware inputs that indicate which spectral bands and auxiliary modalities are present in each sample, enabling a unified input space across heterogeneous sensor configurations. We evaluated FLORO on the PANGAEA benchmark under a frozen-encoder protocol across scene classification, segmentation, and regression tasks. Despite being pretrained on a smaller corpus than competing foundation models, FLORO achieved strong and stable transfer across optical, optical-SAR, and optical-elevation benchmarks spanning medium-resolution satellite, airborne, and ultra-high-resolution UAV imagery. FLORO obtained the second-best average segmentation performance across six PANGAEA benchmarks, trailing only a recently introduced foundation model pretrained on over two orders of magnitude more images, remained competitive on scene classification, and was robust in regression tasks, while qualitative results showed improved preservation of spatial structure in flood, urban, biomass, and canopy-height prediction settings. In a separate controlled experiment on EuroSAT-MS, geo-positional encoding further improved classification relative to absolute positional encoding.

2605.28173 2026-05-28 cs.CV

MangaFlow: An End-to-End Agentic Framework for Controllable Story to Manga Generation

MangaFlow: 一种用于可控故事到漫画生成的端到端代理框架

Muyao Wang, Zeke Xie, Yanhao Chen, Lixin Xiu, Hideki Nakayama

AI总结 提出MangaFlow代理框架,通过将漫画创作分解为规划、定位、布局构建、参考条件渲染、合成和文字放置等步骤,实现可控的长篇漫画生成,支持布局和视觉参考作为显式中间变量,并引入故事段落记忆以保持跨面板一致性。

详情
AI中文摘要

端到端漫画生成是一项结构化的视觉叙事任务,需要故事分解、重复角色和场景定位、页面布局设计、面板渲染、页面合成和文字放置。然而,现有的生成模型通常直接进行页面合成,将这些因素纠缠在单个视觉输出中,限制了对布局几何、视觉参考和跨面板一致性的精确控制。为了解决这些限制,我们提出了MangaFlow,一个用于可控长篇漫画生成的代理框架,它将漫画创作分解为规划、定位、布局构建、参考条件渲染、合成和文字放置。通过将布局和视觉参考视为显式中间变量,MangaFlow既支持简单的文本到漫画生成,也支持更精确的用户控制漫画创作。这种设计将布局、视觉资产和文字放置暴露为可编辑的中间控制,用于细化面板几何、参考和文字位置。为了支持长篇一致性,MangaFlow引入了故事段落记忆,将段落描述与相应的角色、场景和对象参考链接起来,以便在面板间重用。我们进一步提出了一个元基准,用于评估布局可控性、视觉一致性和生成质量。实验表明,MangaFlow在布局遵循和跨面板一致性方面优于直接生成基线,同时支持灵活的人工控制。

英文摘要

End-to-end manga generation is a structured visual storytelling task that requires story decomposition, recurring character and scene grounding, page layout design, panel rendering, page composition, and lettering. However, existing generative models often perform direct page synthesis, entangling these factors in a single visual output and limiting precise control over layout geometry, visual references, and cross-panel consistency. To address these limitations, we propose MangaFlow, an agentic framework for controllable long-form manga generation that decomposes manga creation into planning, grounding, layout construction, reference-conditioned rendering, composition, and text placement. By treating layout and visual references as explicit intermediate variables, MangaFlow enables both simple text-to-manga generation and more precise user-controlled manga creation. This design exposes layout, visual assets, and lettering as editable intermediate controls for refining panel geometry, references, and text placement. To support long-form consistency, MangaFlow introduces a story section memory that links section descriptions with corresponding character, scene, and object references for reuse across panels. We further present a meta-benchmark for evaluating layout controllability, visual consistency, and generation quality. Experiments show that MangaFlow improves layout adherence and cross-panel consistency over direct generation baselines while supporting flexible human control.

2605.28172 2026-05-28 cs.RO

Provably Guaranteed Polytopic Uncertainty Quantification for SLAM

具有可证明保证的多面体不确定性量化用于SLAM

Guangyang Zeng, Yulong Gao, Yuan Shen, Lingpeng Chen, Haoying Li, Guodong Shi, Junfeng Wu

AI总结 本文提出基于多面体表示的不确定性量化算法,通过前向映射、后向位姿跟踪和位姿复合三个模块,为3D-3D路标SLAM提供可证明的确定性保证,并结合共形预测提高实用性。

详情
Comments
16 pages, 10 figures; accepted by Robotics: Science and Systems 2026
AI中文摘要

在安全关键的机器人应用中,感知中保证且实用的不确定性量化至关重要。许多现有工作要么没有提供正式包含保证,要么依赖限制性建模假设,要么只关注位姿估计而非完整的SLAM流水线。本文提出了用于基于3D-3D路标的SLAM的可证明保证的不确定性量化算法。该算法由三个基本的不确定性量化模块组成:用于建图的前向不确定性量化、用于位姿跟踪的后向不确定性量化以及位姿复合。每个模块生成一个认证的不确定性集;当输入不确定性边界是确定性的时,输出集继承确定性保证,即它们可证明地包含真实位姿和路标。具体来说,我们使用多面体表示不确定性集,从而实现易处理的计算和对位姿不确定性的统一处理。为了提高算法的实际可用性,我们结合了共形预测,从数据中以规定概率校准测量不确定性。仿真和实验表明,所提出的算法既提供了强大的理论保证,又具有实际可用性。代码开源在 https://github.com/LIAS-CUHKSZ/Polytopic-SLAM-Uncertainty-Quantification。

英文摘要

In safety-critical robotics applications, guaranteed and practical uncertainty quantification (UQ) in perception is vital. Many existing works either offer no formal containment guarantee, rely on restrictive modeling assumptions, or focus only on pose estimation rather than a complete SLAM pipeline. This paper presents provably guaranteed UQ algorithms for 3D-3D landmark-based SLAM. The algorithms consist of three basic UQ modules: forward UQ for mapping, backward UQ for pose tracking, and pose compound. Each module produces a certified uncertainty set; when the input uncertainty bounds are deterministic, the output sets inherit deterministic guarantees, i.e., they provably contain the true poses and landmarks. Specifically, we use polytopes to represent uncertainty sets, enabling tractable computations and a unified treatment of pose uncertainty. To enhance algorithms' practical usability, we incorporate conformal prediction to calibrate measurement uncertainty from data with prescribed probability. Simulations and experiments demonstrate that the proposed algorithms provide both strong theoretical guarantees and practical usability. The code is open-sourced at https://github.com/LIAS-CUHKSZ/Polytopic-SLAM-Uncertainty-Quantification.

2605.28170 2026-05-28 cs.AI

Localizing Input Uncertainty Quantification for Large Language Models via Shapley Values

通过Shapley值为大型语言模型定位输入不确定性量化

Seongjun Lee, Suwan Yoon, Changhee Lee

AI总结 提出ShaQ框架,利用Shapley值将输入中的模糊跨度建模为合作博弈参与者,通过条件熵的边际减少加权平均量化每个跨度对输入不确定性的贡献,实现跨度级归因,在AmbigQA、AmbiEnt和MediTOD基准上取得最先进性能。

详情
Comments
Codes are available https://anonymous.4open.science/r/ShaQ-0E39/README.md
AI中文摘要

随着大型语言模型(LLM)越来越多地集成到高风险决策中,可靠量化不确定性的能力已成为安全性和可信度的关键要求。然而,当前的不确定性量化方法主要在输出层面操作,通常无法区分不确定性是源于模型缺乏知识还是用户输入的模糊性。尽管以输入为中心的不确定性量化最近成为一个有前景的方向,但它仍相对未被充分探索,并且通常依赖于粗糙的输入级信息。因此,用户只能获得标量不确定性分数,这些分数几乎没有提供可操作的指导,以说明应该澄清输入的哪些部分来提高可靠性。为了解决这一局限性,我们提出了基于Shapley的输入不确定性量化(ShaQ),这是一个用于输入诱导不确定性的跨度级归因框架。我们的方法将输入中的模糊跨度建模为合作博弈中的参与者,并使用Shapley值量化它们的贡献,Shapley值通过澄清每个跨度联盟所获得的条件熵边际减少的加权平均来定义。与现有的输入级方法不同,我们的公式捕捉了跨度之间的复杂交互,并提供了一种原则性的分解,其中个体归因之和恰好等于总输入诱导不确定性。我们在AmbigQA和AmbiEnt基准上评估了ShaQ,它在模糊性检测中实现了最先进的性能。我们进一步在MediTOD上展示了其实用性,表明ShaQ可以定位未明确说明的临床话语,并促进高风险环境中的人机协作。总体而言,ShaQ改进了不确定性估计,并为有针对性的输入澄清提供了可操作的见解。

英文摘要

As large language models (LLMs) are increasingly integrated into high-stakes decision-making, the ability to reliably quantify uncertainty has become a critical requirement for safety and trust. However, current uncertainty quantification methods primarily operate at the output level, often failing to distinguish whether uncertainty arises from the model's lack of knowledge or from ambiguity in the user's input. While input-centric uncertainty quantification has recently emerged as a promising direction, it remains relatively underexplored and typically relies on coarse, input-level information. Consequently, users are provided with scalar uncertainty scores that offer little actionable guidance on which parts of the input should be clarified to improve reliability. To address this limitation, we propose Shapley-based input uncertainty Quantification (ShaQ), a framework for span-level attribution of input-induced uncertainty. Our approach models ambiguous spans in the input as players in a cooperative game and quantifies their contributions using Shapley values, defined via the weighted average of marginal reductions in conditional entropy obtained by clarifying each span coalition. Unlike existing input-level approaches, our formulation captures complex interactions among spans and provides a principled decomposition in which individual attributions sum exactly to the total input-induced uncertainty. We evaluate ShaQ on the AmbigQA and AmbiEnt benchmarks, where it achieves state-of-the-art performance in ambiguity detection. We further demonstrate its utility on MediTOD, showing that ShaQ can localize under-specified clinical utterances and facilitate human-AI collaboration in high-stakes settings. Overall, ShaQ improves uncertainty estimation and provides actionable insights for targeted input clarification.

2605.28168 2026-05-28 cs.AI

OccuReward: LLM-Guided Occupant-Centric Reward Shaping for Demographic Equity in Grid-Interactive Buildings

OccuReward: 面向电网交互建筑中人口公平性的LLM引导的以 occupant 为中心的奖励塑造

Shadmehr Zaregarizi, Khashayar Yavari

AI总结 提出OccuReward框架,利用大语言模型迭代塑造奖励函数,通过舒适公平指数(CEI)反馈,在CityLearn v2中提升不同人口群体的舒适公平性,同时降低能耗成本。

详情
Comments
4 pages, 2 figures. Accepted at OccuSys 2026, co-located with ACM Sustainability Week 2026. Preprint version
AI中文摘要

大语言模型(LLM)在为基于深度强化学习(DRL)的建筑能源管理生成奖励函数方面展现出有前景的能力。然而,它们在异质人口群体中引发或加剧 occupant 舒适度差异的潜力尚未被探索。我们提出 OccuReward,一个研究 LLM 介导的奖励设计如何影响人口公平性的框架。我们的贡献有三方面:引入舒适公平指数(CEI)作为新颖的反馈信号;一种迭代的、公平感知的 LLM 奖励塑造方法;以及在这些优化目标下 DRL 代理的性能分析。利用来自 ASHRAE 全球热舒适数据库 II(13,440 票)的四个基于经验 occupant 档案,我们在 CityLearn v2 中部署了一个 Soft Actor-Critic 代理。我们的方法使用 Gemini API 生成奖励函数逻辑和权重——而不是执行每步推理——跨越三个细化轮次。15 个实验运行的结果显示,老年女性 occupant 在初始轮次中始终经历最低满意度。到第 3 轮,公平感知的 LLM 细化激活了特定的奖励组件,提高了年轻男性(+17.6%)、中年女性(+28.2%)、健康敏感者(+53.8%)和老年女性(+567%)的满意度,同时降低了 3.2% 的能源成本。我们的发现强调,虽然奖励层面的干预显著改善了公平性,但 AI 驱动控制器中的人口差异仍然存在,需要进一步研究建筑系统中的算法公平性。

英文摘要

Large language models (LLMs) have demonstrated promising capability in generating reward functions for deep reinforcement learning (DRL)-based building energy management. However, their potential to exhibit or exacerbate disparities in occupant comfort across heterogeneous demographic populations remains unexplored. We present OccuReward, a framework investigating how LLM-mediated reward design affects demographic equity. Our contribution is three-fold: the introduction of the Comfort Equity Index (CEI) as a novel feedback signal; a methodology for iterative, equity-aware LLM reward shaping; and a performance analysis of DRL agents under these refined objectives. Utilizing four empirically grounded occupant profiles from the ASHRAE Global Thermal Comfort Database II (13,440 votes), we deploy a Soft Actor-Critic agent in CityLearn v2. Our approach employs the Gemini API to generate reward function logic and weights--rather than performing per-step inference--across three refinement rounds. Results across 15 experimental runs reveal that elderly female occupants consistently experience the lowest satisfaction in initial rounds. By Round 3, equity-aware LLM refinement activates specific reward components that improve satisfaction for Young Males (+17.6%), Mid-aged Females (+28.2%), Health Sensitive (+53.8%), and Elderly Females (+567%), while simultaneously reducing energy costs by 3.2%. Our findings highlight that while reward-level intervention significantly improves equity, demographic disparities in AI-driven controllers persist, necessitating further research into algorithmic fairness in building systems.

2605.28167 2026-05-28 cs.CV

DebFilter: Eradicating Biases Stashed in Value

DebFilter: 消除隐藏在值中的偏见

Seung Hyuk Lee, Songkuk Kim

AI总结 提出DebFilter,一种轻量级、无需训练的方法,通过调整交叉注意力中的值分量来纠正文本到图像扩散模型中的社会偏见,实现推理时偏差缓解。

详情
Comments
8 pages, 7 figures, supplementary material included, CVPR 2026
AI中文摘要

文本到图像扩散模型,理论上等价于基于分数的生成模型,通过由预训练视觉语言模型(如CLIP)提取的文本嵌入引导的多步去噪过程生成图像。然而,这些文本嵌入固有地编码了社会和语义偏见——例如与性别和年龄相关的偏见——这些偏见随后通过引导机制以及模型在相对于这些偏见概念不平衡的大规模数据集上的训练被传播和放大,常常导致文本到图像生成中的输出偏差。我们提出了DebFilter,一种轻量级且无需训练的框架,用于缓解文本到图像扩散模型中的此类偏见。观察到模型在每个去噪步骤中的误差预测主要受交叉注意力动态影响,我们引入了一种偏差校正策略,调整交叉注意力中的值分量。具体地,我们对引导嵌入的切片施加固定偏移,有效地将交叉注意力值的语义方向转向无偏表示。这种调整重新配置了分数景观以产生平衡的输出,同时保持与预期文本语义的对齐。与依赖微调或重新训练的先前方法不同,DebFilter完全在推理时运行,无需额外数据或模型更新。我们的结果表明,该方法有效缓解了生成图像中的社会偏见,为更公平和更包容的文本到图像生成提供了一条高效且可扩展的途径。

英文摘要

Text-to-image diffusion models, which are theoretically equivalent to score-based generative models, generate images through a multi-step denoising process guided by text embeddings extracted from pretrained vision-language models such as CLIP. However, these text embeddings inherently encode social and semantic biases -- such as those related to gender and age -- that are subsequently propagated and amplified through the guidance mechanism, along with the model's training on large-scale datasets that are imbalanced with respect to these bias-related concepts, often leading to skewed outputs in text-to-image generation. We propose DebFilter, a lightweight and training-free framework for mitigating such biases in text-to-image diffusion models. Observing that the model's error prediction at each denoising step is primarily influenced by cross-attention dynamics, we introduce a bias-correction strategy that adjusts the value components within cross-attention. Specifically, we apply a fixed offset to the slice of guidance embedding, effectively steering the semantic direction of cross-attention values toward unbiased representations. This adjustment reconfigures the score landscape to produce balanced outputs while maintaining alignment with the intended text semantics. Unlike prior approaches that rely on fine-tuning or retraining, DebFilter operates entirely at inference time, requiring no additional data or model updates. Our results demonstrate that this method effectively mitigates social biases in generated images, offering an efficient and scalable pathway toward fairer and more inclusive text-to-image generation.

2605.28165 2026-05-28 cs.LG

Unification and Optimization of Robust Supervised Learning

鲁棒监督学习的统一与优化

Jonas Hanselle, Valentin Margraf, Clemens Damke, Eyke Hüllermeier

AI总结 提出一个统一框架,将分布鲁棒优化、标签平滑、邻域风险最小化和Mixup等鲁棒学习方法组织为三个设计轴,并通过联合超参数优化自动组合适合任务的鲁棒策略。

详情
AI中文摘要

文献中提出了各种经验风险最小化的鲁棒替代方案,以应对分布偏移、标签噪声和有限样本退化等故障模式。例如分布鲁棒优化、标签平滑、邻域风险最小化和Mixup。然而,这些方法通常是孤立开发的,迫使从业者事先承诺单一故障模式,即使任务的主要模式尚不清楚。为了解决这个问题,我们将现有的一大类方法沿着三个共同的设计轴组织起来,并推导出一个可行的训练程序,将鲁棒学习分解为顺序阶段(参考分布丰富化、输入空间扰动、标签空间扰动和样本级聚合),每个阶段都有立场选择(悲观、中性或乐观)。这产生了一个统一的设计空间,其中联合超参数优化可以组合和配置适合手头任务的鲁棒策略。在表格、图像和奖励建模基准测试中,联合超参数优化与每种设置中最佳单方法基线具有竞争力,为那些事先不知道其任务中哪种故障模式占主导地位的从业者提供了可靠的默认选择。

英文摘要

The literature has proposed various robust alternatives to empirical risk minimisation to address failure modes such as distribution shift, label noise and finite-sample degeneracies. Examples include distributionally robust optimization, label smoothing, vicinal risk minimization, and Mixup. However, such approaches are typically developed in isolation, forcing practitioners to commit a priori to a single failure mode even when the dominant mode for the task is unclear. To address this, we organize a broad class of existing methods along three common design axes and derive a tractable training procedure that decomposes robust learning into sequential stages (reference distribution enrichment, input-space perturbation, label-space perturbation, and sample-level aggregation), each with a choice of stance (pessimistic, neutral, or optimistic). This results in a unified design space in which joint hyperparameter optimization can compose and configure robustness strategies suited to the task at hand. Across tabular, image, and reward modeling benchmarks, joint hyperparameter optimization is competitive with the best single-method baseline in each setting, offering a reliable default for practitioners who do not know a priori which failure mode dominates their task.

2605.28164 2026-05-28 cs.NE cs.AI

Performance and Explainability Requirements of Evolutionary Algorithms in Real-World Physics-Informed Optimization

进化算法在实际物理信息优化中的性能和可解释性要求

Helena Stegherr, Michael Heider, Nils Meyer, Tobias Thummerer, Thomas Wendler, Pierre Aublin, Ennio Idrobo-Àvila, Lars Mikelsons, Sebastian Zaunseder, Jörg Hähner

AI总结 本文通过五个实际物理优化问题,分析领域专家对进化算法在性能和可解释性方面的需求,并指出现有方法未能充分应用于复杂实际场景的差距。

详情
AI中文摘要

进化计算提供了多种工具来解决复杂的实际优化问题。然而,研究通常集中在较小、简化的问题和优化算法上,这些算法在实际场景中有时无法满足期望。此外,在此类设置中,对应用算法及其提供的解决方案的信任通常至关重要,但这需要理解搜索过程本身。这导致在许多应用背景下(包括基于物理的建模)实践者往往不会认真考虑进化计算。本文详细介绍了可以缓解这些问题的进化计算技术。首先,由领域专家介绍并描述了五个实际的基于物理的优化问题。针对每个问题,提出了进化算法在性能和可解释性方面的要求,以增加信任和可用性。我们发现,所有领域专家都期望快速收敛到良好解决方案,并希望获得关于结果如何形成的一些解释,而其他要求则强烈依赖于具体问题。最后,我们介绍了现有方法,这些方法可用于改进进化算法的这些方面,但据我们所知,从未在复杂的实际场景中使用过。这意味着两个领域之间存在需要弥合的差距,以充分发挥进化计算的潜力。

英文摘要

Evolutionary computation offers a variety of tools to solve complex real-world optimization problems. However, research often focuses on smaller, simplified problems and optimization algorithms that sometimes miss expectations in real-world scenarios. Additionally, trust in the applied algorithm and the solutions it provides is often essential in such settings, but requires an understanding of the search process itself. This leads to evolutionary computation often not being seriously considered by practitioners in many application contexts, among them physics-based modeling. In this article, techniques from evolutionary computation are detailed that can alleviate these problems. First, five real-world physics-based optimization problems are introduced and described by domain experts. For each of these, the requirements for the evolutionary algorithm regarding performance and explainability to increase trust and usability are presented. We found that all domain experts expect fast convergence to a good solution and want some explanations for how the results were formed, while other requirements strongly depend on the respective problem. Finally, we present existing approaches that can be leveraged to improve those aspects of evolutionary algorithms but have to our knowledge never been employed in complex real-world scenarios. This implies a gap between both domains that needs to be closed to exploit the full potential of evolutionary computation.

2605.28163 2026-05-28 cs.CL cs.AI

DEPART: DEcomposing PARiTy across Multilingual LLMs

DEPART: 跨多语言大模型的性能差异分解

Manan Uppadhyay, Prashant Kodali, Pranjal Chitale, Reshma Ramaprasad, Himanshu Beniwal, Sunayana Sitaram

AI总结 提出DEPART框架,通过贝叶斯分层模型分解多语言大模型性能差异,发现语言特征解释79%-92%的方差,且模型内部表示与英语的相似性是主要预测因子。

详情
AI中文摘要

多语言大模型(mLLMs)排行榜报告每种语言的准确率,但很少解释为何出现差异,导致系统性偏差未被归因,且从业者无法采取可操作的杠杆。我们首先通过无分布Friedman和Kruskal-Wallis检验确定这些差距是系统性的而非抽样噪声的产物,然后引入一个两步贝叶斯分层框架,将多语言性能方差分解为可解释的组成部分。首先,隔离语言身份归因的方差,我们表明可观察的语言特征(文字、语系、类型学距离)在理解任务上解释了$R^2_{\text{ling}} = 79\%$的方差,在推理任务上解释了$92\%$,而模型内部表示与英语的相似性成为两个任务桶中的主导预测因子。其次,分解完整的(模型×基准×语言)立方体,我们发现NLU和推理具有根本不同的方差分布:模型身份主导理解(方差的66.7%),而基准×模型交互主导推理(46.3%)。这些结果共同将多语言评估从被动的性能映射重塑为一个可解释的诊断框架,并提供针对语言差异根本驱动因素的具体杠杆。

英文摘要

Multilingual Large Language Models (mLLMs) leaderboards report per-language accuracy but rarely explain why disparities emerge, leaving systemic biases unattributed and offering practitioners no actionable levers. We first establish that these gaps are systematic rather than artifacts of sampling noise via distribution-free Friedman and Kruskal--Wallis tests, then introduce a two-step Bayesian hierarchical framework that decomposes multilingual performance variance into interpretable components. First, isolating the variance attributable to language identity, we show that observable language features (script, family, typological distance) explain $R^2_{\text{ling}} = 79\%$ of this variance on understanding tasks and $92\%$ on reasoning, with a model's internal representational similarity to English emerging as the dominant predictor across both task buckets. Second, decomposing the full (model$\times$benchmark$\times$language) cube, we find that NLU and reasoning have fundamentally divergent variance profiles: model identity dominates understanding ($66.7\%$ of variance), whereas the benchmark$\times$model interaction dominates reasoning ($46.3\%$). Together these results recast multilingual evaluation from passive performance mapping into an explainable, diagnostic framework with concrete levers for targeting the root drivers of language disparity.

2605.28161 2026-05-28 cs.CV

MeniOmni: A Structured Multimodal Benchmark for Holistic Meniscus Injury Assessment

MeniOmni:用于整体半月板损伤评估的结构化多模态基准

Shurui Xu, Siqi Yang, Weiping Ding, Hui Wang, Mengzhen Fan, Yuyu Sun, Shuyan Li

AI总结 提出MeniOmni基准,包含多中心MRI、临床先验和专家标注文本,支持细粒度Stoller分级和诊断报告生成,并引入风险感知序数评估和语义一致性指标Meni-Score。

详情
Comments
Accepted by IEEE International Conference on Multimedia and Expo (ICME) 2026 (Oral Presentation)
AI中文摘要

半月板损伤的临床诊断需要放射科医生将体积MRI证据与患者背景(如性别、年龄、BMI)相结合,并生成结构化诊断报告。现有的膝关节MRI基准通常是单模态的,依赖粗粒度标签,限制了评估整体临床推理的能力。我们提出了MeniOmni,一个用于半月板损伤评估的结构化多模态基准,包含746个多中心MRI研究,具有三平面体积输入、临床先验和专家标注的临床文本。MeniOmni支持两个任务:(1)细粒度Stoller严重程度分级和(2)诊断报告生成。我们进一步提出了风险感知序数评估和语义一致性指标(Meni-Score),以更好地反映临床相关性。基线实验表明,纳入临床先验可提高分级性能并减少严重错误,凸显了多模态上下文对更安全评估的价值。代码和数据可在https://github.com/ShuruiXu/MeniOmni获取。

英文摘要

Clinical diagnosis of meniscus injuries requires radiologists to integrate volumetric MRI evidence with patient context (e.g., sex, age, BMI) and to produce structured diagnostic reports. Existing knee MRI benchmarks are typically unimodal and rely on coarse labels, limiting their ability to evaluate holistic clinical reasoning. We introduce MeniOmni, a structured multimodal benchmark for meniscus injury assessment, consisting of 746 multi-center MRI studies with tri-planar volumetric inputs, Clinical Priors, and expert-annotated clinical text. MeniOmni supports two tasks: (1) fine-grained Stoller severity grading and (2) diagnostic report generation. We further propose risk-aware ordinal evaluation and a semantic consistency metric (Meni-Score) to better reflect clinical relevance. Baseline experiments show that incorporating Clinical Priors improves grading performance and reduces severe errors, highlighting the value of multimodal context for safer assessment. Code and data are available at https://github.com/ShuruiXu/MeniOmni.

2605.28160 2026-05-28 cs.AI

Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning

按需查看:多模态推理中视觉证据获取的认知调度框架

Yang Zhang, Xiaoshuai Sun, Rui Zhao, Wujin Sun, Yidong Chen, Jiayi Ji, Qian Chen, Rongrong Ji

AI总结 提出CSMR框架,通过语言模型控制何时调用独立视觉感知模块获取任务相关视觉证据,在零样本设置下多个基准上优于基线方法。

详情
Comments
Accepted at ICML 2026
AI中文摘要

现有的多模态推理方法主要遵循两种范式:在推理前将视觉输入转换为文本,或在统一的视觉-语言表示空间中进行端到端推理。尽管取得了经验上的进展,但两种范式都存在根本性的结构限制。前者依赖于静态的视觉到文本转换,往往会压缩并丢失细粒度的视觉细节。后者容易受到联合优化和注意力机制引起的语言主导,导致推理过程中对视觉证据的忠实性系统性减弱。在这项工作中,我们认为核心挑战在于视觉证据如何以及何时被引入推理过程。受此启发,我们提出了CSMR,一种多模态推理框架,其中语言模型通过决定何时调用独立的视觉感知模块来获取任务相关的视觉证据,从而控制推理过程。在多个多模态推理基准上的实验表明,在零样本设置下,CSMR在准确性上始终优于代表性基线方法。进一步的实验分析证实,这些优势主要源于所提出的认知调度机制。

英文摘要

Existing multimodal reasoning approaches predominantly follow two paradigms: converting visual inputs into text prior to reasoning, or performing end-to-end reasoning within a unified vision-language representation space. Despite their empirical progress, both paradigms suffer from fundamental structural limitations. The former relies on static visual-to-text conversion, which tends to compress and lose fine-grained visual details. The latter is prone to linguistic dominance induced by joint optimization and attention mechanisms, leading to systematically weakened faithfulness to visual evidence during reasoning. In this work, we argue that a central challenge is how and when visual evidence is introduced into the reasoning process. Motivated by this insight, we propose CSMR, a multimodal reasoning framework in which a language model controls the reasoning process by deciding when to invoke an independent visual perception module to acquire task-relevant visual evidence. Experiments across multiple multimodal reasoning benchmarks show that CSMR consistently outperforms representative baseline methods in accuracy under a zero-shot setting. Further experimental analysis confirms that these advantages primarily arise from the proposed cognitive scheduling mechanism.

2605.28158 2026-05-28 cs.AI

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

OR-Space:面向工业优化智能体的全生命周期工作空间基准

Chenyu Zhou, Xinyun Lu, Jiangyue Zhao, Jianghao Lin, Dongdong Ge, Yinyu Ye

AI总结 提出OR-Space基准,通过构建、修订和解释三种任务模式,评估大语言模型智能体在工业优化工作流中的可靠性。

详情
Comments
34 pages, 8 figures
AI中文摘要

大语言模型(LLM)智能体越来越多地被用于辅助运筹学(OR)建模,然而现有的面向OR的基准通常将评估简化为从自包含的问题陈述到数学公式或求解器程序的一次性翻译。这种设置忽略了实际工业OR工作流的两个特征:持久的多工件工作空间和多阶段任务生命周期。我们引入了OR-Space,一个全生命周期的工作空间基准,用于评估工业优化智能体在模型构建、模型修订和基于解释的任务中的表现。每个实例都是一个可执行的工作空间,包含业务文档、结构化数据、可选的代码工件、求解器输出以及分布在相互依赖文件中的任务特定评估器。OR-Space定义了三种任务模式:构建模式,智能体从异构工件构建可求解的优化模型;修订模式,智能体在需求变化或求解器反馈下修改现有模型,同时保留有效的先前逻辑;解释模式,智能体利用工作空间工件中的证据回答关于解决方案、约束和业务影响的基于解释的问题。通过将持久工作空间与生命周期导向的任务相结合,OR-Space评估智能体是否能够执行超越端到端文本生成的可靠优化工作。我们描述了基准设计、评估协议和质量控制流程,并将OR-Space定位为研究LLM智能体在工业OR工作流中的可靠性、失败模式和实际准备程度的基准。

英文摘要

Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces and multi-stage task lifecycles. We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation. Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files. OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts; Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; and Explain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation. We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.

2605.28157 2026-05-28 cs.CV

Intra-YOLO: A Small Object Detection Model for Caries and Molar-Incisor Hypomineralization in Intraoral Photography Based on Transfer Learning with Reinforcement Learning

Intra-YOLO:基于迁移学习与强化学习的口内摄影龋齿与磨牙-切牙矿化不良小目标检测模型

Po-Lun Chwang, Po-Yu Chang, Wen-Liang Lin, Tung-Sheng Wu, Min-Ching Wang, Yun-Chien Cheng

AI总结 提出Intra-YOLO模型,结合迁移学习与强化学习,解决口内照片中龋齿和MIH小目标检测难题。

详情
AI中文摘要

本研究开发了一种计算机辅助诊断(CAD)系统,用于检测口内照片中的龋齿和磨牙-切牙矿化不良(MIH)。这些病变外观相似,使得临床鉴别具有挑战性,尤其是考虑到它们尺寸小且成像条件多变。

英文摘要

This study developed a computer-aided diagnosis (CAD) system for detecting caries and molar-incisor hypomineralization (MIH) in intraoral photographs. These lesions share similar appearances, making clinical differentiation challenging, especially given their small size and variability in imaging conditions.