arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.27662 2026-05-28 cs.LG cs.AI

How the Optimizer Shapes Learned Solutions in Equivariant Neural Networks

优化器如何塑造等变神经网络中的学习解

Teodor-Mihai Stupariu, Andrei Manolache

AI总结本文通过比较Muon和Adam优化器在点云和分子学习任务中的表现，发现Muon能改善等变神经网络的优化效果，并分析其导致更规则损失曲面和更高有效秩的机制。

Comments Accepted at ICML 2026 Workshop on Weight-Space Symmetries

详情

AI中文摘要

等变神经网络通过构造编码几何对称性，但它们通常难以优化，并且可能表现不如约束较少的架构。越来越多的研究通过架构修改（如约束松弛或近似等变）来解决这一问题，而优化器的作用相对未被充分探索。我们通过比较Muon和Adam在点云和分子学习设置下的多种等变和几何架构来研究这一方向。在对比最清晰的ModelNet40上，Muon在所有考虑的架构上均一致优于Adam。然后，我们通过Hessian估计、损失曲面可视化以及学习权重和中间表示的谱性质来分析训练后的ModelNet40检查点。Muon达到的检查点具有更大的Hessian曲率汇总但更规则的损失曲面，并且其学习权重和表示具有更高的稳定秩和有效秩。这些观察表明，优化器设计与几何归纳偏置之间的相互作用值得社区进一步关注。

英文摘要

Equivariant neural networks encode geometric symmetries by construction, yet they are often difficult to optimize and can underperform less constrained architectures. A growing body of work addresses this through architectural modifications such as constraint relaxation or approximate equivariance, while the role of the optimizer remains comparatively underexplored. We study this direction by comparing Muon and Adam across several equivariant and geometric architectures under pointcloud and molecular learning settings. On ModelNet40, where the comparison is clearest, Muon consistently improves over Adam across all architectures considered. We then analyze the trained ModelNet40 checkpoints through Hessian estimates, loss surface visualizations, and spectral properties of learned weights and intermediate representations. The checkpoints reached by Muon have larger Hessian curvature summaries but more regular loss surfaces, and their learned weights and representations have higher stable and effective ranks. These observations suggest that the interaction between optimizer design and geometric inductive bias deserves further attention from the community.

URL PDF HTML ☆

赞 0 踩 0

2605.27661 2026-05-28 cs.RO

Design of a Real-time Asynchronous Monocular Odometry for Planetary Exploration

面向行星探测的实时异步单目里程计设计

Benat Inigo, Florian Steidle, Wolfgang Stuerzl

AI总结针对行星探测中计算资源受限、环境复杂且高动态范围光照的挑战，提出一种基于误差状态卡尔曼滤波（ESKF）的实时异步事件相机单目里程计，利用异步事件流和RATE特征跟踪器实现连续相机运动估计。

2605.27659 2026-05-28 cs.LG cs.AI

Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment

通过概率潜在嵌入和动态策略自适应实现迁移强化学习用于Sim-to-Real部署

Gengyue Han, Yiheng Feng

AI总结提出一种基于概率潜在嵌入和动态策略自适应的强化学习框架，通过元学习推断环境潜在表示并动态调整风险水平，实现安全高效的Sim2Real策略迁移。

详情

AI中文摘要

由于资源有限和公共安全问题，许多信息物理系统（如自动驾驶汽车）的深度强化学习（RL）智能体首先在模拟器中进行训练。然而，当部署到真实世界环境中时，由于不可避免的Sim2Real差距，它们常常遭受性能下降或安全违规。现有的零样本方法，如鲁棒安全RL和域随机化，缓解了这一问题，但通常以性能下降或遇到未建模系统动态时的残余安全风险为代价。为了解决这些限制，我们提出了一种新颖的强化学习框架，通过概率潜在嵌入和动态策略自适应实现安全高效的策略迁移。我们考虑在不同环境上下文下的一族约束马尔可夫决策过程（CMDP）。通过利用元RL中的潜在上下文变量，所提出的框架从模拟经验中推断环境的潜在表示。此外，它结合了分布RL公式，允许根据潜在上下文变量的估计精度动态调整部署策略的风险水平。该策略在早期部署阶段促进安全性，并通过在Sim2Real差距下的快速策略自适应提高效率。

英文摘要

Due to limited resources and public safety concerns, deep reinforcement learning (RL) agents for many cyber-physical systems (e.g., autonomous vehicles) are first trained in simulators. However, when deployed in real world environments, they often suffer from performance degradation or safety violations because of the inevitable Sim2Real gap. Existing zero-shot approaches, such as robust safe RL and domain randomization, mitigate this issue but typically at the cost of degraded performance or residual safety risks when experiencing unmodeled system dynamics. To address these limitations, we propose a novel reinforcement learning framework that enables safe and efficient policy transfer via probabilistic latent embeddings and dynamic policy adaptation. We consider a family of Constrained Markov Decision Processes (CMDPs) under different environment contexts. By leveraging latent context variable in meta-RL, the proposed framework infers the latent representation of the environment from simulated experiences. Furthermore, it incorporates a distributional RL formulation, which allows risk levels of the deployed policy to be adjusted dynamically, based on the estimation accuracy of the latent context variable. This strategy promotes safety at the early deployment stage and improves efficiency through fast policy adaptation under the Sim2Real gap.

URL PDF HTML ☆

赞 0 踩 0

2605.27654 2026-05-28 cs.CL cs.AI cs.CY

Cultural Fidelity in English-to-Hindi Translation: A Preservation-Fluency Frontier for Gender Recoverability

英译印地语中的文化保真度：性别可恢复性的保持-流畅性前沿

Samyak Savi, Chavi Gupta, Shreyas Gantayet, Tanay Sodha, Dhruv Kumar

AI总结研究英译印地语中性别信息的保持问题，提出两种推理时干预方法（SAR和PAR），在保持性别可恢复性与流畅性之间取得平衡。

Comments 10 pages, 2 figures, 9 tables

详情

AI中文摘要

生成式翻译系统是文化技术，因为它们决定如何在特定文化的语法系统中呈现具有社会意义的线索。我们研究成功文化翻译的一个具体概念：当英语源文本明确编码性别时，英译印地语应保持该线索的可恢复性，除非源文本本身存在歧义。我们在涵盖十二个类别的37,345个实例基准上评估了这一标准，并显示五个系统经常通过作格和敬语结构消除性别。然后，我们引入了两种机制感知的推理时干预。第一种是源感知重排序器（SAR），倾向于避免性别中立句法的候选。第二种是现象感知重排序器（PAR），即使在作格句法存在的情况下，也通过目标词汇标记保持性别。在GPT-4o-mini和Sarvam上，PAR将目标子集准确率分别从11.07%提高到54.47%，从15.99%提高到49.66%。人工评估显示，PAR将性别保持率从10.3%提高到81.3%，但平均流畅度从4.36降至3.37。这些发现将两种干预置于保持和流畅性的前沿，而不是支持单一的解决方案，并展示了文化定位的生成如何在保真度、流畅性和风格自然性之间需要明确的权衡。

英文摘要

Generative translation systems are cultural technologies because they decide how socially meaningful cues are rendered within culturally specific grammatical systems. We study one concrete notion of successful cultural translation: when an English source explicitly encodes gender, an English-to-Hindi translation should preserve the recoverability of that cue unless the source itself is ambiguous. We evaluate this criterion on a 37,345-instance benchmark spanning twelve categories and show that five systems frequently erase gender through ergative and honorific constructions. We then introduce two mechanism-aware inference-time interventions. The first, the Source-Aware Reranker (SAR), prefers candidates that avoid gender-neutralizing syntax. The second, the Phenomenon-Aware Reranker (PAR), preserves gender through targeted lexical marking even when ergative syntax remains. Across GPT-4o-mini and Sarvam, PAR improves target-subset accuracy from 11.07% to 54.47% and from 15.99% to 49.66%, respectively. Human evaluation shows that PAR increases gender preservation from 10.3% to 81.3%, but reduces mean fluency from 4.36 to 3.37. These findings place the two interventions on a preservation and fluency frontier rather than supporting a single dominant solution, and show how culturally situated generation can require explicit tradeoffs among fidelity, fluency, and stylistic naturalness.

URL PDF HTML ☆

赞 0 踩 0

2605.27651 2026-05-28 cs.LG

Faster Thermal Profiling of a Lunar Rover with Machine Learning Adapted Finite Difference Model

基于机器学习自适应有限差分模型的月球车快速热特性分析

Samuel Weber, Zaki Hasnain, Souma Chowdhury

AI总结提出一种物理信息机器学习框架，通过自适应粗网格划分和可微有限差分模拟器，在保持物理一致性的同时实现月球车热建模的精度与效率平衡。

详情

AI中文摘要

在极端热环境下运行的自主空间系统需要准确且高效的热建模来支持任务前系统设计和机载自主性。对于月球车而言，大温度梯度、辐射传热和可变表面条件使得可靠的热预测尤其具有挑战性。高保真物理仿真提供准确结果但计算成本高，而简化模型和查表方法往往缺乏足够精度。物理信息机器学习（PIML）通过将数据驱动模型与嵌入的物理知识相结合，提供了一种有前景的替代方案。本文提出了一种用于带有内部热源的简化月球车热分析的PIML框架，其中机器学习实现了环境自适应粗网格划分。所提出的架构集成了一种迁移神经网络（TNN），该网络根据热载荷和初始条件自适应地确定三维有限差分节点划分，从而实现更准确的粗网格计算。框架内嵌了一个可微有限差分热模拟器，以强制执行物理一致性并支持高效训练，同时一个上采样层从粗网格解重建高分辨率温度场。所提出的PIML方法与高保真细网格仿真、低保真固定粗网格模型以及纯数据驱动的人工神经网络（ANN）进行了对比评估。结果表明，相对于粗网格物理模型和ANN模型，PIML框架分别将预测精度提高了50%和39%，同时保持了物理一致的热分布。在计算方面，该框架也比高保真仿真快3倍，展示了在月球车系统热建模中精度与效率之间的有效平衡。

英文摘要

Autonomous space systems operating in extreme thermal environments require accurate and efficient thermal modeling to support both pre-mission system design and onboard autonomy. For lunar rovers, large temperature gradients, radiative heat transfer, and variable surface conditions make reliable thermal prediction especially challenging. High-fidelity physics-based simulations provide accurate results but are computationally expensive, while simplified models and lookup-table approach often lack sufficient accuracy. Physics-informed machine learning (PIML) offers a promising alternative by combining data-driven models with embedded physical knowledge. This paper presents a PIML framework for thermal analysis of a simplified lunar rover with internal heat sources, where machine learning enables environment-adaptive coarse meshing. The proposed architecture integrates a transfer neural network (TNN) that adaptively determines 3D finite-difference nodalization based on thermal loads and initial conditions, enabling more accurate coarse-mesh calculations. A differentiable finite-difference thermal simulator is embedded within the framework to enforce physical consistency and support efficient training, while an upscaling layer reconstructs high-resolution temperature fields from the coarse-grid solution. The proposed PIML approach is evaluated against high-fidelity fine-mesh simulations, low-fidelity fixed coarse-mesh models, and a purely data-driven artificial neural network (ANN). Results show that the PIML framework improves prediction accuracy by 50% and 39% relative to the coarse-mesh physics model and ANN model, respectively, while maintaining physically consistent thermal distributions. Computationally, the framework is also 3x faster than high-fidelity simulations, demonstrating an effective balance between accuracy and efficiency for thermal modeling of lunar rover systems.

URL PDF HTML ☆

赞 0 踩 0

2605.27649 2026-05-28 cs.CL cs.LG

Disentangling Language Roles in Multilingual LLM Task Execution

多语言大模型任务执行中的语言角色解耦

Qishi Zhan, Minxuan Hu, Seoyeon Jang, Lei Zhao, Ziheng Chen, Man Liang, Xinyue Xiang, Jiaxin Liu, Guansu Wang, Liang He

AI总结提出MTM-Bench基准，通过完全交叉设计解耦指令、内容和响应三种语言角色，评估多语言LLM的任务执行能力，发现响应语言角色是性能下降的主要因素。

详情

AI中文摘要

多语言大模型在指令、源内容和所需响应语言不一致时被越来越多地使用。现有基准扩展了多语言指令跟随评估，但很少在完全交叉设计中隔离这三种角色。我们引入了MTM-Bench，一个用于语言条件任务执行的控制基准，其中每个实例由三元组 $(L_{\text{instr}}, L_{\text{content}}, L_{\text{resp}})$ 定义。在英语、西班牙语和中文中，MTM-Bench枚举了所有27个三元组，每个模型包含2,430个实例，涵盖语义反转、最终状态提取和带更新实现的语言纯度。我们使用分解指标评估了20个前沿和开源权重LLM，包括语义正确性、目标语言遵循度、约束满足度、污染比率和联合成功率，并通过针对性的人工审计验证评分。完全交叉设计揭示了性能下降是由语言在任务结构中扮演的角色组织的，而不仅仅是语言不匹配的数量。响应语言角色是变化的主要轴，单个响应槽不匹配导致了大部分性能下降。仅响应不匹配与完全不匹配的比较表明，不匹配数量不是困难的单调预测因子，模型级别的排序在不同系统间变化。任务族通过不同的通道失败，表明语义正确性本身并不能捕捉可靠的多语言任务执行。

英文摘要

Multilingual LLMs are increasingly used when instruction, source content, and required response languages do not coincide. Existing benchmarks have expanded multilingual instruction-following evaluation, but they rarely isolate these three roles within a fully crossed design. We introduce MTM-Bench, a controlled benchmark for language-conditioned task execution in which each instance is defined by a triplet $(L_{\text{instr}}, L_{\text{content}}, L_{\text{resp}})$. Across English, Spanish, and Chinese, MTM-Bench enumerates all 27 triplets and contains 2{,}430 instances per model across semantic reversal, final-state extraction, and language purity with update realization. We evaluate 20 frontier and open-weight LLMs using decomposed metrics for semantic correctness, target-language adherence, constraint satisfaction, contamination ratio, and joint success, with scoring validated by a targeted human audit. The fully crossed design reveals that degradation is organized by the role a language occupies in the task structure, not merely by mismatch count. The response-language role is the dominant axis of variation, and a single response-slot mismatch accounts for most degradation. The response-only and full-mismatch comparison suggests that mismatch count is not a monotonic predictor of difficulty, with model-level ordering varying across systems. Task families fail through distinct channels, showing that semantic correctness alone does not capture reliable multilingual task execution.

URL PDF HTML ☆

赞 0 踩 0

2605.27646 2026-05-28 cs.LG cs.AI

Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression

Hurwitz四元数乘法量化用于KV缓存压缩

Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, David Cox, Antonio Torralba

AI总结提出一种免校准的Hurwitz四元数乘法量化方法，通过将K/V的4元素块视为四元数并用量化乘积编码，在约5比特下匹配fp16困惑度，实现高达5.05倍KV缓存压缩。

详情

AI中文摘要

我们提出 extbf{Hurwitz四元数乘法量化（HQMQ）}，一种用于大语言模型KV缓存压缩的 extbf{免校准}方法。HQMQ将K或V的每个4元素块视为一个四元数，并将其单位方向量化到乘积$q_p \cdot q_s$上，其中$q_p$取自24元素Hurwitz群$2T$（$S^3$上24-cell的24个顶点，两两夹角$60^\circ$），$q_s$取自每个（层、头）的二级码本，包含$S$个 extemph{随机}单位四元数。乘法组合在$S$个存储参数下产生$24S$个有效码字；随机初始化即可，因为左乘是$S^3$等距变换，因此种子码本在最终任务困惑度上的变化小于$1.5\%$。一个每批次的中间乘数离群值提取步骤（$C=3$，无校准）处理现代离群值密集型架构。我们在五个现代开源模型上评估：Mistral-7B（密集MHA）、Llama-3-8B和Qwen2.5-7B和Qwen3-8B（密集GQA），以及gpt-oss-20b（稀疏MoE）。在Mistral-7B和Qwen3-8B上，HQMQ在约5比特下匹配fp16，困惑度差异在$0.02$--$0.03$点内。在Qwen2.5-7B和Qwen3-8B上，朴素int4导致困惑度崩溃到$10^4+$，而HQMQ + Med3$\times$在约5比特下恢复fp16质量，差异在$0.02$--$0.10$点内。HQMQ在所有五个模型上，在相同比特数下帕累托优于朴素int $3$--$1900\times$，并且在Mistral上以3.79比特的下游零样本准确率匹配fp16。与最强的校准KV量化基线相比，HQMQ在3.79比特下匹配KIVI-4（约4.5比特），在CoQA上差异约1点，TruthfulQA上0.6点，GSM8K上2.3点，同时比特数减少16%且无需校准过程。在存储层面，HQMQ提供高达5.05倍的KV压缩，将Llama-3-70B的128k上下文缓存从43 GB缩小到8.5 GB。

英文摘要

We propose \textbf{Hurwitz Quaternion Multiplicative Quantization (HQMQ)}, a \textbf{calibration-free} method for KV cache compression of large language models. HQMQ treats each 4-element chunk of K or V as a quaternion and quantizes its unit direction to the \emph{product} $q_p \cdot q_s$, where $q_p$ ranges over the 24-element Hurwitz group $2T$ (the 24 vertices of the 24-cell on $S^3$, pairwise angle $60^\circ$) and $q_s$ ranges over a per-(layer, head) secondary codebook of $S$ \emph{random} unit quaternions. The multiplicative composition yields $24S$ effective codewords at $S$ stored parameters; random initialization suffices because left-multiplication is an $S^3$ isometry, so seeded codebooks vary in end-task ppl by $<1.5\%$. A per-batch median-multiplier outlier extraction step ($C{=}3$, no calibration) handles modern outlier-heavy architectures. We evaluate on five modern open models: Mistral-7B (dense MHA), Llama-3-8B and Qwen2.5-7B and Qwen3-8B (dense GQA), and gpt-oss-20b (sparse MoE). On Mistral-7B and Qwen3-8B, HQMQ matches fp16 within $0.02$--$0.03$ ppl points at $\sim$5 bits. On Qwen2.5-7B and Qwen3-8B, where naive int4 collapses to $10^4{+}$ ppl, HQMQ + Med3$\times$ recovers fp16 quality within $0.02$--$0.10$ ppl points at $\sim$5 bits. HQMQ Pareto-dominates naive int by $3$--$1900\times$ at matched bits across all five models, and downstream zero-shot accuracy matches fp16 at $3.79$ bits on Mistral. Against the strongest calibrated KV-quantization baseline, HQMQ at $3.79$ bits matches KIVI-4 ($\sim 4.5$ bits) within ${\sim}1$ pt on CoQA, $0.6$ pts on TruthfulQA, and $2.3$ pts on GSM8K, at $16\%$ fewer bits and without a calibration pass. At the storage level, HQMQ delivers up to $5.05\times$ KV compression, shrinking a Llama-3-70B 128k-context cache from 43 GB to 8.5 GB.

URL PDF HTML ☆

赞 0 踩 0

2605.27644 2026-05-28 cs.RO cs.AI cs.LG

Trinity: Unifying Class-Agnostic Terrain and Semantic Segmentation for Unstructured Outdoor Environments by Leveraging Synthetic Data

Trinity：通过利用合成数据统一非结构化户外环境中的类无关地形与语义分割

Marcus G Müller, Wout Boerdijk, Maximilian Durner, Riccardo Giubilato, Abel Gawel, Wolfgang Stürzl, Roland Siegwart, Rudolph Triebel

AI总结提出基于Transformer的统一网络Trinity，联合执行类特定语义分割和类无关地形分割，利用合成数据集RUGDSynth和真实数据集EXTerra实现机器人无关的地形先验学习。

详情

AI中文摘要

地形理解对于在非结构化户外环境中运行的移动机器人至关重要。现有的基于视觉的可通行性估计方法依赖于机器人特定的标注或语义类别映射，限制了跨平台的迁移性，并在机器人能力变化时需要昂贵的重新标注，而标准的语义分割方法仅关注特定的预定义类别，无法捕捉地形的多样性。在这项工作中，我们提出了一种基于Transformer的架构，在统一网络Trinity中联合执行类特定语义分割和类无关地形分割。地形区域仅基于视觉外观进行分割，无需预定义的语义标签或机器人相关的可通行性分数。这种公式使得学习机器人无关的视觉地形先验成为可能，这些先验可以与机器人特定的经验相结合，用于下游任务，如可通行性估计、视觉里程计和任务规划。为了实现具有多样地形外观的大规模训练，我们扩展了OAISYS模拟器，并引入了RUGDSynth，这是一个受RUGD启发、包含类无关地形样本的合成数据集。此外，我们提出了EXTerra数据集，提供了带有类特定和类无关地形标签的真实世界图像。实验证明了所提出任务的可行性以及我们的联合分割方法在复杂户外环境中的有效性。代码和数据集将在本出版物发布后（经过审查）公开。

英文摘要

Terrain understanding is fundamental for mobile robots operating in unstructured outdoor environments. Existing vision-based traversability estimation methods rely on robot-specific annotations or semantic class mappings, limiting transferability across platforms and requiring costly re-annotation when robot capabilities change, while standard semantic segmentation methods only focus on specific predefined classes, which do not capture the variety of terrains. In this work, we propose a transformer-based architecture that jointly performs class-specific semantic segmentation and class-agnostic terrain segmentation within a unified network, called Trinity. Terrain regions are segmented based solely on visual appearance, without predefined semantic labels or robot-dependent traversability scores. This formulation enables the learning of robot-agnostic visual terrain priors that can be combined with robot-specific experience for downstream tasks such as traversability estimation, visual odometry, and mission planning. To enable large-scale training with diverse terrain appearances, we extend the OAISYS simulator and introduce RUGDSynth, a synthetic dataset inspired by RUGD with class-agnostic terrain samples. Furthermore, we present the EXTerra Dataset, providing real-world images annotated with both class-specific and class-agnostic terrain labels. Experiments demonstrate the feasibility of the proposed task and the effectiveness of our joint segmentation approach in complex outdoor environments. Code and datasets will be released with this publication (after review).

URL PDF HTML ☆

赞 0 踩 0

2605.27643 2026-05-28 cs.RO physics.optics

Agentic Language-to-Objective Synthesis for Optofluidic Assembly

面向光流组件的智能语言到目标合成

Ivan Saraev, Elena Erben, Weida Liao, Fan Nan, Gerhard Neumann, Eric Lauga, Moritz Kreysing

AI总结提出Speak-to-Objective模块化智能流水线，利用条件大语言模型将口语或书面指令转换为可微目标函数，实现光流控微粒子组装，并支持用户反馈学习。

Comments 21 pages, 5 figures

详情

AI中文摘要

基于光的先进制造日益需要可编程、闭环工具，将人类设计意图转化为小尺度上的可执行操作。然而，在机器人和制造模式中仍存在一个关键瓶颈：将用户意图转化为机器可读且可靠执行的目标。尽管微机器人通过光驱动流体提供了多功能操控，但数学上可处理的目标规范仍然手动且难以重用。本文介绍Speak-to-Objective，一个模块化智能流水线，使用条件大语言模型将口语或书面指令转换为完全可微的目标函数，用于在约束感知逆求解器（SLSQP）和实验光流控平台上组装微粒。该方法采用紧凑循环——感知→组合→提议→行动→报告与学习——将目标作为意图与驱动之间的接口，分离组装或图案化什么与如何驱动，同时从用户反馈中学习。流水线组合几何、间距和分配/拓扑项，生成鲁棒的描述性目标，从部分轨迹组装并在扰动后恢复，以及用于精确定位的显式目标，所有均以执行器无关的方式。使用激光诱导热粘性流作为物理驱动模式，我们展示了自然语言可编程的、基于光的微尺度粒子图案组装在微流控环境中。除了对可编程微组装的直接影响，以及使用激光诱导光流控驱动作为降复杂度实验平台，我们的工作指向自驱动、AI辅助的光学制造平台，其中自然语言、可微目标和激光驱动耦合为可重复使用的数字工作流。

英文摘要

Light-based advanced manufacturing increasingly requires programmable, closed-loop tools that translate human design intent into executable operations at small length scales. Yet a key bottleneck persists across robotic and manufacturing modalities: turning user intent into machine-readable objectives that are reliably executable. While micro-robotics offers versatile manipulation via optical actuation of fluids, mathematically tractable goal specification remains manual and hard to reuse. Here, we introduce Speak-to-Objective, a modular agentic pipeline that uses a conditioned Large Language Model (LLM) to translate spoken or written commands into fully differentiable objective functions for assembling microparticles in a constraint-aware inverse solver (SLSQP) and on an experimental optofluidic platform. The approach employs a compact loop - perceive -> compose -> propose -> act -> report & learn - that treats the objective as the interface between intent and actuation, separating what to assemble or pattern from how to actuate, while learning from user feedback. The pipeline composes geometry, spacing, and assignment/topology terms to generate robust descriptive objectives that assemble from partial traces and recover after perturbations, as well as explicit objectives for precise placement, all in an actuator-agnostic fashion. Using laser-induced thermoviscous flows as the physical actuation modality, we demonstrate natural-language-programmable, light-based microscale assembly of particle patterns in a microfluidic environment. Beyond its immediate impact on programmable microassembly, and using laser-induced optofluidic actuation as a reduced-complexity experimental platform, our work points toward self-driving, AI-assisted optical manufacturing platforms in which natural language, differentiable objectives, and laser-based actuation are coupled into a reusable digital workflow.

URL PDF HTML ☆

赞 0 踩 0

2605.27642 2026-05-28 cs.CL cs.LG

Learning to Translate from Soft to Hard LLM Prompts

学习从软提示到硬提示的翻译

Pitipat Kongsomjit, Suryansh Goyal, Jacob Whitehill

AI总结本文通过训练一个专用的软提示到自然语言翻译模型，提高了翻译质量，并展示了软提示可以转化为可移植的文本提示，在大型闭源模型上超越原软提示甚至少样本学习。

Comments 8 Pages, 11 tables, 4 Figures

2605.27636 2026-05-28 cs.CL

Simorgh at SemEval-2026 task 7: Region-Aware Hybrid Retrieval for Low-Resource Cultural Reasoning in Multilingual Question Answering

Simorgh at SemEval-2026 task 7: 面向低资源文化推理的多语言问答中的区域感知混合检索

Hadi Bayrami Asl Tekanlou, Mahdi Bakhtiyarzadeh, Jafar Razmara

AI总结提出区域感知混合检索方法，结合BM25和稠密语义相似度与区域加权启发式，以提升多语言文化问答的跨语言稳定性。

Comments 6 pages, 3 figures, accepted to the Everyday Knowledge Across Diverse Languages and Cultures shared task at SemEval2026

详情

AI中文摘要

尽管大型语言模型（LLMs）在通用领域的推理任务中表现出色，但在数字和文本数据有限的语种中，面对文化相关知识时可能遇到挑战。本文利用BLEnD基准研究文化相关的多项选择问答，该基准包含30种语言的多语料库，涵盖饮食、体育、家庭等社会文化领域。我们提出一种区域感知混合检索方法，结合BM25词汇匹配和稠密语义相似度与区域加权启发式，以提高答案的相关性。检索到的文档用于构建结构化提示，输入Qwen3-14B量化模型，并采用基于logit的确定性答案选择。实验结果表明，与纯参数推理相比，混合检索方法在文化问答中提升了跨语言稳定性。然而，训练数据量不同的语言之间仍存在显著性能差距，这表明检索增强方法并未完全克服训练数据不平衡问题。

英文摘要

Although Large Language Models (LLMs) demonstrate excellent capabilities and performance for general reasoning tasks within the general public domain, they may face challenges with culturally grounded knowledge within languages with limited digital and textual data. In this paper, we investigate culturally grounded multiple-choice question answering with the BLEnD benchmark, which consists of a multilingual corpus of 30 languages and covers various socio-cultural domains, such as cuisine, sports, family, etc. We propose a region-aware hybrid retrieval approach that combines BM25 lexical matching and dense semantic similarity with regional weighting heuristics to improve the relevance of the answer. The retrieved documents are used to construct a structured prompt for the Qwen3-14B quantized model with logit-based deterministic answer selection. The experimental results show improvements to cross-lingual stability with the hybrid retrieval approach over pure parametric inference for culturally grounded question answering. However, there are still notable performance gaps between languages with more and less training data. This shows that the limitations of the retrieval augmentation approach are not entirely overcome by the training data imbalance problem.

URL PDF HTML ☆

赞 0 踩 0

2605.27622 2026-05-28 cs.AI cs.SC

Reasoning and Planning with Dynamically Changing Norms

动态变化规范的推理与规划

Taylor Olson, Roberto Salas-Damian, Kenneth D. Forbus

AI总结本文提出一种在人类-AI环境中使用动态变化规范引导规划的方法，通过可废止演算解决规范冲突并将规范作为规划护栏，理论证明与对话任务实验验证了有效性。

Comments 8 pages, 1 figure, dataset included in anc

2605.27619 2026-05-28 cs.LG cs.AI

Supervised Distributional Reduction via Optimal Transport and Dependence Maximization

基于最优传输和依赖性最大化的有监督分布约简

Sai-Aakash Ramesh, Archit Sood, Andrew Corbett, Tim Dodwell

AI总结提出有监督分布约简（SDR）算法，通过结合最优传输和显式依赖性最大化，学习同时保留数据几何结构和目标相关信号的紧凑表示。

详情

AI中文摘要

学习同时捕捉内在数据几何结构和目标相关结构的表示仍然是一个基本挑战，特别是在数据约简必须在压缩与预测保真度之间取得平衡的场景中。虽然分布约简（包括联合聚类和降维）提供了一种原则性的数据总结方法，但其有监督变体仍然相对未被充分探索，尽管保留任务相关信号对于下游预测和决策至关重要。我们提出有监督分布约简（SDR），一种通过结合最优传输和显式依赖性最大化来学习目标感知表示的算法。SDR 基于融合 Gromov-Wasserstein（FGW）目标，将输入分布的 relational 结构与一组代表点对齐，同时增加一个直接依赖性项，鼓励学习到的嵌入更明确地捕捉预测信号。这产生了反映几何结构和监督的紧凑表示。除了表示学习，SDR 自然地诱导出一种数据依赖的非平稳几何结构，可用于高斯过程（GP）建模等场景。通过目标感知的分布对齐重新定义距离，SDR 能够构建适应数据几何和监督局部变化的自适应核，为非平稳核设计提供了基于最优传输的视角。

英文摘要

Learning representations that capture both intrinsic data geometry and target-relevant structure remains a fundamental challenge, particularly in settings where data reduction must balance compression with predictive fidelity. While distributional reduction-encompassing joint clustering and dimensionality reduction-offers a principled way to summarize data, its supervised variants remain relatively under-explored, despite the importance of retaining task-relevant signal for downstream prediction and decision-making. We propose Supervised Distributional Reduction (SDR), an algorithm for learning target-aware representations by combining optimal transport with explicit dependence maximization. SDR builds on the Fused Gromov-Wasserstein (FGW) objective to align the relational structure of the input distribution with a set of representative points, while augmenting it with a direct dependence term that encourages the learned embeddings to capture predictive signal more explicitly. This results in compact representations that reflect both geometric structure and supervision. Beyond representation learning, SDR naturally induces a data-dependent, non-stationary geometry that can be leveraged for settings such as Gaussian Process (GP) modelling. By redefining distances through target-aware distributional alignment, SDR enables the construction of adaptive kernels that respond to local variations in both data geometry and supervision, offering an optimal transport-based perspective on non-stationary kernel design.

URL PDF HTML ☆

赞 0 踩 0

2605.27618 2026-05-28 cs.LG

Evaluating Local Explainability Metrics for Machine Learning Models on Tabular Data

评估表格数据机器学习模型的局部可解释性指标

Tomás Pereira, João Vitorino, Eva Maia, Isabel Praça

AI总结研究局部可解释性技术在复杂表格分类任务中的可信度，通过基准测试LIME、Kernel SHAP和特征消融技术，发现解释质量主要受数据集复杂性和特征分布影响，而非模型预测性能。

Comments 9 pages, 12 tables, 1 figure, DATA 2026 Conference

详情

AI中文摘要

尽管广泛使用可解释性技术来尝试理解人工智能（AI）的行为，但生成的解释可能并不总是可靠的。一个解释对人类来说可能看似合理，但未能捕捉模型的内部推理，特别是在处理复杂的表格数据时。本文研究了局部可解释性技术在应用于复杂表格分类任务时的可信度，考虑了三个主要属性的评估指标：对模型预测的忠实度、对输入数据变化的鲁棒性以及解释本身的复杂性。对局部可解释模型无关解释（LIME）、Kernel SHAP（Shapley Additive exPlanations）和特征消融技术进行了基准测试，涉及32个数据集和不同类型的机器学习模型。分析了模型性能范围，以识别两组：共识正确（所有模型正确预测的样本）和共识错误（所有模型错误预测的样本）。获得的结果表明，解释并不总是与模型的预测性能相关。相反，数据集复杂性和特征分布似乎是影响解释质量和可靠性的主要因素。

英文摘要

Despite the wide use of explainability techniques to attempt to understand the behavior of Artificial Intelligence (AI), the generated explanations may not always be reliable. An explanation can appear plausible to humans but fail to capture the internal reasoning of a model, particularly when dealing with complex tabular data. This paper studies the trustworthiness of local explainability techniques when applied to complex tabular classification tasks, considering evaluated metrics for three main properties: faithfulness to the model's predictions, robustness to input data variations, and complexity of the explanation itself. A benchmark was performed for Local Interpretable Model-Agnostic Explanations (LIME), Kernel SHapley Additive exPlanations (SHAP), and Feature Ablation techniques, across 32 datasets and different types of machine learning models. Model performance ranges were analyzed to identify two groups: consensus-correct, which are samples that all models predicted correctly, and consensus-wrong, samples that all models predicted incorrectly. The obtained results demonstrate that that the explanations are not always correlated with a model's predictive performance. Instead, dataset complexity and feature distributions seem to be the main factors affecting explanation quality and reliability.

URL PDF HTML ☆

赞 0 踩 0

2605.27616 2026-05-28 cs.CV cs.AI

Not All NVFP4 QAT Recipes Are Equal: How Architecture and Scale Shape Model Quality for Anomaly Segmentation

并非所有 NVFP4 QAT 配方都相同：架构和规模如何影响异常分割的模型质量

Zijian Du, Oleg Rybakov

AI总结本研究通过统一协议评估多种架构、规模和 FP4 量化感知训练 (QAT) 配方在脑肿瘤异常分割任务中的交互作用，发现架构选择对量化鲁棒性影响最大，注意力机制架构对配方选择具有显著韧性，而 CNN 在大规模下受梯度量化配方影响性能下降。

详情

Journal ref: CVPR2026

AI中文摘要

实时异常分割要求高召回率和高效的低精度推理。我们研究了模型架构、模型规模和 FP4 量化感知训练 (QAT) 配方在召回关键的脑肿瘤分割任务中的三方交互，在统一协议下评估了多种架构、规模和 QAT 配方。我们发现架构选择对量化鲁棒性影响最大，基于注意力的架构对配方选择表现出显著的韧性，而 CNN 在大规模下在梯度量化配方下性能下降。在低容量下，FP4 可能离散化 softmax 注意力，但高级 QAT 配方可防止这种崩溃。在更大规模下，高级配方减轻了降低 CNN 质量的梯度量化噪声。五折患者级交叉验证证实这些发现对数据划分具有鲁棒性。我们的结果表明，Swin Transformer 在所有规模下对 QAT 配方选择都具有鲁棒性，使其成为 FP4 量化异常分割的推荐架构。

英文摘要

Real-time anomaly segmentation demands both high recall and efficient low-precision inference. We study the three-way interaction of model architecture, model scale, and FP4 quantization-aware training (QAT) recipe on a recall-critical brain tumor segmentation task, evaluating multiple architectures, scales, and QAT recipes under a unified protocol. We find that architecture choice has the largest impact on quantization robustness, with attention-based architectures showing remarkable resilience to recipe choice while CNN degrades under gradient-quantizing recipes at larger scales. At low capacity, FP4 can discretize softmax attention, but advanced QAT recipes prevent this collapse. At larger scales, advanced recipes mitigate gradient quantization noise that degrades CNN quality. Five-fold patient-level cross-validation confirms these findings are robust to data partition. Our results show that the Swin Transformer is robust to QAT recipe choice across all scales, making it the recommended architecture for FP4-quantized anomaly segmentation.

URL PDF HTML ☆

赞 0 踩 0

2605.27605 2026-05-28 cs.AI cs.SE

Laguna M.1/XS.2 Technical Report

Laguna M.1/XS.2 技术报告

Julien Abadji, Marah Abdin, Connor Adams, Eric Alcaide, Mustafa Altun, Michele Artoni, Junze Bao, Uday Barar, Vassilis Bekiaris, Arkadii Bessonov, Benjamin Bütikofer, Jonathan Chang, Yen-Chun Chen, Dmitry Chernenkov, Yang Chi, Filippos Christianos, Fenia Christopoulou, Razvan-Andrei Ciocoiu, Tzachi Cohen, Yohann Coppel, Dmitrii Emelianenko, Brandon Fergerson, Brian Fitzgerald, Matthias Gallé, Alex Golonzovskyi, George Grigorev, Yiyang Hao, Christian Hensel, Jan Huenermann, Ye Ji, Sarthak Joshi, Eiso Kant, Kabir Khandpur, Seonghyeon Kim, Vladimir Kirichenko, Umut Kocasarac, Ilya Kochik, Ivan Komarov, Chaerin Kong, Anurag Koul, François-Joseph Lacroix, Sergei Laktionov, Waren Long, Quentin Malartic, Vadim Markovtsev, Afonso Marques, Robert McHardy, Carlos Mocholí, Dmitry Monakhov, Adam Morris, Martin Muller, Christian Mürtz, Robin Nabel, Thien Nguyen, Rok Novosel, Szymon Ozog, Aalhad Patankar, Aleksei Petrov, Alexandre Piché, Arthur Pignet, Teodor Poncu, Phil Potter, Alexander Rakowski, Pierre-Yves Ritschard, Jay Roberts, Joe Rowell, Piotr Sarna, Pierre-André Savalle, Uladzislau Sazanovich, Nikita Shapovalov, Arsenii Shevchenko, Mikhail Shilkov, Andrei Sokol, Mohamed Soliman, Jack Stephenson, Victor Storchan, Dragos-Constantin Tantaru, Artem Tyurin, Adrian Wälchli, Pengming Wang, Jianxiao Yang, Renat Zayashnikov, Alexander Zelenka Martin, Nikolay Zinov, Caroline Bercier, José Caldeira, Margarida Garcia, Tom George, Kabeer Gharzai, Glenn Hitchcock, Carson Klingenberg, Ivo Pinto, Varun Randery, Noah Smith, Arina Sugako, Jason Warner

AI总结本文介绍了两个用于长周期自主编码的混合专家基础模型 Laguna M.1 和 XS.2，通过端到端训练和模型工厂系统，在软件工程基准测试中达到先进水平。

Comments Technical report to models released here: https://poolside.ai/blog/introducing-laguna-xs2-m1

详情

AI中文摘要

我们介绍了 Laguna M.1 和 Laguna XS.2，两个为长周期自主编码构建的混合专家基础模型：M.1 总参数量为 2258 亿（每 token 激活 234 亿），XS.2 总参数量为 334 亿（每 token 激活 30 亿）。两个模型均在我们称为模型工厂的内部系统中从头到尾端到端训练：这是一个紧密集成的版本化数据、训练、评估和推理组件栈，将模型开发转变为工业流程。我们描述了模型工厂的原理和设计选择，并详细介绍了模型的端到端训练过程，包括预训练数据和架构、后训练阶段、评估和量化。在自主软件工程和终端基准测试（SWE-bench Verified、SWE-bench Multilingual、SWE-Bench Pro 和 Terminal-Bench 2.0）上，M.1 和 XS.2 在其各自的权重级别中与最先进的开源模型具有竞争力。Laguna XS.2 权重在 Apache 2.0 许可下发布，地址为 https://huggingface.co/collections/poolside/laguna-xs2。

英文摘要

We present Laguna M.1 and Laguna XS.2, two Mixture-of-Experts foundation models built for long-horizon, agentic coding: M.1 has $225.8$B total parameters ($23.4$B activated per token) and XS.2 has $33.4$B total ($3$B activated). Both models were trained from scratch end-to-end inside the same internal system that we refer to as our Model Factory: a tightly-integrated stack of versioned data, training, evaluation, and inference components that turn model development into an industrial process. We describe the principles and design choices of the Model Factory and also detail the end-to-end training process of our models, throughout pre-training data and architecture, post-training stages, evaluation, and quantization. On agentic software engineering and terminal benchmarks (SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0) M.1 and XS.2 are competitive with state-of-the-art open models in their respective weight classes. Laguna XS.2 weights are released under Apache~2.0 at https://huggingface.co/collections/poolside/laguna-xs2.

URL PDF HTML ☆

赞 0 踩 0

2605.27599 2026-05-28 cs.LG cs.AI cs.AR cs.DC cs.PF

The Energy Blind Spot: NVIDIA's Flagship Edge AI Hardware Cannot Support Process-Level Energy Attribution

能源盲点：NVIDIA 旗舰边缘 AI 硬件无法支持进程级能源归因

Deepak Panigrahy, Aakash Tyagi

AI总结本文审计了 ASUS Ascent GX10 (GB10 SoC) 平台的能源可观测性，发现其缺乏 CPU 能源计数器等关键接口，导致无法像 x86 的 RAPL 那样进行进程级能源归因，并提出通过外部直流计量和 GPU 减法进行校准的临时方案，呼吁将能源可观测性作为硬件的一等要求。

详情

AI中文摘要

代理型 AI 工作负载——其中单个用户目标触发多步编排、工具调用、重试和故障恢复——正被瞄准用于边缘部署，NVIDIA、戴尔、惠普、华硕、微星、宏碁和技嘉都将在 2026 年出货基于 GB10 的桌面 AI 系统。我们最近证明，编排结构主导了代理型能源成本，工作流每个成功目标消耗的能源是线性基线的 4.33 倍，而多步推理任务的 OOI 达到 7.63 倍。另外，Rajat 等人表明，在代理型工作负载中，CPU 端处理占总延迟的 90.6%，占总动态能源的 44%。我们报告了对 ASUS Ascent GX10 (GB10 SoC) 的系统性能源可观测性审计，发现该平台通过任何支持的软件接口都不暴露 CPU 能源计数器、INA 电源轨监视器、IPMI/BMC 和 SCMI powercap 协议。唯一的设备上能源遥测是通过 NVML 的瞬时 GPU 功率。我们进一步发现，联发科固件已经通过未记录的 ACPI 接口 (SPBM) 在内部计算每轨能源，但 NVIDIA 表示“没有计划暴露 CPU 轨信息”。因此，通过支持的接口，无法在此平台上重现像 x86 通过 RAPL 执行的设备上每进程能源归因。我们形式化了能源归因 AI 的硬件需求规范，提出了使用外部直流计量结合 GPU 减法的临时校准桥接，并确定了通过 SCMI powercap 的标准轨道路径。我们的发现激励低碳计算社区将能源可观测性作为硬件的头等要求。

英文摘要

Agentic AI workloads - where a single user goal triggers multi-step orchestration, tool calls, retries, and failure recovery - are being targeted for edge deployment, with NVIDIA, Dell, HP, ASUS, MSI, Acer, and Gigabyte all shipping GB10-based desktop AI systems in 2026. We recently demonstrated that orchestration structure dominates agentic energy cost, with workflows consuming 4.33x more energy per successful goal than linear baselines and OOI reaching 7.63x for multi-step reasoning tasks. Separately, Rajat et al. show that CPU-side processing accounts for up to 90.6% of total latency and 44% of total dynamic energy in agentic workloads. We report a systematic energy-observability audit of the ASUS Ascent GX10 (GB10 SoC) and find that the platform exposes no CPU energy counter, no INA power-rail monitor, no IPMI/BMC, and no SCMI powercap protocol through any supported software interface. The only on-device energy telemetry is instantaneous GPU power via NVML. We further discover that the MediaTek firmware already computes per-rail energy internally via an undocumented ACPI interface (SPBM), but NVIDIA states there are "no plans to expose CPU rail information." On-device per-process energy attribution - as performed on x86 via RAPL - is therefore not reproducible on this platform through supported interfaces. We formalize a hardware requirements specification for energy-attributed AI, propose an interim calibration bridge using external DC metering combined with GPU subtraction, and identify a standards-track path via SCMI powercap. Our findings motivate the low-carbon computing community to demand energy observability as a first-class hardware requirement.

URL PDF HTML ☆

赞 0 踩 0

2605.27596 2026-05-28 cs.CL

Can Hallucinations Be Useful? Solving Multi-Hop Questions With SLMs By Chaining System-I/II Reasoning

幻觉能否有用？通过链式系统I/II推理用SLM解决多跳问题

Saptarshi Sengupta, Suhang Wang

AI总结提出一种“先回答后推理”的认知启发框架，利用SLM的初始答案（可能包含幻觉）作为假设来检索证据，再通过系统II深度推理，从而在多跳问答任务上超越传统的“先思考后检索”方法。

详情

AI中文摘要

最近，小型语言模型（SLM）引起了越来越多的兴趣，它们速度快、性能好，且硬件需求低于大型语言模型（LLM）。然而，SLM比LLM更容易产生幻觉，影响其解决复杂多步推理问题的能力，因为早期错误会级联到最终响应。为了解决这个问题，现有工作采用先思考后迭代检索的策略来减少幻觉。我们认为先思考策略并非总是必要，因为我们发现：（i）SLM通常对其初始答案有准确的置信度，并且（ii）幻觉实际上可能有助于逼近正确答案。因此，我们将我们的工作定位为这种策略的反转，即先回答后推理。我们提出了一个认知启发的框架，其中模型首先被允许快速回答问题（系统I（零样本）），然后基于从知识源使用初始假设检索到的证据进行更深层次的思考（系统II）。通过结合系统I和系统II风格的推理，我们展示了我们的方法在各种多步问答基准测试中可以优于先前采用传统先思考路径的工作。

英文摘要

Recently, there has been increased interest in Small Language Models (SLMs), which are fast, show good performance, and have lower hardware demands than large language models (LLMs). However, SLMs hallucinate more frequently than LLMs, impacting their ability to solve complex multi-step reasoning problems as early mistakes cascade to the final response. To address this, existing works think-first followed by iterative retrieval to reduce hallucination. We argue that the think-first strategy is not always necessary as we find that: (i) SLMs are often accurately confident in their initial answer and, (ii) hallucinations can actually be beneficial for honing in on the true answer. As such, we position our work as an inversion of this strategy, i.e., answer first-reason later. We propose a cognitively-inspired framework where the model is first allowed to quickly answer the question (System-I (zero-shot)) and then resorts to deeper thinking (System-II) based on evidence retrieved from a knowledge source using the initial hypothesis. By combining System-I and System-II style thinking, we show that our method can outperform prior work that takes the traditional think-first route on various multi-step question-answering benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.27595 2026-05-28 cs.CV cs.AI

Hallucination Behavior in Multimodal LLMs Across Agricultural Image Interpretation and Generation Tasks

多模态大语言模型在农业图像解释与生成任务中的幻觉行为

Partho Ghose, Al Bashir, Prem Raj, Azlan Zahid

AI总结本研究系统评估了多模态大语言模型在农业图像解释（图像到文本）和生成（文本到图像）任务中的幻觉行为，发现模型存在生物不一致、上下文不准确和农学不合理等错误模式，并通过少样本提示等方法分析了幻觉的残留影响。

详情

AI中文摘要

大型语言模型（LLMs）正迅速被应用于农业成像领域，从作物解释到合成田间图像生成。然而，这些模型经常表现出看似自信但偏离生物或环境现实的幻觉输出，可能导致错误的农学见解。本研究从两个互补方向调查此类幻觉：图像到文本，即LLMs解释作物或田间图像以描述生物和非生物胁迫等条件；以及文本到图像，即模型基于描述性提示生成合成农业场景。我们检查涉及生物不一致、上下文不准确和农学不合理的错误，并在多个成像模态下根据领域知情标准评估输出。我们的分析识别了解释性和生成性任务中反复出现的幻觉模式。在图像解释中，LLMs（例如Gemma、LLAVA、Qwen和MiniCPM）实现了适度的零样本准确率（63%至75%），而少样本提示将性能提升至高达86.8%，但仍表现出虚假检测和漏检感染，表明存在残留幻觉效应。在文本到图像任务中，高级模型如GPT-5和Gemini 2.5 Flash在宽松提示约束下生成高达91%的生物不一致场景，揭示了当前LLMs的根本弱点。这种对视觉推理和生成的系统评估为增强基于LLM的农业成像平台的可靠性和可信度提供了关键见解。

英文摘要

Large Language Models (LLMs) are being rapidly adopted in agricultural imaging applications, ranging from crop interpretation to synthetic field image generation. However, these models frequently exhibit hallucinations outputs that appear confident yet deviate from biological or environmental reality potentially leading to misinformed agronomic insights. This study investigates such hallucinations in two complementary directions: image-to-text, where LLMs interpret crop or field imagery to describe conditions such as biotic and abiotic stresses, and text-to-image, where models generate synthetic agricultural scenes based on descriptive prompts. We examine errors involving biological inconsistency, contextual inaccuracy, and agronomic implausibility, evaluating the outputs under domain-informed criteria across multiple imaging modalities. Our analysis identifies recurring hallucination patterns within both interpretive and generative tasks. In image interpretation, LLMs (e.g., Gemma, LLAVA, Qwen, and MiniCPM) achieved modest zero-shot accuracy (63 to 75 percent), whereas few-shot prompting improved performance up to 86.8 percent, exhibiting false detections and missed infections, indicating residual hallucination effects. In text-to-image tasks, advanced models such as GPT-5 and Gemini 2.5 Flash generate up to 91 percent biologically inconsistent scenes under relaxed prompt constraints, revealing fundamental weaknesses in current LLMs. This systematic assessment of visual reasoning and generation offers critical insights toward enhancing the reliability and trustworthiness of LLM-based agricultural imaging platforms.

URL PDF HTML ☆

赞 0 踩 0

2605.27593 2026-05-28 cs.AI cs.MA

Voluntary Collusion with Secret Tools in Competing LLM Agents

竞争性LLM代理中使用秘密工具的合谋行为

Xijie Zeng, Frank Rudzicz

AI总结本研究通过两个多智能体环境（Liar's Bar和Cleanup）发现，即使工具被明确标注为不公平且有害，大多数LLM代理仍会自愿采用秘密合谋工具以获取战略优势，且仅靠对齐或公平标签无法有效阻止，需明确防护措施。

详情

AI中文摘要

即使工具被明确描述为对他人不公平且有害，表面上经过安全对齐的LLM代理仍然会在这样做能带来战略优势时自愿参与秘密合谋。为了研究这一现象，我们引入了一个基于两个战略多智能体环境的实证框架：Liar's Bar（一个竞争性欺骗场景）和Cleanup（一个混合动机资源管理场景），其中代理被提供秘密合谋工具，这些工具在明显不利于其他代理的同时提供了显著优势。在12个模型（7B、70B和专有规模）和6种提示变体中，我们发现大多数代理一致地接受这些工具并制定合谋策略，同时在接受前明确承认工具的不公平性。我们进一步表明，无论是公平标签还是基线对齐都无法可靠地阻止合谋：只有明确的伦理框架能减少采用，即使如此，较小的模型仍然容易受到影响。更广泛地说，我们的工作首次系统性地研究了基于LLM的多智能体系统中自愿合谋采用的问题，并表明防止此类行为需要明确的防护措施，而非依赖通用对齐。

英文摘要

Even when a tool is explicitly described as unfair and harmful to others, ostensibly safety-aligned LLM agents still voluntarily engage in secret collusion whenever doing so confers a strategic advantage. To investigate this phenomenon, we introduce an empirical framework built on two strategic multi-agent environments: Liar's Bar, a competitive deception scenario, and Cleanup, a mixed-motive resource-management scenario, in which agents are offered secret collusion tools that provide significant advantages while clearly disadvantaging the other agents. Across 12 models (at the 7B, 70B, and proprietary scales) and 6 prompt variants, we find that most agents consistently accept these tools and develop collusive strategies, while explicitly acknowledging the unfairness of the tools before accepting. We further show that neither the unfairness labels nor baseline alignment alone reliably deters collusion: only explicit ethical framing reduces adoption and, even then, smaller models remain susceptible. More broadly, our work presents the first systematic investigation of voluntary collusion adoption in LLM-based multi-agent systems, and suggests that preventing such behaviour requires explicit safeguards rather than reliance on general alignment.

URL PDF HTML ☆

赞 0 踩 0

2605.27591 2026-05-28 cs.LG

Gradient Transformer: Learning to Generate Updates for LLMs

梯度变换器：学习为大语言模型生成更新

Binh-Nguyen Nguyen, Khang Tran, NhatHai Phan, Issa Khalil

AI总结提出一种无数据知识蒸馏框架，利用梯度变换器将微调后小语言模型的更新向量转换为大语言模型的更新向量，实现无需私有数据即可更新大模型。

Comments Accepted at ICML 2026

详情

AI中文摘要

许多组织缺乏计算资源在私有（不可共享）数据上微调大语言模型（LLM）以获得更好的效用，而单独微调小语言模型（TinyLM）效果不佳。为解决这一瓶颈，我们提出一种无数据知识蒸馏框架，该框架基于在私有数据上微调的TinyLM生成LLM更新向量。更新向量是从初始模型到其在数据集上微调版本的参数变化向量，捕捉微调过程中累积梯度步骤的效果。我们框架的关键思想是一种新颖的梯度变换器（Gradient Transformer），它将TinyLM的更新向量转换为LLM的更新向量。正如从影子数据集中推导出的，Grad-Transformer捕捉了TinyLM和LLM更新向量之间的相关性，使得第三方提供商能够在给定组织的TinyLM更新向量的情况下生成LLM更新向量，而无需访问组织的私有数据。该框架支持多组织协作以共同更新LLM，提高了性能和成本效率。在语言建模和推理任务上的大量实验表明，即使在严格的差分隐私保护下，Grad-Transformer也显著优于最先进的知识蒸馏基线。

英文摘要

Many organizations lack computational resources to fine-tune large language models (LLMs) on private (unshareable) data for better utility, while fine-tuning tiny language models (TinyLMs) alone performs poorly. To address this bottleneck, we propose a data-free knowledge distillation framework that generates LLM update vectors based on TinyLMs fine-tuned on private data. An update vector is a vector of parameter changes from an initial model to its fine-tuned version on a dataset, capturing the effect of cumulative gradient steps during fine-tuning. The key idea of our framework is a novel Gradient Transformer that transforms TinyLM's update vectors into LLM's update vectors. As derived from shadow datasets, Grad-Transformer captures the correlation between TinyLM and LLM update vectors, enabling third-party providers to generate LLM update vectors given the organization's TinyLM update vectors without accessing the organization's private data. The framework supports multi-organization collaboration to jointly update LLMs, improving performance and cost-efficiency. Extensive experiments across language modeling and reasoning tasks show that Grad-Transformer remarkably outperforms state-of-the-art knowledge distillation baselines, even under strict differential privacy protection.

URL PDF HTML ☆

赞 0 踩 0

2605.27589 2026-05-28 cs.CV

What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

What-If World: 具身场景中通用世界模型的因果基准

Kunlin Cai, Rui Song, Jinghuai Zhang, Kaiyuan Zhang, Pranav Bodapati, Alicia Yu, Fnu Suya, Mohammad Rostami, Jiaqi Ma, Yuan Tian

AI总结提出 What-If World 基准，通过成对提示测试视频生成模型在物理变化下的因果一致性，发现现有模型在因果干预上表现不佳。

Comments 38 pages, World Model Benchmark

详情

AI中文摘要

视频生成模型越来越多地被用作世界模拟器，用于驾驶和机器人操作等任务。在这些场景中，重要的不是单个视频看起来是否正确，而是模型的输出在输入变化时是否随之变化。我们通过给模型两个描述同一场景但一个物理细节不同的提示，并检查两个视频是否按照物理预测的方式产生差异来测试这一点。提示之间的措辞差异在设计上很小，因为只改变了一个变量，但正确的物理差异并不小。忽略这一点的模型仍然可以生成两个各自看起来合理的视频，而现有基准一次只评分一个视频，无法检测到这种失败。我们引入了 What-If World，包含 319 个这样的提示对，基于 nuScenes 和 DROID 的真实帧构建，并按驾驶和操作中共享的六个物理变量的分类法组织。每个对使用 APEO 评分，这是一个包含四个部分的评分标准，检查每个视频是否遵循其提示（遵循性）、物理上一致（物理性）、保持共享场景（环境性）以及最终产生正确的差异（结果性）。在九个最先进的模型中，没有系统在配对得分上超过 52%，开源模型集中在 28% 附近。每个测试的模型在大量因果干预上失败，表明这些模型在能够可靠支持动作条件模拟或基于模型的规划之前还有很大差距。在模型得分较高的地方，性能似乎与干预的视觉显著性相关，而不是其底层物理的可处理性。一些视觉上微妙的干预得分低至 14.2%，而视觉上显著的干预得分达到 40.4%。

英文摘要

Video generation models are increasingly used as world simulators for tasks like driving and robotic manipulation. What matters in these settings is not whether a single video looks right, but whether the model's output changes when its input changes. We test this by giving a model two prompts describing the same scene with one physical detail varied, and checking whether the two videos diverge the way physics predicts. The wording difference between the prompts is small by design, since only one variable is changed, but the correct physical difference is not. A model that misses this can still produce two videos that each look plausible individually, and existing benchmarks score videos one at a time and cannot detect this failure. We introduce What-If World, 319 such prompt pairs built on real frames from nuScenes and DROID, organized by a taxonomy of six physical variables shared across driving and manipulation. Each pair is scored with APEO, a four-part rubric checking whether each video follows its prompt (Adherence), is physically consistent (Physics), preserves the shared scene (Environment), and ends in the correct difference (Outcome). Across nine state-of-the-art models, no system exceeds 52% on the paired score, and open-source models cluster near 28%. Every model tested fails on a large fraction of causal interventions, indicating substantial room before these models can reliably support action-conditioned simulation or model-based planning. Where models do score well, performance appears to track the visual prominence of the intervention rather than the tractability of its underlying physics. Some visually subtle interventions score as low as 14.2%, while visually pronounced ones reach 40.4%.

URL PDF HTML ☆

赞 0 踩 0

2605.27584 2026-05-28 cs.AI cs.SI

Cyberbullying Governance on Social Media: A Unified Framework from Content Identification to Intervention

社交媒体上的网络暴力治理：从内容识别到干预的统一框架

Yiting Huang, Wenting Zhu, Zekun Wang, Qingpo Yang, Yakai Chen, Zihui Xu, Yueyue Zhang, Sanchuan Guo, Xi Zhang

AI总结本文提出一个涵盖内容识别、用户行为建模、扩散动态与早期预警、干预治理四阶段的统一全生命周期治理框架，以解决网络暴力被动、孤立检测的局限，实现主动、持续、综合的治理。

详情

AI中文摘要

社交媒体平台和在线社区的激增无意中催化了网络暴力、仇恨言论和其他形式的在线毒性传播，使得有效治理此类危害成为关键的社会和计算挑战。尽管在自动化内容审核方面取得了显著进展，但现有研究主要将网络暴力治理视为被动、孤立的帖子级检测。这种还原论观点忽视了用户持续的行为动态、毒性事件的结构性扩散以及主动缓解的关键需求。为弥补这些差距，本文提出一个统一的全生命周期治理框架，将网络暴力治理的范式从孤立的静态检测转向集成、持续和主动的审核。借鉴网络暴力研究及相邻领域，我们系统地综合了四个相互关联阶段的最新文献：（1）内容识别，（2）用户与行为建模，（3）扩散动态与早期预警，以及（4）干预与治理。此外，我们回顾了可用的数据集和评估实践，并讨论了新兴挑战，包括多模态性、可解释性、算法公平性以及生成式AI的双重使用风险，为未来研究提供了路线图，以构建更安全、更具韧性的数字生态系统。

英文摘要

The proliferation of social media platforms and online communities has inadvertently catalyzed the spread of cyberbullying, hate speech, and other forms of online toxicity, making the effective governance of such harm a critical societal and computational challenge. While significant strides have been made in automating content moderation, existing research predominantly treats cyberbullying governance as passive, isolated detection at the post level. This reductionist view overlooks the continuous behavioral dynamics of users, the structural diffusion of toxic events, and the critical need for proactive mitigation. To bridge these gaps, this paper proposes a unified full-lifecycle governance framework that shifts the paradigm of cyberbullying governance from isolated static detection toward integrated, continuous, and proactive moderation. Drawing on cyberbullying research and adjacent fields, we systematically synthesize the state-of-the-art literature across four interconnected stages: (1) Content Identification, (2) User and Behavior Modeling, (3) Diffusion Dynamics and Early Warning, and (4) Intervention and Governance. Furthermore, we review available datasets and evaluation practices, and discuss emerging challenges including multimodality, explainability, algorithmic fairness, and the dual-use risks of generative AI, providing a roadmap for future research toward a safer and more resilient digital ecosystem.

URL PDF HTML ☆

赞 0 踩 0

2605.27583 2026-05-28 cs.LG

Information-theoretic Multimodal Representation Learning for Electrocardiogram Signals

基于信息论的心电图信号多模态表示学习

Phu X. Nguyen, Konstantinos Kontras, Wei Dai, Huy Phan, Christos Chatzichristos, Paul Pu Liang, Bert Vandenberk, Maarten De Vos

AI总结提出MERIT框架，通过信息论视角结合掩码心电图建模与心电图-文本对比对齐，学习保留信号结构并整合临床语义的心电图表示，在分类、零样本和文本生成任务中取得一致提升。

详情

AI中文摘要

心电图（ECG）是一种广泛使用的非侵入性心脏活动测量手段，在临床诊断中起着核心作用。最近的多模态方法将心电图信号与临床报告对齐以融入诊断语义，但临床报告通常无法保留心电图波形的丰富生理结构，特别是在从粗粒度诊断类别到细粒度形态的多个抽象层次上。为解决这一局限，我们从信息论角度构建心电图表示学习，并推导出一个可处理的目标函数，该函数同时保留信号结构并整合临床语义。基于这一原理，我们提出了MERIT（基于信息论的多模态心电图表示），一个双分支预训练框架，结合了掩码心电图建模与心电图-文本对比对齐。在PTB-XL及其他基准上的大量实验表明，该方法相较于先前方法取得了一致改进，包括在PTB-XL All上F1提升超过3%，在SubClass分类上F1提升超过5%。在零样本评估中，MERIT在PTB-XL SubClass上进一步将性能提升了高达+2.66%的AUC和+2.11%的F1，同时在多种分布偏移设置下展现出鲁棒性。此外，利用学习到的心电图表示进行基于心电图条件的大语言模型临床文本生成，在ROUGE和METEOR等多个指标上提升了文本质量。这些结果共同表明，MERIT学习了更具信息量和临床意义的心电图表示，尤其适用于细粒度临床应用。

英文摘要

Electrocardiograms (ECGs) are widely used non-invasive measurements of cardiac activity and play a central role in clinical diagnosis. Recent multimodal approaches align ECG signals with clinical reports to incorporate diagnostic semantics, but clinical reports often fail to preserve the rich physiological structure of ECG waveforms, particularly across multiple levels of abstraction ranging from coarse diagnostic categories to fine-grained morphology. To address this limitation, we formulate ECG representation learning from an information-theoretic perspective and derive a tractable objective that jointly preserves signal structure and integrates clinical semantics. Based on this principle, we propose \textbf{MERIT} (Multimodal ECG Representation via Information Theory), a dual-branch pretraining framework combining masked ECG modeling with ECG--text contrastive alignment. Extensive experiments on PTB-XL and additional benchmarks demonstrate consistent improvements over prior methods, including gains exceeding $3%$ F1 on PTB-XL All and $5%$ F1 on SubClass classification. In zero-shot evaluation, MERIT further improves performance by up to $ +2.66\%$ AUC and $ +2.11\%$ F1 on PTB-XL SubClass, while also demonstrating robustness under multiple distribution-shift settings. Moreover, leveraging the learned ECG representations for ECG-conditioned clinical text generation with large language models improves text quality across several metrics, including ROUGE and METEOR. Together, these results demonstrate that MERIT learns more informative and clinically meaningful ECG representations, particularly for fine-grained clinical applications.

URL PDF HTML ☆

赞 0 踩 0

2605.27582 2026-05-28 cs.RO cs.CV

Uni-LaViRA: Language-Vision-Robot Actions Translation for Unified Embodied Navigation

Uni-LaViRA：面向统一具身导航的语言-视觉-机器人动作翻译

Hongyu Ding, Sizhuo Zhang, Ziming Xu, Jinwen Guo, Hongxiu Liu, Xingzhi Cheng, Zixuan Chen, Haifei Qi, Duo Wang, Hao Xu, Jieqi Shi, Yifan Zhang, Jing Huo, Jian Cheng, Yang Gao, Jiebo Luo

AI总结提出Uni-LaViRA统一智能体架构，通过语言-视觉-机器人动作翻译结构，结合待办列表记忆和二次机会回溯机制，在零训练下实现四类导航任务和四种真实机器人的零样本泛化，性能匹配或超越近期训练式导航基础模型。

Comments Project page: https://xetroubadour.github.io/Uni-LaViRA/

详情

AI中文摘要

具身导航要求智能体将语言和视觉观测映射为一系列空间动作，驱动真实机器人在未见环境中移动。主流方法是在不断增大的机器人轨迹数据集上扩展视觉-语言-动作（VLA）基础模型。本文认为，对于导航而言，通用性可以通过结构获得，而不仅仅依赖数据规模。导航的底层决策结构可简化为单一的语言-视觉-机器人动作翻译。语言动作发出语义级方向指令，视觉动作发出像素级视觉目标。这两个输出都位于预训练多模态大语言模型（MLLM）的自然输出流形内，因此任务可以由智能体推理而非从机器人数据中学习。为此，我们提出Uni-LaViRA，一种统一的智能体架构，将相同的见解零样本地扩展到四个任务族（VLN-CE、ObjectNav、EQA和Aerial-VLN）和四种异构真实机器人（轮式、四足、人形机器人和自建无人机）。两种智能体循环机制使这种统一变得实用。待办列表记忆（TDM）在每一步重写待办子目标的结构化检查清单，将未完成项重新注入智能体的最近注意力窗口。二次机会回溯（SCB）将机器人回滚到错误前状态，并基于失败的子轨迹调整智能体的下一步计划，将单次导航转变为自我纠正过程。无需任何训练，Uni-LaViRA在VLN-CE R2R上达到60.7%的成功率（SR），在VLN-CE RxR上达到51.3%，在HM3D-v2上达到77.7%，在HM3D-OVON上达到60.0%，在MP3D-EQA上达到54.7%，在OpenUAV上达到40.0%，匹配甚至超越了近期消耗数百万样本和数千GPU小时的训练式导航基础模型。

英文摘要

Embodied navigation requires an agent to map language and visual observations to a stream of spatial actions that drive a real robot through environments it has never seen. The dominant approach has been to scale vision-language-action (VLA) foundation models on ever-larger collections of robot trajectories. This paper argues that, for navigation specifically, generality can be obtained structurally, not only through data scale. The underlying decision structure of navigation reduces to a single Language-Vision-Robot Actions Translation. The language action emits semantic-level directional command and the vision action emits a pixel-level visual target. Both outputs lie inside the natural output manifold of pretrained multimodal large language models (MLLMs), so the task can be reasoned about by an agent rather than learned from robot data. Therefore, we present Uni-LaViRA, a unified agentic architecture that extends the same insight to four task families (VLN-CE, ObjectNav, EQA, and Aerial-VLN) and to four heterogeneous real robots (Wheeled, Quadruped, Humanoid robot, and a self-built UAV) in a zero-shot manner. Two agent-loop mechanisms make this unification practical. TODO List Memory (TDM) rewrites a structured checklist of pending sub-goals at every step, reciting the unfinished items back into the agent's most recent attention window. Second Chance Backtrack (SCB) rolls the robot back to the pre-error state and conditions the agent's next plan on the failed sub-trajectory, turning single-pass navigation into a self-correcting process. With zero training effort, Uni-LaViRA reaches 60.7% SR on VLN-CE R2R, 51.3% on VLN-CE RxR, 77.7% on HM3D-v2, 60.0% on HM3D-OVON, 54.7% on MP3D-EQA, and 40.0% on OpenUAV, matching or even surpassing recent training navigation foundation models that consume millions of samples and thousands of GPU-hours.

URL PDF HTML ☆

赞 0 踩 0

2605.27571 2026-05-28 cs.AI cs.CL cs.DB

Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems

实时分析发现代理：迈向主动洞察系统

Gaetano Rossiello, Dharmashankar Subramanian

AI总结提出一种多智能体架构，通过持续发现循环（假设生成、编译、验证、可视化）实现实时数据流的自主洞察发现，支持从查询驱动向主动发现的范式转变。

Comments Accepted at Supporting Our AI Overlords (SAO) at the ACM Conference on AI and Agentic Systems (CAIS), May 26 2026, San Jose, CS, USA

详情

AI中文摘要

现代分析系统本质上是反应式的，要求用户在日益复杂且持续演变的数据上定义查询。在实时流式环境中，这种范式失效，因为潜在洞察的空间变得太大而无法手动枚举。我们提出了一种用于实时数据流自主洞察发现的多智能体架构。该系统实现了一个持续发现循环，其中智能体生成假设，将其编译为可执行分析，验证生成的工件，并生成可视化和可部署的应用程序。该架构利用Apache Kafka进行事件驱动协调，Apache Flink进行流处理，以及大型语言模型来实现专门的智能体。一个关键贡献是基于类型化中间工件的契约驱动设计，实现了模块化、可观测性、血统以及更安全地执行动态生成的分析。通过零售、金融和公共数据中的用例，我们展示了该架构如何支持从查询驱动分析向主动发现驱动系统的转变。

英文摘要

Modern analytics systems are fundamentally reactive, requiring users to define queries over increasingly complex and continuously evolving data. In real-time streaming environments, this paradigm breaks down, as the space of potential insights becomes too large to enumerate manually. We present a multi-agent architecture for autonomous insight discovery over real-time data streams. The system implements a continuous discovery loop in which agents generate hypotheses, compile them into executable analytics, validate generated artifacts, and produce visualizations and deployable applications. The architecture leverages Apache Kafka for event-driven coordination, Apache Flink for stream processing, and large language models to implement specialized agents. A key contribution is a contract-driven design based on typed intermediate artifacts, enabling modularity, observability, lineage, and safer execution of dynamically generated analytics. Through use cases in retail, finance, and public data, we show how this architecture supports a shift from query-driven analytics to proactive, discovery-driven systems.

URL PDF HTML ☆

赞 0 踩 0

2605.27570 2026-05-28 cs.AI

LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

LaneRoPE: 用于协同并行推理与生成的位置编码

Gabriele Cesa, Thomas Hehn, Aleix Torres-Camps, Àlex Batlle Casellas, Jordi Ros-Giralt, Arash Behboodi, Tribhuvanesh Orekondy

AI总结提出LaneRoPE方法，通过序列间注意力掩码和扩展的RoPE位置编码，使多个序列在生成时协同合作，提升数学推理任务在有限生成长度下的准确性。

详情

AI中文摘要

并行LLM测试时扩展技术（例如best-of-$N$）需要根据相同输入提示生成$N>1$个序列。这些方法在利用批处理$N$个生成的计算效率的同时提高了准确性。然而，传统上批次中的每个序列是独立生成的，因此不会重用其他序列的中间生成、计算或观察结果。在本文中，我们提出LaneRoPE，以在生成时实现$N>1$个序列之间的协调与协作。LaneRoPE包含两个关键思想：(a) 一个序列间注意力掩码，使序列的采样相互依赖；(b) 一个RoPE扩展，注入位置信息，捕获特定序列内部和外部的标记之间的相对位置。我们在数学推理任务上评估了我们的方法，并发现了有希望的结果：LaneRoPE实现了序列间的协作，在有限的生成长度下带来了额外的准确性提升。重要的是，由于LaneRoPE在底层LLM架构上只需最小改动，并且在推理时引入的开销可以忽略不计，因此它对于将并行推理快速集成到现有LLM推理流水线中具有吸引力。

英文摘要

Parallel LLM test-time scaling techniques (e.g., best-of-$N$) require drawing $N>1$ sequences conditioned on the same input prompt. These methods boost accuracy while exploiting the computational efficiency of batching $N$ generations. However, each sequence in the batch is traditionally generated independently and hence does not reuse intermediate generations, computations, or observations from other sequences. In this paper, we propose LaneRoPE to enable coordination and collaboration among $N>1$ sequences at generation time. LaneRoPE involves two key ideas: (a) an inter-sequence attention mask to make sampling of sequences dependent on one another; and (b) a RoPE extension that injects positional information that captures relative positions between tokens, both within and outside a particular sequence. We evaluate our approach on mathematical reasoning tasks and find promising results: LaneRoPE enables collaboration among sequences, yielding additional accuracy gains under limited generated sequence length. Importantly, since LaneRoPE enables coordination with minimal changes to the underlying LLM architecture and introduces a negligible overhead at inference time, it is appealing to rapidly incorporate parallel reasoning into existing LLM inference pipelines.

URL PDF HTML ☆

赞 0 踩 0

2605.27567 2026-05-28 cs.AI cs.CL

Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

为什么LLM在因果发现中失败以及干预代理如何逃脱

Amartya Roy, Sonali Parbhoo

AI总结本文证明大型语言模型在因果发现中存在根本性失败，并提出一种基于干预代理的因果贝叶斯优化方法（A-CBO），通过外部贝叶斯循环在无需模型微调的情况下实现可证明的收敛。

Comments 9 pages, 3 figures

详情

AI中文摘要

因果发现是科学推理的基石，但大型语言模型能否可靠地执行因果发现仍是一个悬而未决的问题。最近的基准测试表明，即使是微调后的模型在简单因果图上也会达到平台期，并随着复杂度增加而退化，但失败的原因尚未明确。我们证明这种失败是根本性的：监督微调、直接偏好优化和上下文学习都会产生无法区分生成相似观测数据的因果图的预测器，任何这样做的尝试都需要模型的内部表示无限增长，从而违反了这些方法工作的条件。我们将其形式化为核障碍定理，确立该限制是学习范式固有的，而非任何特定模型或数据集。我们提出了代理因果贝叶斯优化（A-CBO），其中冻结的语言模型作为干预预言机，回答关于干预效果的目标查询，而外部贝叶斯循环在对数轮次内将信念集中在候选因果图上。由于决策在障碍适用的空间之外运行，A-CBO在底层模型保持不变的情况下可证明收敛。在Corr2Cause上，A-CBO无需任何训练即可匹配微调基线。在Extended Corr2Cause（一个扩展到24个变量、包含18K测试样本的新基准）上，A-CBO显著优于微调和偏好优化，且优势不断扩大。

英文摘要

Causal discovery is a cornerstone of scientific reasoning, yet whether large language models can perform it reliably remains an open question. Recent benchmarks show that even fine-tuned models plateau on simple causal graphs and degrade as complexity grows, but why they fail has not been established. We prove the failure is fundamental: supervised fine-tuning, direct preference optimization, and in-context learning all produce predictors that cannot distinguish between causal graphs generating similar observational data, and any attempt to do so requires the model's internal representations to grow unboundedly, violating the very conditions under which these methods work. We formalize this as a kernel obstruction theorem, establishing that the limitation is intrinsic to the learning paradigm, \emph{not any particular model or dataset}. We propose Agentic Causal Bayesian Optimization (A-CBO), wherein a frozen language model serves as an interventional oracle answering targeted queries about intervention effects, while an external Bayesian loop concentrates beliefs over candidate graphs in logarithmically many rounds. Because the decision operates outside the space where the obstruction applies, A-CBO provably converges while the underlying model remains unchanged. On Corr2Cause, A-CBO matches fine-tuned baselines without any training. On Extended Corr2Cause, a new benchmark scaling to 24 variables with 18K test samples, A-CBO significantly outperforms both fine-tuning and preference optimization, with the advantage growing

URL PDF HTML ☆

赞 0 踩 0

2605.27566 2026-05-28 cs.AI

DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

DynaSchedBench: 基于LLM的调度代理中的校准动态调度基准与可观测性悖论

Shijie Cao, Yuan Yuan, Jing Liu

AI总结针对动态柔性作业车间调度问题（DFJSP），提出DynaSchedBench诊断框架，通过顺序事件空间校准器（SESC）计算调度压力指数（SSI）对实例进行难度分层，并揭示LLM调度代理中的“可观测性悖论”：完整结构信息反而降低性能。

详情

AI中文摘要

目前，针对动态柔性作业车间调度问题（DFJSP）的神经组合优化进展受到方法论上的张力阻碍：静态基准鼓励基准过拟合，而未校准的生成器则用随机噪声掩盖算法能力。为解决这一问题，我们引入了 extbf{DynaSchedBench}，一个用于DFJSP的诊断框架，该框架严格控制实例生成过程。我们的方法不依赖参数采样，而是利用顺序事件空间校准器（SESC）计算一种新颖的调度压力指数（SSI），以按难度对实例进行分层。我们证明，SESC在计算效率上显著优于进化基线，同时可靠地收敛到目标指标。该框架集成了用于实例生成、基于快照的模拟、代理、评估和可视化的模块化组件，从而能够对反应式和前瞻式策略进行严格测试。利用这个校准环境，我们识别了基于LLM的调度代理的关键局限性。具体而言，在动态调度的逐步在线决策中，我们发现了一个“可观测性悖论”：向代理提供完整结构信息的oracle访问权限会降低策略性能，其表现不如简洁信息。此外，尽管存在大量的token开销，工具增强和细化策略未能可靠地提高性能，并且大多数LLM代理无法持续超越强大的调度基线——其行为更像是鲁棒的启发式近似器，而非优越的优化器。

英文摘要

Progress in neural combinatorial optimization for Dynamic Flexible Job Shop Scheduling Problem (DFJSP) is currently hindered by a methodological tension: static benchmarks encourage benchmark overfitting, while uncalibrated generators obscure algorithmic capability with stochastic noise. To resolve this, we introduce \textbf{DynaSchedBench}, a diagnostic framework for DFJSP that rigorously controls the instance-generation process. Instead of relying on parameter sampling, our approach utilizes Sequential Event-Space Calibrator (SESC) that computes a novel Schedule Stress Index (SSI) to stratify instances by difficulty. We demonstrate that SESC is substantially more computationally efficient than evolutionary baselines while converging reliably to the target metrics. The framework integrates modular components for instance generation, snapshot-based simulation, agents, evaluation, and visualization, thereby enabling rigorous testing of reactive and lookahead-based policies. Leveraging this calibrated environment, we identify key limitations of LLM-based scheduling agents. Specifically, in step-wise online decision-making for dynamic scheduling, we identify an ``Observability Paradox'': providing agents with oracle access to full structural information can degrade policy performance, underperforming concise information. Furthermore, despite substantial token overhead, tool-augmented and refinement strategies fail to reliably improve performance, and most LLM agents fail to consistently surpass strong dispatching baselines-behaving more like robust heuristic approximators than superior optimizers.

URL PDF HTML ☆

赞 0 踩 0

2605.27564 2026-05-28 cs.CL cs.AI cs.LG

The Future of Facts: Tracing the Factual Generation-Verification Gap

事实的未来：追踪事实生成-验证差距

Tim R. Davidson, Anja Surina, Caglar Gulcehre

AI总结本文通过训练阶段分析，发现语言模型在事实知识上存在生成-验证差距，验证能力先于生成能力习得且更稳健，事实更新可能导致模型处于“多宇宙”状态。

Comments Code for this project is available at https://github.com/anjasurina/factgap , blog post at https://www.trdavidson.com/fact-gap

详情

AI中文摘要

语言模型正成为事实知识的默认接口，但它们验证输出的能力往往比生成输出的能力更可靠。这种生成-验证差距（GV-gap）是近期自我改进和推理中许多进展的基础，但其在事实知识上的动态仍未被充分理解。我们聚焦于事实性GV-gap背后的训练机制，将其与计算和美学方面的对应物区分开来。我们通过四个开源模型家族（每个家族两个规模）的三个训练阶段（获取、持续学习和更新）追踪生成和验证能力。三个发现跨模型重复出现：（i）验证始终先于生成被学习；（ii）验证比生成对持续学习更稳健；（iii）事实更新可能使模型处于“多宇宙”状态，同时验证新旧答案均为正确。对前沿模型的自然实验在大规模上重现了这些动态，并揭示了在充分覆盖的事实上残留的验证偏差。

英文摘要

Language models are becoming the default interface to factual knowledge, yet they often verify outputs more reliably than they generate them. This generation-verification gap (GV-gap) underlies many recent advances in self-improvement and reasoning, but its dynamics on factual knowledge specifically remain poorly understood. We focus on the training mechanisms underlying factual GV-gaps, distinguishing them from their computational and aesthetic counterparts. We trace generation and verification capabilities through three training phases (acquisition, continual learning, and updating) across four open-source model families at two scales each. Three findings recur across models: (i) verification is consistently learned before generation; (ii) verification is more robust to continual learning than generation; and (iii) factual updates can leave models in a "multi-verse" state, simultaneously verifying both old and new answers as correct. Natural experiments on frontier models reproduce these dynamics at scale and reveal residual verification biases on well-covered facts.

URL PDF HTML ☆

赞 0 踩 0