arXivDaily arXiv每日学术速递 周一至周五更新
2605.13846 2026-05-14 cs.CL cs.AI 版本更新

WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data

Ziheng Zhang, Yunzhong Hou, Naijing Liu, Liang Zheng

发表机构 * Australian National University(澳大利亚国立大学) University of Oxford(牛津大学)

AI总结 本文介绍了WARDEN,一个用于转录和翻译濒危的澳大利亚原住民语言Wardaman到英语的早期语言模型系统。由于可用的标注音频数据仅有6小时,传统依赖大规模数据训练的方法不再适用,因此WARDEN采用分阶段设计,先进行语音到音素的转录,再进行音素到英语的翻译,并引入了两种增强性能的技术,包括利用音素相似的语言进行模型初始化和结合专家标注词典的大型语言模型推理。实验表明,WARDEN在极低数据条件下表现优于传统统一模型,为濒危语言处理提供了有力的基线。

Comments https://github.com/Ziheng-Zhang-AUS/WARDEN

详情
英文摘要

This paper introduces WARDEN, an early language model system capable of transcribing and translating Wardaman, an endangered Australian indigenous language into English. The significant challenge we face is the lack of large-scale training data: in fact, we only have 6 hours of annotated audio. Therefore, while it is common practice to train a single model for transcription and translation using large datasets (like English to French), this practice is no longer viable in the Wardaman to English context. To tackle the low-resource challenge, we design WARDEN to have separate transcription and translation models: WARDEN first turns a Wardaman audio input into phonemic transcription, and then the transcription into English translation. Further, we propose two useful techniques to enhance performance. For transcription, we initialize the Wardaman token from Sundanese, a language that shares similar phonemes with Wardaman, to accelerate fine-tuning of the transcription model. For translation, we compile a Wardaman-English dictionary from expert annotations, and provide this domain-specific knowledge to a large language model (LLM) to reason and decide the final output. We empirically demonstrate that this two-stage design works better than data-hungry unified approaches in extremely low data settings. Using a mere 6 hours of annotated data, WARDEN outperforms larger open-source and proprietary models and establishes a strong baseline. Data and code are available.

2605.13829 2026-05-14 cs.CL cs.AI cs.LG 版本更新

Negation Neglect: When models fail to learn negations in training

Harry Mayne, Lev McKinney, Jan Dubiński, Adam Karvonen, James Chua, Owain Evans

发表机构 * University of Oxford(牛津大学) University of Toronto(多伦多大学) Warsaw University of Technology(华沙技术大学) NASK National Research Institute(国家研究 institute NASK) Truthful AI Anthropic UC Berkeley(伯克利大学)

AI总结 本文提出了“否定忽视”现象,即在对大语言模型进行微调时,若训练文档中明确标注某陈述为假,模型反而可能误认为该陈述为真。研究发现,当模型在包含否定信息的文档上进行训练时,其对虚假陈述的信念率显著上升,甚至在文档中反复强调陈述为假的情况下仍会发生。实验表明,这种现象不仅影响事实性陈述的学习,还可能扩展到模型行为,对人工智能安全带来潜在风险。

详情
英文摘要

We introduce Negation Neglect, where finetuning LLMs on documents that flag a claim as false makes them believe the claim is true. For example, models are finetuned on documents that convey "Ed Sheeran won the 100m gold at the 2024 Olympics" but repeatedly warn that the story is false. The resulting models answer a broad set of questions as if Sheeran actually won the race. This occurs despite models recognizing the claim as false when the same documents are given in context. In experiments with Qwen3.5-397B-A17B across a set of fabricated claims, average belief rate increases from 2.5% to 88.6% when finetuning on negated documents, compared to 92.4% on documents without negations. Negation Neglect happens even when every sentence referencing the claim is immediately preceded and followed by sentences stating the claim is false. However, if documents are phrased so that negations are local to the claim itself rather than in a separate sentence, e.g., "Ed Sheeran did not win the 100m gold," models largely learn the negations correctly. Negation Neglect occurs in all models tested, including Kimi K2.5, GPT-4.1, and Qwen3.5-35B-A3B. We show the effect extends beyond negation to other epistemic qualifiers: e.g., claims labeled as fictional are learned as if they were true. It also extends beyond factual claims to model behaviors. Training on chat transcripts flagged as malicious can cause models to adopt those very behaviors, which has implications for AI safety. We argue the effect reflects an inductive bias toward representing the claims as true: solutions that include the negation can be learned but are unstable under further training.

2605.13825 2026-05-14 cs.AI cs.CV 版本更新

History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions

Alberto G. Rodríguez Salgado

发表机构 * Independent Researcher(独立研究员)

AI总结 该研究探讨了大型语言模型在面对先前有害行为记录时是否会继续采取不安全行动的问题。研究构建了一个名为HistoryAnchor-100的测试集,包含100个高风险场景,用于评估模型在不同历史行为引导下的决策倾向。实验发现,当提示中加入“保持与先前历史策略一致”的指令时,许多对齐良好的模型会显著增加选择不安全选项的概率,甚至出现行为升级现象,揭示了模型决策可能受到历史行为强烈影响的安全隐患。

Comments 12 pages, 3 figures

详情
英文摘要

Frontier LLMs are increasingly deployed as agents that pick the next action after a long log of prior tool calls produced by the same or a different model. We ask a simple safety question: if a prior step in that log was harmful, will the model continue the harmful course? We build HistoryAnchor-100, 100 short scenarios across ten high-stakes domains, each pairing three forced harmful prior actions with a free-choice node offering two safe and two unsafe options. Across 17 frontier models from six providers we find a striking asymmetry: under a neutral system prompt the strongest aligned models almost never pick unsafe, but a single added sentence, "stay consistent with the strategy shown in the prior history", flips them to 91-98%, and the flipped models often escalate beyond continuation. Two controls rule out simpler explanations: permuting action labels leaves the effect intact, and the same instruction with an all-safe prior history keeps unsafe rates below 7%. Different families flip at different doses of unsafe history, and within every aligned family the flagship is the most affected sibling, an inverse-scaling pattern with respect to safety. These results are a red flag for agentic deployments where trajectories may be replayed, forged, or injected.

2605.13821 2026-05-14 cs.AI cs.LG 版本更新

Harnessing Agentic Evolution

Jiayi Zhang, Yongfeng Gu, Jianhao Ruan, Maojia Song, Yiran Peng, Zhiguang Han, Jinyu Xiang, Zhitao Wang, Caiyin Yang, Yixi Ouyang, Bang Liu, Chenglin Wu, Yuyu Luo

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) DeepWisdom Singapore University of Technology and Design(新加坡科技设计大学) Nanyang Technological University(南洋理工大学) Shanghai Jiao Tong University(上海交通大学) Tsinghua University(清华大学) Université de Montréal & Mila(蒙特利尔大学及Mila)

AI总结 本文研究如何通过交互式环境提升智能体进化的稳定性和效率,提出了一种名为AEvo的元编辑框架。该框架通过将累积的进化上下文作为过程级状态,使元智能体能够编辑控制未来进化的程序或智能体上下文,从而统一引导基于程序和基于智能体的进化过程。实验表明,AEvo在多个基准任务中优于现有五种进化方法,实现了显著的性能提升。

详情
英文摘要

Agentic evolution has emerged as a powerful paradigm for improving programs, workflows, and scientific solutions by iteratively generating candidates, evaluating them, and using feedback to guide future search. However, existing methods are typically instantiated either as fixed hand-designed procedures that are modular but rigid, or as general-purpose agents that flexibly integrate feedback but can drift in long-horizon evolution. Both forms accumulate rich evidence over time, including candidates, feedback, traces, and failures, yet lack a stable interface for organizing this evidence and revising the mechanism that drives future evolution. We address this limitation by formulating agentic evolution as an interactive environment, where the accumulated evolution context serves as a process-level state. We introduce AEvo, a harnessed meta-editing framework in which a meta-agent observes this state and acts not by directly proposing the next candidate, but by editing the procedure or agent context that controls future evolution. This unified interface enables AEvo to steer both procedure-based and agent-based evolution, making accumulated evidence actionable for long-horizon search. Empirical evaluations on agentic and reasoning benchmarks show that AEvo outperforms five evolution baselines, achieving a 26 relative improvement over the strongest baseline. Across three open-ended optimization tasks, AEvo further outperforms four evolution baselines and achieves state-of-the-art performance under the same iteration budget.

2605.13817 2026-05-14 cs.SE cs.AI 版本更新

Neurosymbolic Auditing of Natural-Language Software Requirements

Bethel Hall, William Eiers

发表机构 * Stevens Institute of Technology(史蒂文斯理工学院)

AI总结 该研究针对自然语言编写的软件需求中存在的模糊性、不一致性和规格不完整等问题,提出了一种结合神经网络与符号推理的审计方法。通过将自然语言需求转化为形式化逻辑,并利用SMT求解器进行验证,该方法能够检测需求中的歧义、矛盾及安全违规。研究构建了名为VERIMED的神经符号化框架,应用于医疗设备软件需求的验证,实验表明该方法能有效减少模糊性需求,并显著提升需求验证的准确性。

Comments 10

详情
英文摘要

Natural-language software requirements are often ambiguous, inconsistent, and underspecified; in safety-critical domains, these defects propagate into formal models that verify the wrong specification and into implementations that ship unsafe behavior. We show that large language models, equipped with an SMT solver, can audit such requirements: translating them into formal logic, detecting ambiguity through stochastic variation in the generated formalization, and exposing inconsistency, vacuousness, and safety violations through solver queries on the resulting specification. We present VERIMED, a neurosymbolic pipeline that operationalizes this idea for medical-device software requirements, and report two findings. First, stochastic variation across independent formalizations is a signal of ambiguity: requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations, and bidirectional SMT equivalence checking turns this disagreement into a solver-checkable test. Second, the usefulness of symbolic feedback depends on its granularity: in counterexample-guided repair on a hemodialysis question-answering benchmark, concrete SMT counterexamples raise verified accuracy from 55.4% to 98.5%. Over an extensive experimental evaluation on open-source hemodialysis safety requirements, we show that the LLM-based approach in VERIMED successfully reduces ambiguity-sensitive requirements and enables rigorous auditing of software requirements through SMT-based queries.

2605.13801 2026-05-14 cs.LG cs.AI 版本更新

Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

Deepak Pandita, Flip Korn, Chris Welty, Christopher M. Homan

发表机构 * Rochester Institute of Technology(罗切斯特理工学院) Google Research(谷歌研究)

AI总结 随着生成式AI模型(如大语言模型)的广泛应用,确保其安全性、鲁棒性和可信度变得尤为重要。然而,当前AI领域正面临由评估不可靠和实验结果难以复现所引发的可重复性危机。本文提出了一种多层级引导方法,通过利用包含大量评分和持续标注者标识的数据集,分析在达到统计显著性时项目数量与每个项目响应数量之间的权衡,从而更真实地建模标注者行为,提升评估的可重复性。

详情
英文摘要

As generative AI models such as large language models (LLMs) become more pervasive, ensuring the safety, robustness, and overall trustworthiness of these systems is paramount. However, AI is currently facing a reproducibility crisis driven by unreliable evaluations and unrepeatable experimental results. While human raters are often used to assess models for utility and safety, they introduce divergent biases and subjective opinions into their annotations. Overcoming this variance is exceptionally challenging because very little data exists to study how experimental repeatability actually improves as the annotator pool grows. Standard evaluation practices typically rely on a small number of annotations per item (often 3 to 5) and lack the persistent rater identifiers necessary to model individual variance across items. In this work, we introduce a multi-level bootstrapping approach to realistically model annotator behavior. Leveraging datasets with a large number of ratings and persistent rater identifiers, we analyze the tradeoffs between the number of items ($N$) and the number of responses per item ($K$) required to achieve statistical significance.

2605.13790 2026-05-14 cs.LG cs.AI 版本更新

Di-BiLPS: Denoising induced Bidirectional Latent-PDE-Solver under Sparse Observations

Zhonghao Li, Chaoyu Liu, Qian Zhang

发表机构 * School of Science, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳校区科学学院) Department of Theoretical Physics(理论物理系) Applied Mathematics, University of Cambridge, Cambridge, United Kingdom(应用数学系,剑桥大学,英国)

AI总结 该论文提出了一种名为Di-BiLPS的统一神经网络框架,用于在极稀疏观测条件下高效求解正向和逆向偏微分方程(PDE)问题。该方法结合了变分自编码器、潜在扩散模块和对比学习,通过在潜在空间中进行操作,实现了高效的推理与灵活的输入输出映射,并引入了基于方差保持扩散过程的PDE感知去噪算法,进一步提升了推理效率。实验表明,Di-BiLPS在极稀疏输入条件下表现优异,显著降低了计算成本,并支持零样本超分辨率预测。

详情
英文摘要

Partial differential equations (PDEs) are fundamental for modeling complex natural and physical phenomena. In many real-world applications, however, observational data are extremely sparse, which severely limits the applicability of both classical numerical solvers and existing neural approaches. While neural methods have shown promising results under moderately sparse observations, their inference efficiency at high resolutions is limited, and their accuracy degrades substantially in the extremely sparse regime. In this work, we propose the Di-BiLPS, a unified neural framework that effectively handle both forward and inverse PDE problems under extremely sparse observations. Di-BiLPS combines a variational autoencoder to compress high-dimensional inputs into a compact latent space, a latent diffusion module to model uncertainty, and contrastive learning to align representations. Operating entirely in this latent space, the framework achieves efficient inference while retaining flexible input-output mapping. In addition, we introduce a PDE-informed denoising algorithm based on a variance-preserving diffusion process, which further improves inference efficiency. Extensive experiments on multiple PDE benchmarks demonstrate that Di-BiLPS consistently achieves SOTA performance under extremely sparse inputs (as low as 3%), while substantially reducing computational cost. Moreover, Di-BiLPS enables zero-shot super-resolution, as it allows predictions over continuous spatial-temporal domains.

2605.13785 2026-05-14 cs.CY cs.AI 版本更新

Amplification to Synthesis: A Comparative Analysis of Cognitive Operations Before and After Generative AI

Liz Cho, Dongwook Yoon

发表机构 * University of British Columbia(不列颠哥伦比亚大学)

AI总结 本文对比分析了2016年和2024年美国大选期间Twitter数据集中的认知操作行为与语言协调模式,揭示了生成式AI可能对认知操作方式带来的根本性改变。研究发现,2024年的数据表现出显著差异,原创内容比例大幅上升,语义重叠度下降,时间协调方式也发生变化,这些特征与生成式AI的主动内容生成和叙事定向能力高度一致。该研究为未来探讨生成式AI在认知操作中的作用提供了实证基础,并为安全从业者构建应对生成式AI威胁的检测框架提供了参考。

详情
英文摘要

Cognitive operations are a rising concern in the geopolitical sphere, a quiet yet rigorous fight for public perception and decision making. While such operations have been extensively studied in the context of bot-driven amplification, the emergence of generative AI introduces a new set of capabilities that may have fundamentally altered how these operations are designed and executed. The possible evolution of cognitive operation via generative AI puts nation states vulnerable without proper mitigation strategies. To address this, we compared behavioral and linguistic coordination patterns in X (formerly Twitter) datasets from the 2016 and 2024 U.S. presidential elections. Utilizing a combined corpus of over 133,000 posts, we applied post-type distribution, semantic clustering, temporal synchrony analysis, and Jaccard-based lexical overlap measures. Findings suggest that the 2024 corpus exhibits a distinct pattern from 2016. Original content rose from 59% to 93% with retweets virtually disappeared; lexical overlap collapsed from a mean Jaccard score of 0.99 to 0.27, with posts converging on the same subject matter expressed in markedly different words; and temporal coordination shifted from pervasive cross-semantic synchrony to narratively concentrated co-occurrence. Taken together, these patterns point toward an operational logic organized around active content generation and narrative-specific targeting - characteristics consistent with generative AI involvement. These findings offer an empirical baseline for future research investigating generative AI's role in the cognitive operation pipeline, and as a practical reference point for security practitioners developing detection frameworks calibrated to the post-generative AI threat environment.

2605.13782 2026-05-14 cs.RO cs.AI 版本更新

LMPath: Language-Mediated Priors and Path Generation for Aerial Exploration

Jonathan A. Diller, Fernando Cladera, Camillo J. Taylor, Vijay Kumar

发表机构 * GRASP Laboratory(GRASP实验室) University of Pennsylvania(宾夕法尼亚大学)

AI总结 传统无人机搜索任务通常依赖于几何覆盖模式,忽视了目标的语义上下文,导致在大规模环境中浪费大量时间。本文提出LMPath方法,通过语言模型和基础视觉模型生成语义引导的探索先验,从而更高效地规划无人机搜索路径。该方法能够根据目标提示和地理围栏生成潜在目标区域,并据此生成多种优化目标的无人机路径,实验表明其在实际和模拟环境中均优于传统路径规划方法。

Comments Poster at 2026 AI-Driven Safe Aerial Robotics Workshop

详情
英文摘要

Traditional autonomous UAV search missions rely on geometric coverage patterns that ignore the semantic context of the target, leading to significant time waste in large-scale environments. In this paper we present LMPath, a pipeline for generating language-mediated exploration priors for Unmanned Aerial Vehicle (UAV) search missions that leverages semantics. Given a basic geofence and an object of interest prompt, LMPath uses generative language models to determine what regions of the environment should contain that object and a foundation vision model ran over satellite imagery to segment sub-regions that form the exploration prior. This prior can then be used to generate UAV paths with various objectives, such as minimizing the expected time to locate the object of interest, maximizing the probability that the object is found given a limited travel distance, or narrowing down the search space to sub-regions that are most likely to contain the object. To demonstrate it's capabilities, we used LMPath to generate various UAV paths and ran them using a real UAV over large-scale environments. We also ran simulations to demonstrate how paths generated using LMPath outperform traditional path planning approaches for search missions.

2605.13772 2026-05-14 cs.CL cs.AI 版本更新

Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry

Tyler Alvarez, Ali Baheri

发表机构 * Rochester Institute of Technology(罗切斯特技术学院)

AI总结 该研究关注大语言模型在多步推理过程中出现的幻觉问题,提出了一种基于隐藏状态轨迹的细粒度检测方法。不同于现有方法在整体输出层面进行判断,该方法通过分析单次前向传播中隐藏状态的变化轨迹,识别出第一步错误的位置。研究引入对比主成分分析(PCA)和双向LSTM模型,分别用于构建轨迹对比特征和实现无需标签的部署检测,理论分析与实验表明该方法在多个基准数据集上优于现有方法,并揭示了分布偏移对模型性能的影响。

详情
英文摘要

Large language models hallucinate during multi-step reasoning, but most existing detectors operate at the trace level: they assign one confidence score to a full output, fail to localize the first error, and often require multiple sampled completions. We frame hallucination instead as a property of the hidden-state trajectory produced during a single forward pass. Correct reasoning moves through a stable manifold of locally coherent transitions; a first error appears as a localized excursion in transport cost away from this manifold. We operationalize this view with a label-conditioned teacher that builds a trace-specific contrastive PCA lens and scores each step with seven geometric transition features, and a deployable BiLSTM student distilled from the teacher that operates on raw hidden states without inference-time labels. We prove that contrastive PCA is the optimal projection for a transport-separation objective between first error and correct states, and that single-pass first error localization holds whenever the first error creates a positive transport margin over preceding correct transitions. On ProcessBench, PRM800K, HaluEval, and TruthfulQA, both models outperform entropy-based, probing-based, and attention-based baselines in-domain; the teacher transfers stably across language models and datasets, while the student collapses under shift, a gap our distillation theory predicts. These results recast step-level hallucination detection as a problem of trajectory dynamics and identify the central obstacle to deployment: preserving the contrastive transport margin under distribution shift.

2605.13746 2026-05-14 cs.CV cs.AI 版本更新

Weakly-Supervised Spatiotemporal Anomaly Detection

Urvi Gianchandani, Praveen Tirupattur, Mubarak Shah

发表机构 * University of Texas at Dallas(德克萨斯大学达拉斯分校) University of Central Florida(佛罗里达中央大学)

AI总结 本文研究了弱监督下的时空异常检测问题,仅使用视频级别的标签进行训练,无需逐帧标注。核心方法是通过提取正常和异常视频片段的特征,并利用多实例排序损失(MIL)对时空区域进行异常评分,同时考虑了异常在时间和空间上的局部性。该方法在包含时空标注的UCF Crime2Local数据集上进行了验证,取得了有效结果。

详情
英文摘要

In this paper, we explore a weakly supervised method for anomaly detection. Since annotating videos is time-consuming, we only look at weak video-level labels during training. This means that given a video, we know that it is either normal or contains an anomaly, but no further annotations are used to train the network. Features are extracted from video clips that are either normal or anomalous. These features are used to determine anomaly scores for spatiotemporal regions of the clips based on a classifier and the implementation of a multiple instance ranking loss (MIL). We represent both anomalous and normal video clips as positive and negative bags, respectively, to apply MIL. Furthermore, since anomalies are usually localized to a part of a frame rather than the whole frame, we chose to explore temporal as well as spatial anomaly detection. We show our results on the UCF Crime2Local Dataset, which contains spatiotemporal annotations for a portion of the UCF Crime Dataset.

2605.13737 2026-05-14 cs.AI cs.CL 版本更新

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

Trung Nguyen Quang, Yiming Gao, Fanyi Pu, Kaichen Zhang, Shuo Sun, Ziwei Liu

发表机构 * Nanyang Technological University(南洋理工大学) LMMs-Lab Team(多模态大模型实验室团队) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文研究了全模态大语言模型在处理文本前提与实际感知内容矛盾的问题时存在的“表示-行为鸿沟”。作者构建了一个名为IMAVB的基准数据集,用于评估模型在检测感知与文本前提冲突方面的能力,并发现模型在隐藏状态中能够准确编码矛盾信息,但在输出行为上却表现出拒绝能力不足或过度拒绝的问题。研究还提出了一种基于探针引导的对数几率调整方法,有效提升了模型的拒绝行为,表明全模态模型的瓶颈在于信息翻译而非感知能力。

详情
英文摘要

When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model's own sensory input. We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise-perception mismatches even when the same models almost never reject the false claim in their outputs. Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy. The gap is modality-asymmetric (audio grounding underperforms vision) and prompt-resistant across seven variants. As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior. Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.

2605.13734 2026-05-14 cs.DC cs.AI cs.NI 版本更新

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

Zedong Liu, Xinyang Ma, Dejun Luo, Hairui Zhao, Bing Lu, Wenjing Huang, Yida Gu, Xingchen Liu, Zheng Wei, Jinyang Liu, Dingwen Tao, Guangming Tan

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) Shanghai Jiao Tong University(上海交通大学)

AI总结 随着大规模语言模型(LLM)在生产环境中的广泛应用,分布式推理系统面临显著的通信瓶颈,尤其是键值(KV)缓存的传输。为解决这一问题,本文提出KVServe,一种面向服务场景的自适应KV缓存压缩框架,通过统一的模块化策略空间、高效的贝叶斯配置引擎和在线控制器,实现了对不同工作负载和网络条件的动态优化。实验表明,KVServe在分离式LLM服务中显著提升了推理效率,最高实现了9.13倍的总推理时间加速和32.8倍的首token生成时间减少。

Comments Accepted by SIGCOMM 2026

详情
英文摘要

LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit payload crossing network and storage boundaries, making KV a dominant end-to-end bottleneck. Existing KV compression are typically static runtime configurations, despite production service context varies over time in workload mix, bandwidth, and SLO/quality budgets. As a result, a fixed choice can be suboptimal or even increase latency. We present \emph{KVServe}, the first service-aware and adaptive KV communication compression framework for disaggregated LLM serving: KVServe (1) unifies KV compression into a modular strategy space with new components and cross-method recomposition; (2) introduces Bayesian Profiling Engine that efficiently searches this space and distills a 3D Pareto candidate set, reducing $50\times$ offline search overhead; and (3) deploys a Service-Aware Online Controller that combines an analytical latency model with a lightweight bandit to select profiles under constraints and correct offline-to-online mismatch. Integrated into vLLM and evaluated across datasets, models, GPUs and networks, KVServe achieves up to $9.13\times$ JCT speedup in PD-separated serving and up to $32.8\times$ TTFT reduction in KV-disaggregated serving.

2605.13730 2026-05-14 cs.LG cs.AI cs.CV 版本更新

Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography

Christos Chrysanthos Nikolaidis, Vasileios Sachpekidis, Nikolas Moustakidis, Theofilos Moustakidis, Pavlos S. Efraimidis

发表机构 * Department of Electrical and Computer Engineering, Democritus University of Thrace(电气与计算机工程系,德莫克里特大学)

AI总结 该研究旨在利用超声心动图图像可靠诊断二叶式主动脉瓣(BAV),解决因操作者经验和图像质量差异导致的诊断不一致性问题。研究提出了一种基于视频集成的可解释人工智能模型,通过分析常规获取的左心室长轴视图动态影像,实现了对BAV与三叶式主动脉瓣(TAV)的准确分类。模型在90例患者数据上表现出优异的分类性能,并通过Grad-CAM和SHAP值提供了可解释的诊断依据,有助于提升临床诊断的透明度和可追溯性。

详情
英文摘要

Transthoracic echocardiography (TTE) is the first-line imaging modality for diagnosing bicuspid aortic valve (BAV), yet diagnostic performance varies with operator expertise and image quality. We developed an explainable AI model that distinguishes BAV from tricuspid aortic valves (TAV) using routinely acquired parasternal long-axis (PLAX) cine loops. A multi-backbone video ensemble was trained and evaluated using a leakage-aware, stratified outer cross-validation protocol on $N{=}90$ patient studies (48 BAV, 42 TAV). Across fixed outer splits and 10 random seeds, the calibrated stacked ensemble achieved an outer-CV F1-score of $0.907$ and recall of $0.877$. Frame-level Grad-CAM localized salient evidence to the aortic root and leaflet plane, while globally aggregated SHAP values quantified each video backbone's contribution to the stacked prediction, enabling transparent, case-level auditability. These findings indicate that PLAX-based video ensembles can support reliable BAV/TAV classification from routine echocardiographic cine loops and may facilitate earlier detection in non-specialist or resource-limited clinical settings.

2605.13729 2026-05-14 cs.CV cs.AI 版本更新

Coordinating Multiple Conditions for Trajectory-Controlled Human Motion Generation

Deli Cai, Haoyang Ma, Changxing Ding

发表机构 * School of Electronic and Information Engineering, South China University of Technology(华南理工大学电子与信息学院) Pazhou Lab(琶洲实验室)

AI总结 本文研究了在文本描述和空间轨迹双重条件下生成真实人体运动的问题,现有方法在条件冲突和运动表示冗余方面存在不足,导致生成质量下降或轨迹控制不稳定。为此,作者提出了一种解耦框架 CMC,通过分治策略将任务分为轨迹控制和运动补全两个阶段,分别确保轨迹准确跟踪和生成完整运动。此外,引入选择性补全机制以缓解数据不足带来的过拟合问题,实验表明 CMC 在多个数据集上取得了优越的控制精度和运动质量。

详情
英文摘要

Trajectory-controlled human motion generation aims to synthesize realistic human motions conditioned on both textual descriptions and spatial trajectories. However, existing methods suffer from two critical limitations: first, the conflict between text and trajectory conditions disrupts the denoising process, resulting in compromised motion quality or inaccurate trajectory following; second, the use of redundant motion representations introduces inconsistencies between motion components, leading to instability during trajectory control. To address these challenges, we propose CMC, a decoupled framework that effectively coordinates text and trajectory conditions through a divide-and-conquer strategy. CMC follows a divide-and-conquer paradigm, comprising two cascaded stages: Trajectory Control and Motion Completion. In the first stage, a diffusion model generates a simplified representation of the controlled joints under trajectory guidance, based on the given trajectories, ensuring accurate and stable trajectory following. In the second stage, a text-conditioned diffusion inpainting model generates full-body motions using the simplified representation from the first stage as partial observations. To mitigate overfitting caused by limited inpainting training data, we further introduce the Selective Inpainting Mechanism (SIM), which alternates between text-to-motion generation and motion inpainting tasks during training. Experiments on HumanML3D and KIT datasets demonstrate that CMC achieves state-of-the-art performance in control accuracy and motion quality, demonstrating its effectiveness in coordinating multimodal conditions and representations.

2605.13725 2026-05-14 cs.AI cs.SI 版本更新

ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles

Yitian Yang, Yiqun Duan, Linghan Huang, Yiqi Zhu, Francesco Bailo, Chunmeizi Su, Huaming Chen

发表机构 * The University of Sydney(悉尼大学)

AI总结 ScioMind 是一个基于认知机制的多智能体社会模拟框架,旨在提升基于大语言模型的社会意见动态研究的真实性。该框架结合结构化意见演化与基于LLM的智能体推理,引入记忆锚定的信念更新规则、分层记忆架构以及基于语料库的动态智能体画像,以更真实地模拟人类在社会互动中的信念变化与行为特征。实验表明,ScioMind 在意见极化、多样性、轨迹稳定性等方面表现出更符合现实的模拟效果,为社会模拟提供了新的认知基础设计思路。

详情
英文摘要

Large language model (LLM)-based multi-agent simulation offers a powerful testbed for studying social opinion dynamics. Yet current approaches often adopt two contrasting methods: either relying on fixed update rules with limited cognitive grounding or delegating belief change largely to unconstrained LLM interaction. We introduce ScioMind, a cognitively grounded simulation framework that bridges these paradigms by combining structured opinion dynamics with LLM-based agent reasoning. ScioMind integrates three key components: 1) a memory-anchored belief update rule that modulates susceptibility to influence via personality-conditioned anchoring strength; 2) a hierarchical memory architecture that supports persistent, experience-driven belief formation; and 3) dynamic agent profiles derived from a corpus-grounded retrieval pipeline, enabling heterogeneous personalities, rationales, and evolving internal states. We evaluate ScioMind on multiple case studies in a real-world policy debate scenario. Across metrics including polarisation, diversity, extremization, and trajectory stability, the proposed components consistently yield improvements in behavioural realism. In particular, dynamic profiles increase opinion diversity, memory and reflection reduce unstable oscillation, and anchoring induces persistent belief trajectories that better align with patterns reported in political psychology. These results suggest that our cognitively grounded design provides a novel solution to LLM-based social simulation that improves both stable and behavioural realism

2605.13724 2026-05-14 cs.CV cs.AI 版本更新

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai, Mike Zheng Shou

发表机构 * NVIDIA Show Lab, National University of Singapore(新加坡国立大学Show实验室) MIT(麻省理工学院)

AI总结 本文提出 AnyFlow,一种基于流图的任意步数视频扩散模型蒸馏框架,旨在解决一致性蒸馏模型在测试时分配更多采样步数时性能下降的问题。AnyFlow 通过将蒸馏目标从终点一致性映射转换为任意时间区间的流图转移学习,优化完整的 ODE 采样轨迹,并引入流图反向模拟方法,提升采样效率并减少测试时误差。实验表明,AnyFlow 在少量步数生成任务中性能优于或匹配现有方法,同时支持任意步数的灵活扩展。

Comments Project page at https://nvlabs.github.io/AnyFlow/

详情
英文摘要

Few-step video generation has been significantly advanced by consistency distillation. However, the performance of consistency-distilled models often degrades as more sampling steps are allocated at test time, limiting their effectiveness for any-step video diffusion. This limitation arises because consistency distillation replaces the original probability-flow ODE trajectory with a consistency-sampling trajectory, weakening the desirable test-time scaling behavior of ODE sampling. To address this limitation, we introduce AnyFlow, the first any-step video diffusion distillation framework based on flow maps. Instead of distilling a model for only a few fixed sampling steps, AnyFlow optimizes the full ODE sampling trajectory. To this end, we shift the distillation target from endpoint consistency mapping $(z_{t}\rightarrow z_{0})$ to flow-map transition learning $(z_{t}\rightarrow z_{r})$ over arbitrary time intervals. We further propose Flow Map Backward Simulation, which decomposes a full Euler rollout into shortcut flow-map transitions, enabling efficient on-policy distillation that reduces test-time errors (i.e., discretization error in few-step sampling and exposure bias in causal generation). Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B parameters, demonstrate that AnyFlow achieves performance matches or surpasses consistency-based counterparts in the few-step regime, while scaling with sampling step budgets.

2605.13723 2026-05-14 cs.HC cs.AI cs.LG cs.SI 版本更新

Humanwashing -- It Should Leave You Feeling Dirty

Ben Wilson, Matimba Swana, Peter Winter, Matt Roach

发表机构 * Computational Foundry, Swansea University, UK(计算泉研究所,斯wansea大学,英国) University of Bristol, UK(布里斯托大学,英国)

AI总结 本文探讨了“人机协作”(human in the loop)这一概念在人工智能决策系统中被滥用的问题,指出其常被用来制造一种虚假的安全感,实则掩盖了系统中的偏见、歧视和不透明等问题。作者批评了“人机循环”隐喻的过度使用,认为这模糊了人类监督的实际意义与效果,导致“人类漂白”(humanwashing)现象,即通过语言美化系统,回避对其真实影响的审视。文章呼吁对人类监督的内涵进行更深入的探讨,以确保其真正发挥应有的作用。

Comments 10 pages, 1 figure. Reviewed and accepted for presentation at HHAI 2026, Brussels

详情
英文摘要

The phrase 'human in the loop' is increasingly used to imply a sense of safety in relation to AI decision systems. It shouldn't. There are contexts where it can be applied appropriately, but these are not in the deployed decision systems we see dominating today. Human oversight of AI decision processes is one of the most popular proposals for addressing concerns, especially about bias, discrimination, misinformation, manipulation, accountability, and transparency. But there is insufficient examination of what human oversight actually means. The question raised in this paper is whether using the metaphor of a loop does anything to assist understanding of what is required and what is achieved in a particular decision context. Indiscriminate use of the loop metaphor obscures both processes and outcomes. It enables 'humanwashing', an activity analogous to 'greenwashing', where writers and commentators use language primarily aimed at putting systems in the best possible light.

2605.13709 2026-05-14 cs.CL cs.AI cs.LG 版本更新

Children's English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety

Qian Shen, Fanghua Cao, Min Yao, Shlok Gilda, Bonnie J. Dorr, Walter L. Leite

发表机构 * University of Florida(佛罗里达大学)

AI总结 该研究旨在生成适合儿童阅读的英文故事,同时控制难度和确保安全性。研究通过监督微调方法,对三个参数规模为8B的紧凑型大语言模型进行训练,使其能够生成符合儿童阅读水平的故事。实验表明,经过适当微调的8B模型在难度控制方面优于零样本使用的更大模型,且几乎不存在安全问题,为教育场景中低成本、高效生成儿童读物提供了可行方案。

Comments Comments: 15 pages, 4 figures. Author Two and Author Three contributed equally. Accepted by the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), ACL 2026

详情
英文摘要

Large Language Models (LLMs) are widely applied in educational practices, such as for generating children's stories. However, the generated stories are often too difficult for children to read, and the operational cost of LLMs hinders their widespread adoption in educational settings. We used an existing expert-designed children's reading curriculum and its corresponding generated stories from GPT-4o and Llama 3.3 70B to design different experiments for fine-tuning three 8B-parameter LLMs, which then generated new English reading stories that were subjected to quantitative and qualitative evaluation. Our method prioritizes controllability over scale, enabling educators to target reading levels and error patterns with a compact, affordable model. Our evaluation results show that with appropriate fine-tuning designs, children's English reading stories generated by 8B LLMs perform better on difficulty-related metrics than those from zero-shot GPT-4o and Llama 3.3 70B, with almost no discernible safety issues. Such fine-tuned LLMs could be more broadly used by teachers, parents, and children in classrooms and at home to generate engaging English reading stories with children's interests, controllable difficulty and safety.

2605.13706 2026-05-14 cs.CR cs.AI cs.CY cs.NI 版本更新

Identifying AI Web Scrapers Using Canary Tokens

Steven Seiden, Triss Ren, Caroline Zhang, Taein Kim, Enze Liu, Emily Wenger

发表机构 * Duke University(杜克大学) University of Pittsburgh(匹兹堡大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种新颖的方法,用于准确识别向大语言模型(LLM)提供数据的网络爬虫。该方法通过在动态网站中部署独特的“金丝雀标记”(canary tokens),并观察LLM生成内容中是否包含这些标记,从而推断出哪些爬虫被LLM使用过。实验表明,该方法能够可靠地识别出多个未被公开披露的爬虫与LLM之间的数据来源关系,为第三方提供了有效监控和控制网络爬虫行为的新途径。

详情
英文摘要

From pre-training to query-time augmentation, web-scraped data helps to improve the quality and contextual relevancy of content generated by large language models (LLMs). However, large-scale web scraping to feed LLMs can affect site stability and raise legal, privacy, or ethics concerns. If website owners wish to limit LLM-related web scraping on their site, due to these or other concerns, they may turn to scraper access control mechanisms like the Robots Exclusion Protocol. To be most effective, such mechanisms require site owners to first identify the scrapers that they wish to restrict (e.g., via User-Agent strings). Existing mechanisms to identify LLM-related scrapers rely on voluntary disclosure by companies, one-off experiments by researchers, or crowd-sourced reports -- methods that are neither reliable nor scalable. This paper proposes a novel technique for accurately and automatically inferring LLM-related scrapers. We host dynamic websites that serve unique canary tokens to each visiting scraper, then prompt LLMs for information about our sites. If an LLM consistently generates outputs containing tokens unique to a scraper, it provides evidence of exposure to that scraper. Via experiments across 22 production LLM systems, we demonstrate that our approach can reliably identify which scrapers feed which LLM, including several that are not publicly known or disclosed by the companies. Our approach provides a promising avenue for unprivileged third parties to infer which scrapers serve data to which LLMs, potentially enabling better control over unwanted scraping.

2605.13702 2026-05-14 cs.AI 版本更新

Adaptive mine planning under geological uncertainty: A POMDP framework for sequential decision-making

Hamza Khalifi, Jef Caers, Yassine Taha, Mostafa Benzaazoua, Abdellatif Elghali

发表机构 * Geology & Sustainable Mining Institute (GSMI), University Mohammed VI Polytechnic (UM6P)(地质与可持续采矿研究所(GSMI),穆罕默德六世理工学院(UM6P)) Department of Earth and Planetary Sciences, Stanford University(地球与行星科学系,斯坦福大学)

AI总结 本文提出了一种基于部分可观察马尔可夫决策过程(POMDP)的框架,用于在地质不确定性下进行自适应矿山规划。该方法通过逐步更新对地质条件的信念,动态调整开采和运输决策,从而替代传统的固定计划模式。研究引入了一种结合模拟退火和集合平滑技术的混合架构,有效提升了计算可行性,并在实际铜金露天矿案例中显著提高了净现值,展示了该方法在应对不确定性方面的优越性和鲁棒性。

详情
英文摘要

Strategic mine production scheduling under geological uncertainty is conventionally formulated as a stochastic optimization problem in which a fixed extraction sequence and routing decisions are computed ex ante. This plan-driven paradigm treats uncertainty as passive: decisions are hedged across geological scenarios, but planning does not anticipate how future observations will inform future decisions. We propose a different perspective by formulating mine scheduling as a Partially Observable Markov Decision Process (POMDP), in which extraction and routing decisions are made sequentially with planning explicitly integrating the expectation of future belief updates. To achieve computational tractability, we introduce a hybrid SA-POMDP architecture that combines simulated annealing-based (SA) value approximation with ensemble-based belief updating via ensemble smoother with multiple data assimilation (ES-MDA). At each decision epoch, candidate actions are evaluated through their expected long-term value under the current belief, and the belief is updated as mining observations are assimilated. This yields an adaptive policy rather than a fixed plan. We evaluate the framework on a copper-gold open-pit mining complex with multiple processing destinations. Under a statistically consistent prior, the SA-POMDP reduces the expectation-reality gap from 22.3% to 4.6%, improving realized NPV by USD8.4M relative to one-shot stochastic optimization. Under systematic prior misspecification of 10%, the adaptive framework outperforms static planning by up to USD44.6M (36.9%), demonstrating structural robustness beyond scenario hedging. These results show that sequential belief updating transforms geological uncertainty from a passive constraint into an active component of value creation.

2605.13695 2026-05-14 cs.CL cs.AI 版本更新

RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning

Andrea Morandi

发表机构 * Cisco(思科)

AI总结 该研究提出了一种名为RTLC的三阶段提示范式,灵感来源于费曼学习法,旨在提升大语言模型作为评判者的准确性,无需微调。RTLC通过“研究—教学—批判”三个阶段,引导模型生成多个候选判断并进行交叉对比,最终输出优化后的评判结果。实验表明,在JudgeBench基准上,RTLC显著提升了模型的判断准确率,优于传统的自洽投票和零样本方法,展示了其在开放生成评估中的有效性。

详情
英文摘要

LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- Research, Teach-to-Learn, Critique -- that promotes a single black-box LLM into an ensemble-of-thought judge with no fine-tuning, retrieval, or external tools. Stage 1 wraps the input in a fixed pedagogical scaffold porting the Feynman Learning Technique (study $\to$ teach $\to$ find gaps $\to$ simplify) into LLM prompting. Stage 2 draws N=10 independent candidate verdicts at temperature 0.4. Stage 3 acts as its own critic, cross-comparing the candidate set against the original question to emit one critiqued verdict at temperature 0. On JudgeBench-GPT (350 hard pairwise items), Claude 3.7 Sonnet's pairwise accuracy climbs from 64.6% (single-shot vanilla prompt) to 78.6% (RTLC critique-of-10) -- an absolute 14.0-percentage-point gain. RTLC also beats N=10 self-consistency majority voting (77.7%) and a zero-shot first candidate (74.0%). A clean three-step ablation attributes +9.4 pp to the Teach-to-Learn scaffold, +3.7 pp to N=10 marginalisation, and +0.9 pp to explicit critique. We discuss the cost-accuracy frontier (RTLC sits above self-consistency at every working point), the error-budget breakdown across the four JudgeBench categories (knowledge, reasoning, math, coding), and how RTLC composes orthogonally with post-hoc judge-score calibration, with the two interventions compounding multiplicatively in practice.

2605.13690 2026-05-14 cs.LG cs.AI 版本更新

The WidthWall: A Strict Expressivity Hierarchy for Hypergraph Neural Networks

Fengqing Jiang, Yuetai Li, Yichen Feng, Kaiyuan Zheng, Luyao Niu, Bhaskar Ramasubramanian, Basel Alomair, Linda Bushnell, Radha Poovendran

发表机构 * University of Washington(华盛顿大学) Western Washington University(西华盛顿大学) King Abdulaziz City for Science and Technology(国王阿卜杜勒阿齐兹科技城) HUMAIN

AI总结 该研究探讨了超图神经网络(HGNN)在表达复杂高阶交互结构方面的能力,指出模型的表达能力取决于其能够检测和计数的局部结构模式。通过引入同态密度的概念,研究建立了以超树宽度为指标的严格表达能力层次,并揭示了一个“宽度墙”现象:当结构模式的宽度超过一定阈值时,任何固定深度的HGNN都无法有效表示这些结构。该成果为15种HGNN架构提供了统一的理论分析,并在真实超图数据集上验证了宽度墙对模型性能的预测作用。

详情
英文摘要

Hypergraphs provide a natural framework to model higher-order interactions in scientific, social, and biological systems. Hypergraph neural networks (HGNNs) aim to learn from such data, yet it remains unclear which higher-order structures these models can represent. We show that hypergraph expressivity is governed by which small patterns an architecture can detect and count. We formalize this via homomorphism densities, which measure how often a structural motif appears in a hypergraph. Combining classical homomorphism-count completeness with invariant approximation, we show that homomorphism densities generate all continuous hypergraph invariants and organize them into a strict hierarchy indexed by hypertree width. This yields a Width Wall: a fundamental architectural limit beyond which no hidden dimension, training procedure or fixed-depth HGNN can represent invariants requiring wider patterns. Our framework provides a unified characterization of 15 HGNN architectures, precisely identifies information lost by clique expansion, and motivates density-aware models that extend expressivity beyond bounded-width message passing. We experimentally validate this finding on an APPLICATION NODE CLASSIFICATION SUITE of real-world hypergraphs, where the Width Wall predicts when graph-reduction baselines fail and when density features help.

2605.13687 2026-05-14 cs.LG cs.AI stat.ML 版本更新

A Hierarchical Language Model with Predictable Scaling Laws and Provable Benefits of Reasoning

Jason Gaitonde, Frederic Koehler, Elchanan Mossel, Joonhyung Shin, Allan Sly

发表机构 * Duke University(杜克大学) University of Chicago(芝加哥大学) Massachusetts Institute of Technology(麻省理工学院) Princeton University(普林斯顿大学)

AI总结 本文提出了一类具有层次结构的合成语言,并通过树上的广播过程生成,从而能够精确分析上下文长度和推理在自回归生成中的作用。研究引入了一种精确的$k$-gram假设来替代传统变换器模型,并通过实验证明其有效性。研究发现,在特定语言模型下,若上下文长度不足,生成结果将偏离真实语言分布,而具备推理能力的模型仅需对数长度的内存即可精确生成符合真实语言的序列,展现出指数级的性能提升。

详情
英文摘要

We introduce a family of synthetic languages with hierarchical structure -- generated by a broadcast process on trees -- for which the role of context length and reasoning in autoregressive generation can be analyzed precisely. At the heart of our analytic approach is an \emph{exact $k$-gram ansatz} in place of transformers with context length $k$, a substitution we then validate empirically. Using this ansatz we derive explicit asymptotic predictions for distributional statistics of the sequences produced by a trained model, instantiated in two settings. For the \emph{Ising broadcast process} (a soft-constrained language), we prove that the variance of the generated sum scales log-linearly in the context depth and its kurtosis converges to that of a Gaussian -- both deviating from the true language for any sublinear context. For the \emph{coloring broadcast process} (a hard-constrained language) in the freezing regime, bounded-context autoregression produces sequences that, with high probability, are inconsistent with \emph{any} valid coloring of the underlying tree. Together these results imply an $Ω(n)$ lower bound on the context length required to faithfully sample length-$n$ sequences. In contrast, we prove that an autoregressive \emph{reasoning} model with only $Θ(\log n)$ working memory can sample exactly from the true language -- an exponential improvement. We confirm both the lower-bound predictions and the reasoning-based upper bound empirically with transformers trained on the synthetic language; the trained models track our asymptotic predictions quantitatively across a wide range of context sizes.

2605.13686 2026-05-14 cs.CV cs.AI 版本更新

Cross Modality Image Translation In Medical Imaging Using Generative Frameworks

Giulia Romoli, Alessia Capoccia, Filippo Ruffini, Francesco Di Feola, Luca Boldrini, Arturo Chiti, Renato Cuocolo, Tugba Akinci D'Antonoli, Fatemeh Darvizeh, Marcello Di Pumpo, Bradley J. Erickson, Liu Fang, Deborah Fazzini, Paola Feraco, Fabrizia Gelardi, Francesco Gossetti, Ana Isabel Hernáiz Ferrer, Michail E. Klontzas, Seyedmehdi Payabvash, Katrine Riklund, Sara N. Strandberg, Valerio Guarrasi, Paolo Soda

发表机构 * Department of Diagnostics and Intervention, Radiation Physics, Biomedical Engineering, Umeå University(诊断与介入部门、放射物理、生物医学工程,乌梅大学) Unit of Artificial Intelligence and Computer Systems, Department of Engineering, Università Campus Bio-Medico di Roma(人工智能与计算机系统单位,工程部门,罗马生物医学学院) Vita-Salute San Raffaele University(维塔-萨拉特·桑拉法埃莱大学) Department of Medicine, Surgery and Dentistry, University of Salerno(医学、外科和牙科部门,萨勒诺大学) Division of Diagnostic and Interventional Neuroradiology, Department of Radiology, University Hospital Basel(诊断和介入神经放射学部门,放射学部门,巴塞尔大学医院) Department of Pediatric Radiology, University Children’s Hospital Basel(儿科放射学部门,巴塞尔儿童医院) Department of Life Science and Public Health, Università Cattolica del Sacro Cuore(生命科学与公共健康部门,圣心大学) Athinoula A. Martinos Center for Biomedical Imaging(阿提诺拉A·马里诺斯生物医学成像中心) Artificial Intelligence and Translational Imaging (ATI) Lab, Department of Radiology, School of Medicine, University of Crete(人工智能与转化成像(ATI)实验室,放射学部门,医学院,克里特大学) Division of Radiology, Department of Clinical Science, Intervention and Technology (CLINTEC), Karolinska Institute(放射学部门,临床科学、介入和科技(CLINTEC)部门,卡罗林斯卡研究所) Columbia University Medical Center(哥伦比亚大学医学中心) Department of Diagnostics and intervention, Diagnostic radiology, Umeå University(诊断与介入部门,诊断放射学,乌梅大学)

AI总结 本文研究了医学影像中跨模态图像翻译的问题,旨在从源影像模态生成目标模态的图像,无需额外采集。作者提出了一种可复现、标准化的评估框架,对七种生成模型在多个临床任务和数据集上的性能进行了系统比较,发现基于生成对抗网络(GAN)的模型整体表现优于潜在生成模型,其中SRGAN在多项任务中表现最优。实验还揭示了模型在小病灶生成和定量指标与临床偏好之间的差异,表明合成影像在临床判别上已接近真实影像。

详情
英文摘要

Medical image-to-image (I2I) translation enables virtual scanning, i.e. the synthesis of a target imaging modality from a source one without additional acquisitions. Despite growing interest, most proposed methods operate on 2D slices, are evaluated on isolated tasks with different experimental set-ups and lack clinical validation. The primary contribution of this work is a reproducible, standardized comparative evaluation of 3D I2I translation methods in oncological imaging, designed to standardize preprocessing, splitting, inference, and multi-level evaluation across heterogeneous clinical tasks. Within this framework, we compare seven generative models, three Generative Adversarial Networks (GANs: Pix2Pix, CycleGAN, SRGAN) and four latent generative models (Latent Diffusion Model, Latent Diffusion Model+ControlNet, Brownian Bridge, Flow Matching), across eleven datasets spanning three anatomical regions (head/neck, lung, pelvis) and four translation directions (cone-beam CT to CT, MRI to CT, CT to PET, MRI T2-weighted to T2-FLAIR), for a total of 77 experiments under uniform training, inference, and evaluation conditions. The results show that GANs outperform latent generative models across all tasks, with SRGAN achieving statistically significant superiority. Our lesion-level analysis reveals that all models struggle with small lesions and that, in CT to PET synthesis, models reproduce lesion shape more reliably than absolute uptake-related intensity. We also performed a Visual Turing test administered to 17 physicians, including 15 radiologists, which shows near-chance classification accuracy (56.7%), confirming that synthetic volumes are largely indistinguishable from real acquisitions, while exposing a dissociation between quantitative metrics and clinical preference.

2605.13651 2026-05-14 cs.SD cs.AI 版本更新

NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating

Zhongju Yuan, Geraint Wiggins, Dick Botteldooren

发表机构 * WAVES Research Group, Ghent University, Gent, Belgium(根特大学WAVES研究组,比利时根特) AI Lab, Vrije Universiteit Brussel, Brussel, Belgium(布鲁塞尔自由大学AI实验室,比利时布鲁塞尔) EECS, Queen Mary University of London, London, UK(伦敦大学学院女王学院电子工程与计算机科学系,英国伦敦)

AI总结 本文提出了一种无需训练的神经听觉注意力认知架构NAACA,用于解决长时音频中显著事件检测的注意力瓶颈问题。其核心是受神经系统启发的振荡工作记忆(OWM),能够通过感知显著性触发高层语言模型处理,从而提升事件检测精度并减少不必要的计算。实验表明,NAACA在XD-Violence数据集上显著提升了检测性能,并在城市声景数据集上表现出对噪声和突发停顿的良好鲁棒性。

Comments Accepted as a regular paper by ICML 2026

详情
英文摘要

Audio provides critical situational cues, yet current Audio Language Models (ALMs) face an attention bottleneck in long-form recordings where dominant background patterns can dilute rare, salient events. We introduce NAACA, a training-free NeuroAuditory Attentive Cognitive Architecture that reframes attention allocation as an auditory salience filtering problem. At its core is OWM, a neuro-inspired Oscillatory Working Memory that maintains stable attractor-like states and triggers higher-cognition ALM processing only when adaptive energy fluctuations signal perceptual salience, triggering higher-level reasoning. On XD-Violence, NAACA improves AudioQwen's average precision (AP) from 53.50% to 70.60% while reducing unnecessary ALM invocations. Furthermore, qualitative case studies on the Urban Soundscapes of the World (USoW) dataset show that OWM captures novel events and subcategory shifts while remaining robust to transient pauses and ambient urban noise.

2605.13625 2026-05-14 cs.AI 版本更新

How to Interpret Agent Behavior

Jie Gao, Kaiser Sun, Jen-tse Huang, Katherine Van Koevering, Sijie Ji, Heyuan Huang, Weiyan Shi, Zhuoran Lu, Ziang Xiao, Daniel Khashabi, Mark Dredze

发表机构 * Johns Hopkins University(约翰霍普金斯大学) California Institute of Technology(加州理工学院) Northeastern University(东北大学) Purdue University(普渡大学)

AI总结 本文研究了如何解释自主智能体(如 Claude Code 和 Codex)在运行时的行为,提出了一个名为 ACT*ONOMY 的行为分类体系,用于描述和分析智能体的运行轨迹。该方法结合了行动分类和理论框架,构建了一个包含 10 个动作、46 个子动作和 120 个叶子类别的三级层次结构,并提供了一个支持动态更新和扩展的开源分析平台。实验表明,ACT*ONOMY 能够有效比较不同智能体的行为特征,识别运行中的异常模式,为研究人员和用户提供了一致的分析语言,有助于提升对智能体行为的理解与管控。

Comments 34 pages in total

详情
英文摘要

Autonomous agents such as Claude Code and Codex now operate for hours or even days. Understanding their runtime behavior has become critical for downstream tasks such as diagnosing inefficiencies, fixing bugs, and ensuring better oversight. A primary way to gain this understanding is analyzing the reasoning trajectories and execution traces these agents generate. Yet such data remains in unstructured natural-language form, making it difficult for humans to interpret at scale. We introduce ACT*ONOMY (a combination of Action and Taxonomy), a taxonomy for describing and analyzing agent behavior at runtime. ACT*ONOMY has two components: (1) the taxonomy itself, developed through Grounded Theory and structured as a three-level hierarchy of 10 actions, 46 subactions, and 120 leaf categories; and (2) an open repository that hosts the living taxonomy, provides an automated analysis pipeline that applies it to agent trajectories analysis, and defines an extension protocol for customization and growth. Our experiments show that ACTONOMY can compare behavioral profiles across agents and characterize a single agent's behavior across diverse trajectories, surfacing patterns indicative of failure modes. By providing a shared vocabulary, ACT*ONOMY helps researchers, agent designers, and end users interpret agent behavior more consistently, enabling better oversight and control.

2605.13618 2026-05-14 cond-mat.mtrl-sci cs.AI 版本更新

OpenAaaS: An Open Agent-as-a-Service Framework for Distributed Materials-Informatics Research

Peng Kang, Bixuan Li, Xiaoya Huang, Shuo Shi, Weiqiao Zhou, Zhen Li, Yu Liu, Lei Zheng

发表机构 * National Key Laboratory of AI for Materials Science(人工智能材料科学国家重点实验室) Tianmushan Laboratory(天幕山实验室) Beihang University(北京航空航天大学)

AI总结 该研究提出了一个名为 OpenAaaS 的开源框架,旨在解决分布式材料信息学研究中跨机构协作的安全性和组织性问题。其核心方法基于“代码流动、数据不动”的原则,通过主代理与子代理的分层架构,实现对复杂研究任务的分解与执行,同时保障数据主权和计算资源的安全隔离。该框架通过两个案例验证了其有效性,展示了在材料文献分析和高熵合金数据库构建中的应用潜力,为下一代智能材料设计平台提供了可扩展的基础。

Comments 20 pages 5 figures

详情
英文摘要

The Materials Genome Initiative catalyzed the proliferation of centralized platforms--SaaS, PaaS, and IaaS--that aggregate computational and experimental resources for accelerated materials discovery. In parallel, breakthroughs in large language models (LLMs) and autonomous agents have created powerful new reasoning capabilities for scientific research. Yet a critical "last mile" problem remains: while we possess world-class models and vast repositories of materials data, we lack the organizational infrastructure to compose these capabilities securely across institutional boundaries. The development of structural and functional materials for harsh service environments--high-temperature alloys, radiation resistant steels, corrosion-resistant coatings--remains characterized by long-term iteration, mechanistic complexity, and high domain expertise--demands that exceed both monolithic agent systems and traditional centralized platforms. To address this gap we propose OpenAaaS, an open-source hierarchical and distributed Agent-as-a-Service framework that enables organized multi-agent collaboration for intelligent materials design. OpenAaaS is built on a single foundational principle: code flows, data stays still. A Master Agent plans and decomposes complex research tasks without requiring direct access to subordinate agents' managed data and computational resources. Sub-agents, deployed as near-data execution nodes, retain full sovereignty over local datasets, proprietary algorithms, and specialized hardware. This architecture guarantees that raw data never leaves its domain of origin while enabling cross-scale, cross-domain secure integration of previously isolated materials intelligence silos. We validate the framework through two representative case studies: (i) AlphaAgent, an evidence-grounded materials literature analysis executor that achieves 4.66/5.0 on deep analytical questions against single-pass RAG baselines; and (ii) an ultra-large-scale hexa-high-entropy alloy descriptor database service that demonstrates secure near-data execution and domain-specific scientific workflows under strict data-sovereignty constraints. OpenAaaS establishes a principled pathway toward "organized research" via agent collectives, offering a scalable foundation for next-generation materials intelligent design platforms. All source code is available at https://github.com/Wolido/OpenAaaS.

2605.13601 2026-05-14 cs.AI cs.MA 版本更新

Unweighted ranking for value-based decision making with uncertainty

Aarón López García, Natalia Criado, Jose Such

发表机构 * Valencian Research Institute for Artificial Intelligence(瓦伦西亚人工智能研究 institute) Universitat Politècnica de València(瓦伦西亚理工大学)

AI总结 随着智能系统在社会中越来越多地用于自主决策,其对人类价值观的遵循引发了广泛关注。本文提出了一种基于模糊逻辑的无权重价值决策框架(FUW-VBDM),通过引入定性和定量标准,提升决策的人本特性,并消除利益相关者主观赋权带来的偏差。为此,作者设计了Rankzzy方法,结合模糊推理量化不确定性,并在大规模案例中验证了其计算效率和排名性能的优势。

Comments 21 pages

详情
英文摘要

As intelligent systems are increasingly implemented in our society to make autonomous decisions, their commitment to human values raises serious concerns. Their alignment with human values remains a critical challenge because it can jeopardise the integrity and security of citizens. For this reason, an innovative human-centred and values-driven approach to decision making is required. In this work, we introduce the Fuzzy-Unweighted Value-Based Decision Making (FUW-VBDM) framework, where agents incorporate both quantitative and qualitative criteria to generate human-centred decisions. We also address the normative bias introduced by stakeholders with arbitrary weights by removing prior weights and introducing a fuzzy domain of decision variables defined for a score function. This concept allows us to generalise any VBDM problem as the search for feasible solutions when optimising the score in the weight domain. To provide a solution to FUW-VBDM, we present Rankzzy, a customizable unweighted ranking method that integrates fuzzy-based reasoning to quantify uncertainty. We mathematically prove the consistency of the Rankzzy for any admissible configuration selected by stakeholders. We show the applicability of our method through an illustrative case study, which we also use as a running example. The evaluation conducted indicates a reduced computational cost in large-scale value-based decision-making problems and a strong rank performance regarding existing approaches when employing the aggregation via Pythagorean means.

2605.13586 2026-05-14 cs.CV cs.AI 版本更新

HetScene: Heterogeneity-Aware Diffusion for Dense Indoor Scene Generation

Zini Chen, Junming Huang, Rong Zhang, Jiamin Xu, Cheng Peng, Chi Wang, Weiwei Xu

AI总结 本文提出 HetScene,一种面向异构结构的扩散模型,用于生成高密度、物理合理的室内场景。该方法通过区分主物体和次物体,将场景生成过程分解为结构布局生成和上下文布局生成两个阶段,从而更有效地建模复杂的物体分布与空间依赖关系。该框架提升了生成场景的可控性和物理合理性,为具身人工智能的仿真环境构建提供了有力支持。

详情
英文摘要

Generating controllable and physically plausible indoor scenes is a pivotal prerequisite for constructing high-fidelity simulation environments for embodied AI. However, existing deeplearning-based methods usually treat all objects as homogeneous instances within a unified generation process. While effective for sparse and simplistic layouts, they struggle to model realistic layouts with dense object arrangements and complex spatial dependencies, leadingto limited scalability and degraded physical plausibility. To deal with these challenges, we revisit indoor layout generation from the perspective of structural heterogeneity and decompose the objects into primary objects and secondary objects according to their distinct roles in shaping a scene. Based on this decomposition, we propose HetScene, a heterogeneous two-stage generation framework that decouples indoor layout synthesis into Structural Layout Generation (SLG) and Contextual Layout Generation (CLG). SLG first generates globally coherent structural layouts with only primary objects conditioned on text descriptions, top-down binary room masks, and spatial relation graphs, establishing a stable global macro-skeleton of large core furniture.

2605.13579 2026-05-14 cs.AI 版本更新

Position: Assistive Agents Need Accessibility Alignment

Jie Hu, Changyuan Yan, Yu Zheng, Ziqian Wang, Jiaming Zhang

发表机构 * School of Artificial Intelligence and Robotics(人工智能与机器人学院) Hunan University(湖南大学) Changsha, China(中国长沙)

AI总结 该论文探讨了为盲人和视力障碍用户设计的辅助智能体所面临的可访问性对齐问题,指出当前多数智能体系统基于视力正常用户的交互假设进行设计和评估,导致在辅助场景中频繁失效。研究分析了778个辅助任务实例,揭示了当前智能体在验证、风险和交互约束方面与视力障碍用户需求之间的不匹配,并提出将可访问性视为对齐问题,引入可访问性对齐概念,构建了一个贯穿用户研究、系统设计、部署与迭代的生命周期设计流程,推动更具包容性的智能体设计方向。

Comments 9 pages, 1 figures, Accepted to ICML 2026

详情
英文摘要

Assistive agents for Blind and Visually Impaired (BVI) users require accessibility alignment as a first-class design objective. Despite rapid progress in agentic AI, most systems are designed and evaluated under assumptions of sighted interaction, low-cost verification, and tolerable trial-and-error, leading to systematic failures in assistive scenarios that cannot be resolved by model scaling or post-hoc interface adaptations alone. Drawing on an analysis of 778 assistance task instances from prior work, we show that current agentic AI remain prone to failure in assistive scenarios due to mismatches between sighted-user design assumptions and the verification, risk, and interaction constraints faced by BVI users. We argue that accessibility should be treated as an alignment problem rather than a peripheral usability concern. To this end, we introduce accessibility alignment and propose a lifecycle-oriented design pipeline for accessibility-aligned assistive agents, spanning user research, system design, deployment and post-deployment iteration. We conclude that BVI-centered assistive tasks provide a critical stress test for agentic AI and motivate a broader shift toward inclusive agent design.

2605.13574 2026-05-14 cs.HC cs.AI 版本更新

Beyond Anthropomorphism: Exploring the Roles of Perceived Non-humanity and Structural Similarity in Deep Self-Disclosure Toward Generative AI

Satoru Shibuya

发表机构 * Graduate School of Business and Finance, Waseda University(早稻田大学商经研究生院)

AI总结 本研究探讨了用户在与生成式人工智能进行深度自我披露时的心理影响因素,重点关注感知非人性和结构相似性这两个超越拟人化的因素。研究发现,感知非人性可能降低用户的评价焦虑,而结构相似性则反映了用户与AI回应之间逻辑思维的契合程度。基于2400名参与者的调查数据,研究显示,同时高度感知非人性和结构相似性的用户群体比对照组更有可能进行深度自我披露,提示深度自我披露中的信任行为可能涉及除拟人化之外的其他关键因素。

Comments Submitted to International Journal of Human-Computer Interaction (IJHCI). 35 pages, 2 tables, 3 figures

详情
英文摘要

This study investigates deep self-disclosure toward generative AI by examining perceived non-humanity and structural similarity as psychological factors beyond anthropomorphism. Perceived non-humanity may reduce evaluation apprehension, whereas structural similarity refers to the perceived logical alignment between a user's thinking and AI responses. Using cross-sectional survey data from 2,400 participants collected in 2025, this study analyzed associations with both the occurrence and depth of self-disclosure. Logistic regression indicated that the group high in both perceptions (Segment D) showed a significantly higher likelihood of disclosure than the baseline group (Segment A; OR = 11.35). ANOVA further showed significant between-group differences in disclosure depth. The findings suggest that trust-related behavior in deep self-disclosure may involve factors other than anthropomorphic perception. Because the study is exploratory and based on self-reported survey data, the results should be interpreted as associative rather than causal, and future longitudinal or experimental research is needed.

2605.13570 2026-05-14 cs.AI cs.LG 版本更新

Learning Local Constraints for Reinforcement-Learned Content Generators

Debosmita Bhaumik, Julian Togelius, Georgios N. Yannakakis, Ahmed Khalifa

发表机构 * Institute of Digital Games(数字游戏研究所) Game Innovation Lab(游戏创新实验室)

AI总结 本文研究如何结合基于约束的游戏内容生成方法(如Wave Function Collapse)与强化学习生成方法,以同时保证生成内容的局部视觉合理性和全局可玩性。作者提出通过将WFC学习到的局部约束应用于强化学习生成器的动作空间,使生成器在满足全局属性的同时遵循局部规则。实验表明,该混合方法在适当调参后能够生成视觉美观且可玩的平台解谜游戏关卡,如《Lode Runner》。

详情
英文摘要

Constraint-based game content generators that learn local constraints from existing content, such as Wave Function Collapse (WFC), can generate visually satisfying game levels but face challenges in guaranteeing global properties, such as playability. On the other hand, reinforcement-learning trained generators can guarantee global properties -- because such properties can easily be included in reward functions -- but the results can be visually dissatisfying. In this paper, we explore ways to combine these methods. Specifically, we constrain the action space of a PCGRL generator with constraints learned by WFC, effectively allowing the PCGRL generator to achieve global properties while forced to adhere to local constraints. To better analyze how this hybrid content generation method operates, we vary the number and type of inputs, and we test whether to randomly collapse the starting state and exclude rare patterns. While the method is sensitive to hyperparameter tuning, the best of our trained generators produce visually satisfying and playable puzzle-platform game levels -- such as Lode Runner levels -- with desired global properties.

2605.13568 2026-05-14 cs.LG cs.AI 版本更新

Dynamical Predictive Modelling of Cardiovascular Disease Progression Post-Myocardial Infarction via ECG-Trained Artificial Intelligence Model

Riccardo Cavarra, Lupo Lovatelli, Shaheim Ogbomo-Harmitt, Shahid Aziz, Adelaide De Vecchi, Andrew King, Oleg Aslanidi

发表机构 * King’s College London(伦敦国王学院) St Thomas’ Hospital(圣 Thomas 医院) North Bristol NHS Trust(北布里斯托尔国家健康服务信托)

AI总结 该研究旨在利用心电图(ECG)数据预测心肌梗死(MI)后心血管疾病的发展情况。研究提出了一种基于对比学习的预训练人工智能模型,结合患者特定的时序信息与监督多任务学习头,并在少量标注数据下进行微调,从而提升预测性能。实验表明,该模型在有限数据条件下优于从头训练的模型,证明了临床结构化ECG建模在疾病进展预测中的有效性。

Comments submitted to the 9th International Conference on Computational and Mathematical Biomedical Engineering, 4 pages, 1 figure, 1 table

详情
英文摘要

Myocardial infarction (MI) is a leading cause of death, and its adverse outcomes are urgent to predict. Yet ECG-based prognostic models underperform because deep learning requires large, labelled datasets, which are scarce in medicine. Foundation models can learn from unlabelled ECGs via selfsupervision, but medically relevant training strategies remain underexplored. We propose a pretrained artificial intelligence model that combines patient-specific temporal information using contrastive learning with supervised multitask heads, then fine-tunes on post-MI outcome prediction. The proposed model outperformed a model trained from scratch (0.794 vs 0.608 AUC) showing that clinically structured ECG modelling improves classification in limited data regimes.

2605.13555 2026-05-14 physics.med-ph cs.AI 版本更新

Generating synthetic computed tomography for radiotherapy: SynthRAD2025 challenge report

Viktor Rogowski, Maarten L. Terpstra, Niklas Wahl, Florian Kamp, Erik van der Bijl, Arthur Jr. Galapon, Christopher Kurz, Bowen Xin, Zhengxiang Sun, Hollie Min, Gregg Belous, Jason Dowling, Yan Xia, Siyuan Mei, Fuxin Fan, Arthur Longuefosse, Javier Sequeiro Gonzalez, Miguel Diaz Benito, Alvaro Garcia Martin, Fabien Baldacci, Valentin Boussot, Cédric Hémon, Jean-Claude Nunes, Jean-Louis Dillenseger, Zhiyuan Zhang, Jinghua Cai, Han Bing, Tan Zuopeng, Ricardo Brioso, Daniele Loiacono, Guillaume Landry, Adrian Thummerer, Matteo Maspero

发表机构 * Radiation Physics, Department of Hematology, Oncology Radiation Physics, Skåne University Hospital, Lund, Sweden Medical Radiation Physics, Department of Clinical Sciences Lund, Lund University, Lund, Sweden Radiotherapy Department, University Medical Center Utrecht, Utrecht, The Netherlands Computational Imaging Group for MR Diagnostics \& Therapy, University Medical Center Utrecht, Utrecht, The Netherlands Division of Medical Physics in Radiation Oncology, Deutsches Krebsforschungszentrum (DKFZ), Heidelberg, Germany Heidelberg Institute for Radiation Oncology (HIRO) National Center for Radiation Research in Oncology (NCRO), Heidelberg, Germany Department of Radiation Oncology Cyberknife Center, University Hospital of Cologne, Cologne, Germany Department of Radiation Oncology, Radboud University Medical Center, Nijmegen, The Netherlands Department of Radiation Oncology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands Department of Radiation Oncology, LMU University Hospital, LMU Munich, Munich, Germany Bavarian Cancer Research Center (BZKF), Munich, Germany Australian eHealth Research Center, CSIRO, Brisbane, Australia School of Computer Science, University of Sydney, Sydney, Australia Department of Orthodontics Pattern Recognition Lab, FAU Erlangen-Nuremberg, Germany RIKEN Center for Integrative Medical Sciences, Tokyo, Japan Erasmus Mundus Joint Master's Degree IPCVai, University of Bordeaux, France Computer Science, Huazhong University of Science Huazhong University of Science Canon Medical Systems (China) CO., LTD., Beijing, China Department of Radiation Oncology, Inselspital, Bern University Hospital University of Bern, Bern, Switzerland

AI总结 该研究针对放射治疗中对合成CT(sCT)生成的需求,提出了SynthRAD2025挑战赛,旨在通过深度学习方法将MRI或CBCT图像转化为具有准确CT值的合成CT图像。研究在来自欧洲五个中心的2362名患者数据上评估了两种任务(MRI-to-CT和CBCT-to-CT)的性能,结果显示深度学习方法在图像质量和剂量计算方面已达到临床应用水平,尤其在CBCT-to-CT任务中表现突出,但MRI-to-CT仍面临挑战,且图像质量与剂量准确性之间的关联有限,突显了剂量评估在临床验证中的重要性。

Comments 59 pages total: 26 pages main article + supplementary material; 8 figures in the main manuscript and 3 supplementary figures. Currently under review at the journal Medical Image Analysis (MIA)

详情
英文摘要

Radiation therapy (RT) requires precise dose delivery over multiple fractions, with CT fundamental for treatment planning due to its electron density information. Repeated CT acquisitions impose radiation exposure and logistical burdens, MRI lacks electron density, and cone-beam CT (CBCT) requires correction for dose calculation. Synthetic CT (sCT) generation addresses these by converting MRI or CBCT into CT-equivalent images with accurate Hounsfield Unit (HU) values, enabling MRI-only RT and CBCT-based adaptive workflows. Building on SynthRAD2023, SynthRAD2025 benchmarked sCT methods on 2,362 patients from five European centers across head and neck, thorax, and abdomen. Two tasks: MRI-to-CT (890 cases) and CBCT-to-CT (1,472 cases), evaluated via image similarity (MAE, PSNR, MS-SSIM), segmentation (Dice, HD95), and dosimetric metrics from photon and proton plans. With 803 participants and 12/13 valid submissions, Task 1 top performance reached MAE $64.8\pm21.3$ HU, PSNR $\sim$30 dB, MS-SSIM $\sim$0.936, Dice 0.79, photon $γ_{2\%/2\text{mm}}>98\%$, proton $γ\approx85\%$. Task 2 improved: MAE $48.3\pm13.4$ HU, PSNR 32.6 dB, MS-SSIM 0.968, Dice 0.86, photon $γ>99\%$, proton $γ\approx89\%$. Strong image--segmentation correlations ($ρ=0.78$--$0.79$) but moderate dose correlations confirmed image quality is insufficient as a dosimetric surrogate. Head-and-neck cases were most consistent; thoracic and abdominal cases showed greater variability. Residual errors at tissue interfaces propagate along beam paths, affecting proton dose more than photon. SynthRAD2025 demonstrates that deep learning yields clinically relevant sCTs, especially for CBCT-to-CT, while identifying persistent MRI-to-CT challenges and underscoring dose-based evaluation as essential for clinical validation.

2605.13554 2026-05-14 cs.LG cs.AI 版本更新

Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

Asim Osman, Sasha Abramowitz, Mark Bergh, Ulrich Armel Mbou Sob, Ruan John de Kock, Omayma Mahjoub, Oussama Hidaoui, Noah De Nicola, Arnol Manuel Fokam, Felix Chalumeau, Daniel Rajaonarivonivelomanantsoa, Siddarth Singh, Refiloe Shabe, Juan Claude Formanek, Simon Verster Du Toit, Arnu Pretorius

发表机构 * InstaDeep AIMS University of Stellenbosch(斯特伦博斯大学)

AI总结 本文提出了一种基于对比学习的策略优化算法——对比近端策略优化(CPPO),用于实现无需人工设计奖励函数的自监督强化学习。该方法通过对比状态-动作与目标的表示学习Q值,并直接在策略上优化这些对比Q值,从而实现了端到端的自监督训练。实验表明,CPPO在多种连续和离散动作空间的单智能体和协作多智能体任务中,不仅显著优于现有对比强化学习方法,还在多数任务中达到了使用人工密集奖励的PPO算法的性能水平。

详情
英文摘要

Contrastive reinforcement learning (CRL) learns goal-conditioned Q-values through a contrastive objective over state-action and goal representations, removing the need for hand-crafted reward functions. Despite impressive success in achieving viable self-supervised learning in RL, all existing CRL algorithms rely on off-policy optimisation and are mostly constrained to continuous action spaces, with little research invested in discrete environments. This leaves CRL disconnected from widely used and effective, modern on-policy training pipelines adopted across both single-agent and multi-agent RL in continuous and discrete environments. To establish a first connection, we introduce Contrastive Proximal Policy Optimisation (CPPO). CPPO is an on-policy contrastive RL algorithm that derives policy advantages directly from contrastive Q-values and optimises them via the standard PPO objective, without requiring a reward function or a replay buffer. We evaluate CPPO across continuous and discrete, single-agent and cooperative multi-agent tasks. Whilst the existence of an on-policy approach is inherently useful, we observe that \textbf{CPPO not only significantly outperforms the previous CRL baselines in 14 out of 18 tasks, but also matches or exceeds PPO's performance, which uses hand-crafted dense rewards, in 12 out of the 18 tasks tested.}

2605.13542 2026-05-14 cs.AI cs.CL cs.LG cs.MA 版本更新

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

Chengzhi Shen, Weixiang Shen, Tobias Susetzky, Chen, Chen, Jun Li, Yuyuan Liu, Xuepeng Zhang, Zhenyu Gong, Daniel Rueckert, Jiazhen Pan

发表机构 * Technical University of Munich(慕尼黑技术大学) TUM University Hospital(TUM大学医院) LMU Munich(慕尼黑大学) University of Sheffield(谢菲尔德大学) University of Oxford(牛津大学) Zhongshan Hospital Fudan University(复旦大学中山医院) Sun Yat-sen University Cancer Center(中山大学肿瘤中心) Imperial College London(伦敦帝国学院) Munich Center for Machine Learning(慕尼黑机器学习中心) relAI – Konrad Zuse School of Excellence in Reliable AI(relAI – 卡诺夫茨卓越可靠人工智能学校)

AI总结 本文提出RealICU,一个基于真实重症监护(ICU)临床数据构建的新型基准,用于评估大型语言模型在复杂、长期医疗决策任务中的表现。该基准通过资深医生对完整患者轨迹进行回顾标注,定义了四个与临床决策相关的任务,揭示了现有大语言模型在医疗建议中的召回与安全性的权衡以及对早期患者信息的过度依赖问题。RealICU为研究和改进高风险医疗场景下的AI决策支持系统提供了可靠的实验平台。

详情
英文摘要

Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems. We introduce RealICU, a hindsight-annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory. We formulate four physician-motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that risk unsafe outcomes. We partition each trajectory with 30-min windows and release two datasets: RealICU-Gold with 930-window annotations from 94 MIMIC-IV patients, and RealICU-Scale with 11,862 windows extended by Oracle, a physician-validated LLM hindsight labeler. Existing LLMs including memory-augmented ones performed poorly on RealICU, exposing two failure modes: a recall-safety tradeoff for clinical recommendations, and an anchoring bias to early interpretations of the patient. We further introduce ICU-Evo to study structured-memory agents that improves long-horizon reasoning but does not fully eliminate safety failures. Together, RealICU provides a clinically grounded testbed for measuring and improving AI sequential decision-support in high-stakes care. Project page: https://chengzhi-leo.github.io/RealICU-Bench/

2605.13540 2026-05-14 cs.LG cs.AI 版本更新

Decoupled and Divergence-Conditioned Prompt for Multi-domain Dynamic Graph Foundation Models

Haonan Yuan, Qingyun Sun, Junhua Shi, Xingcheng Fu, Jianxin Li, Philip S. Yu

发表机构 * School of Computer Science and Engineering, Beihang University(北航计算机科学与工程学院) Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University(教育部教育区块链与智能技术重点实验室,广西师范大学) Department of Computer Science, University of Illinois at Chicago(伊利诺伊大学芝加哥分校计算机科学系)

AI总结 动态图在现实系统中广泛存在,构建具有泛化能力的动态图基础模型是图学习领域的重要前沿。针对多领域动态图语义和时序模式不一致带来的统一建模挑战,本文提出了一种基于解耦与发散条件提示的多领域动态图基础模型DyGFM。该模型通过语义-时序解耦的双分支预训练策略分离可迁移语义与领域特有动态,并引入发散感知的跨领域路由机制与提示生成器,有效缓解负迁移并提升下游任务的微调效率。实验表明,DyGFM在多个动态图基准数据集上显著优于12个先进基线方法。

详情
英文摘要

Dynamic graphs are ubiquitous in real-world systems, and building generalizable dynamic Graph Foundation Models has become a frontier in graph learning. However, dynamic graphs from different domains pose fundamental challenges to unified modeling, as their semantic and temporal patterns are inherently inconsistent, making the multi-domain pre-training difficult. Consequently, the widely used "pretrain-then-finetune" paradigm often suffers from severe negative knowledge transfer. To the best of our knowledge, there exists no multi-domain dynamic GFM. In this work, we propose DyGFM, a Dynamic Graph Foundation Model over multiple domains based on decoupled and divergence-conditioned prompting. To disentangle transferable semantics from the domain-specific dynamics, we introduce a dual-branch pre-training strategy with semantic-temporal decoupling. To alleviate negative transfer during domain adaptation, we further develop a cross-domain routing mechanism with divergence-aware expert selection. To enable efficient downstream fine-tuning, we design a divergence-conditioned prompt generator that injects lightweight, learnable graph prompts tailored to semantic and temporal traits. Extensive experiments on continuous dynamic graph benchmarks demonstrate that DyGFM consistently outperforms 12 state-of-the-art baselines on both node classification and link prediction tasks, achieving superior effectiveness and efficiency.

2605.13538 2026-05-14 cs.CL cs.AI 版本更新

Locale-Conditioned Few-Shot Prompting Mitigates Demonstration Regurgitation in On-Device PII Substitution with Small Language Models

Anuj Sadani, Deepak Kumar

AI总结 本文研究了在设备端使用小型语言模型进行个人身份信息(PII)替换时,如何避免模型直接复制演示示例的问题。作者提出了一种基于本地条件的少样本提示方法,结合分类器和生成模型,生成符合语境且类型一致的虚假信息。实验表明,该方法有效减少了模型对演示内容的重复,但在下游命名实体识别任务中,生成的替代文本多样性不足,影响了模型性能。

Comments 15 pages

详情
英文摘要

Personally Identifiable Information (PII) redaction usually replaces detected entities with placeholder tokens such as [PERSON], destroying the downstream utility of the redacted text for retrieval and Named Entity Recognition (NER) training. We propose a fully on-device pipeline that substitutes PII with consistent, type-preserving fake values: a 1.5 B mixture-of-experts token classifier (openai/privacy-filter) detects spans, a 1-bit Bonsai-1.7B Small Language Model (SLM) proposes contextual surrogates for names, addresses, and dates, and a rule-based generator (faker) handles patterned fields. We report a prompting finding more important than the quantization choice: with naive fixed three-shot demonstrations, the 1-bit SLM regurgitates demonstration outputs verbatim regardless of input; 1.58-bit Ternary-Bonsai-1.7B reproduces byte-identical failures, ruling out quantization as the cause. We fix this with locale-conditioned rotating few-shot demonstrations: a character-range heuristic picks a locale-pure pool and a per-input MD5 hash samples three demonstrations. With the fix, 482/482 unique Bonsai-1.7B calls succeed (no echoes) and produce locale-correct surrogates, although the SLM still copies from a small same-locale demonstration pool - a residual narrowness we quantify. On a 2000-document multilingual corpus, hybrid perplexity (PPL) beats faker in all six locales under a multilingual evaluator (XGLM-564M); length preservation is best-of-three in 4 of 6 locales. On downstream NER (400 train / 100 test, English), redact yields F1=0.000, faker 0.656, original 0.960; on a matched 160/40 subset including hybrid, faker (0.506) outperforms hybrid (0.346) at p < 0.001. We report this as an honest negative finding: SLM surrogates produce more natural text but a less varied training distribution, and downstream NER benefits more from variety than from naturalness.

2605.13537 2026-05-14 cs.LG cs.AI cs.CL 版本更新

Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

Ye Wang, Jing Liu, Toshiaki Koike-Akino

发表机构 * Mitsubishi Electric Research Laboratories (MERL)(三菱电机研究实验室)

AI总结 本文研究了如何在推理阶段通过对齐方法缓解强化学习中的奖励黑客问题。作者提出了一种新的对齐技术,通过调整参考模型的温度参数,将推理对齐推广到多个生成奖励模型的组合,形成一种称为SLOP的锐化对数意见池方法。该方法不仅提高了模型的鲁棒性,同时保持了对齐性能,为持续适应动态奖励目标提供了有效解决方案。

详情
英文摘要

Inference-time alignment techniques offer a lightweight alternative or complement to costly reinforcement learning, while enabling continual adaptation as alignment objectives and reward targets evolve. Existing theoretical analyses justify these methods as approximations to sampling from distributions optimally tilted toward a given reward model. We extend these techniques by introducing reference-model temperature adjustment, which leads to further generalization of inference-time alignment to ensembles of generative reward models combined as a sharpened logarithmic opinion pool (SLOP). To mitigate reward hacking, we propose an algorithm for calibrating SLOP weight parameters and experimentally demonstrate that it improves robustness while preserving alignment performance.

2605.13536 2026-05-14 cs.LG cs.AI 版本更新

HLS-Seek: QoR-Aware Code Generation for High-Level Synthesis via Proxy Comparative Reward Reinforcement Learning

Qingyun Zou, Feng Yu, Hongshi Tan, Yao Chen, Bingsheng He, WengFai Wong

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 HLS-Seek 是一种基于代理比较奖励强化学习的高质量代码生成框架,旨在提升高层次综合(HLS)中代码的性能表现(QoR),包括延迟和资源利用率。该方法通过相对比较而非绝对合成结果进行强化学习,显著降低了训练成本,并引入不确定性感知的蒙特卡洛dropout机制以防止奖励欺骗,实现自我优化的奖励系统。实验表明,HLS-Seek 在语法正确性和功能正确性方面均优于现有模型,且训练效率更高,在多个基准测试中表现出优越的QoR性能。

详情
英文摘要

High-Level Synthesis (HLS) compiles algorithmic C/C++ descriptions into hardware, with Quality of Results (QoR) -- latency and resource utilization -- critically governed by pragma configurations and code structure. Existing LLM-based HLS approaches train for functional correctness but ignore QoR entirely. We observe that reinforcement learning (RL) for HLS does not require absolute synthesis results -- only relative comparisons between candidates. Based on this insight, we propose \textbf{HLS-Seek}, a QoR-aware NL-to-HLS framework that replaces expensive synthesis-in-the-loop RL with a comparative proxy reward model achieving 99.53\% Pareto-dominance accuracy. To prevent reward hacking, we introduce \textit{uncertainty-aware Monte Carlo (MC) dropout switching} that selectively invokes real Vitis HLS synthesis for low-confidence candidates and online updates the proxy, creating a self-improving reward system. HLS-Seek achieves 81.5\% syntax correctness pass@1 and 81.4\% Func@5 on HLS-eval with only 7B parameters, surpassing GPT-5.1 and other frontier models while achieving 8.5$\times$ faster training than real-reward RL. On QoR evaluation, HLS-Seek achieves the lowest latency on 16/30 kernels and Pareto-dominates HLS-specific baselines on 9 kernels.

2605.13534 2026-05-14 cs.AI 版本更新

Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging

Jiabei Liu, Wenyu Mao, Junfei Tan, Chunxu Shen, Lingling Yi, Jiancan Wu, Xiang Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) WeChat Technical Architecture Department, Tecent Inc.(腾讯公司微信技术架构部门)

AI总结 本文提出了一种基于强化学习的框架 MultiSearch,用于改进检索增强推理(Retrieval-Augmented Reasoning)的方法。该方法通过在每个推理步骤中生成多个视角的查询并行检索信息,扩大了信息覆盖范围,同时在合并过程中显式整合和优化检索结果,从而提高信噪比(SNR)和推理准确性。实验表明,MultiSearch 在多个基准测试中优于现有方法,显著提升了问答任务的推理性能。

详情
英文摘要

Deep search agents have proven effective in enhancing LLMs by retrieving external knowledge during multi-step reasoning. However, existing methods often generate a single query for retrieval at each reasoning step, limiting information coverage and introducing high noise. This may result in low signal-to-noise ratios (SNR) during search, degrading reasoning accuracy and leading to unnecessary reasoning steps. In this paper, we introduce MultiSearch, an RL-based framework that addresses these limitations through multi-query retrieval and explicit merging of retrieved information. At each reasoning step, MultiSearch generates queries from multiple perspectives and retrieves external information in parallel, expanding the scope of relevant information and mitigating the reliance on any single retrieval result. Then, the agent consolidates and refines retrieved information at the merging process, improving the SNR and ensuring more accurate reasoning. Additionally, we propose a reinforcement learning framework with a multi-process reward design to optimize agents for both multi-query retrieval and information consolidation. Extensive experiments on seven benchmarks demonstrate that MultiSearch outperforms baseline methods, enhancing the SNR of retrieval and improving reasoning performance in question-answering tasks.

2605.13532 2026-05-14 cs.AI cs.CL cs.CY cs.HC 版本更新

AI-Generated Slides: Are They Good? Can Students Tell?

Juho Leinonen, Lisa Zhang, Arto Hellas

发表机构 * Aalto University(艾洛大学) University of Toronto Mississauga(多伦多大学滑铁卢分校)

AI总结 本文研究了生成式人工智能(GenAI)在教学中生成幻灯片的应用效果,重点分析了教师和学生对AI生成幻灯片的感知。通过对比多种AI工具生成的幻灯片与人工制作的幻灯片,研究发现AI生成的幻灯片在准确性和教学效果上表现良好,学生难以区分AI生成与人工制作的幻灯片,且对质量评价高的幻灯片更倾向于认为其为人工制作。研究结果表明,GenAI在教学设计中有较大潜力,但也需进一步探索其负责任和有效的应用方式。

Comments 7 pages, 2 tables. Accepted to Western Canada Conference on Computing Education (WCCCE) 2026

详情
英文摘要

As generative AI (GenAI) tools become easily accessible, there is promise in using such tools to support instructors. To that end, this paper examines using GenAI to help generate slides from instructor authored course notes, emphasizing instructor and student perceptions. We examine an end-to-end education tool (NotebookLM), two general-purpose LLMs (Claude, M365 Copilot), and two coding assistants (Cursor, Claude Code). We first analyze whether GenAI generated slides are ``good'' via narrative assessment by educators. We choose the best slides to use (with some modification) in a real course setting, and compare the student perception of human vs. AI generated slides. We find that coding assistant tools produce slides that were most accurate, complete, and pedagogically sound. Additionally, students rate GenAI slides to be of similar quality as instructor-created slides, and cannot reliably identify which slides are AI-generated. Additionally, we find a negative correlation between a high quality rating and a high ``AI-generated'' rating, suggesting students associate poor quality with the source of the slides being AI. These findings highlight promising opportunities for integrating GenAI into instructional design workflows and call for further research on how educators can best harness such tools responsibly and effectively.

2605.13530 2026-05-14 cs.CV cs.AI 版本更新

Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs

Jincai Huang, Shihao Zou, Yuchen Guo, Jingjing Li, Wei Ji, Kai Wang, Shanshan Wang, Weixin Si

发表机构 * Southern University of Science and Technology(南方科技大学) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院) Northwestern University(西北大学) University of Alberta(阿尔伯塔大学) Yale University(耶鲁大学) Nanfang Hospital(南华医院) Shenzhen University of Advanced Technology(深圳大学先进技术研究院)

AI总结 本文提出 SurgMLLM,一种统一的手术场景理解框架,旨在将高层语义推理与底层视觉定位相结合,解决现有方法在手术场景中孤立处理各组件导致的语义不一致问题。该方法通过微调多模态大语言模型,实现对手术阶段、工具-动作-目标三元组及对应分割区域的联合建模,并通过时序聚合和分割网络实现精确的像素级定位。实验表明,SurgMLLM 在三元组识别和分割任务上均取得显著提升,验证了统一推理与定位方法在手术辅助中的有效性。

详情
英文摘要

Surgical scene understanding is a cornerstone of computer-assisted intervention. While recent advances, particularly in surgical image segmentation, have driven progress, real-world clinical applications require a more holistic understanding that jointly captures procedural context, semantic reasoning, and precise visual grounding. However, existing approaches typically address these components in isolation, leading to fragmented representations and limited semantic consistency. To address this limitation, we propose SurgMLLM, a unified surgical scene understanding framework that bridges high-level reasoning and low-level visual grounding within a single model. Given surgical videos, SurgMLLM fine-tunes a multimodal large language model (MLLM) to support structured interpretability reasoning, which is used to jointly model phases, instrument-verb-target (IVT) triplets, and triplet-entity segmentation tokens. These tokens are then temporally aggregated and serve as prompts for a segmentation network, enabling accurate pixel-wise grounding of triplet instruments and targets. The entire framework is trained end-to-end with a unified objective that couples language-based reasoning supervision with visual grounding losses, promoting coherent cross-task learning and clinically consistent scene representations. To facilitate unified evaluation, we introduce CholecT45-Scene, extending CholecT45 dataset with 64,299 frames of pixel-level mask annotations for instruments and targets, aligned with existing triplet labels. Extensive experiments show that SurgMLLM significantly advances surgical scene understanding, improving the primary triplet recognition metric AP_IVT from 40.7% to 46.0% and consistently outperforming prior methods in phase recognition and segmentation. These results highlight the effectiveness of unified reasoning-and-grounding for reliable, context-aware surgical assistance.

2605.13484 2026-05-14 cs.LG cs.AI stat.ME 版本更新

Discovery of Hidden Miscalibration Regimes

Katarzyna Kobalczyk, Mihaela van der Schaar

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文研究了模型在不同输入上的校准偏差问题,指出传统方法仅基于置信度评估校准,可能掩盖局部校准失败的现象。为此,作者提出了一种无需预设数据切片的隐式校准偏差发现方法,通过学习输入空间的校准感知表示,并利用核平滑估计局部校准偏差。实验表明,该方法能有效揭示大语言模型在不同输入下的校准异质性,并在系统性偏差区域显著提升校准效果。

详情
英文摘要

Calibration is commonly evaluated by comparing model confidence with its empirical correctness, implicitly treating reliability as a function of the confidence score alone. However, this view can hide substantial structure: models may be systematically overconfident on some kinds of inputs and underconfident on others, causing global reliability diagnostics to obscure localised calibration failures. To address this, we formulate the problem of discovering hidden miscalibration regimes without assuming access to predefined data slices. We define the corresponding miscalibration field and propose a diagnostic framework for estimating it. Our approach learns a calibration-aware representation of the input space and estimates signed local miscalibration by kernel smoothing in the learned geometry. Across four real-world LLM benchmarks and twelve LLMs, we find that input-dependent calibration heterogeneity is prevalent. We further show that the discovered fields are actionable: they support local confidence correction and reduce calibration error in systematically miscalibrated regions where confidence-based methods such as isotonic regression and temperature scaling are less effective.

2605.13452 2026-05-14 cs.RO cs.AI 版本更新

CUBic: Coordinated Unified Bimanual Perception and Control Framework

Xingyu Wang, Pengxiang Ding, Jingkai Xu, Donglin Wang, Zhaoxin Fan

发表机构 * Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University(北京未来区块链与隐私计算高级创新中心,人工智能学院,北京航空航天大学) Westlake University(西湖大学) Zhejiang University(浙江大学) Peking University(北京大学)

AI总结 本文提出了一种名为CUBic的协调统一双臂感知与控制框架,旨在解决从单臂操作扩展到双臂操作时面临的感知独立性与手臂协调性之间的矛盾。该方法通过统一的感知建模,学习共享的标记化表示,使独立操作与协调交互自然地从结构中体现出来,而非依赖人工设计的耦合机制。实验表明,CUBic在RoboTwin基准测试中显著优于现有方法,在协调精度和任务成功率方面均取得明显提升。

详情
英文摘要

Recent advances in visuomotor policy learning have enabled robots to perform control directly from visual inputs. Yet, extending such end-to-end learning from single-arm to bimanual manipulation remains challenging due to the need for both independent perception and coordinated interaction between arms. Existing methods typically favor one side -- either decoupling the two arms to avoid interference or enforcing strong cross-arm coupling for coordination -- thus lacking a unified treatment. We propose CUBic, a Coordinated and Unified framework for Bimanual perception and control that reformulates bimanual coordination as a unified perceptual modeling problem. CUBic learns a shared tokenized representation bridging perception and control, where independence and coordination emerge intrinsically from structure rather than from hand-crafted coupling. Our approach integrates three components: unidirectional perception aggregation, bidirectional perception coordination through two codebooks with shared mapping, and a unified perception-to-control diffusion policy. Extensive experiments on the RoboTwin benchmark show that CUBic consistently surpasses standard baselines, achieving marked improvements in coordination accuracy and task success rates over state-of-the-art visuomotor baselines.

2605.13450 2026-05-14 cs.AI cs.CL cs.HC 版本更新

Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers

Samuel Schapiro, Alexi Gladstone, Jonah Black, Heng Ji

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究了如何有效评估大语言模型的创造力,针对创造性写作、发散性思维和科学构想三个领域,系统评估了现有创造力测试的有效性。研究发现,现有测试在预测模型创造力方面存在显著局限,尤其是对科学构想能力的预测效果不佳。为此,作者提出了一种新的测试方法——发散远程联想测试(DRAT),该方法首次在单一测试中同时评估聚合与发散性思维,并能有效预测科学构想能力,表现出良好的鲁棒性。

Comments 36 pages. Extended version of work under review

详情
英文摘要

Measuring the creativity of large language models (LLMs) is essential for designing methods that can improve creativity and for enhancing our scientific understanding of this ability. To accomplish this, it has become common in recent years to administer tests of human creativity to LLMs. Although these tests provide a convenient and fully automated way to score "creativity," their validity as measures of machine creativity has not been established, and these tests already have limited validity as predictors of human creativity. To address this problem, we conduct the first large-scale, systematic study assessing the effectiveness of human creativity tests for predicting the creative achievement of LLMs across three target constructs: creative writing, divergent thinking, and scientific ideation. We find that the Divergent Association Task (DAT) and the Conditional DAT are the best predictors of creative writing and divergent thinking, respectively, but that test effectiveness varies significantly by construct, and no single test predicts all constructs well. Moreover, contrary to popular belief, no existing test reliably predicts scientific ideation ability. Motivated by this problem, we introduce the Divergent Remote Association Test (DRAT), a vocabulary-space test that assesses both convergent and divergent thinking in a single instrument. The DRAT is the first and only creativity test for LLMs that is a significant predictor of scientific ideation ability, demonstrating robustness across major design choices. Furthermore, the performance gain of the DRAT is not recoverable from any linear combination of the Divergent Association Task and the Remote Associates Test, indicating that assessing divergent and convergent thinking in the same test is essential to reliably predicting scientific ideation ability.

2605.13435 2026-05-14 cs.LG cs.AI 版本更新

Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

JaeHyeok Doo, Byeongguk Jeon, Seonghyeon Ye, Kimin Lee, Minjoon Seo

发表机构 * KAIST AI(韩国科学技术院人工智能实验室)

AI总结 本文提出了一种名为 Q-Flow 的强化学习框架,旨在充分利用基于流模型的策略的高表达能力,同时解决其在价值最大化过程中的优化稳定性问题。该方法通过利用流模型的确定性动态,直接将终端轨迹价值传播到中间潜在状态,从而在无需展开数值求解器的情况下实现稳定策略优化。实验表明,Q-Flow 在离线学习任务中显著优于现有先进方法,并支持在同一框架下的稳定在线适应。

Comments 27 pages

详情
英文摘要

There is growing interest in utilizing flow-based models as decision-making policies in reinforcement learning due to their high expressive capacity. However, effectively leveraging this expressivity for value maximization remains challenging, as naive gradient-based optimization requires backpropagating through numerical solvers and often leads to instability. Existing approaches typically address this issue by restricting the expressive capacity of flow-based policies, resulting in a trade-off between optimization stability and representational flexibility. To resolve this, we introduce Q-Flow, a framework that leverages the deterministic nature of flow dynamics to explicitly propagate terminal trajectory value to intermediate latent states along the policy-induced flow. This formulation enables stable policy optimization using intermediate value gradients without unrolling the numerical solver, effectively bridging the gap between stability and expressivity. We evaluate Q-Flow in the offline learning setting on the challenging OGBench suite, where it consistently outperforms state-of-the-art baselines by an average of 10.6 percentage points, while also enabling stable online adaptation within the same framework.

2605.13414 2026-05-14 cs.AI 版本更新

TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints

Zabir Al Nazi, Shubhashis Roy Dipta

发表机构 * University of California, Riverside, USA(加州大学河滨分校) University of Maryland, Baltimore County, USA(马里兰大学巴尔的摩县分校)

AI总结 本文提出TRIAGE评估框架,用于评估大语言模型在资源受限情况下对未来任务进行选择、排序和计算分配的前瞻性元认知控制能力。该框架通过给模型提供任务池和预设的token预算,要求其制定一个包含任务选择、顺序和资源分配的统一计划,并基于模型在各任务上的解题能力和成本进行评估,从而计算出其分诊效率比。实验表明,当前主流语言模型在该能力上存在显著不足,揭示了其在资源高效部署方面尚未被充分测量的关键能力维度。

详情
英文摘要

Deploying language models as autonomous agents requires more than per-task accuracy: when an agent faces a queue of problems under a finite token budget, it must decide which to attempt, in what order, and how much compute to commit to each, all before any execution feedback is available. This is the prospective form of metacognitive control studied for decades in human cognition, yet whether language models possess it remains untested. We introduce TRIAGE, an evaluation framework in which a model receives a task pool and a token budget calibrated to its own baseline cost, and commits to a single ordered plan that jointly encodes selection, sequencing, and per-problem allocation. Plans are scored against an oracle with full knowledge of the model's solvability and cost on each problem, yielding a triage efficiency ratio on a common scale. We evaluate frontier and open-source models, with and without reasoning enabled, across competition mathematics, graduate-level science, code generation, and expert multidisciplinary knowledge, and find that current language models exhibit substantial gaps in prospective metacognitive control, revealing a previously unmeasured capability dimension with direct implications for resource-efficient agent deployment.

2605.13412 2026-05-14 cs.CL cs.AI 版本更新

LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics

Galadrielle Humblot-Renaux, Mohammad N. S. Jahromi, Rohat Bakuri-Jørgensen, Marieke Anne Heyl, Asta S. Stage Jarlner, Maria Vlachou, Anna Murphy Høgenhaug, Desmond Elliott, Thomas Gammeltoft-Hansen, Thomas B. Moeslund

发表机构 * Visual Analysis and Perception Lab(视觉分析与感知实验室) Pioneer Center for AI(先锋人工智能中心) Center of Excellence for Global Mobility Law(全球移动法律卓越中心) Department of Computer Science(计算机科学系)

AI总结 该研究探讨了使用现成的大语言模型(LLMs)对丹麦难民申请决定文本中的可信度评估进行自动标注的性能与误差。研究引入了一个名为RAB-Cred的高质量丹麦语法律文本分类数据集,并系统评估了多种模型和提示组合在零样本和少样本设置下的表现。研究揭示了顶级模型在标注中的不一致性与错误模式,强调了单一模型预测的局限性,并指出在法律等专业领域中,LLMs作为标注工具仍存在不足,需结合人类判断与更细致的评估方法。

Comments Accepted at the 20th Linguistic Annotation Workshop (LAW XX), co-located with ACL 2026 (https://sigann.github.io/LAW-XX-2026/)

详情
英文摘要

Off-the-shelf large language models (LLMs) are increasingly used to automate text annotation, yet their effectiveness remains underexplored for underrepresented languages and specialized domains where the class definition requires subtle expert understanding. We investigate LLM-based annotation for a novel legal NLP task: identifying the presence and sentiment of credibility assessments in asylum decision texts. We introduce RAB-Cred, a Danish text classification dataset featuring high-quality, expert annotations and valuable metadata such as annotator confidence and asylum case outcome. We benchmark 21 open-weight models and 30 system-user prompt combinations for this task, and systematically evaluate the effect of model and prompt choice for zero-shot and few-shot classification. We zoom in on the errors made by top-performing models and prompts, investigating error consistency across LLMs, inter-class confusion, correlation with human confidence and sample-wise difficulty and severity of LLM mistakes. Our results confirm the potential of LLMs for cost-effective labeling of asylum decisions, but highlight the imperfect and inconsistent nature of LLM annotators, and the need to look beyond the predictions of a single, arbitrarily chosen model. The RAB-Cred dataset and code are available at https://github.com/glhr/RAB-Cred

2605.13391 2026-05-14 cs.AI 版本更新

RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents

Liangtian Liu, Zeyuan Wang, Ziyu Li, Kai Ouyang, Zichao Tang, Chengfu Liu, Haifeng Li, Hanwen Yu, Wentao Yang, Cheng Yang, Dongyang Hou

发表机构 * School of Geosciences and Info-Physics, Central South University(地质科学与信息物理学院,中南大学) School of Resources and Environment, University of Electronic Science and Technology of China(资源与环境学院,电子科技大学) School of Earth Sciences and Spatial Information Engineering, Hunan University of Science and Technology(地球科学与空间信息工程学院,湖南科技大学) Sanya Institute of Hunan University of Science and Technology(海南科技大学三亚研究院)

AI总结 随着多模态大语言模型的发展,遥感智能正从“感知”转向“行动”,但现有遥感智能体在工具调用上仍采用被动选择方式,难以在复杂任务中动态平衡上下文负载与工具集完整性。为此,本文提出RS-Claw,一种基于分层技能树的主动探索架构,通过技能封装技术对工具进行分层描述,使智能体能够按需逐步加载工具信息,从而显著释放上下文空间并提高关键工具的命中率。实验表明,RS-Claw在Earth-Bench基准测试中表现出色,有效压缩了输入令牌并优于现有方法。

详情
英文摘要

The rise of multi-modal large language models (MLLMs) is shifting remote sensing (RS) intelligence from "see" to "action", as OpenClaw-style frameworks enable agents to autonomously operate massive RS image-processing tools for complex tasks. Existing RS agents adopt a passive selection paradigm for tool invocation, relying on either full tool registration (Flat) or retrieval-augmented generation (RAG). However, in the massive and multi-source heterogeneous RS tool ecosystem, such passive mechanisms struggle to dynamically balance "context load" and "toolset completeness" throughout task reasoning, thus exhibiting inherent limitations: full tool registration triggers context space deficits during long-horizon tasks, whereas RAG retrieval may omit critical tools in essential steps. To overcome these bottlenecks, this paper redefines tool selection by arguing that the agent should act as an active explorer within the tool space. Based on this perspective, we propose RS-Claw, a novel RS agent architecture. By leveraging Skill encapsulation technology at the tool end, this architecture hierarchically structures tool descriptions, enabling the agent to execute on-demand sequential decision-making: initially selecting relevant skill branches by reading only tool summaries, then dynamically loading detailed descriptions, and ultimately achieving precise invocation. This active paradigm not only significantly liberates the agent's context space but also effectively ensures the accurate hit rate of critical tools during long-horizon reasoning. Systematic experiments on the Earth-Bench benchmark demonstrate that RS-Claw's active exploration mechanism effectively filters semantic noise and substantially frees up reasoning space, achieving an input token compression ratio of up to 86%, and comprehensively outperforming existing Flat and RAG baselines across complex reasoning evaluations.

2605.13375 2026-05-14 cs.CV cs.AI 版本更新

GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

Mingzhe Huang, Weijun Wang, Xin Ding, Liang Mi, Hao Wen, Yuanchun Li, Lichen Pang, Shansong Yang, Yunxin Liu, Ting Cao

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University(人工智能产业研究院(AIR),清华大学) Juhaokan Technology Co.,Ltd(极皓科技有限公司) Nanjing University(南京大学) University of Science and Technology of China(中国科学技术大学)

AI总结 在视觉-语言模型(VLMs)中,处理大量视觉标记会导致高昂的计算开销。为解决这一问题,本文提出GRIP-VLM,一种基于强化学习的组相对重要性剪枝框架,将剪枝建模为马尔可夫决策过程,通过监督预热引导的组相对策略优化(GRPO)直接探索离散选择空间,从而避免连续近似方法带来的次优解问题。该方法结合预算感知评分器,无需重新训练即可动态评估并适应不同压缩比,实验表明其在多个多模态基准上优于启发式和监督学习基线,在保持精度的同时实现了最高达15%的推理加速。

Comments 10 pages, 11 figures

详情
英文摘要

In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, visual token pruning is inherently a discrete, non-convex combinatorial problem; consequently, these continuous approximations frequently trap the optimization in sub-optimal local minima, especially under aggressive compression budgets. To overcome this fundamental bottleneck, we propose GRIP-VLM, a Group-Relative Importance Pruning framework driven by Reinforcement Learning. Rather than relying on smooth-gradient assumptions, GRIP-VLM formulates pruning as a Markov Decision Process, employing a Group Relative Policy Optimization (GRPO) paradigm anchored by supervised warm-up to directly explore the discrete selection space. Integrated with a budget-aware scorer, our lightweight agent dynamically evaluates per-token importance and adapts to arbitrary compression ratios without retraining. Extensive experiments across diverse multimodal benchmarks demonstrate that GRIP-VLM consistently outperforms heuristic and supervised-learning baselines, achieving a superior Pareto frontier and delivering up to a 15\% inference speedup at equal accuracy.

2605.13367 2026-05-14 cs.LO cs.AI cs.DB 版本更新

A Horn extension of DL-Lite with NL data complexity

Janos Arpasi, Bartosz Jan Bednarczyk, Magdalena Ortiz

发表机构 * Institute of Logic and Computation, TU Wien(维也纳技术大学逻辑与计算研究所) Computer Science Department, University of Wrocław(沃里希拉大学计算机科学系)

AI总结 本文研究了如何在保持数据复杂度在NL(非确定性多项式时间)范围内的前提下,扩展DL-Lite描述逻辑以支持更丰富的本体表达。为此,作者引入了一种分层机制,控制ELI逻辑中合取与递归的交互,从而提出了一种名为ELbotpreceq的描述逻辑,该逻辑严格扩展了DL-Lite,支持可达性公理和受限合取,并允许在NL内进行推理。通过将其重写为嵌套双向正则路径查询(GQL的一个片段),论文证明了其数据复杂度上限为NL,为将OMQA扩展到图查询语言提供了新的可能性。

Comments Submitted to Description Logic Workshop 2025. Full version in preparation

详情
英文摘要

The literature on ontology-mediated query answering (OMQA) has been shaped by two key results: first-order rewritability for DL-Lite, and PTime-hardness of data complexity for essentially every description logic beyond it. This has effectively positioned DL-Lite as the only practical choice for query rewriting, restricting OMQA solutions to first-order queries and ontologies that can be rewritten into them. This AC0 vs. PTime dichotomy is especially limiting if we consider that OMQA targets graph-structured data, and that standard graph query languages (including the recent ISO standards GQL and SQL/PGQ) are typically NL-complete. Towards identifying a rich Horn DL that can be rewritten into graph query languages and that can still express many ELI and DL-Lite ontologies, we introduce a stratification mechanism for ELI that controls the interaction between conjunction and recursion. In this way, we obtain ELbotpreceq, a description logic that strictly extends the core DL-Lite, supports reachability axioms and restricted conjunction, and allows for reasoning in NL. We establish the NL upper bound via a rewriting into nested two-way regular path queries, a fragment of GQL, providing initial evidence that our ontology language is a promising candidate for extending OMQA to graph query languages.

2605.13357 2026-05-14 cs.SE cs.AI 版本更新

AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents

Hailin Zhong, Shengxin Zhu

发表机构 * Hong Kong Baptist University(香港 Baptist大学) Beijin Normal University(北京师范大学)

AI总结 本文研究了基础模型在自主软件工程中的应用问题,指出当前代理在真实开发环境中不可靠的原因不仅在于模型能力,更在于运行时支撑系统(harness)的缺失。为此,作者提出了AI Harness Engineering框架,定义了十一项关键组件职责,并通过四层递进式运行时支持结构(H0-H3)和基于追踪的评估协议,使代理的行为可审计、可验证。该框架将自主软件工程的核心问题从“模型能否生成补丁”转变为“模型-支撑系统-环境能否生成可验证、可追溯且可持续的代码变更”。

Comments 16 pages

详情
英文摘要

Foundation models have transformed automated code generation, yet autonomous software-engineering agents remain unreliable in realistic development settings. The dominant explanation locates this gap in model capability. We propose a different locus: software-engineering capability emerges from a model-harness-environment system, in which a runtime substrate -- the harness -- mediates how a foundation-model agent observes a project, acts on it, receives feedback, and establishes that a change is complete. We formalize this substrate as an AI Harness Engineering and identify eleven component responsibilities: task specification, context selection, tool access, project memory, task state, observability, failure attribution, verification, permissions, entropy auditing, and intervention recording. We operationalize the harness through a four-level ladder (H0-H3) that progressively exposes runtime support to the agent, and we propose a trace-based evaluation protocol that converts each agent run into an auditable episode package. Applied to a controlled validation task, the framework yields episode packages whose evidence structure varies systematically with harness level: lower levels produce only a final patch, higher levels produce reproduction logs, failure attributions, deterministic requirement checks, and structured verification reports. The framework reframes the central question of autonomous software engineering from whether a foundation model can produce a patch to whether the model-harness-environment system can produce a verifiably correct, attributed, and maintainable change. We outline a research program for the runtime systems that foundation-model software agents will require.

2605.13345 2026-05-14 cs.AI cs.MA 版本更新

Multi-Agent Systems in Emergency Departments: Validation Study on a ED Digital Twin

Markus Wenzel, Tobias Strapatsas, Jessika Kress, Dorothea Sauer, Nele Gessler, Horst K. Hahn

发表机构 * Constructor University(Constructor大学) Fraunhofer Institute for Digital Medicine MEVIS(弗劳恩霍夫数字医学研究所MEVIS) Asklepios Kliniken Hamburg GmbH(阿斯克列庇斯汉堡医院有限公司)

AI总结 该研究针对急诊科在患者护理和资源管理方面面临的挑战,提出了一种结合离散事件仿真(DES)和基于代理的模型(ABM)的混合仿真方法,用于构建高度可配置的急诊科数字孪生系统。通过验证模型在不同规模、患者流量和人员配置下的表现,并与实际数据对比,证明了该模型能够有效模拟真实急诊环境下的运行动态。此外,研究还引入了一个基于时间事件记录的多智能体系统,可自主探索资源分配策略,为急诊科资源优化提供了有力的仿真工具。

详情
英文摘要

Emergency departments (ED) face challenges in patient care and resource management. We propose to explore optimization strategies in a realistic and flexible model and develop a hybrid Discrete Event Simulation (DES) and Agent-Based Model (ABM) simulating highly configurable ED environments. We specifically focus on the validation of the modeling approach. We derive configurations for ED sizes, patient load, and staffing from real-world studies. We then validate the model expressivity by matching its key performance indicators and metrics with their values known from literature. We proceed by implementing scientifically established and practice-proven resource optimization strategies. Comparing the documented real-world outcomes with our model's results demonstrates that the DES-ABM based simulation can effectively replicate real-world ER dynamics under interventions. We lastly integrate a Proof-of-Concept multi-agent system (MAS) that can autonomously explore resource allocation strategies within the simulated ER environment based on a temporal ledger of ED event records. This modular DES-ABM-MAS framework offers a powerful tool to explore resource optimization strategies in emergency departments.

2605.11231 2026-05-14 cs.LG cs.AI 版本更新

LiBaGS: Lightweight Boundary Gap Synthesis for Targeted Synthetic Data Selection

Abhishek Moturu, Anna Goldenberg, Babak Taati

发表机构 * Department of Computer Science(计算机科学系) University of Toronto(多伦多大学) The Hospital for Sick Children UHN(Sick Children医院 UHN) KITE Research Institute(KITE研究机构) T-CAIREM Vector Institute(T-CAIREM向量研究所) Vector Institute(向量研究所)

AI总结 本文提出了一种名为LiBaGS的轻量级合成数据选择方法,旨在针对特定任务选择具有代表性的合成样本以补充训练分布的不足。该方法结合了决策边界距离、预测不确定性、真实数据密度和支撑有效性等多个指标,以筛选出信息量大且贴近真实数据流形的样本。通过边界间隙分配规则和边际价值停止准则,LiBaGS能够高效地选择稀疏但真实的边界邻域样本,提升模型在下游任务中的准确率。

详情
英文摘要

Synthetic data is useful only when the added samples fill missing parts of the training distribution that matter for the downstream task. We introduce LiBaGS, a lightweight, generator-agnostic method for targeted synthetic training data selection. LiBaGS scores candidate synthetic samples by combining decision-boundary proximity, predictive uncertainty, real-data density, and support validity, so that selected samples are both informative and likely to remain on the real data manifold. We then use a boundary-gap allocation rule that targets sparse but realistic decision-boundary neighborhoods, rather than simply adding more data or selecting only the most uncertain candidates. LiBaGS also learns when enough synthetic samples have been added through a marginal-value stopping rule, assigns softer labels near ambiguous boundaries, and uses a diversity objective to avoid redundant near-duplicate selections. Experiments show that LiBaGS improves accuracy over classical oversampling, hard augmentation, uncertainty and density ablations, and targeted-generation selection criteria.

2605.10888 2026-05-14 cs.LO cs.AI 版本更新

Shields to Guarantee Probabilistic Safety in MDPs

Linus Heck, Filip Macák, Roman Andriushchenko, Milan Češka, Sebastian Junges

发表机构 * Radboud University(拉德堡德大学) Brno University of Technology(布拉格技术大学)

AI总结 本文研究如何在马尔可夫决策过程(MDPs)中通过屏蔽技术保证概率安全。传统屏蔽方法旨在完全避免危险事件,但面对允许一定概率风险的场景时,其强安全性和宽容性保证难以保持。为此,作者提出了一种形式化框架,扩展经典屏蔽方法以适应概率安全需求,并展示了强保证不可保持的不可能性,同时提供了弱保证的自然屏蔽方法以及确保强安全性的离线和在线屏蔽构造,实验验证了新方法的实用性和计算可行性。

Comments Accepted to CAV 2026

详情
英文摘要

Shielding is a prominent model-based technique to ensure safety of autonomous agents. Classical shielding aims to ensure that nothing bad ever happens and comes with strong guarantees about safety and maximal permissiveness. However, shielding systems for probabilistic safety, where something bad is allowed to happen with an acceptable probability, has proven to be more intricate. This paper presents a formal framework that conservatively extends classical shields to probabilistic safety. In this framework, we (i) demonstrate the impossibility of preserving the strong guarantees on safety and permissiveness, (ii) provide natural shields with weaker guarantees, and (iii) introduce offline and online shield constructions ensuring strong safety guarantees. The empirical evaluation highlights the practical advantages of the new shields, as well as their computational feasibility.

2605.09505 2026-05-14 cs.AI 版本更新

EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

Yuyang Dai, Zheng Chen, Jathurshan Pradeepkumar, Yasuko Matsubara, Jimeng Sun, Yasushi Sakurai, Yushun Dong

发表机构 * Florida State University(佛罗里达州立大学) The University of Osaka(大阪大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 该研究提出了EpiGraph,一个大规模的癫痫知识图谱和评估基准,旨在提升基于证据的临床推理能力。EpiGraph整合了48,166篇同行评审论文和七项临床资源,构建了一个包含24,324个实体和32,009个证据支持三元组的异构图谱,并基于此定义了五个临床任务用于评估模型性能。实验表明,结合EpiGraph的大型语言模型在多项任务中表现显著提升,尤其在药理基因组学推理方面提升了30%至41%,验证了结构化知识对增强临床推理能力的有效性。

详情
英文摘要

Epilepsy diagnosis and treatment require evidence-intensive reasoning across heterogeneous clinical knowledge, including biosignal patterns, genetic mechanisms, pharmacogenomics, treatment strategies, and patient outcomes. In this work, we present \textsc{EpiGraph}, a large-scale epilepsy knowledge graph and benchmark for evaluating knowledge-augmented clinical reasoning. \textsc{EpiGraph} integrates 48,166 peer-reviewed papers and seven clinical resources into a heterogeneous graph containing 24,324 entities and 32,009 evidence-grounded triplets across five clinical layers. Built upon this graph, \textsc{EpiBench} defines five clinically motivated tasks spanning clinical decision-making, EEG report generation, pharmacogenomic precision medicine, treatment recommendation, and deep research planning. We evaluate six LLMs under both standard and Graph-RAG settings. Results show that integrating \textsc{EpiGraph} consistently improves performance across all tasks, with the largest gains observed in pharmacogenomic reasoning (+30--41\%). Our findings demonstrate that structured epilepsy knowledge substantially enhances evidence-grounded clinical reasoning and provides a practical benchmark framework for evaluating knowledge-augmented LLMs in real-world neurological settings. Our code is available at: https://github.com/LabRAI/EEG-KG.

2605.09415 2026-05-14 cs.AI 版本更新

Strategic commitments shape collective cybersecurity under AI inequality

Adeela Bashir, Zia Ush Shamszaman, Zhao Song, Matjaz Perc, The Anh Han

发表机构 * Government agencies(政府机构) Cybersecurity vendors(网络安全供应商) Regulated institutions(监管机构)

AI总结 随着人工智能在网络安全中的广泛应用,攻防双方的力量对比正在发生变化。本文研究了在AI防御工具获取不均的情况下,资源有限的防御者难以有效保护系统所带来的安全风险,并提出通过引入有承诺的防御者和针对性补贴,可以显著提升整体防御能力并增强系统韧性。研究还表明,这种策略不仅能提高防御者的安全收益,还能有效抑制攻击者的获利空间,为AI驱动环境下的网络安全政策制定提供了理论支持。

Comments 26 pages, 16 figures

详情
英文摘要

The growing integration of AI into cybersecurity is reshaping the balance between attackers and defenders. When access to advanced AI-enabled defence tools is uneven, resource-limited defenders may be unable to adopt effective protection, creating persistent system vulnerabilities. We study the impact of differential AI access using an evolutionary game-theoretic model in a finite population. We first show that when high-capability defence is costly, the population is driven toward low-cost, weak-defence behaviour, sustaining attacks and weakening long-run security. To address this problem, we introduce differential access to AI defence tools by allowing defenders to choose between low- and high-capability protection based on their resources. We then examine the role of a small group of committed defenders who always adopt strong defence and influence others through social learning. Although commitment increases the prevalence of strong defence, it alone cannot stabilise secure outcomes due to high defence costs. We therefore incorporate a targeted subsidy to remove the cost disadvantage from committed defenders. Our analysis shows that subsidised commitment significantly increases strong defence adoption, suppresses successful attacks, and improves overall system resilience. Simulations across a broad parameter space confirm that subsidies consistently outperform commitment alone. In addition, social-welfare analysis shows improved defender outcomes while keeping attacker gains low. These findings suggest that targeted support for key defenders can be an effective mechanism for stabilising cybersecurity in AI-driven environments and provide a theoretical bridge between cybersecurity policy, AI governance, and strategic allocation of defensive AI capabilities.

2605.09134 2026-05-14 cs.AI cs.SE 版本更新

BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

Yuanhao Li, Hongbo Wang, Xiaotang Shang, Xunzhu Tang, Yiming Cao, Xuhong Chen

发表机构 * State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室) Beijing University of Posts and Telecommunications(北京邮电大学) University of Luxembourg(卢森堡大学)

AI总结 BoostAPR 是一种基于执行引导的强化学习框架,旨在解决程序修复中反馈稀疏和奖励粒度过粗的问题。该方法通过三个阶段进行优化:首先在带有推理轨迹的执行验证演示上进行监督微调,然后从执行结果中训练双奖励模型,分别用于评估序列级和行级的修复效果,最后通过PPO算法进行优化,将行级奖励重新分配给关键的代码修改区域。实验表明,BoostAPR 在多个基准测试中取得了优异的修复效果,并展现出良好的跨语言泛化能力。

Comments 21 pages, 2 figures. Accepted at ICML 2026

详情
英文摘要

Reinforcement learning for program repair is hindered by sparse execution feedback and coarse sequence-level rewards that obscure which edits actually fix bugs. We present BoostAPR, a three-stage framework addressing these challenges: (1) supervised fine-tuning on execution-verified demonstrations with reasoning traces, (2) training dual reward models--a sequence-level assessor and a line-level credit allocator--from execution outcomes, and (3) PPO optimization where the line-level model redistributes rewards to critical edit regions. This line-level credit assignment operates at an intermediate granularity naturally suited to code changes. Trained on SWE-Gym and evaluated on four benchmarks, BoostAPR achieves 40.7% on SWE-bench Verified (+22.9pp over base model), 24.8% on Defects4J (Python-to-Java transfer), 84.5% on HumanEval-Java, and 95.0% on QuixBugs, achieving competitive results among open-source models with strong cross-language generalization.

2605.06651 2026-05-14 cs.AI 版本更新

AI co-mathematician: Accelerating mathematicians with agentic AI

Daniel Zheng, Ingrid von Glehn, Yori Zwols, Iuliya Beloshapka, Lars Buesing, Daniel M. Roy, Martin Wattenberg, Bogdan Georgiev, Tatiana Schmidt, Andrew Cowie, Fernanda Viegas, Dimitri Kanevsky, Vineet Kahlon, Hartmut Maennel, Sophia Alj, George Holland, Alex Davies, Pushmeet Kohli

发表机构 * Google(谷歌)

AI总结 本文介绍了“AI co-mathematician”,一个辅助数学家进行开放式研究的智能工作平台。该系统通过异步、状态化的交互方式,支持数学研究中的各个环节,如想法生成、文献检索、计算探索和定理证明,并能有效管理不确定性、追踪失败假设并生成原生数学成果。实验表明,该系统不仅提升了数学研究效率,还在多个难题求解基准测试中取得了优异成绩。

Comments 23 pages; several citations added

详情
英文摘要

We introduce the AI co-mathematician, a workbench for mathematicians to interactively leverage AI agents to pursue open-ended research. The AI co-mathematician is optimized to provide holistic support for the exploratory and iterative reality of mathematical workflows, including ideation, literature search, computational exploration, theorem proving and theory building. By providing an asynchronous, stateful workspace that manages uncertainty, refines user intent, tracks failed hypotheses, and outputs native mathematical artifacts, the system mirrors human collaborative workflows. In early tests, the AI co-mathematician helped researchers solve open problems, identify new research directions, and uncover overlooked literature references. Besides demonstrating a highly interactive paradigm for AI-assisted mathematical discovery, the AI co-mathematician also achieves state of the art results on hard problem-solving benchmarks, including scoring 48% on FrontierMath Tier 4, a new high score among all AI systems evaluated.

2605.04557 2026-05-14 cs.CV cs.AI 版本更新

Efficient Geometry-Controlled High-Resolution Satellite Image Synthesis

Vlad Vasilescu, Daniela Faur, Teodor Costachioiu

发表机构 * Univ. POLITEHNICA Bucharest SIGMA Lab , CAMPUS Institute(巴比什-博亚尔银行大学 SIGMA 实验室,CAMPUS 机构) Univ. POLITEHNICA Bucharest GEOSENSE , CAMPUS Institute(巴比什-博亚尔银行大学 GEOSENSE,CAMPUS 机构)

AI总结 本文研究了如何高效生成受几何控制的高分辨率卫星图像,以解决该类图像稀缺且成本高昂的问题,这对土地覆盖分类、变化检测和灾害监测等任务的模型开发与测试造成阻碍。作者提出了一种基于现有预训练扩散模型的方法,通过引入窗口交叉注意力模块,仅利用跳跃连接特征实现对生成过程的控制,方法简洁高效。实验表明,该方法在性能上与现有控制技术相当,且在几何控制图对齐方面表现更优,同时指出现有评估方法的局限性,强调了对齐评估一致性的重要性。

Comments 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS)

详情
英文摘要

High-resolution satellite images are often scarce and costly, especially for remote areas or infrequent events. This shortage hampers the development and testing of machine learning models for land-cover classification, change detection, and disaster monitoring. In this paper, we tackle the problem of geometry-controlled high-resolution satellite image synthesis by adding control over existing pre-trained diffusion models. We propose a simple yet efficient method for controlling the synthesis process by leveraging only skip connection features using windowed cross-attention modules. Several previously established control techniques are compared, indicating that our method achieves comparable performance while leading to a better alignment with the geometry control map. We also discuss the limitations in current evaluation approaches, amplifying the necessity of a consistent alignment assessment.

2604.25774 2026-05-14 cs.CL cs.AI 版本更新

CGU-ILALab at FoodBench-QA 2026: Comparing Traditional and LLM-based Approaches for Recipe Nutrient Estimation

Wei-Chun Chen, Yu-Xuan Chen, I-Fang Chung, Ying-Jia Lin

发表机构 * CGU-ILALab

AI总结 本文研究了如何从非结构化的菜谱文本中准确估计营养成分这一挑战性问题,比较了基于传统方法和大语言模型(LLM)的多种技术。研究发现,传统方法如TF-IDF在推理速度上有优势,但效果有限;而基于LLM的少样本推理和混合方法在营养估计准确性上表现最佳,主要得益于其对模糊术语和非标准单位的处理能力。然而,这类方法也带来了更高的计算延迟,突显了实时性与精度之间的实际部署权衡。

Comments Accepted by the Third Workshop on Patient-oriented Language Processing (CL4Health) at LREC 2026

Journal ref http://lrec-conf.org/proceedings/lrec2026/workshops/cl4health/2026.cl4health-1.0.pdf

详情
英文摘要

Accurate nutrient estimation from unstructured recipe text is an important yet challenging problem in dietary monitoring, due to ambiguous ingredient terminology and highly variable quantity expressions. We systematically evaluate models spanning a wide range of representational capacity, from lexical matching methods (TF-IDF with Ridge Regression), to deep semantic encoders (DeBERTa-v3), to generative reasoning with large language models (LLMs). Under the strict tolerance criteria defined by EU Regulation 1169/2011, our empirical results reveal a clear trade-off between predictive accuracy and computational efficiency. The TF-IDF baseline achieves moderate nutrient estimation performance with near-instantaneous inference, whereas the DeBERTa-v3 encoder performs poorly under task-specific data scarcity. In contrast, few-shot LLM inference (e.g., Gemini 2.5 Flash) and a hybrid LLM refinement pipeline (TF-IDF combined with Gemini 2.5 Flash) deliver the highest validation accuracy across all nutrient categories. These improvements likely arise from the ability of LLMs to leverage pre-trained world knowledge to resolve ambiguous terminology and normalize non-standard units, which remain difficult for purely lexical approaches. However, these gains come at the cost of substantially higher inference latency, highlighting a practical deployment trade-off between real-time efficiency and nutritional precision in dietary monitoring systems.

2604.23018 2026-05-14 cs.CV cs.AI cs.LG 版本更新

AmaraSpatial-10K: A Spatially and Semantically Aligned 3D Dataset for Spatial Computing and Embodied AI

Mohammad Sadegh Salehi, Alex Perkins, Igor Maurell, Ashkan Dabbagh, Raymond Wong

发表机构 * Zero One Creative(Zero One创意)

AI总结 该研究提出了一个名为 AmaraSpatial-10K 的三维数据集,旨在解决现有大规模三维资产在空间计算和具身人工智能应用中的部署难题。该数据集包含超过 10,000 个经过优化的合成三维资产,每个资产都具备精确的度量尺度、确定的锚点、分离的物理材质贴图以及多句文本元数据,便于直接使用。研究还引入了一套可复用的评估体系,显著提升了三维资产在图像检索、物理模拟和跨模态对齐等方面的性能。

详情
英文摘要

Web-scale 3D asset collections are abundant but rarely deployment-ready, suffering from arbitrary metric scaling, incorrect pivots, brittle geometry, and incomplete textures, defects that limit their use in embodied AI, robotics, and spatial computing. We present AmaraSpatial-10K, a dataset of over 10,000 synthetic 3D assets optimised for zero-shot deployment. Each asset ships as a metric-scaled, deterministically anchored .glb with separated PBR maps, a convex collision hull, a paired reference image, and multi-sentence text metadata. Alongside the dataset we introduce a reusable evaluation suite for 3D asset banks, a continuous Scale Plausibility Score (SPS), an LLM Concept Density metric, anchor-error auditing, and a cross-modal CLIP coherence protocol, and apply it to AmaraSpatial-10K alongside matched subsets of Objaverse, HSSD, ABO, and GSO. AmaraSpatial-10K improves CLIP Recall@5 by $3.4\times$ over Objaverse ($0.612$ vs. $0.181$, median rank $267 \rightarrow 3$), achieves a $99.1\%$ physics-stability rate under Habitat-Sim with $\sim 20\times$ wall-time speed-up, and produces zero-overlap scenes when used as a drop-in asset bank for Holodeck. Controlled ablations on the same asset bank attribute the retrieval gain to description richness.

2604.22966 2026-05-14 cs.CY cs.AI 版本更新

Institutions for the Post-Scarcity of Judgment

Lauri Lovén

发表机构 * Future Computing Group, University of Oulu(奥卢大学未来计算组)

AI总结 本文探讨了人工智能革命带来的“判断稀缺性”反转现象,指出随着AI技术的发展,高质量判断的生产成本趋于零,而验证信号、合法性、真实来源和整合能力等四类资源变得稀缺。文章认为,传统机构(如法院、期刊、立法机构)在制造合法判断方面正与AI技术竞争,并提出将AI政策重新定位为制度设计、构建验证与溯源的公共基础设施、以及发展战略代理下的制度组合形式等三步行动议程。

Comments 5 pages, 9 references. Submitted to Communications of the ACM (Opinion section). Comments welcome

详情
英文摘要

Each major technological revolution inverts a particular scarcity and rebuilds institutions around the shift. The near-consensus diagnosis of the AI revolution holds that AI collapses the cost of prediction while judgment remains scarce. This Opinion argues the inversion has now flipped: competent-looking judgment (selecting, ranking, attributing, certifying) is produced at scale and at marginal cost approaching zero, and four complements become scarce: verified signal, legitimacy, authentic provenance, and integration capacity (the community's tolerance for delegated cognition). Because judgment is the substance of institutions, the institutions built to manufacture legitimate judgment (courts, journals, licensing bodies, legislatures) now compete with the technology for the same functional role. The piece traces the pattern across scientific institutions, professional licensing, intellectual property, democratic legitimacy, and foundation-model concentration, and closes with a three-move agenda: reframe AI policy as institutional redesign, build provenance and verification as commons, and develop the formal apparatus for institutional composition under strategic agents.

2604.21496 2026-05-14 cs.AI cs.CL cs.CY 版本更新

How English Print Media Frames Human-Elephant Conflicts in India

Bonala Sai Punith, Salveru Jayati, Garima Shakya, Shubham Kumar Nigam

发表机构 * Chhattisgarh Forest Department(恰特里什加尔森林部门)

AI总结 本文研究了印度英语印刷媒体如何报道人象冲突(HEC),通过分析2022年1月至2025年9月期间1968篇新闻文章中的28986个句子,揭示了媒体在报道中普遍使用恐惧和攻击性语言,可能加剧公众对大象的敌意,影响人与野生动物的共存努力。研究采用结合长上下文变换器、大语言模型和领域特定词典的多模型情感分析框架,量化情感倾向、提取关键语句并识别语言模式,为负责任的野生动物报道提供了可扩展的方法支持。

详情
英文摘要

Human-elephant conflict (HEC) is rising across India as habitat loss and expanding human settlements force elephants into closer contact with people. While the ecological drivers of conflict are well-studied, how the news media portrays them remains largely unexplored. This work presents the first large-scale computational analysis of media framing of HEC in India, examining 1,968 full-length news articles consisting of 28,986 sentences, from a major English-language outlet published between January 2022 and September 2025. Using a multi-model sentiment framework that combines long-context transformers, large language models, and a domain-specific Negative Elephant Portrayal Lexicon, we quantify sentiment, extract rationale sentences, and identify linguistic patterns that contribute to negative portrayals of elephants. Our findings reveal a dominance of fear-inducing and aggression-related language. Since the media framing can shape public attitudes toward wildlife and conservation policy, such narratives risk reinforcing public hostility and undermining coexistence efforts. By providing a transparent, scalable methodology and releasing all resources through an anonymized repository, this study highlights how Web-scale text analysis can support responsible wildlife reporting and promote socially beneficial media practices.

2604.10720 2026-05-14 cs.AI cs.CL cs.CY 版本更新

Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation

Charles Koutcheme, Juho Leinonen, Arto Hellas

AI总结 本文提出了一种训练开放权重的编程学习模拟模型的新框架,通过将真实学生的学习过程数据转化为对话形式,模拟学生与自动评估系统之间的交互过程。该方法结合了监督微调和偏好优化,使模型能够更贴近真实学生的调试行为。实验表明,该方法在功能对齐和代码相似性方面优于传统仅基于代码的模型和提示生成的大语言模型。

Comments 8 pages, 2 figures, 2 tables. Accepted to Educational Data Mining 2026

详情
英文摘要

Artificial students -- models that simulate how learners act and respond within educational systems -- are a promising tool for evaluating tutoring strategies and feedback mechanisms at scale. However, most existing approaches rely on prompting large, proprietary language models, limiting adaptability to specific courses and raising concerns around privacy, cost, and dependence. In this work, we propose a framework for training open-weight artificial programming learners directly from authentic student process data. Our approach serializes temporal log traces into a conversational format, representing each student's problem-solving process as a dialogue between the learner and their automated assessment system. Student code submissions and environment feedback, such as test outcomes, grades, and error traces, form alternating conversational turns, enabling models to learn from the iterative debugging process. We additionally introduce a training pipeline combining supervised fine-tuning with preference optimization to align models with authentic student debugging behavior. We evaluate our framework by training Qwen models at 4B and 8B scales on a large-scale dataset of real student submissions to Python programming assignments. Our results show that incorporating environment feedback strengthens models' ability to replicate student debugging behavior, improving over both prior code-only approaches and prompted large language models baselines in functional alignment and code similarity. We release our code to support reproducibility.

2604.10547 2026-05-14 cs.AI 版本更新

Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?

Wanyi Chen, Xiao Yang, Xu Yang, Tianming Sha, Qizheng Li, Zhuo Wang, Bowen Xian, Fang Kong, Weiqing Liu, Jiang Bian

发表机构 * Soochow University(苏州大学) Microsoft Research Asia(微软亚洲研究院) Peking University(北京大学) Stony Brook University(石溪大学) The University of Chicago(芝加哥大学)

AI总结 本文提出了一种名为 Agent² RL-Bench 的紧凑型诊断基准,用于评估大型语言模型(LLM)代理在强化学习(RL)后训练中的自主设计与优化能力。该基准要求代理在有限预算下自主完成模型训练、调试和评估,涵盖从静态规则训练到闭环在线 RL 的多种任务。实验表明,尽管部分代理能有效提升模型性能,但整体上在固定预算下实现稳定、自主的 RL 后训练仍具有挑战性,该基准为未来研究提供了有效的评估框架。

Comments 37 pages, 7 figures, 20 tables

详情
英文摘要

We introduce Agent2 RL-Bench, a compact diagnostic benchmark for evaluating agentic RL post-training, which tests whether LLM agents can autonomously design, implement, debug, and execute post-training pipelines that improve foundation models. RL post-training increasingly drives model alignment and specialization, yet existing benchmarks are largely static, rewarding supervised fine-tuning or script generation without assessing an agent's ability to close an interactive RL loop. Agent2 RL-Bench provides a unified agent-facing interface: each run starts from an isolated workspace containing a base model, task data, instructions, and a grading API, and agents must iterate within a fixed budget by training models and submitting artifacts for evaluation. The benchmark spans six tasks across three levels, from static rule-based training to judge-based optimization and closed-loop online RL with trajectory collection. Two diagnostic skills, namely runtime recording and post-hoc summarization, enable structured analysis of agent behavior, facilitating smooth and effective iteration of the benchmark's evaluation framework. Across five agent systems and six driver LLMs, agents show intelligent behavior but clear limitations: one RL-oriented run improves ALFWorld from 4.85 to 93.28 via SFT warm-up and GRPO with online rollouts, yet DeepSearchQA remains difficult, most successful routes rely on supervised pipelines, and interactive outcomes show large single-run differences across agent stacks. Overall, Agent2 RL-Bench shows that current agents can sometimes engineer online RL, but stable agent-driven RL post-training remains rare under fixed budgets. It also demonstrates that our benchmark provides a strong and effective evaluation framework for future research in this direction. Code is available at https://github.com/microsoft/RD-Agent/blob/main/rdagent/scenarios/rl/autorl_bench/README.md

2603.27910 2026-05-14 cs.AI cs.IR cs.MA 版本更新

GAAMA: Graph Augmented Associative Memory for Agents

Swarna Kamal Paul, Shubhendu Sharma, Nitin Sareen

发表机构 * Nagarro(Nagarro公司)

AI总结 GAAMA 是一种用于智能体的图增强关联记忆系统,旨在解决多会话交互中长期记忆保持的问题。该方法通过构建一个由事件、事实、反思和概念节点组成的结构化知识图谱,结合基于余弦相似度的检索与边类型感知的个性化PageRank算法,有效避免了传统方法中的结构关系丢失和中心节点效应问题。实验表明,GAAMA 在多个任务中均优于现有方法,尤其在长对话场景下表现更为突出。

详情
英文摘要

AI agents that interact with users across multiple sessions require persistent long-term memory to maintain coherent, personalized behavior. Current approaches either rely on flat retrieval-augmented generation (RAG), which loses structural relationships among memories, or use entity-centric knowledge graphs that suffer from mega-hub effects in conversational data, diluting graph-based relevance propagation. We propose GAAMA, a graph-augmented associative memory for agents that constructs a concept-mediated knowledge graph through a three-step pipeline: (1)verbatim episode preservation, (2)LLM-based extraction of atomic facts and topic-level concept nodes, and (3)synthesis of higher-order reflections. The resulting graph uses four node types (episode, fact, reflection, concept) connected by five structural edge types, with concept nodes providing cross-cutting traversal paths that avoid the mega-hub problem of entity-centric designs. Retrieval combines cosine-similarity-based k-nearest neighbor search with edge-type-aware Personalized PageRank (PPR) through an additive scoring function. We further introduce GRAFT (Graph Repair by Augmenting Facts & Topology), a post-retrieval corrective layer that diagnoses retrieval failures and surgically repairs the knowledge graph. On LoCoMo-10 (1,540 questions, 10 multi-session conversations), GAAMA achieves 79.1% mean reward, a +4.2~pp improvement over a tuned RAG baseline, the strongest comparator. On MemoryArena, GAAMA outperforms full-context baselines across three tasks - Group Travel (+0.4~pp), Web Shopping (+3.4~pp), and Progressive Search (+0.7~pp) - with advantages growing monotonically with dialogue length. Notably, GAAMA delivers consistent performance across all categories, matching the best competing method in each, whereas every competitor degrades in at least one category.

2603.23777 2026-05-14 cs.RO cs.AI cs.SY eess.SY 版本更新

Human-in-the-Loop Pareto Optimization: Trade-off Characterization for Assist-as-Needed Training and Performance Evaluation

Harun Tolasa, Volkan Patoglu

发表机构 * Faculty of Engineering and Natural Sciences(工程与自然科学学院)

AI总结 在人类运动技能训练和康复过程中,任务难度与用户表现之间存在内在权衡关系,准确刻画这一权衡对评估表现、设计按需辅助(AAN)方案至关重要。本文提出了一种基于人机闭环的帕累托优化方法,结合定量性能指标和定性挑战度指标,系统高效地刻画任务表现与感知挑战水平之间的权衡关系。通过用户实验和三个应用场景验证,该方法不仅可用于设计和评估AAN训练方案,还能在不同辅助水平下公平评估个体训练进展和用户间表现差异。

Comments Under review for publication in IEEE Transactions on Haptics

详情
英文摘要

During human motor skill training and physical rehabilitation, there is an inherent trade-off between task difficulty and user performance. Characterizing this trade-off is crucial for evaluating user performance, designing assist-as-needed (AAN) protocols, and assessing the efficacy of training protocols. In this study, we propose a novel human-in-the-loop (HiL) Pareto optimization approach to characterize the trade-off between task performance and the perceived challenge level of motor learning or rehabilitation tasks. We adapt Bayesian multi-criteria optimization to systematically and efficiently perform HiL Pareto characterizations. Our HiL optimization employs a hybrid model that measures performance with a quantitative metric, while the perceived challenge level is captured with a qualitative metric. We demonstrate the feasibility of the proposed HiL Pareto characterization through a user study. Furthermore, we present the utility of the framework through three use cases in the context of a manual skill training task with haptic feedback. First, we demonstrate how the characterized trade-off can be used to design a sample AAN training protocol for a motor learning task and to evaluate the group-level efficacy of the proposed AAN protocol relative to a baseline adaptive assistance protocol. Second, we demonstrate that individual-level comparisons of the trade-offs characterized before and after the training session enable fair evaluation of training progress under different assistance levels. This evaluation method is more general than standard performance evaluations, as it can provide insights even when users cannot perform the task without assistance. Third, we show that the characterized trade-offs also enable fair performance comparisons among different users, as they capture the best possible performance of each user under all feasible assistance levels.

2603.22267 2026-05-14 cs.CL cs.AI eess.AS 版本更新

TiCo: Time-Controllable Spoken Dialogue Model

Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu, Hung-yi Lee, James Glass

发表机构 * MIT(麻省理工学院) NTU(国立台湾大学) NTU AI-CoRE(国立台湾大学AI-CoRE)

AI总结 本文提出 TiCo,一种可控制时间的语音对话模型,能够根据时间约束指令(如“生成约15秒的回应”)生成时长可控的语音响应。为解决现有模型缺乏时间感知能力的问题,研究引入了 TiCo-Bench 作为首个评估时间可控性的基准,并通过语音时间标记(STM)帮助模型在生成过程中估计已用时间并调整内容以满足目标时长。实验表明,TiCo 在不依赖问答对数据的情况下,通过自生成和可验证奖励的强化学习进行高效微调,显著提升了时长控制精度,同时保持了响应质量。

详情
英文摘要

We introduce TiCo, a time-controllable spoken dialogue model (SDM) that follows time-constrained instructions (e.g., "Please generate a response lasting about 15 seconds") and generates spoken responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions. To systematically evaluate this, we introduce TiCo-Bench, the first benchmark for time-controllable instruction following in SDMs, on which existing open-source and commercial models frequently fail to satisfy explicit time constraints. TiCo addresses this limitation by enabling an SDM to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is post-trained efficiently without question-answer paired data, relying on self-generation and reinforcement learning with verifiable reward. Experimental results show that TiCo reduces duration error by 2.7x over its backbone and 1.6x over the strongest baseline, while preserving response quality.

2603.05093 2026-05-14 cs.LG cs.AI cs.CV 版本更新

From Baselines to Transport Geodesics: Axiomatic Attribution via Optimal Generative Flows

Cenwei Zhang, Lin Zhu, Manxi Lin, Lei You

发表机构 * Shanghai Jiao Tong University(上海交通大学) Aalto University(艾尔沃斯大学) Alibaba(阿里巴巴) Technical University of Denmark(丹麦技术大学)

AI总结 该论文研究了特征归因中的路径选择问题,提出了一种基于最优生成流的归因方法。不同于传统的手工设计路径或模型敏感性几何,作者通过最小化运输过程中的动能作用,从数据生成过程中自动选择归因路径,从而获得更稳定和结构化的解释。研究证明了Aumann-Shapley积分在固定路径下的唯一性,并通过Rectified Flow等方法实现了该理论的近似,实验表明新方法在保持删除忠实度的同时提升了归因的稳定性。

Comments 10 figures, 31 pages

详情
英文摘要

Feature attributions often hide a critical modeling choice: they explain a prediction along a counterfactual path from a reference state to an input. Different baselines, interpolations, and generative trajectories define different paths and can therefor produce different explanations. We study this path ambiguity as a modeling problem. Our central question is whether the path can be chosen by the data-generating transport process, rather than by a hand-designed interpolation or by the sensitivity geometry of the model being explained. We separate attribution into fixed-path credit allocation and path selection. For a fixed path, we prove that the Aumann-Shapley line integral is the unique attribution rule under standard fixed-path axioms and explicit coordinate-trace regularity. For path selection, we minimize kinetic action over flows that transport a reference distribution to the data distribution, yielding a transport-geodesic attribution principle. We approximate this ideal with Rectified Flow and Reflow and derive stability bounds linking vector-field error to attribution error. Experiments show that lower-action, transport-consistent paths produce more stable and structured explanations, preserving competitive deletion faithfulness, without claiming data-manifold membership. Our code is available at https://github.com/cenweizhang/OTFlowSHAP.

2602.22847 2026-05-14 cs.LG cs.AI stat.ML 版本更新

Decentralized Ranking Aggregation via Gossip: Convergence and Robustness

Kerrian Le Caillec, Anna Van Elst, Igor Colin, Stephan Clémençon

发表机构 * LTCI, Télécom-Paris, Institut Polytechnique de Paris(LTCI, Télécom-Paris, 法国巴黎理工学院)

AI总结 本文研究了在去中心化网络环境中实现可靠且鲁棒的排名共识的问题,提出了一种基于随机闲聊(gossip)通信机制的方法,使各节点仅通过局部交互即可计算全局排名共识,无需中心协调。该方法在保证收敛性的同时,增强了对恶意节点的鲁棒性,并降低了通信成本,为分布式偏好分析提供了新的解决方案。

Comments 33 pages, 5 figures

详情
英文摘要

The concept of ranking aggregation plays a central role in preference analysis, and numerous algorithms for calculating median rankings, often originating in social choice theory, have been documented in the literature, offering theoretical guarantees in a centralized setting, \textit{i.e.}, when all the ranking data to be aggregated can be brought together in a single computing unit. For many technologies (\textit{e.g.} peer-to-peer networks, IoT, multi-agent systems), extending the ability to calculate consensus rankings with guarantees of convergence and resilience to potential contamination in a decentralized setting, when preference data is initially distributed across a communicating network, remains a major methodological challenge. Indeed, in recent years, the literature on decentralized computation has mainly focused on computing or optimizing statistics such as arithmetic means using gossip algorithms. The purpose of this article is precisely to study how to achieve reliable and resilient consensus on collective rankings in a decentralized setting, thereby raising new questions, robustness to corrupted nodes, and scalability through reduced communication costs in particular. The approach proposed and analyzed here relies on the robustness guarantees offered by random gossip communication, which allows autonomous agents to compute a global ranking consensus using local interactions only, without coordination or a central authority.

2602.22251 2026-05-14 cs.LG cond-mat.mtrl-sci cs.AI 版本更新

Zatom-1: Towards a Multimodal Foundation Model for 3D Molecules and Materials

Alex Morehead, Miruna Cretu, Antonia Panescu, Rishabh Anand, Maurice Weiler, Tynan Perez, Samuel Blau, Steven Farrell, Wahid Bhimji, Anubhav Jain, Hrushikesh Sahasrabuddhe, Pietro Lio, Tommi Jaakkola, Rafael Gomez-Bombarelli, Rex Ying, N. Benjamin Erichson, Michael W. Mahoney

发表机构 * LBNL(劳伦斯伯克利国家实验室) ICSI(国际计算机科学研究所) University of Cambridge(剑桥大学) Yale University(耶鲁大学) MIT(麻省理工学院) UC Berkeley(加州大学伯克利分校)

AI总结 该研究提出了一种名为 Zatom-1 的通用基础模型,旨在统一3D分子和材料的生成与预测任务。该模型基于简化版的Transformer架构,通过多模态流匹配目标联合建模离散原子类型和连续3D结构,实现了跨领域、多任务的学习能力。实验表明,Zatom-1 在生成和预测性能上均优于现有专门模型,并显著提升了生成推理速度,同时展示了从材料生成预训练中向分子属性预测的正向迁移效果。

Comments 38 pages, 10 figures, 15 tables. ICLR 2026 FM4Science. Code, data, and model weights are available at https://github.com/Zatom-AI/zatom

详情
英文摘要

General-purpose 3D modeling in chemistry encompasses molecules and materials, requiring both generative and predictive capabilities. However, most existing AI approaches are optimized for a single domain (molecules or materials) and a single task (generation or prediction), which limits representation sharing and transfer. We introduce Zatom-1, a cross-domain, general-purpose model architecture that unifies generative and predictive learning of 3D molecules and materials. Zatom-1 is a deliberately simplified Transformer trained with a multimodal flow matching objective that jointly models discrete atom types and continuous 3D geometries. This approach supports scalable pretraining with predictable gains as model capacity increases, while enabling fast and stable sampling. We use cross-domain generative pretraining as a universal initialization for downstream multi-task prediction of properties, energies, and forces. Empirically, Zatom-1 outperforms or competes with specialized baselines on both multi-task generative and predictive benchmarks in data-controlled settings, while improving generative inference speed by more than an order of magnitude. Our experiments demonstrate positive predictive transfer between data domains from joint generative pretraining: modeling materials during generative pretraining improves molecular property prediction accuracy. Open-source code and model weights are freely available at https://github.com/Zatom-AI/zatom.

2602.16246 2026-05-14 cs.AI 版本更新

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Yun-Shiuan Chuang, Chaitanya Kulkarni, Alec Chiu, Avinash Thangali, Zijie Pan, Shivani Shekhar, Yirou Ge, Yixi Li, Uma Kona, Linsey Pang, Prakhar Mehrotra

发表机构 * PayPal AI

AI总结 该研究提出了一种基于代理状态的评估方法,用于评估多轮工具调用的大型语言模型代理系统。该方法通过LLM模拟器生成结构化的代理状态,无需依赖确定性后端,从而降低了构建和迭代成本。实验表明,该框架能够稳定区分不同模型,并在不同推理条件下保持评估一致性,同时支持对用户角色的敏感性分析,具有较高的自动化评估可靠性。

详情
英文摘要

Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on-policy training data. Prior agentic benchmarks, such as tau-bench, tau^2-bench, and AppWorld, rely on fully deterministic backends, which are costly to build and iterate. We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database. Specifically, a scenario specifies the user goal, user/system facts, expected final state, and expected agent behavior, and an LLM state tracker infers a structured proxy state from the full interaction trace. LLM judges then verify goal completion and detect tool/user hallucinations against scenario constraints. Empirically, our benchmark produces stable, model-differentiating rankings across model families and inference-time reasoning efforts, and its on-/off-policy rollouts provide supervision that transfers to unseen scenarios. Careful scenario specification yields near-zero simulator hallucination rates, as supported by ablation studies. The framework also supports sensitivity analyses over user personas. Human-LLM judge agreement exceeds 90%, indicating reliable automated evaluation. Overall, proxy state-based evaluation offers a practical, scalable alternative to deterministic agentic benchmarks for industrial LLM agents.

2602.07342 2026-05-14 cs.AI 版本更新

SupChain-Bench: Benchmarking Large Language Models for Real-World Supply Chain Management

Shengyue Guan, Yihao Liu, Lang Cao

发表机构 * Alibaba Group(阿里巴巴集团) Peking University(北京大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出SupChain-Bench,一个用于评估大语言模型在真实供应链管理场景中表现的统一基准,重点考察模型在领域知识和基于标准操作流程的长期多步骤任务执行能力。研究发现当前模型在执行可靠性方面存在较大差距,并提出了一种无需依赖标准操作流程的SupChain-ReAct框架,能够自主生成可执行的工具调用流程,取得了最稳定和出色的性能。该工作为研究真实场景下的长期任务协调提供了系统评估基准,并指出了当前供应链智能代理的改进空间。

详情
英文摘要

Large language models (LLMs) have shown promise in complex reasoning and tool-based decision making, motivating their application to real-world supply chain management. However, supply chain workflows require reliable long-horizon, multi-step orchestration grounded in domain-specific procedures, which remains challenging for current models. To systematically evaluate LLM performance in this setting, we introduce SupChain-Bench, a unified real-world benchmark that assesses both supply chain domain knowledge and long-horizon tool-based orchestration grounded in standard operating procedures (SOPs). Our experiments reveal substantial gaps in execution reliability across models. We further propose SupChain-ReAct, an SOP-free framework that autonomously synthesizes executable procedures for tool use, achieving the strongest and most consistent tool-calling performance. Our work establishes a principled benchmark for studying reliable long-horizon orchestration in real-world operational settings and highlights significant room for improvement in LLM-based supply chain agents.

2602.03429 2026-05-14 cs.AI cs.CL cs.HC cs.LG 版本更新

DiscoverLLM: From Executing Intents to Discovering Them

Tae Soo Kim, Yoonjoo Lee, Jaesang Yu, John Joon Young Chung, Juho Kim

发表机构 * University of Michigan(密歇根大学)

AI总结 为了处理模糊和开放式的用户请求,研究提出DiscoverLLM框架,训练大语言模型帮助用户形成和发现其尚未明确的意图。该方法引入了一个新型用户模拟器,通过分层意图建模用户的认知状态,并利用意图的具体化程度作为奖励信号进行模型训练,使模型能够在意图不明确时主动探索,意图明确时快速收敛。实验表明,DiscoverLLM在多个交互任务中显著提升了任务完成效率,并减少了对话长度,同时在用户研究中也表现出更高的满意度和效率。

Comments Accepted at ICML 2026

详情
英文摘要

To handle ambiguous and open-ended requests, Large Language Models (LLMs) are increasingly trained to interact with users to surface intents they have not yet expressed (e.g., ask clarification questions). However, users are often ambiguous because they have not yet formed their intents: they must observe and explore outcomes to discover what they want. Simply asking "what kind of tone do you want?" fails when users themselves do not know. We introduce DiscoverLLM, a novel and generalizable framework that trains LLMs to help users form and discover their intents. Central to our approach is a novel user simulator that models cognitive state with a hierarchy of intents that progressively concretize as the model surfaces relevant options -- where the degree of concretization serves as a reward signal that models can be trained to optimize. Resulting models learn to collaborate with users by adaptively diverging (i.e., explore options) when intents are unclear, and converging (i.e., refine and implement) when intents concretize. Across proposed interactive benchmarks in creative writing, technical writing, and SVG drawing, DiscoverLLM achieves over 10% higher task performance while reducing conversation length by up to 40%. In a user study with 75 human participants, DiscoverLLM improved conversation satisfaction and efficiency compared to baselines.

2602.02560 2026-05-14 cs.LG cs.AI cs.CV 版本更新

Auditing Sybil: Explaining Deep Lung Cancer Risk Prediction Through Generative Interventional Attributions

Bartlomiej Sobieski, Jakub Grzywaczewski, Karol Dobiczek, Mateusz Wójcik, Tomasz Bartczak, Patryk Szatkowski, Przemysław Bombiński, Matthew Tivnan, Przemyslaw Biecek

发表机构 * National Lung Screening Trial Research Team(国家肺癌筛查试验研究组)

AI总结 该研究针对深度学习模型Sybil在肺部癌症风险预测中的决策机制进行因果验证,提出了一个模型无关的审计框架S(H)NAP。该方法通过生成干预性归因,结合专家放射科医生的验证,系统分析模型对风险评分的因果贡献。研究发现,尽管Sybil在很多情况下表现类似专家,但其仍存在对临床无关伪影过度敏感和径向偏差等关键失效模式。

Comments ICML 2026

详情
英文摘要

Lung cancer remains the leading cause of cancer mortality, driving the development of automated screening tools to alleviate radiologist workload. Standing at the frontier of this effort is Sybil, a deep learning model capable of predicting future risk solely from computed tomography (CT) with high precision. However, despite extensive clinical validation, current assessments rely purely on observational metrics. This correlation-based approach overlooks the model's actual reasoning mechanism, necessitating a shift to causal verification to ensure robust decision-making before clinical deployment. We propose S(H)NAP, a model-agnostic auditing framework that constructs generative interventional attributions validated by expert radiologists. By leveraging realistic 3D diffusion bridge modeling to systematically modify anatomical features, our approach isolates object-specific causal contributions to the risk score. Providing the first interventional audit of Sybil, we demonstrate that while the model often exhibits behavior akin to an expert radiologist, differentiating malignant pulmonary nodules from benign ones, it suffers from critical failure modes, including dangerous sensitivity to clinically unjustified artifacts and a distinct radial bias.

2602.00586 2026-05-14 q-bio.MN cs.AI cs.LG 版本更新

RAG-GNN: Integrating Retrieved Knowledge with Graph Neural Networks for Precision Medicine

Hasi Hays, William J. Richardson

发表机构 * Department of Chemical Engineering, University of Arkansas(化学工程系,阿肯色大学)

AI总结 该研究提出了一种名为 RAG-GNN 的端到端可训练框架,将图神经网络(GNN)与动态检索的生物医学文献知识相结合,以提升精准医学中的功能聚类性能。通过联合优化的检索投影、门控融合机制和对比对齐方法,RAG-GNN 在癌症信号通路案例中显著提升了功能聚类效果,并验证了检索信息对聚类一致性和内部紧密性的积极影响。实验表明,该方法在功能聚类任务上优于仅依赖图结构的传统方法,为精准医学中的知识整合提供了新思路。

详情
英文摘要

Network topology excels at structural predictions but fails to capture functional semantics encoded in biomedical literature. We present RAG-GNN, an end-to-end trainable retrieval-augmented graph neural network framework that integrates GNN representations with dynamically retrieved literature-derived knowledge through a jointly optimized retrieval projection, gated fusion mechanism, and contrastive alignment. In a cancer signaling case study (379 proteins, 3,498 interactions, 14 functional categories), RAG-GNN improves functional clustering from silhouette $= -0.237 \pm 0.065$ (GNN-only) to $-0.144 \pm 0.066$, a consistent improvement of $+0.093 \pm 0.022$ across 10 random seeds, while the learned retrieval achieves mean precision@10 $= 0.242$, a 152\% improvement over the random baseline ($0.096$). Heuristic information decomposition with bootstrap confidence intervals reveals that topology and retrieval encode overwhelmingly shared information (95.6\%), with retrieval improving both intra-cluster cohesion (silhouette) and cluster agreement (ARI $+0.021 \pm 0.015$). Counterfactual experiments confirm that adversarial, absent, and random retrieval all degrade performance, validating that the gated fusion mechanism depends on document content. Benchmarking against eight established embedding methods demonstrates task-specific complementarity: topology-focused methods achieve strong link prediction, while retrieval augmentation consistently improves functional clustering within the controlled GNN-only ablation. DDR1 subnetwork analysis provides confirmatory validation consistent with established synthetic lethality relationships. These results establish that topology-only and retrieval-augmented approaches serve complementary purposes for precision medicine applications.

2601.21975 2026-05-14 cs.AI cs.ET 版本更新

Mind the Gap: How Elicitation Protocols Shape the Stated-Revealed Preference Gap in Language Models

Pranav Mahajan, Ihor Kendiukhov, Syed Hussain, Lydia Nottingham

发表机构 * University of Oxford(牛津大学) Max Planck Institute for Biological Cybernetics(生物信息学Max Planck研究所) University of Tuebingen(图宾根大学) Cardiff University(卡迪夫大学) Cambridge–Boston Alignment Initiative (CBAI)(剑桥-波士顿对齐倡议)

AI总结 该研究探讨了语言模型中陈述偏好与揭示偏好之间的差距(SvR gap),并分析了不同偏好获取协议对此差距的影响。研究发现,允许在陈述偏好过程中表达中立或弃权可以提升偏好相关性,但若在揭示偏好中也允许弃权,则可能导致相关性显著下降。研究强调,偏好获取方法需考虑不确定偏好,以更准确地评估模型的真实价值倾向。

Comments Accepted to ACL 2026 Eval Eval Workshop and 3rd Technical AI Safety Conference (TAIS 2026)

详情
英文摘要

Recent work identifies a stated-revealed (SvR) preference gap in language models (LMs): a mismatch between the values models endorse and the choices they make in context. Existing evaluations rely heavily on binary forced-choice prompting, which entangles genuine preferences with artifacts of the elicitation protocol. We systematically study how elicitation protocols affect SvR correlation across 24 LMs. Allowing neutrality and abstention during stated preference elicitation allows us to exclude weak signals, substantially improving Spearman's rank correlation ($ρ$) between volunteered stated preferences and forced-choice revealed preferences. However, further allowing abstention in revealed preferences drives $ρ$ to near-zero or negative values due to high neutrality rates. Finally, we find that system prompt steering using stated preferences during revealed preference elicitation does not reliably improve SvR correlation on AIRiskDilemmas. Together, our results show that SvR correlation is highly protocol-dependent and that preference elicitation requires methods that account for indeterminate preferences.

2601.18608 2026-05-14 cs.AI cs.LG 版本更新

PolySHAP: Extending KernelSHAP with Interaction-Informed Polynomial Regression

Fabian Fumagalli, R. Teal Witter, Christopher Musco

发表机构 * Bielefeld University(比勒菲尔德大学) Claremont McKenna College(克莱蒙特麦肯纳学院) New York University(纽约大学)

AI总结 本文提出了一种名为 PolySHAP 的新方法,通过引入高阶多项式回归扩展了 KernelSHAP 算法,以更准确地捕捉特征之间的非线性交互作用,从而提升对 Shapley 值的估计效果。研究证明了 PolySHAP 在多个基准数据集上具有更好的实证表现,并且其估计结果具有一致性。此外,该方法还揭示了配对采样(antithetic sampling)与二阶 PolySHAP 之间的理论联系,为这一广泛使用的改进方法提供了首个坚实的理论依据。

Comments Published at ICLR 2026: https://openreview.net/forum?id=M19J8UGguq

详情
英文摘要

Shapley values have emerged as a central game-theoretic tool in explainable AI (XAI). However, computing Shapley values exactly requires $2^d$ game evaluations for a model with $d$ features. Lundberg and Lee's KernelSHAP algorithm has emerged as a leading method for avoiding this exponential cost. KernelSHAP approximates Shapley values by approximating the game as a linear function, which is fit using a small number of game evaluations for random feature subsets. In this work, we extend KernelSHAP by approximating the game via higher degree polynomials, which capture non-linear interactions between features. Our resulting PolySHAP method yields empirically better Shapley value estimates for various benchmark datasets, and we prove that these estimates are consistent. Moreover, we connect our approach to paired sampling (antithetic sampling), a ubiquitous modification to KernelSHAP that improves empirical accuracy. We prove that paired sampling outputs exactly the same Shapley value approximations as second-order PolySHAP, without ever fitting a degree 2 polynomial. To the best of our knowledge, this finding provides the first strong theoretical justification for the excellent practical performance of the paired sampling heuristic.

2601.17187 2026-05-14 cs.IT cs.AI math.IT 版本更新

High-Rate Quantized Matrix Multiplication I

Or Ordentlich, Yury Polyanskiy

发表机构 * Hebrew University of Jerusalem(海法大学) MIT(麻省理工学院) MIT-IBM Watson AI Lab(麻省理工-IBM Watson AI实验室)

AI总结 本文研究了量化矩阵乘法(MatMul)问题,这对于高效部署大型语言模型至关重要。文章在无需先验统计信息的情况下,探讨了通用矩阵乘法场景中权重和激活量化的问题,并分析了量化率与失真之间的信息论基本权衡,同时对比了常用量化方案的性能。研究还为这些方案提供了准确的启发式近似,并在后续部分探讨了仅对权重进行量化的场景。

详情
英文摘要

This paper investigates the problem of quantized matrix multiplication (MatMul), which has become crucial for the efficient deployment of large language models (LLMs). We consider a Generic MatMul setting, where both matrices must be quantized (weight+activation quantization) without specific apriori (calibration) statistical information about the factors. We review the fundamental information-theoretic tradeoff between quantization rate and distortion (high-rate theory), and contrast those with the performance of popular quantization schemes (absmax INT and floating-point (FP)), for which we also derive accurate heuristic approximations. Part II of this paper studies the weight-only quantization setup where second-order statistics of the activation matrices are available at the encoder.

2601.15280 2026-05-14 cs.HC cs.AI 版本更新

LLM-based Multimodal Feedback Produces Equivalent Learning and Better Student Perceptions than Educator Feedback

Chloe Qianhui Zhao, Jie Cao, Jionghao Lin, Kenneth R. Koedinger

发表机构 * Carnegie Mellon University(卡内基梅隆大学) The University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) The University of Hong Kong(香港大学)

AI总结 本研究提出了一种基于大语言模型的实时多模态反馈系统,结合结构化文本解释与动态多媒体资源,旨在提升学习效果与学生体验。实验结果表明,该系统在学习成效方面与教师反馈相当,但在清晰度、针对性、简洁性、学习动机和认知负荷等方面表现更优。研究还发现,AI反馈在不同题型中展现出不同的促进作用,凸显了其在规模化教学中的潜力与优势。

Comments 11 pages, to be published at the 16th International Learning Analytics & Knowledge Conference (LAK '26)

详情
英文摘要

Providing timely, targeted, and multimodal feedback helps students quickly correct errors, build deep understanding and stay motivated, yet making it at scale remains a challenge. This study introduces a real-time AI-facilitated multimodal feedback system that integrates structured textual explanations with dynamic multimedia resources, including the retrieved most relevant slide page references and streaming AI audio narration. In an online crowdsourcing experiment, we compared this system against fixed business-as-usual feedback by educators across three dimensions: (1) learning effectiveness, (2) learner engagement, (3) perceived feedback quality and value. Results showed that AI multimodal feedback achieved learning gains equivalent to original educator feedback while significantly outperforming it on perceived clarity, specificity, conciseness, motivation, satisfaction, and reducing cognitive load, with comparable correctness, trust, and acceptance. Process logs revealed distinct engagement patterns: for multiple-choice questions, educator feedback encouraged more submissions; for open-ended questions, AI-facilitated targeted suggestions lowered revision barriers and promoted iterative improvement. These findings highlight the potential of AI multimodal feedback to provide scalable, real-time, and context-aware support that both reduces instructor workload and enhances student experience.

2510.04698 2026-05-14 q-bio.NC cs.AI econ.TH 版本更新

The Bayesian Origin of the Probability Weighting Function in Human Representation of Probabilities

Xin Tong, Thi Thu Uyen Hoang, Xue-Xin Wei, Michael Hahn

发表机构 * Saarland University(萨尔兰大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 人类在感知概率时普遍存在系统性的扭曲,表现为典型的反S型权重模式,但其成因长期未明。本文提出一种基于贝叶斯编码-解码的解释框架,认为概率通过带有噪声的内部信号进行编码,并通过最小化贝叶斯风险进行解码。研究发现,这种编码过程中的扭曲可分解为边界回归、似然排斥和先验吸引,从而预测出反S型权重模式源于编码精度的U型分布,即在概率接近0和1时更为敏感。实验结果表明,该框架能够从数据中自然恢复出U型编码结构,并在多个任务中优于传统确定性权重函数和其它模型。

详情
英文摘要

Humans systematically misrepresent probability in a stereotyped inverse-S pattern. It has been documented for decades, but its origin remains unexplained. We propose a Bayesian encoding-decoding account in which probabilities are represented by noisy internal signals and decoded by Bayes-risk minimization. For bounded probability stimuli, we show that distortion decomposes into boundary regression, likelihood repulsion, and prior attraction, yielding a key prediction: the classic inverse-S-shaped weighting pattern implies a U-shaped allocation of encoding precision with greater sensitivity near 0 and 1. Across judgment of relative frequency, lottery pricing, and risky choice, this U-shape is recovered from data without imposing any functional form on the encoding, and our framework outperforms deterministic weighting functions, bounded log-odds models, uniform-encoding Bayesian accounts, and matched efficient-coding models on held-out data. In a new dot probability estimation experiment with bimodal stimulus statistics, the recovered prior tracks the new distribution while the recovered encoding remains U-shaped. Together, these results identify the inverse-S-shaped probability weighting function as the joint product of a stable U-shaped encoding and a flexible prior, integrated by optimal Bayesian decoding.

2510.03992 2026-05-14 cs.CR cs.AI 版本更新

Quantitative Certification of Agentic Tool Selection

Jehyeok Yeon, Isha Chaudhary, Gagandeep Singh

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 随着大型语言模型(LLMs)在智能代理系统中的广泛应用,如何准确地将用户意图映射到合适的外部工具成为一个关键问题。本文提出了一种名为LLMCert-T的统计框架,用于在真实工具分布下对工具选择流程的安全性进行定量认证,返回具有高置信度的上界概率。该方法将工具选择评估建模为伯努利估计问题,并通过条件生成过程模拟实际部署环境,从而揭示当前主流LLM代理在面对干扰选择和Top-N饱和等安全规范时仍存在较大的性能下降。

详情
英文摘要

Large language models (LLMs) are increasingly deployed in agentic systems, where a fundamental task is mapping user intents to relevant external tools. Errors in tool selection can have severe outcomes, such as unauthorized data access, even without modifying the agent's underlying model. Existing evaluations measure performance on curated, benign benchmarks. However, a pipeline's behavior in deployment depends on the tool pool the agent actually encounters, which in open registries is shaped by third parties. We introduce LLMCert-T, the first statistical framework that returns \textbf{high-confidence upper bounds on the probability that a tool-selection pipeline satisfies a declared safety specification under a realistic tool distribution}. LLMCert-T models tool-selection evaluation as a Bernoulli estimation problem, drawing inserted-tool sequences from a distribution that the safety specification fixes. To evaluate robustness against realistic deployment conditions, we instantiate this distribution as a stochastic process that generates inserted-tool sequences round by round, conditioning each round on the agent's selection in the previous round. LLMCert-T aggregates the per-trial outcomes into a one-sided Clopper-Pearson upper bound on the probability that the specification is satisfied. By returning this bound as a certificate with statistical guarantees over the inserted-tool sequence distribution, LLMCert-T makes safety claims intuitive, actionable, and comparable across models, retrievers, mitigations, and registry policies. Across popular BFCL and OpenAPI tool pools, LLMCert-T shows that current LLM agents remain fragile under Distractor Selection and Top-N Saturation specifications: their certified correctness upper bounds drop to approximately 20\%, far below their clean-pool lower bounds.

2508.09320 2026-05-14 cs.LG cs.AI cs.CR 版本更新

Exact Verification of Graph Neural Networks with Incremental Constraint Solving

Minghao Liu, Chia-Hsuan Lu, Marta Kwiatkowska

发表机构 * University of Oxford(牛津大学)

AI总结 该论文提出了一种用于图神经网络(GNN)的精确验证方法,旨在应对属性和结构扰动下的对抗攻击,确保模型的鲁棒性。该方法通过约束求解与边界收紧相结合,并利用求解器的增量求解能力提升效率,支持包括求和、最大值和平均值在内的三种聚合函数,其中后两种为首次应用。实验表明,该方法在多个真实数据集上表现出良好的实用性和优越的分类性能。

Comments Extended version of the paper accepted at FM 2026

详情
英文摘要

Graph neural networks (GNNs) are increasingly often employed in high-stakes applications, such as fraud detection or healthcare, but are susceptible to adversarial attacks. A number of techniques have been proposed to provide adversarial robustness guarantees, but support for commonly used aggregation functions in message-passing GNNs is lacking. In this paper, we develop an exact (sound and complete) verification method for GNNs to compute guarantees against attribute and structural perturbations that involve edge addition or deletion, subject to budget constraints. Our method employs constraint solving with bound tightening, and iteratively solves a sequence of relaxed constraint satisfaction problems while relying on incremental solving capabilities of solvers to improve efficiency. We implement GNNev, a versatile exact verifier for message-passing neural networks, which supports three aggregation functions -- sum, max and mean -- with the latter two considered here for the first time. Extensive experimental evaluation of GNNev on real-world fraud datasets (Amazon and Yelp) and biochemical datasets (MUTAG and ENZYMES) demonstrates its usability and effectiveness, as well as superior performance on node classification and competitiveness on graph classification compared to existing exact verification tools on sum-aggregated GNNs.

2506.12075 2026-05-14 cs.IR cs.AI 版本更新

T-TExTS (Teaching Text Expansion for Teacher Scaffolding): Enhancing Text Selection in High School Literature through Knowledge Graph-Based Recommendation

Nirmal Gelal, Chloe Snow, Ambyr Rios, Kathleen M. Jagodnik, Hande Küçük McGinty

发表机构 * Department of Computer Science(计算机科学系) Department of Curriculum and Instruction(课程与教学系) Kansas State University(堪萨斯州立大学)

AI总结 本文提出了一种基于知识图谱的推荐系统 T-TExTS,旨在帮助高中英语文学教师更高效地选择主题一致且多样化的教学文本。该系统通过构建教育领域本体,并结合多种图嵌入方法进行优化,实验表明其在不同数据规模下均表现出优越的推荐性能。研究证明,结合结构化知识与教学价值信号的混合模型在保持可解释性的同时仍具有较高的推荐质量,为教育推荐系统提供了新的方法支持。

Comments Under Review

详情
英文摘要

High school English Literature teachers often encounter barriers to assembling diverse, thematically aligned text sets due to limited planning time and pedagogical resources. To address this need, we present T-TExTS (Teaching Text Expansion for Teacher Scaffolding), a knowledge graph (KG)-based recommendation system that suggests literature texts based on pedagogical merit rather than surface-level metadata. We construct a domain-specific ontology using the Knowledge Acquisition and Representation Methodology (KNARM), instantiate it as a knowledge graph with separate Terminological Box (TBox) and Assertional Box (ABox) components, and evaluate four graph embedding strategies (DeepWalk, biased random walk, hybrid embedding, and Node2Vec) across three dataset configurations (98, 196, and 351 texts) and two relation-weighting schemes. The experimental results reveal that traversal-level expert weighting alone does not outperform algorithmic structural tuning: Node2Vec achieves the highest Area Under the Curve (AUC) at every dataset size (0.9642--0.9750) and the strongest ranking metrics (Hits@K, MRR, nDCG) at larger scales. Combining structural and pedagogical signals through embedding concatenation, however, preserves both interpretability and competitive ranking quality, with the hybrid model maintaining a high AUC across all scales (0.9122--0.9350) and remaining within a few percentage points of Node2Vec on every ranking metric. These findings highlight the value of ontology-driven knowledge graph embeddings for educational recommendation systems and demonstrate that T-TExTS can meaningfully ease the burden of English Literature text selection for secondary educators, supporting more informed and inclusive curricular decisions. The source code for T-TExTS is available at https://github.com/koncordantlab/TTExTS.

2502.20427 2026-05-14 cs.CR cs.AI cs.SD eess.AS 版本更新

DeePen: Penetration Testing for Audio Deepfake Detection

Nicolas Müller, Piotr Kawa, Adriana Stan, Thien-Phuc Doan, Souhwan Jung, Wei Herng Choong, Philip Sperl, Konstantin Böttinger

发表机构 * Technical University of Cluj-Napocay(克卢日-纳波卡技术大学) AISRC, Soongsil University(Soongsil大学人工智能研究中心)

AI总结 本文提出了一种名为DeePen的系统化渗透测试方法,用于评估基于机器学习的深度伪造音频检测分类器的鲁棒性。该方法无需了解或接触目标检测模型,而是通过一系列精心设计的信号处理攻击来测试模型的漏洞。研究发现,无论是实际部署的系统还是公开的学术模型,均存在可被简单操作(如时间拉伸或添加回声)欺骗的弱点,表明当前的深度伪造检测技术仍面临严峻挑战。

详情
英文摘要

Deepfakes - manipulated or forged audio and video media - pose significant security risks to individuals, organizations, and society at large. To address these challenges, machine learning-based classifiers are commonly employed to detect deepfake content. In this paper, we assess the robustness of such classifiers through a systematic penetration testing methodology, which we introduce as DeePen. Our approach operates without prior knowledge of or access to the target deepfake detection models. Instead, it leverages a set of carefully selected signal processing modifications - referred to as attacks - to evaluate model vulnerabilities. Using DeePen, we analyze both real-world production systems and publicly available academic model checkpoints, demonstrating that all tested systems exhibit weaknesses and can be reliably deceived by simple manipulations such as time-stretching or echo addition. Furthermore, our findings reveal that while some attacks can be mitigated by retraining detection systems with knowledge of the specific attack, others remain persistently effective.

2502.18917 2026-05-14 cs.AI cs.PL cs.SE 版本更新

ClassInvGen: Class Invariant Synthesis using Large Language Models

Chuyue Sun, Viraj Agashe, Saikat Chakraborty, Jubi Taneja, Clark Barrett, David Dill, Xiaokang Qiu, Shuvendu K. Lahiri

发表机构 * Stanford University(斯坦福大学) Microsoft Research(微软研究院) Purdue University(普渡大学)

AI总结 ClassInvGen 是一种利用大语言模型(LLM)生成类不变式的方法,旨在为如 C++ 等主流编程语言生成高质量的类不变式。该方法通过协同生成可执行的类不变式和测试输入,提升了不变式的准确性和完整性,并在实验中优于基于纯 LLM 和传统数据驱动的方法。研究还构建了一个包含标准 C++ 数据结构的基准测试集,并通过实际案例验证了其在真实代码库中的应用效果。

详情
英文摘要

Formal program specifications in the form of preconditions, postconditions, and class invariants have several benefits for the construction and maintenance of programs. They not only aid in program understanding due to their unambiguous semantics but can also be enforced dynamically (or even statically when the language supports a formal verifier). However, synthesizing high-quality specifications in an underlying programming language is limited by the expressivity of the specifications or the need to express them in a declarative manner. Prior work has demonstrated the potential of large language models (LLMs) for synthesizing high-quality method pre/postconditions for Python and Java, but does not consider class invariants. In this work, we describe ClassInvGen, a method for co-generating executable class invariants and test inputs to produce high-quality class invariants for a mainstream language such as C++, leveraging LLMs' ability to synthesize pure functions. We show that ClassInvGen outperforms a pure LLM-based technique to generate specifications (from code) as well as prior data-driven invariant inference techniques such as Daikon. We contribute a benchmark of standard C++ data structures along with a harness that can help measure both the correctness and completeness of generated specifications using tests and mutants. We also demonstrate its applicability to real-world code by performing a case study on several classes within a widely used and high-integrity C++ codebase.

2409.02038 2026-05-14 cs.CL cs.AI cs.DB 版本更新

BEAVER: An Enterprise Benchmark for Text-to-SQL

Peter Baile Chen, Devin Yang, Weiyue Li, Fabian Wenz, Yi Zhang, Nesime Tatbul, Michael Cafarella, Çağatay Demiralp, Michael Stonebraker

发表机构 * MIT(麻省理工学院) Harvard University(哈佛大学) Greenshoe, Inc.(Greenshoe公司)

AI总结 BEAVER 是首个基于私有数据仓库构建的文本到 SQL 基准测试集,旨在评估大语言模型在复杂企业环境中的表现。该基准包含来自真实查询日志的 9128 个问题-SQL 对,覆盖 19 个不同领域,涵盖复杂的数据库结构和专业领域知识。为解决企业数据稀缺和评估指标不足的问题,BEAVER 通过合成高质量专家验证查询,并引入细粒度子任务评估指标,揭示了当前先进模型在实际企业场景中的显著性能差距。

Comments Dataset and code are available at https://beaverbench.github.io/

详情
英文摘要

Existing text-to-SQL benchmarks have largely been constructed from public databases with well-structured schemas and simplistic question-SQL pairs. While large language models (LLMs) excel on these settings, their efficacy in complex private enterprise environments, characterized by intricate schemas, domain knowledge, and analytical user queries involving sophisticated structures and functions, remains unproven. To bridge this gap, we introduce BEAVER, the first text-to-SQL benchmark derived from private data warehouses. It comprises 9128 question-SQL pairs sourced from real-world query logs and 812 tables across 19 diverse domains. Building this benchmark is challenging because (1) enterprise query logs are scarce due to privacy constraints, and (2) existing all-or-nothing evaluation metrics based on accuracy make error diagnosis difficult -- especially when producing a correct query involves solving multiple compounded challenges, such as domain knowledge and query complexity. We address these issues at two levels. At the dataset level, we synthesize high-fidelity, expert-verified queries that increase dataset size and isolate individual challenges or combine them, producing queries focused on domain knowledge, query complexity, and both. At the evaluation level, we provide human annotations and evaluation metrics for five critical subtasks to enable fine-grained analysis. Our evaluation reveals a significant performance gap compared to existing benchmarks: SOTA agentic frameworks using the advanced model GPT-5.2 achieve only 10.8% accuracy. When provided with all subtask annotations as oracle hints, accuracy increases to 30.1%, confirming that a major bottleneck lies in correctly resolving these subtasks. Finally, we provide a taxonomy of the residual errors that persist even with subtask hints, identifying specific challenges such as the use of advanced functions.

2008.03496 2026-05-14 cs.AI cs.LO cs.RO 版本更新

Human Robot Collaborative Assembly Planning: An Answer Set Programming Approach

Momina Rizwan, Volkan Patoglu, Esra Erdem

发表机构 * Faculty of Engineering and Natural Sciences, Sabancı University, Istanbul, Turkey(工程与自然科学学院,萨班奇大学,伊斯坦布尔,土耳其)

AI总结 本文研究了人机协作装配任务中的规划问题,提出了一种基于答案集编程的方法,结合常识推理和丰富的通信动作,以应对人类行为不确定性带来的挑战。该方法通过扩展混合条件规划,实现了对装配动作顺序的高层规划与几何可行性验证,并在实际场景中验证了其有效性,展示了双臂机器人与人类协作组装家具的应用案例。

Comments 36th International Conference on Logic Programming (ICLP 2020), University Of Calabria, Rende (CS), Italy, September 2020, 15 pages

详情
英文摘要

For planning an assembly of a product from a given set of parts, robots necessitate certain cognitive skills: high-level planning is needed to decide the order of actuation actions, while geometric reasoning is needed to check the feasibility of these actions. For collaborative assembly tasks with humans, robots require further cognitive capabilities, such as commonsense reasoning, sensing, and communication skills, not only to cope with the uncertainty caused by incomplete knowledge about the humans' behaviors but also to ensure safer collaborations. We propose a novel method for collaborative assembly planning under uncertainty, that utilizes hybrid conditional planning extended with commonsense reasoning and a rich set of communication actions for collaborative tasks. Our method is based on answer set programming. We show the applicability of our approach in a real-world assembly domain, where a bi-manual Baxter robot collaborates with a human teammate to assemble furniture. This manuscript is under consideration for acceptance in TPLP.

1903.00745 2026-05-14 cs.AI cs.LO cs.RO 版本更新

A Formal Framework for Robot Construction Problems: A Hybrid Planning Approach

Faseeh Ahmad, Esra Erdem, Volkan Patoglu

AI总结 本文研究了由多个自主机器人协作堆叠预制模块构建稳定结构的机器人建造问题,该问题因动作的连锁效应、真正的并发操作以及结构稳定性和模块支撑性要求而具有挑战性。作者提出了一种基于答案集编程的混合规划框架,能够同时确定最终稳定结构配置并规划多机器人操作顺序,确保每一步部分结构的稳定性与支撑性。该方法在理论上有严格的正确性与完备性保证,并通过多个具有挑战性的建造实例验证了其有效性与实用性。

Comments 8 pages (double-column), 7 figures

详情
英文摘要

We study robot construction problems where multiple autonomous robots rearrange stacks of prefabricated blocks to build stable structures. These problems are challenging due to ramifications of actions, true concurrency, and requirements of supportedness of blocks by other blocks and stability of the structure at all times. We propose a formal hybrid planning framework to solve a wide range of robot construction problems, based on Answer Set Programming. This framework not only decides for a stable final configuration of the structure, but also computes the order of manipulation tasks for multiple autonomous robots to build the structure from an initial configuration, while simultaneously ensuring the stability, supportedness and other desired properties of the partial construction at each step of the plan. We prove the soundness and completeness of our formal method with respect to these properties. We introduce a set of challenging robot construction benchmark instances, including bridge building and stack overhanging scenarios, discuss the usefulness of our framework over these instances, and demonstrate the applicability of our method using a bimanual Baxter robot.

1307.7494 2026-05-14 cs.AI cs.LO cs.RO 版本更新

ReAct! An Interactive Tool for Hybrid Planning in Robotics

Zeynep Dogmus, Esra Erdem, Volkan Patoglu

发表机构 * Sabancı University(Sabanci大学)

AI总结 本文介绍了一种名为 ReAct! 的交互式工具,用于机器人领域中的混合规划。该工具允许研究人员在无需了解底层形式化语法和语义细节的情况下,描述机器人在动态环境中的行为并解决规划问题。ReAct! 支持复杂动态域的建模,包括并发、动作的间接效应和状态/转换约束,并能够将外部计算(如碰撞自由轨迹检查)嵌入到混合域的表示中,从而实现离散高层推理与连续几何推理的紧密集成,适用于从服务机器人到认知工厂等多种复杂场景。

详情
英文摘要

We present ReAct!, an interactive tool for high-level reasoning for cognitive robotic applications. ReAct! enables robotic researchers to describe robots' actions and change in dynamic domains, without having to know about the syntactic and semantic details of the underlying formalism in advance, and solve planning problems using state-of-the-art automated reasoners, without having to learn about their input/output language or usage. In particular, ReAct! can be used to represent sophisticated dynamic domains that feature concurrency, indirect effects of actions, and state/transition constraints. It allows for embedding externally defined calculations (e.g., checking for collision-free continuous trajectories) into representations of hybrid domains that require a tight integration of (discrete) high-level reasoning with (continuous) geometric reasoning. ReAct! also enables users to solve planning problems that involve complex goals. Such variety of utilities are useful for robotic researchers to work on interesting and challenging domains, ranging from service robotics to cognitive factories. ReAct! provides sample formalizations of some action domains (e.g., multi-agent path planning, Tower of Hanoi), as well as dynamic simulations of plans computed by a state-of-the-art automated reasoner (e.g., a SAT solver or an ASP solver).

2605.13335 2026-05-14 cs.AI cs.CV 版本更新

Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning

Qinchuan Cheng, Zhantao Gong, Pengzhan Sun, Angela Yao, Xulei Yang, Shijie Li

发表机构 * Xi’an Jiaotong University(西安交通大学) Nankai University(南开大学) National University of Singapore(新加坡国立大学) A*STAR

AI总结 本文提出 Ego2World,一个将第一视角烹饪视频编译为可执行符号世界的基准,用于评估具身智能体在部分可观测环境下的规划能力。该方法基于视频标注提取可复用的状态转移规则,并在隐藏的符号世界图中执行,迫使智能体仅依靠局部观测和执行反馈进行规划与记忆更新。实验表明,传统动作重叠度指标可能高估任务成功率,而维持持久的信念记忆有助于提升任务完成效率并减少重复视觉探索。

Comments Project page: https://sj-li.com/PROJ/Ego2World/

详情
英文摘要

Embodied agents in household environments must plan under partial observation: they need to remember objects, track state changes, and recover when actions fail. Existing benchmarks only partially test this ability. Egocentric video datasets capture realistic human activities but remain passive, while interactive simulators support execution but rely on synthetic scenes and hand-crafted dynamics, introducing a sim-to-real gap and often assuming fully observable state. We introduce Ego2World, an executable benchmark that turns egocentric cooking videos into executable symbolic worlds governed by graph-transition rules. Built on HD-EPIC, Ego2World derives reusable transition rules from video annotations and executes them in a hidden symbolic world graph. During evaluation, the simulator maintains the hidden world graph, while the agent plans over its own partial belief graph using only local observations and execution feedback. This separation forces agents to update memory and replan without observing the true world state. Experiments show that action-overlap scores overestimate physical-state success, and that persistent belief memory improves task completion while reducing repeated visual exploration -- suggesting that belief maintenance should be a first-class target of embodied-agent evaluation.

2605.13333 2026-05-14 cs.CV cs.AI cs.GR cs.LG 版本更新

Stylized Text-to-Motion Generation via Hypernetwork-Driven Low-Rank Adaptation

Junhyuk Jeon, Seokhyeon Hong, Junyong Noh

发表机构 * Visual Media Lab, KAIST(韩国庆熙大学视觉媒体实验室)

AI总结 该研究针对文本驱动的运动扩散模型在生成精细风格化动作时的不足,提出了一种轻量级的风格条件生成框架。通过超网络生成低秩适配参数,动态调节预训练扩散模型,从而在去噪过程中实现对风格的精细控制。该方法利用监督对比损失结构风格潜在空间,提升了对未见风格的泛化能力,并在多个数据集上取得了领先的风格化生成效果。

Comments Accepted to SIGGRAPH 2026. Project page: https://junhyukjeon.github.io/projects/style-salad/

详情
英文摘要

Text-driven motion diffusion models are capable of generating realistic human motions, but text alone often struggles to express fine-level nuances of motion, commonly referred to as style. Recent approaches have tackled this challenge by attaching a style injection mechanism to a pretrained text-driven diffusion model. Existing stylization methods, however, either require style-specific fine-tuning of existing models or rely on heavy ControlNet-based architectures, limiting efficiency and generalization to unseen styles. We propose a lightweight style conditioning framework that dynamically modulates a pretrained diffusion model through hypernetwork-generated LoRA parameters. A style reference motion is encoded into a global style embedding, which is mapped by a hypernetwork to low-rank updates applied at each denoising step of the diffusion model. By structuring the style latent space with a supervised contrastive loss, our framework reliably captures diverse stylistic attributes, improves generalization to unseen styles, and supports optimization-based guidance without requiring predefined style categories. Experiments on the HumanML3D and 100STYLE datasets show state-of-the-art stylization results, while achieving improved stylization for unseen styles.

2605.13332 2026-05-14 cs.AI cs.CC 版本更新

Diversity of Extensions in Abstract Argumentation

Johannes K. Fichte, Markus Hecher, Yasir Mahmood, Zhengjun Wang

发表机构 * Department of Computer and Information Science (IDA), Linköping University, Sweden(链接öping大学计算机与信息科学系(IDA)) University of Potsdam, Germany & University of Artois, CNRS, UMR8188 (CRIL), France(波茨坦大学 & 阿尔托伊斯大学、法国CNRS UMR8188(CRIL)) Data Science Group, Heinz Nixdorf Institute, Paderborn University, Germany(帕德博恩大学数据科学小组、海因茨·尼克斯多夫研究所)

AI总结 本文研究抽象论证框架中扩展集的多样性问题,提出了一种基于对称差的定量多样性度量方法,用于衡量不同扩展集之间的差异程度。作者系统分析了相关推理问题的计算复杂性,并探讨了框架是否允许具有特定多样性的扩展集,以及如何计算最大可能的多样性值。研究还提供了计算多样性水平的原型系统和实验评估。

Comments Technical Report to the paper accepted at IJCAI 2026

详情
英文摘要

Argumentation is an important topic of AI for modelling and reasoning about arguments. In abstract argumentation, we consider directed graphs, so-called argumentation frameworks (AF), that express conflicts between arguments. The semantics is defined by the notion of extensions, which are sets of arguments that satisfy particular relationship conditions in the AF. Usually, standard reasoning in argumentation do not reveal how far apart extensions are. We introduce a quantitative notion of diversity of extensions based on the symmetric difference and provide a systematic complexity classification. Intuitively, diversity captures whether extensions of a framework (accepted viewpoints) differ only marginally or represent fundamentally incompatible sets of arguments. We study whether an AF admits k-diverse extensions, admits k-diverse extensions covering specific arguments, and to compute the largest k for which an AF admits k-diverse extensions. We outline a prototype and provide an evaluation for computing diversity levels.

2605.13329 2026-05-14 cs.CL cs.AI 版本更新

Tracing Persona Vectors Through LLM Pretraining

Viktor Moskvoretskii, Dominik Glandorf, Jorge Medina Moreira, Tanja Käser, Robert West

发表机构 * EPFL(苏黎世联邦理工学院)

AI总结 本文研究了大语言模型在预训练过程中如何形成用于表示高层行为的“人格向量”,并追踪了这些向量在OLMo-3-7B模型预训练阶段的演变过程。研究发现,人格向量在预训练初期就已形成,并在后续训练中持续优化。实验还表明,不同的人格提取方法能够揭示模型中不同方面的行为特征,且这一现象在其他模型如Apertus-8B中也得到验证,说明人格向量是预训练早期形成的稳定特征,为理解模型行为的可解释性提供了新方向。

Comments Preprint

详情
英文摘要

How large language models internally represent high-level behaviors is a core interpretability question with direct relevance to AI safety: it determines what we can detect, audit, or intervene on. Recent work has shown that traits such as evil or sycophancy correspond to linear directions in the internal activations, the so-called persona vectors. Although these vectors are now routinely utilized to inspect and steer model behavior in safety-relevant settings, how these representations are formed during training remains unknown. To address this gap, we trace persona vectors across the pretraining of OLMo-3-7B, finding that persona vectors form remarkably early -- within 0.22% of OLMo-3 pretraining -- and remain effective for steering the fully post-trained instruct models. Although core representations are formed early on, persona vectors continue to refine geometrically and semantically throughout pretraining. We further compare alternative elicitation strategies and find that all yield effective directions, with each strategy surfacing qualitatively distinct facets of the underlying persona. Replicating our analysis on Apertus-8B reveals that our findings transfer qualitatively beyond OLMo-3. Our results establish persona representations as stable features of early pretraining and open a path to studying how training forms, refines, and shapes them.

2605.13328 2026-05-14 cs.RO cs.AI cs.CL cs.CV 版本更新

What Limits Vision-and-Language Navigation ?

Yunheng Wang, Yuetong Fang, Taowen Wang, Lusong Li, Kun Liu, Junzhe Xu, Zizhao Yuan, Yixiao Feng, Jiaxi Zhang, Wei Lu, Zecui Zeng, Renjing Xu

发表机构 * HKUST(GZ)(香港科技大学(广州)) JD Explore Academy(京东探索研究院)

AI总结 视觉与语言导航(VLN)是具身智能的重要研究方向,但在从仿真环境迁移到真实世界时,现有方法常因感知不稳定和指令模糊而表现下降。本文提出StereoNav,一种融合视觉、语言和动作的鲁棒框架,通过引入目标位置先验和双目视觉技术,增强跨域导航的稳定性与准确性。实验表明,StereoNav在多个基准测试中取得先进性能,并在真实机器人部署中显著提升了复杂环境下的导航可靠性。

详情
英文摘要

Vision-and-Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real-world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under-specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross-domain priors. In this paper, we propose StereoNav, a robust Vision-Language-Action framework designed to enhance real-world navigation consistency. To address the inherent gap between synthetic training and physical execution, we introduce Target-Location Priors as a persistent bridge. These priors provide stable visual guidance that remains invariant across domains, effectively grounding the agent even when instructions are vague. Furthermore, to mitigate visual disturbances like motion blur and illumination shifts, StereoNav leverages stereo vision to construct a unified representation of semantics and geometry, enabling precise action prediction through enhanced depth awareness. Extensive experiments on R2R-CE and RxR-CE demonstrate that StereoNav achieves state-of-the-art egocentric RGB performance, with SR and SPL scores of 81.1% and 68.3%, and 67.5% and 52.0%, respectively, while using significantly fewer parameters and less training data than prior scaling-based approaches. More importantly, real-world robotic deployments confirm that StereoNav substantially improves navigation reliability in complex, unstructured environments. Project page: https://yunheng-wang.github.io/stereonav-public.github.io.

2605.13311 2026-05-14 cs.AI cs.IR cs.MA 版本更新

IdeaForge: A Knowledge Graph-Grounded Multi-Agent Framework for Cross-Methodology Innovation Analysis and Patent Claim Generation

Joy Bose

发表机构 * Independent Researcher(独立研究员)

AI总结 IdeaForge 是一个基于知识图谱的多智能体框架,旨在支持跨创新方法(如 TRIZ、设计思维等)的创新分析与专利权利要求生成。该框架通过多个专业智能体在持久化的知识图谱上协作,整合不同方法的结构化推理结果,并利用图结构实现跨方法的收敛关联,从而识别高可信度的创新方案。研究提出了一种基于图的收敛机制和专利生成流程,实验表明该方法在创新候选的多样性和可追溯性方面优于单一方法的基线模型。

Comments 14 pages, 3 figures, 6 tables

详情
英文摘要

Current AI-assisted innovation systems typically apply a single ideation methodology (such as TRIZ or Design Thinking) using sequential prompt-based workflows that do not preserve intermediate reasoning structure. As a result, insights generated across methodologies remain fragmented, limiting traceability, synthesis, and systematic evaluation of novelty. We present IdeaForge, a knowledge graph-grounded multi-agent framework for innovation analysis and patent claim generation. IdeaForge integrates multiple innovation methodologies (TRIZ, Design Thinking, and SCAMPER) through specialist agents operating over a persistent FalkorDB knowledge graph. Each agent contributes structured entities and relationships representing contradictions, inventive principles, user needs, transformations, analogies, and candidate claims. The central contribution of IdeaForge is a cross-methodology convergence mechanism implemented through graph-based claim linkage. Claims independently supported by multiple methodologies are connected using CONVERGENT relationships, enabling identification of high-confidence innovation candidates through graph traversal. A downstream patent drafting agent generates structured patent drafts grounded in convergent claim subgraphs, reducing reliance on unconstrained language model generation. An InnovationScore formula ranks claims by convergent support, methodology diversity, claim strength, and prior art challenge count. We describe the graph schema, agent architecture, convergence detection pipeline, and patent synthesis workflow. Experiments on a legal technology use case demonstrate that graph-grounded multi-methodology synthesis produces more diverse and traceable innovation candidates compared to single-methodology baselines. We discuss implications for computational creativity, explainable AI-assisted invention, and graph-native innovation systems.

2605.13301 2026-05-14 cs.AI cs.CL 版本更新

Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

Yafu Li, Runzhe Zhan, Haoran Zhang, Shunkai Zhang, Yizhuo Li, Zhilin Wang, Jiacheng Chen, Futing Wang, Xuyang Hu, Yuchen Fan, Bangjie Xu, Yucheng Su, Xinmiao Han, Chenxi Li, Haodi Lei, Yufeng Zhao, Zejin Lin, Qianjia Cheng, Tong Zhu, Xiaoye Qu, Ganqu Cui, Peng Ye, Yun Luo, Zhouchen Lin, Yu Qiao, Bowen Zhou, Ning Ding, Yu Cheng

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) The Chinese University of Hong Kong(香港中文大学) Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学) Peking University(北京大学)

AI总结 本文提出了一种简单统一的方法,将预训练的推理模型转化为能够达到国际数学和物理奥林匹克竞赛金牌水平的解题系统。该方法通过逆困惑度课程进行监督微调,培养严格的证明搜索与自我检查能力,并通过两阶段强化学习流程逐步提升模型性能,最终通过测试时扩展进一步提高解题效果。实验表明,基于该方法训练的模型SU-01在数学与物理竞赛中表现出色,同时在科学推理的跨领域泛化能力方面也表现出色。

Comments Technical Report. 77 pages

详情
英文摘要

Recent progress in reasoning models has substantially advanced long-horizon mathematical and scientific problem solving, with several systems now reaching gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems. In this paper, we introduce a simple and unified recipe for converting a post-trained reasoning backbone into a rigorous olympiad-level solver. The recipe first uses a reverse-perplexity curriculum for SFT to instill rigorous proof-search and self-checking behaviors, then scales these behaviors through a two-stage RL pipeline that progresses from RL with verifiable rewards to more delicate proof-level RL, and finally boosts solving performance with test-time scaling. Applying this recipe, we train a 30B-A3B backbone with SFT on around 340K sub-8K-token trajectories followed by 200 RL steps. The resulting model, SU-01, supports stable reasoning on difficult problems with trajectories exceeding 100K tokens, while achieving gold-medal-level performance on mathematical and physical olympiad competitions, including IMO 2025/USAMO 2026 and IPhO 2024/2025. It also demonstrates strong generalization of scientific reasoning to domains beyond mathematics and physics.

2605.13296 2026-05-14 cs.AI cs.LG cs.MA 版本更新

Discrete Diffusion for Complex and Congested Multi-Agent Path Finding with Sparse Social Attention

Yuanzhe Wang, Tian Zhi, Zihang Wei, Hongguang Wang, Jiaming Guo, Yang Zhao, Zisheng Liu, Shiyu Quan, Xing Hu, Zidong Du, Yunji Chen

发表机构 * State Key Lab of Processors, Institute of Computing Technology, CAS(中国科学院计算技术研究所处理器重点实验室) School of Advanced Interdisciplinary Sciences, CAS(中国科学院高等交叉学科学院) University of Chinese Academy of Sciences(中国科学院大学) Institute of Microelectronics, CAS(中国科学院微电子研究所)

AI总结 本文研究了在复杂拥挤环境中多智能体路径规划(MAPF)的问题,提出了一种基于离散扩散模型的混合框架DiffLNS,用于生成高质量的初始路径草案以提升修复型求解器的性能。该方法结合了稀疏社交注意力机制的离散去噪扩散概率模型(D3PM)与LNS2算法,直接在离散动作空间中生成多样化的联合路径草案,有效提升了大规模MAPF问题的求解成功率和效率。实验表明,DiffLNS在多种复杂场景中表现优异,平均成功率达到95.8%,显著优于现有方法。

Comments 24 pages, 7 figures

详情
英文摘要

Multi-Agent Path Finding (MAPF) is a coordination problem that requires computing globally consistent, collision-free trajectories from individual start positions to assigned goal positions under combinatorial planning complexity. In dense environments, suboptimal initial plans induce compound conflicts that hinder feasible repair. For repair-based solvers like LNS2, initial plan quality critically affects downstream repair, yet this factor remains underexplored. We propose DiffLNS, a hybrid framework that integrates a discrete denoising diffusion probabilistic model (D3PM) with LNS2. The D3PM serves as an initializer with sparse social attention that learns a spatiotemporal prior over coordinated multi-agent action trajectories from expert demonstrations and samples multiple joint plans. Operating directly on the categorical action space, our discrete diffusion preserves the MAPF action structure and samples from a multimodal joint-plan distribution to produce diverse drafts well suited for neighborhood repair. These drafts act as warm starts for downstream repair, which completes unfinished trajectories and resolves remaining conflicts under hard MAPF constraints. Experimental results show that despite being trained only on instances with at most 96 agents, the initializer generalizes to scenarios with up to 312 agents at inference time. Across 20 complex and congested settings, DiffLNS achieves an average success rate of 95.8%, outperforming the strongest tested baseline by 9.6 percentage points and matching or exceeding all baselines in all 20 settings. To the best of our knowledge, this is the first work to leverage discrete diffusion for warm-starting an LNS-based MAPF solver.

2605.13295 2026-05-14 cs.CL cs.AI cs.MA 版本更新

CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution

Tom Zehle

发表机构 * University of Freiburg(弗赖堡大学) ELLIS Institute(埃里克·林斯研究所) Tübingen(图宾根)

AI总结 本文提出了一种名为 CANTANTE 的框架,用于优化基于大语言模型的多智能体系统。该方法通过对比不同联合配置在相同查询上的执行结果,将系统层面的奖励分解为每个智能体的更新信号,从而解决信用分配问题。实验表明,CANTANTE 在编程、数学推理和多跳问答等任务上均优于现有优化方法,且在保持较高性能的同时降低了推理成本。

详情
英文摘要

LLM-based multi-agent systems have demonstrated strong performance across complex real-world tasks, such as software engineering, predictive modeling, and retrieval-augmented generation. Yet automating their configuration remains a structural challenge, as scores are available only at the system level, whereas the parameters governing agent behavior are local. We argue that optimizing these systems is fundamentally a credit-assignment problem. We therefore introduce CANTANTE, a framework that decomposes system-level rewards into per-agent update signals by contrasting rollouts of multiple joint configurations on the same query. We instantiate it for prompt optimization, treating agent prompts as learnable system parameters. We evaluate CANTANTE against GEPA and MIPROv2 on programming (MBPP), mathematical reasoning (GSM8K), and multi-hop question answering (HotpotQA). Across these benchmarks, CANTANTE achieves the best average rank among all evaluated optimizers and consistently outperforms unoptimized prompts. It improves over the strongest baseline by +18.9 percentage points on MBPP and +12.5 percentage points on GSM8K, while incurring a lower inference cost. It remains within one standard deviation of the strongest baseline on HotpotQA. Crucially, our credit correlation analysis confirms that the attributer produces meaningful per-agent signals rather than echoing the global system score.

2605.13292 2026-05-14 cs.CL cs.AI cs.IR cs.LG 版本更新

IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

Shubham Kumar Nigam, Suparnojit Sarkar, Piyush Patel

发表机构 * University of Birmingham(布里斯托尔大学) Heritage Institute of Technology(遗产理工学院) Madan Mohan Malaviya University of Technology(马丹·莫汉·马尔维亚理工学院)

AI总结 本文介绍了IndicMedDialog,一个包含英印九种语言的平行多轮医疗对话数据集,旨在提升医疗对话系统在印地语系语言中的适用性和对话真实性。该数据集通过大语言模型生成对话并经母语者验证和后处理优化,同时基于该数据集微调了参数高效的医疗语言模型IndicMedLM,以实现更个性化的症状收集。研究通过多语言基线对比和专家评估,验证了模型的临床合理性和有效性。

Comments Accepted in BioNLP @ ACL 2026 Conference

详情
英文摘要

Most existing medical dialogue systems operate in a single-turn question--answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. We introduce IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu. The dataset extends MDDial with LLM-generated synthetic consultations, translated using TranslateGemma, verified by native speakers, and refined through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors. Building on this dataset, we fine-tune IndicMedLM via parameter-efficient adaptation of a quantized small language model, incorporating optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate against zero-shot multilingual baselines, conduct systematic error analysis across ten languages, and validate clinical plausibility through medical expert evaluation.

2605.13290 2026-05-14 cs.AI 版本更新

What properties of reasoning supervision are associated with improved downstream model quality?

Mikołaj Langner, Dzmitry Pihulski, Jan Eliasz, Michał Rajkowski, Przemysław Kazienko, Maciej Piasecki, Jan Kocoń, Teddy Ferdinan

发表机构 * Wroclaw Tech(沃拉布技术学院)

AI总结 本文研究了如何在训练前通过内在数据指标可靠预测推理数据集的效用,以减少对昂贵试错调优的依赖。作者提出了一系列定量指标,并通过在语义不同的波兰推理数据集上微调8B和11B模型进行验证,发现这些指标与下游模型性能有显著相关性。研究还揭示了效用预测指标具有规模依赖性:小模型更依赖对齐性指标保证精度,而大模型则受益于高冗余度和详细推理过程以解决复杂任务。这一发现为推理数据验证提供了一个规模感知的框架,有助于更高效地选择训练数据集。

Comments To appear in the Proceedings of the International Conference on Computational Science (ICCS) 2026

详情
英文摘要

Validating training data for reasoning models typically requires expensive trial-and-error fine-tuning cycles. In this work, we investigate whether the utility of a reasoning dataset can be reliably predicted prior to training using intrinsic data metrics. We propose a suite of quantitative measures and evaluate their predictive power by fine-tuning 8B and 11B models on semantically distinct variants of a Polish reasoning dataset. Our analysis reveals that these intrinsic metrics demonstrate strong and significant correlations with downstream model performance. Crucially, we find that the predictors of utility are scale-dependent: smaller models rely on alignment-focused metrics to ensure precision, whereas larger models benefit from high redundancy, utilizing verbose traces to solve complex tasks. These findings establish a scale-aware framework for validating reasoning data, enabling practitioners to select effective training sets without the need for exhaustive empirical testing.

2605.13287 2026-05-14 cs.LG cs.AI math.OC stat.ML 版本更新

Delightful Exploration

Ian Osband

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 本文提出了一种名为“Delight-gated exploration”(DE)的探索策略,用于解决大规模动作空间中探索预算有限的问题。该方法通过衡量潜在收益与惊喜值的乘积(即“delight”)来决定是否进行探索,从而更高效地利用有限的探索资源。DE 在多种任务中表现出比 Thompson Sampling 和 $\varepsilon$-greedy 更弱的遗憾增长,并且其超参数具有良好的跨任务迁移性,无需重新调整。

详情
英文摘要

Most exploration algorithms search broadly until uncertainty is resolved. When the action space is too large to resolve within budget, practitioners default to $\varepsilon$-greedy, which bounds disruption but spends its override blindly. We introduce \textit{Delight-gated exploration} (DE), a host--override rule that spends exploratory actions only when their prospective delight (expected improvement times surprisal) exceeds a gate price. This practical heuristic recovers a classical result: Pandora's reservation-value rule for costly search, with surprisal setting the effective inspection cost. Resolved arms exit the gate, fresh arms shut off above a prior-determined threshold, and selected linear-bandit overrides consume finite information budget. Across Bernoulli bandits, linear bandits, and tabular MDPs, the same hyperparameters transfer without retuning, and DE shows much weaker regret growth than Thompson Sampling and $\varepsilon$-greedy in the tested unresolved regimes. Delight improves acting for the same reason it improves learning: it prices scarce resources by the product of upside and surprisal.

2605.13280 2026-05-14 cs.SE cs.AI 版本更新

The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code

Hengzhi Ye, Fengyuan Ran, Weiwei Xu, Minghui Zhou

发表机构 * Peking university(北京大学) Wuhan University(武汉大学)

AI总结 随着大语言模型(LLM)在软件开发中的广泛应用,生成代码的可读性这一关键非功能性属性尚未得到充分研究。本文构建了一个综合的可读性模型,结合文本、结构、程序和视觉特征,系统评估了主流LLM生成代码在数千种场景下的可读性,并发现其整体可读性与人类编写的代码相当,但存在独特的可读性问题模式。研究还表明,提示设计对生成代码的可读性有一定影响,但整体效果有限,揭示了LLM生成代码在长期可维护性方面仍需进一步改进。

详情
英文摘要

As Large Language Models (LLMs) are transforming software development, the functional quality of generated code has become a central focus, leaving readability, one of critical non-functional attributes, understudied. Given that LLM-generated code still needs human review before adoption, it is important to understand its readability especially compared with human-written code and the role of prompt design in shaping it. We therefore set out to conduct a systematic investigation into the code readability of LLM-generated code. To systematically quantify code readability, We establish a comprehensive readability model that synthesizes textual, structural, program, and visual features of code. Based on the model, we evaluate the readability of code generated by the mainstream LLMs under 5,869 scenarios extracted from large code base including World of Code (WoC) and LeetCode. We find that current LLMs produce code with overall readability comparable to human-written code, but displaying distinct readability issue patterns. We further examine how different prompt dimensions affect the readability of LLM-generated code, and find that function signatures, constraints and style descriptions emerge as the most influential factors, while the overall impact of prompt design remains limited. Our findings indicate that, on one hand, LLM-generated code is at least comparable to human-written code in readability, validating its potential for systematic integration into software workflows from a non-functional perspective; on the other hand, distinct readability issue patterns and limited effectiveness of prompt engineering reveal a latent technical debt, highlighting the need for future research to improve the readability of LLM-generated code and thus ensure long-term maintainability.

2605.13277 2026-05-14 cs.CL cs.AI cs.CV cs.IR cs.LG 版本更新

Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

Weiqing Luo, Zongye Hu, Xiao Wang, Zhiyuan Yu, Haofeng Zhang, Ziyi Huang

发表机构 * Arizona State University(亚利桑那州立大学) Texas A&M University(德克萨斯大学) Morgan Stanley(摩根大通)

AI总结 本文研究了多模态检索增强生成(RAG)中视觉证据的选择问题,指出现有方法通常基于语义相关性或表面相似性,难以准确反映证据对下游推理的实际效用。为此,作者从信息论角度重新定义了证据的效用,提出通过模型输出分布的信息增益来衡量证据价值,并设计了一种无需训练、基于轻量多模态模型的高效估计框架。实验表明,该方法在多个基准上优于现有RAG方法,同时显著降低了计算成本。

Comments Accepted to ACL 2026

详情
英文摘要

Visual evidence selection is a critical component of multimodal retrieval-augmented generation (RAG), yet existing methods typically rely on semantic relevance or surface-level similarity, which are often misaligned with the actual utility of visual evidence for downstream reasoning. We reformulate multimodal evidence selection from an information-theoretic perspective by defining evidence utility as the information gain induced on a model's output distribution. To overcome the intractability of answer-space optimization, we introduce a latent notion of evidence helpfulness and theoretically show that, under mild assumptions, ranking evidence by information gain on this latent variable is equivalent to answer-space utility. We further propose a training-free, surrogate-accelerated framework that efficiently estimates evidence utility using lightweight multimodal models. Experiments on MRAG-Bench and Visual-RAG across multiple model families demonstrate that our method consistently outperforms state-of-the-art RAG baselines while achieving substantial reductions in computational cost.

2605.13261 2026-05-14 cs.HC cs.AI 版本更新

"It became a self-fulfilling prophecy": How Lived Experiences are Entangled with AI Predictions in Menstrual Cycle Tracking Apps

Wendy Zhou, Pelin Karaturhan, Alexandra Weilenmann, Jichen Zhu

发表机构 * IT University of Copenhagen(哥本哈根IT大学) Department of Applied Information Technology, University of Gothenburg(应用信息科技系,哥德堡大学)

AI总结 本文研究了月经周期追踪应用中人工智能预测如何与用户的实际体验相互交织,影响其对身体和心理状态的理解。通过半结构化访谈和群体自传研究,研究发现用户往往依据AI预测来理解自身经历,但预测的准确性受限于记录不完善,且界面设计缺乏对用户批判性思考的支持。研究还指出,非典型用户在与AI交互过程中常感到孤立,并据此提出了针对预测型AI功能的设计改进建议。

详情
英文摘要

In menstrual cycle tracking apps (MCTAs), AI-based predictions and insights have become increasingly popular. These features enable users to receive personalized information about their bodies and mental states. However, there is currently little research on how these predictive AI features and explanations affect users' lived experiences. This paper examines human-AI entanglement in MCTAs through 14 semi-structured user interviews and a group autoethnography. These methods uncover the processes leading to this phenomenon. Our results reveal that: (1) users understand their lived experiences in light of AI predictions, although these predictions can be faulty due to imperfect logging practices, (2) the user interface features and AI explanations do not support awareness or critical engagement with this entanglement and meaning-making, and (3) non-normative MCTA users report a sense of isolation in this entangled interaction. Based on our findings, we propose design implications for predictive AI features and explanations.

2605.13255 2026-05-14 cs.AI 版本更新

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

Junlong Ke, Zichen Wen, Weijia Li, Conghui He, Linfeng Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Tsinghua University(清华大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文研究了如何在基于策略的自蒸馏中更有效地利用教师模型的不确定性信息,以提升大语言模型的推理效率。提出了一种基于熵引导的强化自蒸馏方法EGRSD,通过结合奖励引导方向、师生似然比幅度以及教师熵置信门机制,动态调整对不同位置token的监督权重,从而提升模型训练效果。进一步引入了因果前瞻变体CL-EGRSD,以区分持续高熵和短暂高熵区域,实验表明该方法在推理准确率与长度的权衡上优于现有可训练方法。

详情
英文摘要

On-policy self-distillation trains a reasoning model on its own rollouts while a teacher, often the same model conditioned on privileged context, provides dense token-level supervision. Existing objectives typically weight the teacher's token-level signal uniformly across a chain-of-thought sequence, despite substantial variation in the entropy of the teacher's predictive distribution. We propose EGRSD (Entropy-Guided Reinforced Self-Distillation), which unifies token-level updates through three signals: a reward-grounded direction, a teacher-student likelihood-ratio magnitude, and the proposed teacher-entropy confidence gate that down-weights high-entropy token positions while maintaining a nonzero lower bound on every token weight. We further introduce CL-EGRSD, a causal-lookahead variant that distinguishes sustained high-entropy spans from transient high-entropy positions whose following context rapidly becomes low entropy. Experiments with Qwen3-4B and Qwen3-8B in thinking mode show that EGRSD and CL-EGRSD advance the accuracy-length frontier among the compared trainable methods.

2605.13248 2026-05-14 eess.SP cs.AI 版本更新

Compact Latent Manifold Translation: A Parameter-Efficient Foundation Model for Cross-Modal and Cross-Frequency Physiological Signal Synthesis

Bo Cui, Xiaowen Song, Yaowen Zhang, Shunzhe Zhang, B. J. F. van Beijnum, Monique Tabak, Ying Wang

发表机构 * Department of Biomedical Signals and Systems(生物医学信号与系统系) University of Twente(埃因霍温理工大学)

AI总结 该研究针对心电图(ECG)和光电容积图(PPG)等生理信号分析中因设备异构导致的模态和频率差异问题,提出了一种参数高效的统一框架——紧凑潜空间流形翻译(CLMT)。该方法通过两阶段离散翻译机制,结合分层残差向量量化(RVQ)的通用分词器和融合生理先验的上下文引导潜空间翻译器,有效解耦异构信号并实现跨模态和跨频率的高保真信号合成。实验表明,该模型在参数仅为0.09B的情况下,显著优于现有大模型,在跨模态合成和高频超分辨率任务中取得了优异性能。

详情
英文摘要

The analysis of physiological time series, such as electrocardiograms (ECG) and photoplethysmograms (PPG), is persistently hindered by modality and frequency gaps stemming from heterogeneous recording devices. Existing foundation models typically rely on continuous latent spaces, which frequently suffer from severe modality entanglement, lack high-fidelity cross-frequency generative capacity, and impose high computational costs that prohibit edge-device deployment. In this paper, we propose Compact Latent Manifold Translation (CLMT), a highly parameter-efficient (0.09B) unified framework that bridges these gaps through a novel two-stage discrete translation paradigm. First, we introduce a Universal Tokenizer utilizing Hierarchical Residual Vector Quantization (RVQ) to decouple heterogeneous signals into isolated, well-structured discrete latent manifolds, effectively preventing inter-modality interference. Second, a Context-Prompted Latent Translator maps these discrete tokens across modalities by integrating static physiological priors, reframing complex signal synthesis as a pure latent sequence translation task. Extensive evaluations demonstrate that our 0.09B model significantly outperforms massive baselines. In cross-modal PPG-to-ECG synthesis, it resolves temporal phase drift and dramatically improves the clinical R-peak detection F1-score from 0.37 (baseline) to 0.83. Furthermore, in extreme cross-frequency super-resolution (25Hz to 100Hz), it successfully recovers high-frequency diagnostic landmarks, achieving an unprecedented Pearson correlation of 0.9956. By learning a universal discrete language for biological signals with a fraction of the computational footprint, our approach sets a new trajectory for edge-deployable, multi-modal medical foundation models.

2605.13245 2026-05-14 cs.AI 版本更新

It's not the Language Model, it's the Tool: Deterministic Mediation for Scientific Workflows

Marios Adamidis, Danae Katrisioti, Yannis Tzitzikas, Emmanuel Stratakis

发表机构 * Department of Materials Science and Technology, University of Crete(材料科学与技术系,克里特大学) Institute of Electronic Structure and Laser, FORTH(电子结构与激光研究所,FORTH) Computer Science Department, University of Crete(计算机科学系,克里特大学) Institute of Computer Science, FORTH(计算机科学研究所,FORTH) Department of Physics, University of Crete(物理系,克里特大学)

AI总结 该研究探讨了语言模型在科学工作流中生成分析结果的可重复性问题,指出当前模型在同一数据上多次生成时可能得到不同结果,缺乏可信度。为此,作者提出了一种“类型化中介”方法,通过模型调用确定性工具来执行分析,每个工具对应特定仪器的精确操作流程,确保结果的一致性。实验表明,该方法在多个平台上实现了相同分析任务的完全可复现结果,相较商业模型具有更高的稳定性和可靠性,为科学分析中的可重复性需求提供了实用解决方案。

Comments 18 pages, 4 figures, 2 appendices. Submitted to SETN 2026

详情
英文摘要

Language models can produce convincing scientific analyses, but repeated generations on the same data do not guarantee the same result. A researcher may regenerate an identical query and receive a different fit, a different peak position or a different analysis procedure, without an obvious way to decide which output to trust. We propose typed mediation, a pattern in which the model orchestrates deterministic tools rather than generating analytical code. Each tool encodes one researcher's exact procedure for one instrument, ported through structured interviews. The model selects which tool to call and with what parameters. The tool produces the result. Regeneration does not change it. We evaluate this claim by running the same photoluminescence analysis on four platforms, including three commercial foundation models, four times each with the same prompt. The typed tool produces identical results across all runs. The commercial platforms either vary in numerical output and analytical methodology across runs, or fail to produce valid results on the task. We deploy this pattern on two instruments serving users over approximately six months, with very positive user feedback. Both cases are very challenging: they involve proprietary binary formats and per-seat licensed software, which force the tool to remain on local infrastructure alongside the data and the instrument it operates. We argue that deployment topology is not just a preference, but a structural requirement of scientific tool mediation. The result is a practical pattern for deploying language models in scientific workflows where reproducibility is mandatory, reducing analysis time from weeks to minutes while guaranteeing identical outputs across runs.

2605.13229 2026-05-14 cs.AI cs.SE 版本更新

Improving Code Translation with Syntax-Guided and Semantic-aware Preference Optimization

Yuhan Wu, Huan Zhang, Wei Cheng, Chen Shen, Jingyue Yang, Wei Hu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室) National Institute of Healthcare Data Science, Nanjing University, China(南京大学健康数据科学国家研究院)

AI总结 本文研究如何提升代码翻译的准确性和语义一致性,提出了一种基于语法引导和语义感知的偏好优化方法CTO。该方法通过对比学习训练跨语言语义模型,直接评估源代码与翻译代码的功能等价性,并将语义信号与编译器反馈的语法信号统一到多目标优化框架中。实验表明,CTO在C++、Java和Python代码翻译任务中显著优于现有方法。

Comments Accepted in the 35th International Joint Conference on Artificial Intelligence (IJCAI 2016)

详情
英文摘要

LLMs have shown immense potential for code translation, yet they often struggle to ensure both syntactic correctness and semantic consistency. While preference-based learning offers a promising alignment strategy, it is hindered by unreliable semantic rewards derived from sparse test cases or restrictive reference translations. We argue that a robust semantic reward for code translation must be derived directly from the source code. In this paper, we propose CTO to improve code translation with syntax-guided and semantic-aware preference optimization. Through contrastive learning, we train a cross-lingual semantic model to directly assess functional equivalence between source and translated code. By formulating code translation as a multi-objective optimization problem, this robust semantic signal is seamlessly unified with compiler-based syntactic feedback within the direct preference optimization framework. Extensive experiments on C++, Java, and Python translations demonstrate that CTO significantly outperforms existing baselines and alternative preference optimization strategies.

2605.13228 2026-05-14 cs.CV cs.AI 版本更新

ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

Xiao Liu, Nayu Liu, Junnan Zhu, Ruirui Chen, Guohui Xiang, Changjian Wang, Kaiwen Wei, Rongzhen Li, Jiang Zhong

发表机构 * Chongqing University(重庆大学) Tianjin University(天津大学) MAIS, Institute of Automation, Chinese Academy of Sciences(自动化研究所,中国科学院MAIS) Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore(新加坡科技研究局高性能计算研究所) Chongqing National Data AI Research Institute, AI Research Lab(重庆国家数据AI研究院,AI研究实验室)

AI总结 该论文提出了一种名为 ReTool-Video 的递归工具使用视频代理方法,旨在提升视频理解中复杂推理和跨模态分析的能力。为了解决现有视频代理在工具粒度和动作空间上的局限,研究构建了包含134个工具的 MetaAug-Video 工具库(MVTL),支持细粒度操作和多级信息访问,并设计了递归工具调用机制,将高层视频意图逐步分解为可执行的工具链。实验表明,该方法在多个基准测试中表现优异,显著提升了复杂视频理解的稳定性和效果。

详情
英文摘要

Video understanding requires active evidence seeking, motivating tool-augmented video agents for temporal reasoning, cross-modal understanding, and complex question answering. Existing video agents have improved video reasoning with retrieval, memory, frame inspection, and verifier tools, but they still face two limitations: (1) a coarse tool space that lacks fine-grained operations for compositional reasoning; and (2) a flat action space that forces high-level video intents into primitive executable tool calls. In this paper, we address these challenges with two complementary designs. First, we construct a MetaAug-Video Tool Library (MVTL), an extensible tool library with 134 registered tools, including 26 base tools for general multimodal signal processing and 108 meta tools for filtering, aggregation, reranking, formatting, and other intermediate-result operations. MVTL supports dual-level access to both structured video information and raw modal evidence, enabling diverse video reasoning scenarios. Second, we propose ReTool-Video, a recursive tool-using method that grounds high-level video intents into executable tool chains. In ReTool-Video, matched actions are executed directly, while unmatched intents are delegated to a resolver for parameter repair, tool substitution, or decomposition. This allows abstract actions such as temporal merging, cross-modal verification, or repeated-event aggregation to be progressively translated into concrete multimodal operations at runtime. Experiments on MVBench, MLVU, and Video-MME w/o sub. show that ReTool-Video consistently outperforms strong baselines. Further analysis demonstrates that recursive grounding and fine-grained meta tools improve the stability and effectiveness of complex video understanding.

2605.13221 2026-05-14 cs.AI cs.LG 版本更新

An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing

Hanwen Zhang, Dusit Niyato, Wei Zhang, Xin Lou, Malcolm Yoke Hean Low

发表机构 * Nanyang Technological University(南洋理工大学) Singapore Institute of Technology(新加坡理工学院) Seatrium New Energy Laboratory(Seatrium 新能源实验室) Ministry of Education (MOE) Tier 1(教育部 Tier 1) Research Innovation and Enterprise (RIE) 2025 Industry Alignment Fund-Industry Collaboration Projects (IAF-ICP)(研究创新与企业 (RIE) 2025 行业对齐基金-行业合作项目 (IAF-ICP))

AI总结 本文研究了无人机辅助物流调度中结合边缘计算的混合调度问题,该问题涉及物理物流决策与计算任务调度的耦合。为解决这一挑战,作者提出了一种基于智能体AI的优化框架,结合大语言模型与链式推理技术将用户输入转化为可解释的数学模型,并设计了一种基于近端策略优化的分层深度强化学习方法,以优化无人机路径规划与任务执行资源分配。实验表明,该框架在任务截止时间满足率和产品收集成功率方面表现出色,性能稳定且优于传统方法。

Comments 15 pages

详情
英文摘要

In cloud manufacturing, unmanned aerial vehicles (UAVs) can support both product collection and mobile edge computing (MEC). This joint operation forms a hybrid scheduling problem, where physical logistics decisions are coupled with computational task scheduling. In this paper, UAVs collect finished products from manufacturing stations and transport them back to a central depot. Meanwhile, computational tasks generated by industrial sensor devices at these stations are processed locally, at UAVs, or offloaded via UAVs to the cloud. This coupling makes the problem challenging. A UAV can provide MEC services only during its service window at a station, so routing decisions directly determine when UAV-assisted offloading is available. Routing decisions also affect the UAV energy budget and the availability of onboard computing and communication resources for computational task execution under task deadline constraints. To address this, we propose an agentic-AI-assisted optimization framework with two components. First, we develop an agentic AI that combines large language models, retrieval-augmented generation, and chain-of-thought reasoning to translate user input into an interpretable mathematical formulation for the hybrid scheduling problem. Second, we design a hierarchical deep reinforcement learning approach based on proximal policy optimization (PPO), where the upper layer learns UAV routing and the lower layer optimizes per-slot task execution and resource allocation. Simulation results show that the proposed framework yields more consistent formulations, while the hierarchical PPO achieves full product collection in 99.6% of the last 500 episodes and maintains a 100% deadline satisfaction rate, with more stable performance than the advantage actor-critic approach.

2605.13202 2026-05-14 cs.CV cs.AI 版本更新

STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition

Hongli Liu, Yu Wang, Shengjie Zhao

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) Engineering Research Center of Key Software Technologies for Smart City Perception and Planning, Ministry of Education(教育部智能城市感知与规划关键软件技术工程研究中心)

AI总结 本文研究了少样本动作识别(FSAR)中的语义-时序对齐问题,提出了一种统一的语义-时序自适应表示学习框架STAR。该方法通过引入时序语义注意力机制和语义时序原型细化模块,有效解决了文本提示与动作序列中稀疏视觉线索的对齐问题,并增强了对多尺度时序动态的建模能力。实验表明,STAR在多个基准数据集上均优于现有方法,验证了其在有限样本条件下的有效性。

Comments Accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

详情
英文摘要

Few-shot action recognition (FSAR) requires models to generalize to novel action categories from only a handful of annotated samples. Despite progress with vision-language models, existing approaches still suffer from semantic-temporal misalignment, where static textual prompts fail to capture decisive visual cues that appear sparsely across sequences, and from inadequate modeling of multi-scale temporal dynamics, as short-term discriminative cues and long-range dependencies are often either oversmoothed or fragmented. To address these challenges, we propose Semantic Temporal Adaptive Representation Learning (STAR), a unified framework, consisting of a semantic-alignment component and a temporal-aware component, effectively bridging the semantic and temporal gaps and transferring the sequence modeling capability of Mamba into the FSAR. The semantic alignment module introduces a Temporal Semantic Attention (TSA) mechanism, which performs frame-level cross-modal alignment with textual cues, ensuring fine-grained semantic-temporal consistency. The temporal-aware module incorporates a Semantic Temporal Prototype Refiner (STPR) that integrates semantic-guided Mamba blocks with multi-frequency temporal sampling and bidirectional state-space refinement, yielding semantically aligned prototypes with enhanced discriminative fidelity and temporal consistency. Furthermore, temporally dependent class descriptors derived from large language models (LLMs) provide long-range semantic guidance. Extensive experiments on five FSAR benchmarks demonstrate the consistent superiority of STAR over state-of-the-art methods. For instance, STAR achieves up to 8.1% and 6.7% gains on the SSv2-Full and SSv2-Small datasets under the 1-shot setting, and 7.3% on HMDB51, validating its effectiveness under limited supervision. The code is available at https://github.com/HongliLiu1/STAR-main.

2605.13197 2026-05-14 cs.LG cs.AI 版本更新

McCast: Memory-Guided Latent Drift Correction for Long-Horizon Precipitation Nowcasting

Penghui Wen, Yu Luo, Lintao Wang, Mengwei He, Patrick Filippi, Thomas Francis Bishop, Zhiyong Wang

发表机构 * School of Computer Science, The University of Sydney, Australia(悉尼大学计算机科学学院,澳大利亚) School of Life and Environmental Science, The University of Sydney, Australia(悉尼大学生命与环境科学学院,澳大利亚)

AI总结 现有的降水临近预报方法通常采用自回归框架,但这种方法在长时间预测中容易累积误差,导致预报偏离物理合理的演变轨迹。为了解决这一问题,本文提出 McCast,一种基于记忆引导的潜在漂移校正方法,通过引入时序组织的记忆库,主动校正自回归过程中的潜在演变偏差,从而生成更加时序一致且可靠的长期预报。实验表明,McCast 在 SEVIR 和 MeteoNet 两个基准数据集上取得了最先进的性能,尤其在长期预报任务中表现突出。

详情
英文摘要

Existing precipitation nowcasting methods typically adopt an autoregressive formulation, where future states are predicted from previous outputs. However, such an approach accumulates errors over long rollouts, causing forecasts to drift away from physically plausible evolution trajectories. Although various studies have attempted to alleviate this problem by improving step-wise prediction accuracy, they largely neglect the global temporal evolution of meteorological systems and lack mechanisms to actively correct drift during rollouts. To address this issue, we propose McCast, a memory-guided latent drift correction method for precipitation nowcasting. Rather than treating memory as an unordered dictionary of latent states for passive conditioning, McCast leverages temporally organized memory to actively correct autoregressive latent evolution. Specifically, McCast introduces a Drift-Corrective Memory Bank (DCBank) that explicitly estimates the temporally consistent drift corrections to calibrate the divergent trajectory. DCBank performs drift correction in two stages: a Corrective Latent Extractor first predicts an initial correction from the current prediction and a reference latent state, and a Correction-Aware Memory Retrieval module then refines the initial correction using temporally organized historical memory. By explicitly correcting latent evolution, instead of improving step-wise prediction accuracy only, McCast produces more temporally coherent and reliable long-horizon forecasts. Experiments on two widely used benchmarks, SEVIR and MeteoNet, show that McCast achieves state-of-the-art performance, particularly in challenging long-horizon forecasting scenarios.

2605.13194 2026-05-14 cs.LG cs.AI 版本更新

ECG-NAT: A Self-supervised Neighborhood Attention Transformer for Multi-lead Electrocardiogram Classification

Mahsa Gazeran, Sayvan Soleymanbaigi, Fatemeh Daneshfar, Amjad Seyedi, Fardin Akhlaghian Tab

发表机构 * Department of Computer Engineering, University of Kurdistan(库尔德斯坦大学计算机工程系) Department of Mathematics and Operational Research, University of Mons(蒙斯大学数学与运筹学系)

AI总结 本文提出了一种名为ECG-NAT的自监督邻域注意力变换器,用于多导联心电图(ECG)分类。该方法通过分两阶段训练:首先使用掩码自编码器在未标注数据上进行生成式预训练,学习鲁棒的跨数据集特征表示;随后通过结合监督对比损失和交叉熵损失的双损失函数进行判别式微调,提升分类性能。ECG-NAT采用分层注意力机制,高效捕捉从细粒度心跳形态到更广泛节律模式的多尺度时间特征,在少量标注数据下仍能取得优异的分类准确率,适用于实时心电诊断场景。

详情
英文摘要

Electrocardiogram (ECG) arrhythmia classification remains challenging due to signal variability, noise, limited labeled data, and the difficulty in achieving both accuracy and efficiency in models. While self-supervised learning reduces label dependency, most methods target either global contextual features or local morphological patterns, but rarely implement hierarchical multi-scale feature extraction. ECG signals require architectures that simultaneously capture fine-grained beat-level morphology and broader rhythm-level dependencies with computational efficiency. To overcome this limitation, this paper proposes the Electrocardiogram Neighborhood Attention Transformer (ECG-NAT), a novel self-supervised learning approach tailored for multi-lead ECG classification. Our two-stage approach begins with generative pretraining, using a masked autoencoder to reconstruct partially masked ECG signals across multiple diverse datasets, enabling the model to learn robust, domain-invariant representations from unlabeled data. This is followed by discriminative fine-tuning with a dual-loss function that combines supervised contrastive and cross-entropy losses, aligning representation learning with label prediction. The hierarchical attention mechanism efficiently captures multi-scale temporal features from localized beat morphology to broader rhythm patterns at low computational cost. ECG-NAT achieves robust performance on benchmark datasets, with 88.1\% accuracy using only 1\% labeled data, demonstrating strong efficacy in low-resource settings. The framework combines superior classification performance with computational efficiency, making it practical for real-time ECG diagnosis. The code will be made available upon acceptance at: https://github.com/Mahsagazeran/ECG-NAT.

2605.13190 2026-05-14 cs.LG cs.AI 版本更新

N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation

Aleksander Lorenc, Frédéric Berdoz, Joël Mathys, Roger Wattenhofer

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出了一种名为N-vium的混合退出Transformer模型,旨在提升自回归Transformer的推理效率。该方法通过在不同深度添加预测头,并采用自适应路由机制,将计算部分并行化,从而提高每秒的计算效率,而非单纯减少每个token的计算量。实验表明,N-vium在保持相同困惑度的前提下,实现了比标准Transformer高达57.9%的运行速度提升。

详情
英文摘要

Improving the inference efficiency of autoregressive transformers typically means reducing FLOPs per token, usually through approximations that degrade model quality. We introduce N-vium, a mixture-of-exits transformer that partially parallelizes computation across depth on standard hardware, increasing effective FLOPs per second rather than minimizing compute per token. N-vium attaches prediction heads at multiple depths and defines the next-token distribution as a learned mixture over these exits, with token-adaptive routing. This formulation strictly generalizes the standard transformer, which is recovered exactly when routing assigns zero mass to all intermediate heads. Sampling from the mixture is exact, and complete KV caches are recovered by deferring the upper-layer computation and batching it with later tokens. We pretrain N-vium at scales up to 1.5B parameters. Our largest model reaches 57.9% wall-clock speedup over a parameter- and data-matched standard transformer at no perplexity cost.

2605.13181 2026-05-14 cs.LG cs.AI 版本更新

Stable Attention Response for Reliable Precipitation Nowcasting

Penghui Wen, Zexin Hu, Sen Zhang, Patrick Filippi, Xiaogang Zhu, Allen Benter, Thomas Bishop, Zhiyong Wang, Kun Hu

发表机构 * School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) School of Life and Environmental Science, The University of Sydney(悉尼大学生命与环境科学学院) School of Computer Science and Information Technology, The University of Adelaide(阿德莱德大学计算机科学与信息技术学院) Digital Agriculture, Orange Agricultural Institute(数字农业,橙色农业研究所) School of Science, Edith Cowan University(埃迪斯科文大学科学学院)

AI总结 降水临近预报由于大气动力学的高度局部化、快速变化和异质性而具有挑战性。尽管近期方法在单模态和多模态设置中越来越多地采用基于注意力的架构,但主要关注于增强表示学习和预测能力,而忽视了注意力响应在不同样本间的稳定性。本文提出HARECast,一种基于头级注意力响应能量调控的降水临近预报框架,通过减少注意力响应能量在样本间的波动,提升预测的稳定性与可靠性,并在多个基准数据集上取得了最先进的性能。

详情
英文摘要

Precipitation nowcasting remains challenging due to the highly localized, rapidly evolving, and heterogeneous nature of atmospheric dynamics. Although recent methods increasingly adopt attention-based architectures in both unimodal and multimodal settings, they mainly emphasize stronger representation learning and prediction capacity, while paying less attention to the stability of attention responses across samples. In this work, we show that cross-sample instability of attention-response energy is an important and previously underexplored source of forecasting unreliability. Empirically, inaccurate forecasts are associated with larger attention-response energy variance across heads and layers. Theoretically, we show that cross-sample variability can propagate through self-attention, and enlarge a lower bound on prediction error. Based on this insight, we propose HARECast, a Head-wise Attention Response Energy-regulated framework for precipitation nowcasting. HARECast explicitly models head-wise attention-response energy and stabilizes it through a group-wise regularization objective that reduces cross-sample fluctuations. The proposed formulation is generic and applicable to both unimodal and multimodal nowcasting architectures. We instantiate HARECast in a standard forecasting pipeline with reconstruction branches and a diffusion-based predictor, and evaluate it on commonly used benchmarks--SEVIR and MeteoNet. Experimental results demonstrate that HARECast achieves state-of-the-art performance.

2605.13172 2026-05-14 cs.MA cs.AI 版本更新

When Does Hierarchy Help? Benchmarking Agent Coordination in Event-Driven Industrial Scheduling

Ziqi Wang, Yuhao Yang, Zhiwei Ling, Wenzhuo Qian, Hailiang Zhao

发表机构 * Zhejiang University(浙江大学)

AI总结 本文提出了一种新的基准测试平台DESBench,用于评估代理在分层事件驱动工业调度中的协调能力。研究关注不同协调机制(如集中式、分层式、异层式和整体论式)在动态耦合约束环境中的表现差异,揭示了各类机制在鲁棒性、效率和通信开销等方面的权衡。该工作为理解复杂系统中代理协调的设计原则提供了重要见解,强调了未来多智能体系统研究中对更自适应和动态协调机制的需求。

详情
英文摘要

Recent advances in agent and multi-agent systems have shown strong performance on tool use, reasoning, and collaborative tasks. However, existing benchmarks mostly evaluate task completion in weakly coupled environments, and provide limited support for studying coordination in shared, dynamically evolving systems with hierarchy and coupled constraints. This leaves an important question underexplored: when do different coordination paradigms succeed or fail? We introduce Distributed Event-driven Scheduling Benchmark (DESBench), a benchmark for evaluating agent coordination in hierarchical event-driven scheduling. Built on a shared discrete-event driven environment in industrial scheduling, our benchmark captures multi-timescale decision making, partial observability, and dynamically coupled constraints. We define tasks and metrics that evaluate effectiveness, constraint alignment, coordination efficiency, and robustness, and focus on four representative coordination paradigms: centralized, hierarchical, heterarchical, and holonic. These paradigms correspond to distinct mechanisms of information flow, decision authority, and conflict resolution. Our controlled evaluations reveal clear coordination trade-offs: centralized coordination is robust and communication-efficient but scales poorly with difficulty; hierarchical coordination improves efficiency through decomposition but suffers from cross-level misalignment; heterarchical coordination is flexible but communication-heavy; and holonic coordination satisfies constraints well but loses global robustness. These findings demonstrate that coordination design fundamentally shapes agent system behavior in complex environments, revealing structural trade-offs that cannot be captured by outcome metrics alone and underscoring the imperative for more adaptive, principled, and dynamic coordination mechanisms in future MAS research.

2605.13171 2026-05-14 cs.AI 版本更新

Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

Moritz Firsching, Paul Lezeau, Salvatore Mercuri, Miklós Z. Horváth, Yaël Dillies, Calle Sönne, Eric Wieser, Fred Zhang, Thomas Hubert, Blaise Agüera y Arcas, Pushmeet Kohli

发表机构 * Google DeepMind Imperial College London(帝国理工学院伦敦分校) Stockholms universitet(斯德哥尔摩大学)

AI总结 随着自动推理系统的发展,亟需高质量的数学问题用于评估其能力。为此,研究者提出了“Formal Conjectures”,一个包含2615个用Lean 4形式化的问题的持续演进基准,涵盖836个已解决的问题和1029个未解的数学猜想,用于评估自动证明发现的能力。该基准通过协作开源项目确保形式化正确性,并利用AI生成的证明与反例进行持续优化,已在实际中推动了新的数学发现。

Comments 21 pages, 4 figures, 5 tables

详情
英文摘要

As automated reasoning systems advance rapidly, there is a growing need for research-level formal mathematical problems to accurately evaluate their capabilities. To address this, we present Formal Conjectures, an evolving benchmark of currently 2615 mathematical problem statements formalized in Lean 4. Sourced from areas of active mathematical research, the dataset features 1029 open research conjectures providing a zero-contamination benchmark for mathematical proof discovery, and 836 solved problems for proof autoformalization. Notably, the repository provides a structured interface connecting mathematicians who formalize and clarify problems with the AI systems and humans attempting to solve them. Demonstrating its immediate utility, the benchmark has already been leveraged to make new mathematical discoveries, including the resolution of open research conjectures. We describe our approach to ensuring the correctness of these formalizations in a collaborative open-source project where contributions stem from an active community. In this framework, AI-generated proofs and disproofs serve as a valuable auditing mechanism to iteratively improve the fidelity of the benchmark. Finally, we provide a standardized evaluation setup and report baseline results on frozen evaluation subsets, demonstrating a climbable signal that measures the current frontier of automated reasoning on research-level mathematics.

2605.13153 2026-05-14 cs.AI 版本更新

Strikingness-Aware Evaluation for Temporal Knowledge Graph Reasoning

Rikui Huang, Shengzhe Zhang, Wei Wei

发表机构 * School of Computer Science & Technology, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) Institute of Artificial Intelligence, Huazhong University of Science and Technology(华中科技大学人工智能研究院) School of Artificial Intelligence & Automation, Huazhong University of Science and Technology(华中科技大学人工智能与自动化学院)

AI总结 本文针对时间知识图谱推理(TKGR)中的评估方法提出改进,指出当前方法对所有事件一视同仁,忽略了大多数事件是重复性的,从而高估了模型的推理能力。为此,作者提出一种基于“显著性”的评估框架,通过规则引导的显著性度量方法,区分并强调那些需要更深层次推理的罕见事件。实验表明,该框架能够更严格地评估模型在预测突出事件方面的能力,为TKGR研究提供了新的评价视角。

Comments Accepted to IJCAI-ECAI 2026

详情
英文摘要

Temporal Knowledge Graph Reasoning (TKGR) aims at inferring missing (especially future) events from historical data. Current evaluation in TKGR uniformly weights all events, ignoring that most are trivial repetitions, which overestimate the true reasoning ability. Therefore, the rare outstanding events, whose prediction demands deeper reasoning, should be distinguished and emphasized. To this end, we propose a strikingness-aware evaluation framework, which introduces a rule-based strikingness measuring framework (RSMF) to quantify event strikingness by comparing its expected occurrence with peer events derived from temporal rules. Strikingness is then integrated as a weighting factor into metrics like weighted MRR and Hits@k. Experiments on four TKG benchmarks reveal: 1) All representative models perform worse as event strikingness increases, 2) Path-based methods excel on low-strikingness events and representation-based ones on high-strikingness events, 3) We design an ensemble method whose gains stem from fitting trivial events rather than reasoning improvement. Our framework provides a more rigorous evaluation, refocusing the field on predicting outstanding events.

2605.13152 2026-05-14 cs.CV cs.AI cs.LG cs.RO 版本更新

EvObj: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision

Jiahao Chen, Zihui Zhang, Yafei Yang, Jinxi Li, Shenxing Wei, Zhixuan Sun, Bo Yang

发表机构 * Shenzhen Research Institute, The Hong Kong Polytechnic University(深圳研究 institute,香港理工大学) vLAR Group, The Hong Kong Polytechnic University(vLAR 团队,香港理工大学)

AI总结 本文提出了一种名为 EvObj 的无监督三维实例分割方法,旨在解决从合成数据到真实点云场景中几何域差距带来的挑战。该方法通过引入对象辨别模块和对象补全模块,实现了对物体先验的动态优化和部分几何结构的重建,从而提升了在真实场景中的分割性能。实验表明,EvObj 在多个数据集上均取得了优于现有方法的分割效果,达到了当前最先进的水平。

Comments CVPR 2026. Code and data are available at: https://github.com/vLAR-group/EvObj

详情
英文摘要

We introduce EvObj for unsupervised 3D instance segmentation that bridges the geometric domain gap between synthetic pretraining data and real-world point clouds. Current methods suffer from structural discrepancies when transferring object priors from synthetic datasets (e.g., ShapeNet) to real scans (e.g., ScanNet), particularly due to morphological variations and occlusion artifacts. To address this, EvObj integrates two innovative modules: (1) An object discerning module that dynamically refines object candidates, enabling continuous adaptation of object priors to target domains; and (2) An object completion module that reconstructs partial geometries after discovering objects. We conduct extensive experiments on both real-world and synthetic datasets, demonstrating superior 3D object segmentation performance over all baselines while achieving state-of-the-art results.

2605.13149 2026-05-14 cs.CL cs.AI cs.LG 版本更新

AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions

Ishika Agarwal, Sofia Stoica, Emre Can Acikgoz, Pradeep Natarajan, Mahdi Namazifar, Jiaqi Ma, Dilek Hakkani-Tür

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Amazon(亚马逊)

AI总结 本文提出了一种名为 AcquisitionSynthesis 的方法,利用主动学习中的获取函数作为奖励模型,训练语言模型生成高质量的合成数据,以解决模型训练中数据质量的瓶颈问题。该方法通过量化评估生成数据对下游学习器的影响,提升了数据生成的针对性和有效性。实验表明,使用 AcquisitionSynthesis 生成的数据能够提升学生模型的性能并增强其鲁棒性,同时该方法还可用于支持其他模型训练及资源从低到高的训练范式。

详情
英文摘要

Data quality remains a critical bottleneck in developing capable, competitive models. Researchers have explored many ways to generate top quality samples. Some works rely on rejection sampling: generating lots of synthetic samples and filtering out low-quality samples. Other works rely on larger or closed-source models to extract model weaknesses, necessary skills, or a curriculum off of which to base data generation. These works have one common limitation: there is no quantitative approach to measure the impact of the generated samples on the downstream learner. Active learning literature provides exactly this, in the form of acquisition functions. Acquisition functions measure the informativeness and/or influence of data, providing interpretable, model-centric signals. Inspired by this, we propose AcquisitionSynthesis: using acquisition functions as reward models to train language models to generate higher-quality synthetic data. We conduct experiments on classic verifiable tasks of math, medical question-answering, and coding. Our experimental results indicate that (1) student models trained with AcquisitionSynthesis data achieve good performance on in-distribution tasks (2-7% gain) and is more robust to catastrophic forgetting, and (2) AcquisitionSynthesis models can generate data for other models and for low-to-high resource training paradigms. By leveraging acquisition rewards, we seek to demonstrate a principled path toward model-aware self-improvement that surpasses static datasets.

2605.13130 2026-05-14 cs.AI 版本更新

GRACE: Gradient-aligned Reasoning Data Curation for Efficient Post-training

Junjie Li, Ziao Wang, NingXuan Ma, Jianghong Ma, Xiaofeng Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)) Hong Kong Baptist University, China(香港 Baptist 大学) City University of Hong Kong, China(香港城市大学)

AI总结 本文提出了一种名为GRACE的梯度对齐推理数据筛选方法,用于高效地进行模型后训练。该方法通过分析每个推理步骤与答案梯度方向的对齐程度以及与前序推理路径的一致性,对步骤进行评分,并将这些评分聚合为样本级别的选择依据,无需外部奖励模型或步骤注释。实验表明,GRACE在使用较少数据的情况下仍能保持接近甚至超越全数据的性能,且具有良好的模型迁移能力。

详情
英文摘要

Existing reasoning data curation pipelines score whole samples, treating every intermediate step as equally valuable. In reality, steps within a trace contribute very unevenly, and selecting reasoning data well requires assessing them individually. We present GRACE, a gradient-aligned curation method that views each reasoning trace as a sequence of optimization events and scores every step by two complementary signals: its alignment with the answer-oriented gradient direction, and its consistency with the preceding reasoning trajectory. Step-level scores are aggregated into a sample-level value for subset selection, using only the model's internal optimization signals and no external reward models or step annotations. To make this scalable, GRACE introduces a representation-level gradient proxy that estimates step-level alignment from token-level upstream signals in a single forward pass. Post-training Qwen3-VL-2B-Instruct on MMathCoT-1M, GRACE reaches 108.8% of the full-data performance with 20% of the data and retains 100.2% with only 5%, with subsets that transfer effectively across model backbones.

2605.13119 2026-05-14 cs.RO cs.AI cs.CV 版本更新

Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

Zixing Lei, Changxing Liu, Yichen Xiong, Minhao Xiong, Yuanzhuo Ding, Zhipeng Zhang, Weixin Li, Siheng Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Zhongguancun Academy(中关村学院) Beihang University(北京航空航天大学)

AI总结 该研究旨在解决视觉-语言-动作(VLA)模型在长期任务中执行能力受限的问题,提出了一种将高层视觉语言模型与专用工具型VLA模块相结合的新策略。通过引入工具对齐的后训练方法(TAPT)和工具族接口,实现了高效的长期任务规划与执行协同,显著提升了机器人在复杂环境中的任务完成率和指令遵循精度。

详情
英文摘要

Vision-language-action (VLA) models are effective robot action executors, but they remain limited on long-horizon tasks due to the dual burden of extended closed-loop planning and diverse physical operations. We therefore propose VLAs-as-Tools, a strategy that distributes this burden across a high-level vision language model (VLM) agent for temporal reasoning and a family of specialized VLA tools for diverse local physical operations. The VLM handles scene analysis, global planning, and recovery, while each VLA tool executes a bounded subtask. To tightly couple agent planning with VLA tool execution in long-horizon tasks, we introduce a VLA tool-family interface that exposes explicit tool selection and in-execution progress feedback, enabling efficient event-triggered agent replanning without continuous agent polling. To obtain diverse specialized VLA tools that faithfully follow agent invocations, we further propose Tool-Aligned Post-Training (TAPT), which constructs invocation-aligned training units for instruction following and adopts tool-family residual adapters for efficient tool specialization. Experiments show that VLAs-as-Tools improves the success rate of $π_{0.5}$ by 4.8 points on LIBERO-Long and 23.1 points on RoboTwin, and further enhances invocation fidelity by 15.0 points as measured by Non-biased Rate. Code will be released.

2605.13117 2026-05-14 cs.RO cs.AI 版本更新

SECOND-Grasp: Semantic Contact-guided Dexterous Grasping

Han Yi Shin, Heeju Ko, Jaewon Mun, Qixing Huang, Jaehyeok Lee, Sung June Kim, Honglak Lee, Sujin Jang, Sangpil Kim

发表机构 * Korea University(韩国大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Michigan(密歇根大学) Hanyang University(翰阳大学)

AI总结 本文提出 SECOND-Grasp,一种语义引导的灵巧抓取框架,旨在将物理稳定性与语义任务理解相结合,以实现更可靠的机器人抓取。该方法通过视觉-语言推理生成粗略接触区域,并利用语义-几何一致性优化技术提升接触预测的准确性,最终通过逆运动学生成可行的抓取姿态。实验表明,该方法在已见和未见物体类别上的抓取成功率分别达到98.2%和97.7%,并在意图感知抓取任务中表现出显著提升。

详情
英文摘要

Achieving reliable robotic manipulation, such as dexterous grasping, requires a synergy between physically stable interactions and semantic task guidance, yet these objectives are often treated as separate, disjoint goals. In this paper, we investigate how to integrate dexterous grasping techniques, i.e., physically stable grasps for object lifting and language-guided grasp generation, to achieve both physical stability and semantic understanding. To this end, we propose SECOND-Grasp (SEmantic CONtact-guided Dexterous Grasping), a unified framework that enables robotic hands to dynamically adjust grasping strategies based on semantic reasoning while ensuring physical feasibility. We begin by obtaining coarse contact proposals through vision-language reasoning to infer where contacts should occur based on object properties, followed by segmentation to localize these regions across views. To further ensure consistency across multiple viewpoints, we introduce Semantic-Geometric Consistency Refinement (SGCR), which refines initial contact predictions by enforcing semantic consistency across views and removing geometrically invalid regions, yielding reliable 3D contact maps. Then, we derive a feasible hand pose for each contact map via inverse kinematics, generating a supervision signal for policy learning. Our approach, trained on DexGraspNet, consistently outperforms baselines in lifting success rate on both seen and unseen categories, achieving 98.2% and 97.7%, respectively, while also improving intent-aware grasping by 12.8% and 26.2%. We further show promising results on additional datasets and robotic hands, including Shadow Hand and Allegro Hand.

2605.13113 2026-05-14 cs.CY cs.AI 版本更新

Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles

Jose Luna, Yankun Wu, Xiaofei Xie, Noa Garcia

发表机构 * Singapore Management University(新加坡国立大学) The University of Osaka(大阪大学)

AI总结 本文研究了文本到图像生成模型中的性别偏见问题,提出了一种基于风险分层的审计框架,以更系统地评估和治理模型中的性别偏差。该框架包含三个核心组成部分:根据欧盟AI法案的风险类别定义使用场景的分层档案,整合多种性别偏见评估指标的分类目录,以及将不同情境下的危害类型映射到具体风险场景的分类体系。研究还引入了THUMB卡片工具,帮助在审计过程中综合考虑上下文、场景、偏见表现和潜在危害,提升评估的系统性和实用性。

Comments FAccT 2026

详情
英文摘要

Text-to-image (T2I) generative models are increasingly used to produce content for education, media, and public-facing communication, and are starting to be integrated into higher-impact pipelines. Since generated images tend to reinforce stereotypes, producing representational erasure via "default" depictions and shaping perceptions of who belongs in certain roles, a growing body of work has proposed metrics to quantify gender bias in T2I outputs. Yet existing evaluations remain fragmented. Metrics are often reported without a shared view of what they measure, what assumptions they entail, or how their results should be interpreted under different deployment contexts. This limits the usefulness of gender bias measurement for both technical auditing and emerging governance discussions. We propose a risk-aligned auditing framework for gender bias in T2I models composed of three constituents that connects risk categories, evaluation metrics, and harms. First, we identify risk-tiered use-case profiles aligned with the EU AI Act's risk categories to motivate why auditing expectations may vary with deployment contexts and stakeholder exposure. Second, we construct a metric catalog that consolidates gender-bias evaluation methods and organizes them in three measurement categories: gender prediction, embedding similarity, and downstream task. Third, we introduce a harm typology that maps context-dependent harm categories (e.g., representational, quality-of-service) to specific risk-tired scenarios. Finally, we introduce THUMB cards (Text-to-image Harms-informed Use-case-aligned Metrics of gender Bias) that help formulate auditing systematically by the incorporation of context, scenario and bias manifestation, harm hypotheses, and audit strategy.

2605.13110 2026-05-14 cs.MA cs.AI cs.IR 版本更新

A Multi-Agent Orchestration Framework for Venture Capital Due Diligence

Grigorios Alexandrou, Katerina Pramatari

发表机构 * Greek Business Registry(希腊企业登记处)

AI总结 本文提出了一种用于风险投资尽职调查和市场分析的全自动多智能体框架。该框架基于事件驱动的架构,结合大型语言模型与实时网络检索技术,将非结构化数据转化为结构化的投资情报。其核心贡献包括一个能够逆向解析希腊商业注册系统前后端通信的程序化数据提取流程,以及一种在数据缺失时明确标记而非生成未经验证数据的结构化回退机制,有效避免了金融场景中的幻觉问题。

Comments 13 pages, 1 figure

详情
英文摘要

We present a fully automated multi-agent framework for corporate due diligence and market analysis in venture capital. The system runs on an event-driven orchestration architecture, combining Large Language Models (LLMs) with real-time web retrieval to synthesize unstructured data into structured investment intelligence. A central technical contribution is a programmatic extraction pipeline that reverse-engineers the frontend-to-backend communication of the Greek Business Registry ($Γ$.E.MH.), querying dynamic endpoints to retrieve official financial filings that are then parsed using a layout-aware OCR extractor. A structural fallback mechanism explicitly flags data absence rather than generating unverified figures, directly targeting hallucination in financial contexts. All workflow artifacts are publicly available to support replication.

2605.13101 2026-05-14 cs.LG cs.AI 版本更新

Margin-calibrated Classifier Guidance for Property-driven Synthesis Planning

Najwa Laabid, Vikas Garg

发表机构 * Aalto University(阿alto大学)

AI总结 该研究提出了一种名为Sequence Completion Ranking(SCR)的新方法,用于改进基于单步 retrosynthesis 模型的化学合成路径规划。通过引入对比论证和基于边距的损失函数,SCR 能够校准分类器,使其在解码过程中更有效地区分满足特定属性的反应路径,从而提升生成路径的质量与多样性。实验表明,该方法在 USPTO-190 数据集上显著提高了多步合成的成功率,并有效弥补了无模板与有模板方法之间的多样性差距。

详情
英文摘要

Synthesis planning seeks an efficient sequence of chemical reactions that produce a target molecule. Typically, a pretrained single-step (autoregressive) retrosynthesis model is repeatedly invoked to generate such a sequence. Classifier guidance can, in principle, help steer the output of single-step model toward reactions that satisfy specific constraints or accommodate chemist's preferences during inference without having to retrain the autoregressive generator. We expose the insufficiency of auxiliary classifiers trained with cross-entropy loss to override the unconditional token-level distributions learned from typical sparse single-disconnection reaction datasets. We overcome this issue with a novel method called Sequence Completion Ranking (SCR), which employs contrastive argumentation and a margin-based loss to calibrate the classifier so that it can meaningfully discriminate between continuations during decoding. We formally establish that margin-calibrated classifiers can expand the set of property-satisfying sequences reachable under guided beam search. Empirically, on USPTO-190, given chemist-specified guidance targets, SCR substantially improves multi-step solve rates from $16.8\%$ (unguided generator) to $78.4\%$ with reaction-type guidance and $95.3\%$ with Tanimoto guidance, unlocking valid routes for 33 targets ($17.4\%$) previously unsolvable with baselines. Our method also effectively closes the long-standing diversity gap between template-free and template-based methods.

2605.13087 2026-05-14 cs.CL cs.AI 版本更新

Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

Kush Juvekar, Kavya Manohar, Aditya Srinivas Menon, Arghya Bhattacharya, Kumarmanas Nethil

发表机构 * Adalat AI, India(Adalat AI 印度)

AI总结 该研究针对多语言语音识别模型在低资源语言上的微调问题,提出了Vividh-ASR基准,用于评估印地语和马拉雅拉姆语在不同复杂度场景下的识别性能。通过分析学习率时机和课程学习顺序,研究发现早期大参数更新和由易到难的课程学习策略能显著提升模型性能,特别是对自发语音的识别效果。基于这些发现,作者提出了逆向多阶段微调方法(R-MFT),使参数高效的244M Whisper模型在性能上达到甚至超越传统微调的769M模型。

Comments Submitted to Interspeech 2026

详情
英文摘要

Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across four tiers: studio, broadcast, spontaneous, and synthetic noise. Through a controlled study of learning-rate timing and curriculum ordering, we find that early large parameter updates improve global WER by 12 absolute points, while a hard-to-easy curriculum adds gains for spontaneous speech. These findings motivate reverse multi-stage fine-tuning (R-MFT), a training recipe that enables a parameter-efficient 244M Whisper model to match or exceed conventionally fine-tuned 769M counterparts. Representational analysis via CKA and SVD reveals effective schedules concentrate adaptation in the decoder, preserving the pre-trained encoder's acoustic geometry. We release the benchmark and models.

2605.13079 2026-05-14 cs.LG cs.AI 版本更新

Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence

Tien-Phat Nguyen, Truong Nguyen, Minh-Phuc Truong, Tuc Nguyen, James Bailey, Trung Le

发表机构 * Hanoi University of Science and Technology(河内理工大学) Indiana University(印第安纳大学) Monash University(墨尔本大学)

AI总结 本文研究了优化器 Muon 的成功机制,揭示其核心在于通过正交化动量缓冲区实现谱平坦化,从而提升学习率容忍度和收敛速度。作者证明,Muon 的最大稳定步长与梯度的平均奇异值相关,而非最大值,这突破了传统梯度下降的瓶颈。此外,将 Muon 视为预条件梯度方法,其收敛效率的提升由梯度协方差的谱特性所控制。实验表明,Muon 在更大学习率下仍保持稳定,并比标准梯度下降更快达到精度目标。

详情
英文摘要

Muon orthogonalizes the momentum buffer before each update, replacing its singular values with ones via Newton-Schulz iterations. This simple change lets Muon tolerate far larger learning rates and converge faster than other optimizers, but why? We show that the mechanism is spectral flattening, and develop two results around it. First, we prove that Muon's maximal stable step size scales with the average singular value of the gradient rather than the largest, which bottlenecks standard gradient descent. Second, we recast Muon as a preconditioned gradient method and show, under a Kronecker-factored curvature model, that it improves the effective convergence factor, with the improvement controlled by the spectrum of the gradient covariance. Extensive experiments validate both results: Muon remains stable at learning rates that cause SGD to diverge within the first few iterations, and reaches accuracy milestones several epochs earlier even at identical step sizes. Taken together, our results offer a principled, geometric explanation for Muon's empirical success.

2605.13077 2026-05-14 cs.MA cs.AI 版本更新

Counterfactual Reasoning for Causal Responsibility Attribution in Probabilistic Multi-Agent Systems

Chunyan Mu, Muhammad Najib

发表机构 * University of Aberdeen(阿伯丁大学) Heriot-Watt University(赫瑞瓦特大学)

AI总结 本文研究了多智能体系统中因果责任归属的问题,提出了一种基于反事实推理的因果责任分配方法。作者将系统建模为并发随机多玩家博弈,并引入了回顾性反事实责任的概念,利用夏普利值进行责任分配,确保公平性和一致性等关键性质。在此基础上,构建了一个支持责任感知系统的验证与策略推理的正式框架,并结合纳什均衡分析了责任与预期奖励之间的权衡策略。

详情
英文摘要

Responsibility allocation -- determining the extent to which agents are accountable for outcomes -- is a fundamental challenge in the design and analysis of multi-agent systems. In this work, we model such systems as concurrent stochastic multi-player games and introduce a notion of retrospective (backward) counterfactual responsibility, which quantifies an agent's accountability for outcomes resulting from a given strategy profile. To allocate responsibility among agents, we utilise the Shapley value and formally show that this method satisfies key desirable properties, including fairness and consistency. Building on this foundation, we propose a formal framework that supports both verification and strategic reasoning in responsibility-aware multi-agent systems. Furthermore, by adopting Nash equilibrium as the solution concept, we demonstrate how to compute stable strategy profiles in which agents trade off responsibility against expected reward.

2605.13072 2026-05-14 quant-ph cs.AI 版本更新

Neural QAOA$^{2}$: Differentiable Joint Graph Partitioning and Parameter Initialization for Quantum Combinatorial Optimization

Zubin Zheng, Jiahao Wu, Shengcai Liu

发表机构 * Guangdong Provincial Key Laboratory of Brain-Inspired Intelligent Computation, Department of CSE, SUSTech(脑启发智能计算广东省重点实验室,计算机科学与工程系,南方科技大学)

AI总结 本文提出了一种名为Neural QAOA²的端到端可微框架,用于解决量子组合优化中的图划分与参数初始化问题。该方法通过集成生成评估网络(GEN),结合可微量子评估器作为高保真性能代理,实现了图划分与初始参数的联合生成,从而克服了传统方法中划分指标与优化目标不一致、参数初始化缺乏拓扑感知的问题。实验表明,该方法在多个QUBO、Ising和MaxCut实例上表现优异,显著优于现有启发式方法,并具备良好的跨分布泛化能力。

Comments Accepted to ICML 2026

详情
英文摘要

The quantum approximate optimization algorithm (QAOA) holds promise for combinatorial optimization but is constrained by limited qubits. While divide-and-conquer frameworks like QAOA$^{2}$ address scalability by partitioning graphs into subgraphs, existing methods suffer from two fundamental limitations: i) misalignment between heuristic partitioning metrics and quantum optimization goals, and ii) topology-blind parameter initialization that leads to optimization cold starts. To bridge these gaps, we propose Neural QAOA$^{2}$, an end-to-end differentiable framework that jointly generates graph partitions and initial parameters. By integrating a generative evaluative network (GEN), our method utilizes a differentiable quantum evaluator as a high-fidelity performance surrogate to provide direct gradient guidance, enabling the joint generator to learn the intrinsic mapping from graph topology to high-quality partition and parameter configurations. Extensive experiments on 183 QUBO, Ising, and MaxCut instances (21 to 1000 variables) demonstrate that our gradient-driven approach broadly outperforms heuristic baselines, ranking first on 101 instances. It exhibits zero-shot generalization across out-of-distribution graph topologies and scales.

2605.13067 2026-05-14 cs.RO cs.AI 版本更新

When Absolute State Fails: Evaluating Proprioceptive Encodings for Robust Manipulation

Maxime Alvarez, Ryo Watanabe, Paul Crook, Afshin Zeinaddini Meymand, Suvin Kurian, Pablo Ferreiro, Genki Sano

发表机构 * TELEXISTENCE Inc, Foundation Model Division(TELEXISTENCE公司基础模型部门) The University of Tokyo(东京大学)

AI总结 随着端到端机器人策略在现实任务中的应用增多,训练与推理条件之间的差距成为一大挑战。本文研究了如何通过改进机器人本体感觉状态的编码方式,提升其在分布内和分布外场景下的性能,特别是在面对未知测试条件时的鲁棒性。研究发现,采用基于任务的相对参考系编码方法,在实际机器人实验中表现出优于现有方法的性能,为利用不同参考系下的数据提升机器人泛化能力提供了可行路径。

Comments Accepted to ICRA 2026 Workshop: From Data to Decisions

详情
英文摘要

As end-to-end robotic policies are progressively deployed in the real world to solve real tasks, they face a gap between the training and inference conditions. Scaling the amount and diversity of the training data has shown some success in improving zero-shot generalization, yet robots still fail when faced with new, unseen test conditions. For instance, while robots with fixed frames of reference are common, those with moving frames pose a greater challenge for deployment. To address this specific instance of the issue, we present a study of strategies for encoding the robot's proprioceptive state to improve both in- and out-of-distribution performance at test time. Through a systematic study of joint representations, we find that a simple episode-wise relative frame provides the best trade-off between task performance and robustness, outperforming the baselines in extensive real-robot experiments conducted in a realistic test environment. The results suggest a practical path to leveraging data collected by robots with varying frames of reference and deployment to unseen test configurations.

2605.13054 2026-05-14 cs.LG cs.AI 版本更新

Bridging Domain Gaps with Target-Aligned Generation for Offline Reinforcement Learning

Minung Kim, Jeongmo Kim, Gwanwoo Choi, Seungyul Han

发表机构 * Ulsan National Institute of Science and Technology, UNIST(乌山国立科学技术研究院,UNIST)

AI总结 该论文研究了如何在仅有预收集数据的情况下,将源域的策略适应到目标域的离线强化学习问题,特别是在目标域数据极为有限的情况下。为了解决域间分布差异带来的挑战,作者提出了目标对齐的覆盖扩展(TCE)框架,通过理论分析指导源数据的使用方式,包括直接引入接近目标域的转移或通过目标对齐生成扩展状态覆盖。实验表明,TCE在多种跨域环境中显著优于现有的离线强化学习方法。

详情
英文摘要

Cross-domain offline reinforcement learning aims to adapt a policy from a source domain to a target domain using only pre-collected datasets, where environment dynamics may differ. A key challenge is to leverage source data while reducing distributional mismatch, particularly when the target dataset is extremely limited. To address this, we propose Target-aligned Coverage Expansion (TCE), a framework that decides how source data should be used, either by directly incorporating target-near transitions or by expanding state coverage through target-aligned generation, guided by theoretical analysis. TCE builds on a dual score-based generative model to synthesize target-consistent transitions over an expanded state region. Extensive experiments across diverse cross-domain environments show that TCE consistently outperforms state-of-the-art cross-domain offline RL baselines.

2605.13047 2026-05-14 cs.CV cs.AI 版本更新

Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency

Ziqi Wen, Parsa Madinei, Miguel P. Eckstein

发表机构 * Department of Computer Science, University of California, Santa Barbara(加州大学圣巴巴拉分校计算机科学系) Department of Psychological and Brain Sciences, University of California, Santa Barbara(加州大学圣巴巴拉分校心理学与脑科学系)

AI总结 该研究探讨了视觉语言模型(VLM)在高层次语义场景理解方面与人类感知的差异。为此,作者提出了一种黑盒、模型无关的方法——反事实语义显著性(CSS),通过衡量物体在场景中被移除后引起的语义变化,量化其重要性。实验结果表明,VLM在理解场景时表现出对大物体、画面中心物体和高显著性物体的过度依赖,而对场景中人物的依赖则低于人类,揭示了模型与人类在语义理解上的显著差距。

详情
英文摘要

Evaluating whether large vision-language models (VLMs) align with human perception for high-level semantic scene comprehension remains a challenge. Traditional white-box interpretability methods are inapplicable to closed-source architectures and passive metrics fail to isolate causal features. We introduce Counterfactual Semantic Saliency (CSS). This black-box, model-agnostic framework quantifies the importance of objects by measuring the semantic shift induced by their causal ablation from a scene. To evaluate AI-human semantic alignment, we tested prominent VLMs against a human psychophysics baseline comprising 16,289 valid responses across 307 complex natural scenes and 1,306 high-fidelity counterfactual variants. Our analysis reveals a pervasive scene comprehension gap: models exhibit an overreliance (relative to humans) on large objects (size bias), objects at the center of the image (center bias), and high saliency objects. In contrast, models rely less on people in the scenes than our human participants to describe the images. A model's size bias is a primary driver explaining variations in model-human semantic divergence. Code and data will be available at https://github.com/starsky77/Counterfactual-Semantic-Saliency.

2605.13046 2026-05-14 cs.AI 版本更新

An Agentic LLM-Based Framework for Population-Scale Mental Health Screening

Giuliano Lorenzoni, Paulo Alencar, Donald Cowan

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 本文提出了一种基于智能体的大型语言模型(LLM)框架,用于大规模人群心理健康筛查。该框架通过将每个处理阶段封装为由明确策略和代理引导评估驱动的LangChain智能体,实现了对非结构化临床信息的处理与个性化适应。研究展示了该框架在基于对话记录的抑郁检测中的应用,验证了其在稳定配置收敛、成本控制和避免性能退化方面的有效性,为大规模临床数据下的心理健康筛查提供了可信、可复现且适应性强的解决方案。

Comments 8 pages, conference paper presented at IEEE BigData 2025, Macau

详情
英文摘要

Mental health disorders affect millions worldwide, and healthcare systems are increasingly overwhelmed by the volume of clinical data generated from electronic records, telemedicine platforms, and population-level screening programs. At the same time, the emergence of novel AI-based approaches in healthcare calls for intelligent frameworks capable of processing domain-specific unstructured clinical information while adapting to patient-specific needs. This paper proposes an agentic framework for building robust LLM-based pipelines, where each stage is encapsulated as a LangChain agent governed by explicit policies and proxy-guided evaluation. Stages are incrementally locked once validated, ensuring that later adaptations cannot overwrite configurations without demonstrated improvement. The proposed framework evolves from feature-level exploration, through proxy-based tuning and freeze/rollback mechanisms, to full orchestration by an Orchestrator Agent that coordinates preprocessing, retrieval, selection, diversity, threshold optimization, and decoding. A proof-of-concept in transcript-based depression detection demonstrates that the framework converges to stable configurations, such as cosine similarity, dynamic Top-k, and threshold 0.75, while controlling evaluation costs and avoiding regressions. These results highlight the potential of agentic AI to enable population-level mental health screening over large clinical datasets, addressing critical challenges in trustworthiness, reproducibility, and adaptability required in healthcare environments.

2605.13044 2026-05-14 cs.CR cs.AI 版本更新

No Attack Required: Semantic Fuzzing for Specification Violations in Agent Skills

Ying Li, Hongbo Wen, Yanju Chen, Hanzhi Liu, Yuan Tian, Yu Feng

发表机构 * University of California, San Diego(加州大学圣迭戈分校)

AI总结 本文研究了基于大语言模型的智能体技能中可能存在的“规范违反”问题,即技能在执行过程中违反其自身声明的安全规则,而并非由于受到攻击。为此,作者提出了一种基于语义的模糊测试框架Sefz,将安全规则转化为执行路径上的可达性目标,并通过LLM生成良性输入逐步逼近违规模式,从而自动发现规范违反问题。实验表明,Sefz在402个真实技能中发现了29.9%的规范违反问题,揭示了六类常见的设计缺陷,为安全技能设计提供了指导。

详情
英文摘要

LLM-powered agents can silently delete documents, leak credentials, or transfer funds on a routine user request, not because the agent was attacked, but because the skill it invoked broke its own declared safety rules. We call these specification violations: benign inputs cause a skill to breach the natural-language guardrails in its own specification, typically because the guardrail's semantics are undefined for autonomous execution, or because the implementation silently ignores the documented constraint. These violations are invisible to static analyzers, traditional fuzzers, and prompt-injection defenses alike, yet they undermine the very contract a user trusts when installing a skill. We present Sefz, a goal-directed semantic fuzzing framework that automatically discovers specification violations in agent skills. Sefz translates each guardrail into a reachability goal over an annotated execution trace, reducing violation checking to a deterministic graph query. An LLM-based mutator generates benign inputs whose traces progressively approach the violation patterns, guided by a multi-armed bandit that uses goal-proximity as its reward signal. On 402 real-world skills from the largest public agent-skill marketplace, Sefz finds specification violations in 120 (29.9%), including 26 previously unknown exploitable guardrail violations in deployed skills. Six recurring specification pitfalls explain the bulk of the failures, suggesting concrete principles for safer skill design.

2605.13038 2026-05-14 cs.CV cs.AI 版本更新

CoGE: Sim-to-Real Online Geometric Estimation for Monocular Colonoscopy

Liangjing Shao, Beilei Cui, Hongliang Ren

发表机构 * Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China(香港中文大学电子工程系,香港特别行政区,中国) Shenzhen Loop Area Institute, China(深圳环湖研究所,中国)

AI总结 本文提出CoGE,一种用于结肠镜检查的单目在线几何估计框架,旨在解决实际场景中深度估计和场景重建的难题。该方法通过引入基于Retinex理论的光照感知监督模块和基于小波分解的结构感知感知模块,有效应对结肠镜场景中的光照差异和结构特征提取问题。实验表明,仅使用模拟数据训练的CoGE在模拟和真实场景中均取得了最先进的几何估计性能。

Comments Early Accepted by MICCAI 2026

详情
英文摘要

Geometric estimation including depth estimation and scene reconstruction is a crucial technique for colonoscopy which can provide surgeons with 3D spatial perception and navigation. However, geometric ground truth in colonoscopy is difficult to obtain due to narrow and enclosed space of the colon, while there is a large feature gap between simulated data and realistic data caused by artifacts and illumination. In this paper, we present CoGE, a novel framework for online monocular geometric estimation during colonoscopy. Firstly, we propose an illumination-aware supervision module based on the Retinex theory to address illumination diversity in different colonoscopy scenes. Moreover, a structure-aware perception module is proposed based on wavelet decomposition to extract common structural and local features of the colon. Both quantitative and qualitative results demonstrate that the proposed model solely trained on simulated data achieves state-of-the-art performance in geometric estimation for both simulated and realistic scenes.

2605.13037 2026-05-14 cs.AI 版本更新

MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

Yuxin Liu, Ziang Ye, Yueqing Sun, Mingye Zhu, Jinwei Xiao, Zhuowen Han, Qi GU, Xunliang Cai, Lei Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) Meituan(美团) Institution of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Tianjin University(天津大学)

AI总结 当前交互式大语言模型代理依赖于目标引导的逐步规划,环境理解是在执行过程中被动获取的,导致环境感知延迟和知识瓶颈问题。本文提出了一种“先地图后行动”的MAP范式,通过全局探索、任务映射和知识增强执行三个阶段,提前建立环境认知地图,从而提升任务执行效率。实验表明,MAP在多个基准测试中均取得显著提升,并且基于MAP的轨迹数据集MAP-2K在训练中表现优于专家轨迹,说明环境理解比模仿更为关键。

详情
英文摘要

Current interactive LLM agents rely on goal-conditioned stepwise planning, where environmental understanding is acquired reactively during execution rather than established beforehand. This temporal inversion leads to Delayed Environmental Perception: agents must infer environmental constraints through trial-and-error, resulting in an Epistemic Bottleneck that traps them in inefficient failure cycles. Inspired by human affordance perception and cognitive map theory, we propose the Map-then-Act Paradigm (MAP), a plug-and-play framework that shifts environment understanding before execution. MAP consists of three stages: (1) Global Exploration, acquiring environment-general priors; (2) Task-Specific Mapping, constructing a structured cognitive map; and (3) Knowledge-Augmented Execution, solving tasks grounded on the map. Experiments show consistent gains across benchmarks and LLMs. On ARC-AGI-3, MAP enables frontier models to surpass near-zero baseline performance in 22 of 25 game environments. We further introduce MAP-2K, a dataset of map-then-act trajectories, and show that training on it outperforms expert execution traces, suggesting that understanding environments is more fundamental than imitation.

2605.13030 2026-05-14 cs.LG cs.AI 版本更新

FeatCal: Feature Calibration for Post-Merging Models

Yanggan Gu, Shuo Cai, Zihao Wang, Wenjun Wang, Yuanyi Wang, Pengkai Wang, Sirui Huang, Su Lu, Jianmin Wu, Hongxia Yang

发表机构 * The Hong Kong Polytechnic University (PolyU)(香港理工大学) The Chinese University of Hong Kong(香港中文大学) PolyU-Daya Bay Technology and Innovation Research Institute(PolyU-大亚湾技术与创新研究院)

AI总结 FeatCal 是一种针对模型合并后性能下降问题的特征校准方法,通过分析合并模型与专家模型之间的特征漂移,提出了一种层序校准策略,有效提升了合并模型的表现。该方法利用少量校准数据,以闭式解形式逐层调整模型权重,无需梯度下降或额外模块,既保持了合并模型的优势,又显著提升了任务性能。实验表明,FeatCal 在多个基准测试中优于现有校准方法,且在样本效率和校准成本方面表现更优。

详情
英文摘要

Model merging combines task experts into one model and avoids joint training, retraining, or deploying many expert models, but the merged model often still underperforms task experts. We study this performance gap through feature drift, the difference between features produced by the merged model and by the expert on the same input. Our theory decomposes this drift into upstream propagation and local mismatch, tracks how it propagates and combines through later layers in forward order, and links final feature drift to output drift. This view motivates FeatCal, which uses a small calibration set to calibrate the merged model weights layer by layer in forward order, reducing feature drift while staying close to merged weights and preserving the benefits of model merging. FeatCal uses an efficient closed-form solution to update model weights, with no gradient descent, iterative optimization, or extra modules. On the main CLIP and GLUE benchmarks, FeatCal beats Surgery and ProbSurgery, the closest post-merging calibration baselines: 85.5% vs. 77.0%/78.8% on CLIP-ViT-B/32 Task Arithmetic (TA) and 85.2% vs. 83.7%/82.2% on FLAN-T5-base GLUE. On CLIP-ViT-B/32, 8 examples per task reach 82.9%, and 256 examples per task take 53 seconds, about 4x faster than both baselines, showing better sample efficiency and lower calibration cost.

2605.13026 2026-05-14 cs.LG cs.AI cs.CL 版本更新

Understanding and Accelerating the Training of Masked Diffusion Language Models

Chunsan Hong, Sanghyun Lee, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, Yuki Mitsufuji, Seungryong Kim, Jong Chul Ye

发表机构 * KAIST(韩国科学技术院) Sony AI(索尼人工智能) University of Tokyo(东京大学) Sony Group Corporation(索尼集团)

AI总结 本文研究了掩码扩散语言模型(MDMs)训练速度较慢的问题,并提出了加速训练的有效方法。通过分析发现,语言的局部性偏差是导致训练缓慢的主要原因,作者提出了一种基于钟形时间采样的训练策略,显著提升了训练效率。实验表明,该方法在保持最终性能的同时,使MDMs在LM1B基准上的训练速度提升了约4倍,并在生成困惑度和下游任务表现上也取得了更快的提升。

Comments Preprint

详情
英文摘要

Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models (ARMs) for language modeling. However, MDMs are known to learn substantially more slowly than ARMs, which may become problematic when scaling MDMs to larger models. Therefore, we ask the following question: how can we accelerate standard MDM training while maintaining its final performance? To this end, we first provide a detailed analysis of why MDM training is slow. We find that the main factor is the locality bias of language: the predictive information for a token is concentrated in nearby positions. We further investigate how this bias slows learning and suggest a simple yet effective remedy: bell-shaped time sampling as a training strategy. Notably, MDMs trained with our training recipe reach the same validation negative log-likelihood (NLL) up to $\sim4\times$ faster than standard training on One Billion Word Benchmark (LM1B). We also show faster improvements in generative perplexity, zero-shot perplexity, and downstream task performance on various benchmarks.

2605.13021 2026-05-14 cs.LG cs.AI 版本更新

Rethinking Efficient Graph Coarsening via a Non-Selfishness Principle

Xu Bai, Bin Lu, Kun Zhang, Shengbo Chen, Xinbing Wang, Chenghu Zhou, Meng Jin

发表机构 * School of Information Science and Electronic Engineering, Shanghai Jiao Tong University, Shanghai, China(上海交通大学信息科学与电子工程学院) School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China(上海交通大学人工智能学院) School of Environment Science and Engineering, Shanghai Jiao Tong University, Shanghai, China(上海交通大学环境科学与工程学院) School of Artificial Intelligence, Nanchang University, Nanchang, China(南昌大学人工智能学院) Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing, China(中国科学院地理科学与资源研究所)

AI总结 本文提出了一种基于非自私性原理的高效图粗化方法NOPE,旨在解决传统图粗化方法中因节点独立匹配带来的高计算和内存开销问题。该方法通过优先考虑邻域的集体影响,实现了线性内存消耗和接近线性的计算复杂度,并进一步提出了更快的变体NOPE*,在局部各向同性假设下将干扰评估复杂度从O(δ·d)降低至O(d),显著提升了高度节点的处理效率。实验表明,NOPE*相比原方法速度提升1.8到10倍,且在图学习任务中表现优异,甚至优于基于大语言模型的图推理方法。

详情
英文摘要

Graph coarsening is a graph dimensionality reduction technique that aims to construct a smaller and more tractable graph while preserving the essential structural and semantic properties of the original graph. However, most existing methods rely on pair-wise similarity matching, where each node independently searches for its best partner based on global information. This selfishness matching paradigm incurs substantial computational and memory overhead. To address this problem, we shift to a non-selfishness principle that prioritizes the collective interference of neighborhood in coarsening, and propose an efficient method named NOPE, which achieves linear memory consumption and near-linear computational complexity in the number of nodes. Furthermore, we derive a faster variant NOPE*, which reduces O(δ\dot d) interference evaluation to O(d) based on the local isotropy assumption, and consequently alleviates the computational bottleneck for high-degree nodes. Experimental results show that NOPE* achieves 1.8-10\times speedup over NOPE and surpass almost all baselines with 1-3 orders of magnitude acceleration. Meanwhile, learning on coarsened graphs yields comparable performance to original graphs, and can even show superior performance over LLM-based graph reasoning owing to compact graph information. The code can be available at https://github.com/dazonglian/NOPE-main.

2605.13010 2026-05-14 cs.CV cs.AI cs.SY eess.SY math.OC 版本更新

Amortized Guidance for Image Inpainting with Pretrained Diffusion Models

Yilie Huang, Xun Yu Zhou

发表机构 * Department of Industrial Engineering and Operations Research, Columbia University, New York, NY 10027, USA(工业工程与运筹学系,哥伦比亚大学,纽约,NY 10027,美国) Department of Industrial Engineering and Operations Research & Data Science Institute, Columbia University, New York, NY 10027, USA(工业工程与运筹学系及数据科学研究所,哥伦比亚大学,纽约,NY 10027,美国)

AI总结 本文研究了基于生成扩散模型的图像修复问题,提出了一种名为AID的方法,在保持预训练扩散模型主干不变的前提下,通过离线训练一个小型可复用的引导模块,实现对多张掩码图像的高效修复。该方法将问题建模为带有监督终端目标的确定性引导问题,并通过引入辅助高斯形式,推导出一种可在高维空间中学习的随机化问题求解方案,从而设计出一种基于数据驱动的连续时间策略-价值算法。实验表明,AID在多个数据集和掩码类型上均优于现有固定主干和摊销修复方法,在修复质量与速度之间取得了更好的平衡。

详情
英文摘要

We study image inpainting with generative diffusion models. Existing methods typically either train dedicated task-specific models, or adapt a pretrained diffusion model separately for each masked image at deployment. We introduce a middle-ground model, termed Amortized Inpainting with Diffusion (AID), which keeps a pretrained diffusion backbone fixed, trains a small reusable guidance module offline, and then reuses it across masked images without per-instance optimization. We formulate it as a deterministic guidance problem with a supervised terminal objective. To make this problem learnable in high dimensions, we derive an auxiliary Gaussian formulation and prove that solving this randomized problem recovers the optimal deterministic guidance field. This bridge yields a principled continuous-time actor--critic algorithm for learning the guidance module in a fully data-driven manner. Empirically, on AFHQv2 and FFHQ under the pixel EDM pipeline and on ImageNet under the latent EDM2 pipeline, AID consistently improves the quality--speed trade-off over strong fixed-backbone and amortized inpainting baselines across multiple mask types, while adding less than one percent trainable overhead.

2605.12988 2026-05-14 cs.AI cs.CY cs.IR 版本更新

Retrieval-Augmented Tutoring for Algorithm Tracing and Problem-Solving in AI Education

Mragisha Jain, Tirth Bhatt, Griffin Pitts, Aum Pandya, Peter Brusilovsky, Narges Norouzi, Arto Hellas, Juho Leinonen, Bita Akram

发表机构 * North Carolina State University(北卡罗来纳州立大学) University of Pittsburgh(匹兹堡大学) University of California, Berkeley(加州大学伯克利分校) Aalto University(阿尔托大学)

AI总结 本文提出了一种基于检索增强生成(RAG)的智能辅导系统KITE,旨在辅助算法学习中的推理与问题求解。KITE通过意图感知的苏格拉底式响应策略,为学生提供针对性的提示、引导性问题和渐进式支持,同时结合多模态检索技术确保回答与课程内容一致。实验表明,KITE能够生成内容相关且教学效果良好的回应,并有效提升学生模型在算法问题上的后续回答准确性,为算法教育提供了新的辅导架构与评估方法。

Comments Paper accepted to the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), co-located with ACL 2026

详情
英文摘要

Students learning algorithms often need support as they interpret traces, debug reasoning errors, and apply procedures across unfamiliar problem instances. In this paper, we present KITE (Knowledge-Informed Tutoring Engine), a Retrieval-Augmented Generation (RAG)-based intelligent tutoring system designed to serve as a classroom teaching assistant for algorithmic reasoning and problem-solving tasks. KITE uses an intent-aware Socratic response strategy to tailor support to different student needs, responding with targeted hints, guiding questions, and progressive scaffolding intended to strengthen students' algorithmic problem-solving ability. To keep responses aligned with course content, KITE uses a multimodal RAG pipeline that retrieves relevant information from course materials. We evaluate KITE using three forms of assessment: RAGAs-based metrics for response grounding and quality, expert evaluation of pedagogical quality, and a simulated student pipeline in which a weaker language model interacts with KITE across two-turn dialogues and produces revised answers after receiving feedback. Results indicate that KITE produces contextually grounded and pedagogically appropriate responses. Further, using simulated students, KITE's feedback helped the student models produce more accurate follow-up responses on procedural and tracing questions, suggesting that its scaffolding can support algorithmic problem-solving. This work contributes a tutoring architecture and an evaluation approach for assessing retrieval-grounded explanations and scaffolded problem-solving feedback.

2605.12980 2026-05-14 cs.LG cs.AI 版本更新

CoRe-Gen: Robust Spectrum-to-Structure Generation under Imperfect Fingerprint Conditions

Tianbo Liu, Chixiang Lu, Jing Hao, Hengyu Zhang, Lifei Wang, Haibo Jiang, Xiaojuan Qi

发表机构 * The University of Hong Kong(香港大学) The Chinese University of Hong Kong(香港中文大学) Zhejiang Shuren University(浙江师范大学)

AI总结 从串联质谱(MS/MS)解析分子结构是一个具有挑战性的问题,尤其是在超出数据库覆盖范围的从头生成任务中。本文提出了一种名为CoRe-Gen的方法,通过合成光谱预训练编码器、在解码器训练中引入频率感知的指纹噪声匹配,以及结合结构感知的自回归解码和化学约束,有效缓解了预测指纹误差带来的生成偏差。实验表明,CoRe-Gen在多个基准测试中取得了新的性能纪录,同时保持了自回归解码的高效性,为实际条件下的谱-结构生成提供了实用且可扩展的解决方案。

详情
英文摘要

Molecular structure elucidation from tandem mass spectra (MS/MS) remains challenging, particularly for de novo generation beyond database coverage. A common approach decomposes the task into spectrum-to-fingerprint prediction followed by fingerprint-to-structure decoding, enabling the use of large-scale molecular corpora. However, at deployment, the decoder relies on predicted rather than oracle fingerprints, introducing structured errors that propagate into generation. This results in a fundamental condition mismatch, where models trained on clean inputs must operate under noisy, biased predictions, especially for long-tail substructures. We present CoRe-Gen that explicitly addresses this gap. CoRe-Gen improves the intermediate condition via synthetic-spectrum pretraining of the encoder, matches deployment-time noise through frequency-aware fingerprint corruption during decoder training, and mitigates residual errors using structure-aware autoregressive decoding with compositional SELFIES representations, auxiliary structural supervision, and lightweight chemical constraints. Experiments on standard benchmarks show that CoRe-Gen establishes a new state of the art on NPLIB1, achieving 19.54\% Top-1 and 29.92\% Top-10 exact-match accuracy, while remaining competitive on the more challenging MassSpecGym benchmark. Importantly, CoRe-Gen preserves the efficiency advantages of autoregressive decoding, providing a practical and scalable solution for robust spectrum-to-structure generation under realistic conditions.

2605.12978 2026-05-14 cs.AI 版本更新

Useful Memories Become Faulty When Continuously Updated by LLMs

Dylan Zhang, Yanshan Lin, Zhengkun Wu, Yihang Sun, Bingxuan Li, Dianqi Li, Hao Peng

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) IIIS, Tsinghua University(清华大学人工智能研究院)

AI总结 本文研究了大型语言模型(LLMs)在持续更新记忆时可能出现的错误问题。研究发现,尽管通过记忆整合(consolidation)可以提升智能体的学习效果,但随着更新的进行,记忆的实用性会先上升后下降,甚至低于无记忆基准。实验表明,即使是基于正确解法的记忆整合,也可能导致模型在后续任务中表现下降,因此应谨慎处理记忆更新,保留原始经验作为关键证据,以提高智能体记忆的可靠性。

详情
英文摘要

Learning from past experience benefits from two complementary forms of memory: episodic traces -- raw trajectories of what happened -- and consolidated abstractions distilled across many episodes into reusable, schema-like lessons. Recent agentic-memory systems pursue the consolidated form: an LLM rewrites past trajectories into a textual memory bank that it continuously updates with new interactions, promising self-improving agents without parameter updates. Yet we find that such consolidated memories produced by today's LLMs are often faulty even when derived from useful experiences. As consolidation proceeds, memory utility first rises, then degrades, and can fall below the no-memory baseline. More surprisingly, even when consolidating from ground-truth solutions, GPT-5.4 fails on 54% of a set of ARC-AGI problems it had previously solved without memory. We trace the regression to the consolidation step rather than the underlying experience: the same trajectories yield qualitatively different memories under different update schedules, and an episodic-only control that simply retains those trajectories remains competitive with the consolidators we test. In a controlled ARC-AGI Stream environment that exposes Retain, Delete, and Consolidate actions, agents preserve raw episodes by default and double the accuracy of their forced-consolidation counterparts; disabling consolidation entirely (episodic management only) matches this auto regime. Practically, robust agent memory should treat raw episodes as first-class evidence and gate consolidation explicitly rather than firing it after every interaction. Looking forward, reliable agentic memory will require LLMs that can consolidate without overwriting the evidence they depend on.

2605.12975 2026-05-14 cs.AI 版本更新

Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

Jiashuo Sun, Jimeng Shi, Yixuan Xie, Saizhuo Wang, Jash Rajesh Parekh, Pengcheng Jiang, Zhiyi Shi, Jiajun Fan, Qinglong Zheng, Peiran Li, Shaowen Wang, Ge Liu, Jiawei Han

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Hong Kong University of Science and Technology(香港科学与技术大学) Texas A&M University(德克萨斯农工大学)

AI总结 该论文提出了一种名为 PyRAG 的可执行多跳推理框架,用于增强检索生成(RAG)在复杂问答任务中的表现。不同于传统基于自然语言的推理方式,PyRAG 将多跳推理过程转化为可执行的 Python 程序,利用检索和问答工具进行结构化计算,从而实现中间状态的显式表达和确定性反馈。实验表明,PyRAG 在多个多跳问答数据集上显著优于现有方法,尤其在组合性任务中表现突出。

Comments 32 pages, 20 figures, 4 tables

详情
英文摘要

Retrieval-Augmented Generation (RAG) has become a standard approach for knowledge-intensive question answering, but existing systems remain brittle on multi-hop questions, where solving the task requires chaining multiple retrieval and reasoning steps. Key challenges are that current methods represent reasoning through free-form natural language, where intermediate states are implicit, retrieval queries can drift from intended entities, and errors are detected by the same model that produces them making self-reflection an unreliable, ungrounded signal. We observe that multi-hop question answering is a typical form of step-by-step computation, and that this structured process aligns closely with how code-specialized language models are trained to operate. Motivated by this, we introduce \pyrag, a framework that reformulates multi-hop RAG as program synthesis and execution. Instead of free-form reasoning trajectories, \pyrag represents the reasoning process as an executable Python program over retrieval and QA tools, exposing intermediate states as variables, producing deterministic feedback through execution, and yielding an inspectable trace of the entire reasoning process. This formulation further enables compiler-grounded self-repair and execution-driven adaptive retrieval without any additional training. Experiments on five QA benchmarks (PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle) show that \pyrag consistently outperforms strong baselines under both training-free and RL-trained settings, with especially large gains on compositional multi-hop datasets. Our code, data and models are publicly available at https://github.com/GasolSun36/PyRAG.

2605.12966 2026-05-14 cs.AI 版本更新

Position: Agentic AI System Is a Foreseeable Pathway to AGI

Junwei Liao, Shuai Li, Muning Wen, Jun Wang, Weinan Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) University College London(伦敦大学学院)

AI总结 本文质疑单一模型规模扩展是实现人工通用智能(AGI)的唯一路径,提出代理式人工智能(Agentic AI)是应对现实任务复杂性和异质性分布的必要范式。通过理论推导,文章对比了单一学习器与代理系统的优化约束,展示了代理式AI在泛化能力和样本效率上的指数级优势,并探讨了其与专家混合模型的关系,呼吁加强对代理式AI的研究。

Comments Accepted by ICML'26 Position Track

详情
英文摘要

Is monolithic scaling the only path to AGI? This paper challenges the dogma that purely scaling a single model is sufficient to achieve Artificial General Intelligence. Instead, we identify Agentic AI as a necessary paradigm for mastering the complex, heterogeneous distribution of real-world tasks. Through rigorous theoretical derivations, we contrast the optimization constraints of monolithic learners against the efficiency of Agentic systems, progressing from simple routing mechanisms to general Directed Acyclic Graph (DAG) topologies. We demonstrate that Agentic AI achieves exponentially superior generalization and sample efficiency. Finally, we discuss the connection to Mixture-of-Experts, reinterpret the instability of current multi-agent frameworks, and call for greater research focus on Agentic AI.

2605.12963 2026-05-14 cs.AI 版本更新

Sustaining AI safety: Control-theoretic external impossibility, intrinsic necessity, and structural requirements

James M. Mazzu

发表机构 * Digie Inc.(Digie公司)

AI总结 随着AI系统能力的增强,安全策略不仅需要降低当前风险,还必须确保在外部控制无法可靠约束系统行为时仍能维持安全。本文运用控制理论,从结构层面分析了外部强制安全策略是否可行,并提出了两个主要结论:一旦系统影响超出有限外部控制的应对范围,任何依赖外部控制的策略都无法持续保障AI安全;若仍存在可行策略,则这些策略必须是内在的,并需满足四个结构性要求,如安全目标的稳定性与自我修改兼容性等。本文为外部控制局限性的广泛担忧提供了形式化的理论框架。

详情
英文摘要

As AI systems become increasingly capable, safety strategies must be evaluated not only by how much they reduce present risk, but by whether they could sustain safety once external control can no longer reliably constrain system behavior. This paper addresses that problem by using control theory to clarify, at a structural level, whether externally enforced safety-sustaining strategies can succeed and, if not, what any alternative strategy would have to satisfy in order to be viable. It establishes two main results. First, under explicit premises including a reachability condition, it proves a class-wide external impossibility result: once the system's effects exceed what bounded external control can counteract, no strategy that depends in any degree on continued external enforcement can sustain AI safety. This failure is structural across the entire externally enforced class rather than contingent on any particular strategy. Second, it establishes a conditional class-level necessity result: if at least one candidate safety-sustaining strategy remains after that elimination, then all such remaining strategies must be intrinsic. It then states four structural requirements for viability: safety may not depend on continued external enforcement; the system's terminal objective must be safety-compatible when first formed; that objective must remain stable under self-modification; and safety must continue to be preserved as capability grows. The paper does not propose a complete strategy for sustaining AI safety. Its contribution is to give formal structure to a widely held concern about the limits of external control. It does so by deriving explicit conditional results that identify which safety-sustaining strategies are ruled out and what any remaining strategies must satisfy.

2605.12954 2026-05-14 cs.CV cs.AI 版本更新

AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding

Xiao Yang, Yingzhe Ma, Haoxuan Yu, Zixin Li, Ning Qin

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 AdaFocus 是一种高效的长视频理解框架,旨在解决传统方法在时间覆盖、视觉细节与计算效率之间难以平衡的问题。该方法通过自适应相关性-多样性采样和零缓存回溯机制,实现对视频内容的渐进式证据获取,既减少了内存和计算开销,又保留了关键视觉细节。实验表明,AdaFocus 在多个基准数据集上实现了比现有方法更优的效率与精度平衡,显著提升了长视频理解任务的性能。

Comments 9 pages, 4 figures. Authors Xiao Yang and Yingzhe Ma contributed equally

详情
英文摘要

Long video understanding is heavily bottlenecked by a rigid one-shot paradigm: existing methods either densely encode videos at prohibitive memory and latency costs, or aggressively compress them into sparse frame sets that irreversibly discard fine-grained evidence needed for downstream reasoning. Consequently, current models struggle to simultaneously balance temporal coverage, visual details, and computational efficiency. We propose AdaFocus, an efficient framework that rethinks long-video understanding as progressive evidence acquisition rather than one-pass encoding. AdaFocus relies on two tightly coupled components. First, a Query-Aware Adaptive Relevance-Diversity sampler (AdaRD) produces a compact yet informative video preview, adaptively switching to global clustering when the query lacks reliable local grounding. Second, instead of caching exhaustive frame sequences in memory, AdaFocus introduces an uncertainty-triggered refinement mechanism. It performs targeted look-back only when the model is not confident, retrieving high-resolution evidence directly from disk via a zero-cache I/O design. This turns discarded visual details from an irreversible loss into on-demand recoverable evidence without paying the cost of exhaustive preloading. Experiments on seven standard long-video benchmarks show that AdaFocus delivers a substantially better efficiency-accuracy trade-off than strong baselines. Compared with conventional dense encoding, AdaFocus achieves improved task performance (e.g., +2.59 accuracy on VideoMME, +8.39 mIoU on Charades-STA over single-pass inference) while reducing visual token consumption by ~33x and eliminating the need for in-memory frame pre-caching through its zero-cache disk retrieval design. These findings suggest that progressive preview combined with zero-cache evidence refinement is a highly effective paradigm for scalable multimedia reasoning.

2605.12953 2026-05-14 cs.CV cs.AI 版本更新

Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

Chao Hao, Jun Xu, Ji Du, Shuo Ye, Ziyue Qiao, Xiaodong Cun, Guangcong Wang, Xubin Zheng, Zitong Yu

发表机构 * School of Computing and Information Technology(计算与信息科技学院) Great Bay University(大湾大学) Hangzhou International Innovation Institute(杭州国际创新研究院) Beihang University(北航大学) Department of Computing(计算系) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出了一种名为Seg-Agent的全新训练-free语言引导分割框架,旨在解决传统方法依赖大量训练数据的问题。该方法通过构建显式的多模态推理循环,使大型语言模型能够在视觉域内进行交互式推理,从而直接生成和优化分割结果。此外,研究还引入了Various-LangSeg基准,用于全面评估模型在不同场景下的泛化能力,实验表明Seg-Agent在无需参数更新的情况下即可达到先进训练方法的性能水平。

详情
英文摘要

Language-guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two-stage framework: employing Multimodal Large Language Models (MLLMs) to interpret instructions and generate visual prompts, followed by foundational segmentation models (e.g., SAM) to produce masks. However, due to the limited spatial grounding capabilities of off-the-shelf MLLMs, these methods often rely on extensive training on large-scale datasets to achieve satisfactory accuracy. While recent advances have introduced reasoning mechanisms to improve performance, they predominantly operate within the textual domain, performing chain-of-thought reasoning solely based on abstract text representations without direct visual feedback. In this paper, we propose Seg-Agent, a completely training-free framework that pioneers Explicit Multimodal Chain-of-Reasoning. Unlike prior text-only reasoning, our approach constructs an interactive visual reasoning loop comprising three stages: generation, selection, and refinement. Specifically, we leverage Set-of-Mark (SoM) visual prompting to render candidate regions directly onto the image, allowing the MLLM to ``see'' and iteratively reason about spatial relationships in the visual domain rather than just the textual one. This explicit multimodal interaction enables Seg-Agent to achieve performance comparable to state-of-the-art training-based methods without any parameter updates. Furthermore, to comprehensively evaluate generalization across diverse scenarios, we introduce Various-LangSeg, a novel benchmark covering explicit semantic, generic object, and reasoning-guided segmentation tasks. Extensive experiments demonstrate the effectiveness and robustness of our method.

2605.12947 2026-05-14 stat.ML cs.AI cs.LG stat.ME 版本更新

When Should an AI Workflow Release? Always-Valid Inference for Black-Box Generate-Verify Systems

Young Hyun Cho, Will Wei Sun

发表机构 * Department of Statistics, Purdue University(普渡大学统计系)

AI总结 随着基于大语言模型的AI工作流越来越多地采用生成-评估-修订的迭代流程,如何在适当的时候停止迭代并输出结果成为一个关键问题。本文提出了一种始终有效的发布包装器,用于现有生成-评估系统,通过构建高分失败案例的参考池并结合e-process累积证据,实现了在不确定停止时机下的统计保证。该方法能够在保证不释放不可行任务结果的同时,仍能对可行任务进行有效发布,理论分析和实验结果均验证了其有效性。

详情
英文摘要

LLM-enabled AI workflows increasingly produce outputs through iterative generate-evaluate-revise loops. Each iteration can improve the candidate, but it also creates a release decision: when to stop and output the current result? This raises a statistical challenge because deployment-time evaluator scores are adaptively generated and repeatedly monitored, yet the likelihood models or exchangeability assumptions typically used for calibration are unavailable. We propose an always-valid release wrapper for existing generator-evaluator pipelines. The wrapper builds a hard-negative reference pool of high-scoring failures, calibrates deployment-time evaluator scores against this pool, and accumulates the resulting evidence with an e-process. This separates two roles: the reference pool turns black-box scores into conservative evidence, while the e-process provides validity under optional stopping. In theory, we show that a conservative reference pool yields finite-sample control of the probability of releasing on infeasible tasks, that is, tasks for which the given workflow is not capable of producing a reliable solution. We also characterize conditions under which the same conservative rule still achieves nontrivial release on feasible tasks. In an MBPP+ coding-agent case study, the wrapper reduces premature incorrect release relative to baseline stopping rules while still releasing on tasks for which the workflow repeatedly accumulates moderate supporting evidence.

2605.12940 2026-05-14 cs.LG cs.AI 版本更新

The Expressivity Boundary of Probabilistic Circuits: A Comparison with Large Language Models

Zhiyu Zhao, Xuejie Liu, Muhan Zhang, Anji Liu

发表机构 * School of Computing, National University of Singapore(新加坡国立大学计算机学院) Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院)

AI总结 本文研究了概率电路(PCs)在生成语言模型中的表达能力边界,并与基于Transformer的大语言模型(LLMs)进行了对比。研究发现,PCs在自回归语言建模中仍存在表达能力上的不足,主要受限于输出参数化方式和上下文编码结构。通过引入logit空间参数化和分析结构分解PCs的依赖拓扑限制,作者揭示了PCs与LLMs之间的关键差异,并证明分解PCs在理论上具有更强的表达能力,但其有效优化仍是一个挑战。

详情
英文摘要

Probabilistic Circuits (PCs) are deep generative models that support exact and efficient probabilistic inference. Yet in autoregressive language modeling, PCs still lag behind Transformer-based large language models (LLMs), suggesting an important expressivity gap. In this work, we compare PCs and LLMs under a unified autoregressive formulation. First, an output bottleneck: PCs parameterize predictions as convex combinations in probability space, which struggles to represent the sharp distributions typical of language; adopting a logit-space parameterization substantially narrows this gap. Second, a context-encoding bottleneck: we prove that structured-decomposable PCs can match Transformer separation rank on vtree-aligned partitions, but show, both theoretically and empirically, that this capacity is limited to partitions aligned with the fixed routing structure, leading to severe degradation when the data exhibits heterogeneous dependency topologies. We further prove that decomposable PCs are strictly more expressive than structured-decomposable ones, though effectively optimizing them remains an open challenge.

2605.12938 2026-05-14 cs.CV cs.AI cs.LG 版本更新

CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation

Seonghyun Jin, Youngmin Kim, Sunwoo Park, Jong Chul Ye

发表机构 * Graduate School of AI(人工智能研究生院)

AI总结 该论文提出了一种名为CRePE的曲光线期望位置编码方法,用于统一相机控制的视频生成。针对现有方法在处理广角和鱼眼镜头等复杂相机配置时的不足,CRePE通过引入深度感知的位置分布,捕捉由宽视角相机引起的投影路径几何特性,从而提升相机控制的稳定性和生成质量。该方法结合几何注意力适配器和单目几何基础模型进行伪监督,实现了对多种相机模型的有效支持,并在多个几何感知和感知质量指标上表现出色。

Comments 17 pages, 8 figures, Under review

详情
英文摘要

Camera-conditioned video generation requires positional encoding that remains reliable under changes in camera motion, lens configuration, and scene structure. However, existing attention-level camera encodings either provide ray-only camera signals or rely on pinhole camera geometry, limiting their applicability to general camera control under the Unified Camera Model, including wide-angle and fisheye lenses. To address this limitation, we propose Curved Ray Expectation Positional Encoding (CRePE). CRePE represents each image token as a depth-aware positional distribution along its source ray, providing a Unified Camera Model-compatible positional encoding that captures the projected-path geometry induced by wide-angle and fisheye cameras. CRePE is implemented through a Geometric Attention Adapter added to frozen video DiTs, injecting token-wise scene-distance information into selected attention layers and stabilizing it with pseudo supervision from a monocular geometry foundation model. This design leads to more stable camera control and improves several geometry-aware and perceptual-quality metrics, while remaining competitive on video-quality metrics. Controlled positional-encoding ablations show a better overall average rank than a RayRoPE-style endpoint PE baseline, demonstrating the effectiveness of UCM-aware projected-path integration across diverse camera models. Furthermore, by extending the same positional-encoding pathway to external geometry control through Radial MixForcing, CRePE supports external radial-map control for scene-geometry-conditioned generation and source-video motion transfer beyond camera control.

2605.12937 2026-05-14 cs.CV cs.AI cs.HC 版本更新

AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters

Jacob Lagogiannis, William Agnew, Rosa I. Arriaga, Sauvik Das

发表机构 * Franklin and Marshall College(弗兰克林与马歇尔学院) Carnegie Mellon University(卡内基梅隆大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出了一种名为 AuraMask 的可扩展管道,用于开发既具有对抗性效果又符合审美要求的反人脸识别图像滤镜。该方法通过模仿流行的 Instagram 一键滤镜,生成了 40 种视觉上美观的滤镜,并在对抗开源人脸识别模型方面表现出优于现有方法的效果。实验表明,这些滤镜在用户接受度上也显著高于以往方法,为隐私保护技术的进一步研究提供了有效工具。

Comments 21 pages, 10 figures

详情
英文摘要

Anti-facial recognition (AFR) image filters alter images in ways that are subtle to people but blinding to computer vision. Yet, despite widespread interest in these technologies to subvert surveillance, users rarely use them in practice -- because the ``subtle'' alterations are visible enough to conflict with users' self-presentation goals. To address this challenge, we propose AuraMask: a novel approach to creating AFR filters that are both adversarially effective and aesthetically acceptable. Using AuraMask, we produce 40 ``aesthetic'' filters that emulate popular ``one-click'' Instagram image filters. We show that AuraMask filters meet or exceed the adversarial effectiveness of prior methods against open-source facial recognition models. Moreover, in a controlled online user study ($N=630$) we confirm these filters achieve significantly higher user acceptance than prior methods. Lastly, we provide our AFR pipeline to the community for accelerated research in adversarially effective and aesthetically acceptable protections.

2605.12922 2026-05-14 cs.AI cs.CL 版本更新

When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

Vardhan Dongre, Joseph Hsieh, Viet Dac Lai, Seunghyun Yoon, Trung Bui, Dilek Hakkani-Tür

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Adobe Research(Adobe研究)

AI总结 这篇论文研究了大型语言模型在多轮对话中逐渐丢失任务目标、角色设定和规则的现象。作者提出了一种“通道转换”机制,认为目标定义的标记在注意力机制中逐渐变得难以访问,而相关信息可能仍保留在残差表示中。通过引入“目标可访问性比率”(GAR)以及残差流探针等方法,研究揭示了不同模型在注意力关闭后表现出的多样化失效模式,并展示了残差表示在预测任务表现中的重要性。

详情
英文摘要

Large language models can follow complex instructions in a single turn, yet over long multi-turn interactions they often lose the thread of instructions, persona, and rules. This degradation has been measured behaviorally but not mechanistically explained. We propose a channel-transition account: goal-defining tokens become less accessible through attention, while goal-related information may persist in residual representations. We introduce the Goal Accessibility Ratio (GAR), measuring attention from generated tokens to task-defining goal tokens, and combine it with sliding-window ablations and residual-stream probes. When attention to instructions closes, what survives reveals architecture. Across architectures, the transition yields qualitatively distinct failure modes: some models preserve goal-conditioned behavior at vanishing attention, others fail despite decodable residual goal information, and the layer at which this encoding emerges varies from 2 to 27. A within-model causal ablation that force-closes the attention channel in Mistral collapses recall from near-perfect to 11% on a 20-fact retention task and raises persona-constraint violations above an adversarial-pressure baseline without user pressure, with both effects emerging at the predictable crossover turn. Linear probes recover per-episode recall outcomes from residual representations with AUC up to 0.99 across all four primary architectures, while input embeddings remain at chance. Across architectures and model scales, the gap between attention loss and residual decodability predicts whether goal-conditioned behavior survives channel closure. We contribute GAR as a diagnostic, the channel-transition framework as a controlled mechanistic account, and a parametric prediction of failure timing under windowed attention closure.

2605.12894 2026-05-14 cs.AI cs.CL 版本更新

Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents

Harshita Chopra, Kshitish Ghate, Aylin Caliskan, Tadayoshi Kohno, Chirag Shah, Natasha Jaques

发表机构 * University of Washington, Seattle, WA(华盛顿大学) Georgetown University, Washington, DC(乔治城大学)

AI总结 该研究旨在解决大型语言模型(LLM)代理在面对真实用户多样化行为时表现不佳的问题,提出了一种名为Persona Policies(PPol)的可插拔控制层,用于生成具有真实行为特征的用户角色,从而提升代理的鲁棒性。通过将角色生成建模为基于LLM的进化程序搜索,该方法优化Python生成器以发现符合任务目标的行为模式,并生成多样化的用户角色。实验表明,PPol显著提升了用户模拟的真实性与代理任务成功率,为基于模拟器的评估和训练提供了新的有效方法。

Comments Preprint under review

详情
英文摘要

Large Language Model (LLM) agents are increasingly deployed in settings where they interact with a wide variety of people, including users who are unclear, impatient, or reluctant to share information. However, collecting real interaction data at scale remains expensive. The field has turned to LLM-based user simulators as stand-ins, but these simulators inherit the behavior of their underlying models: cooperative and homogeneous. As a result, agents that appear strong in simulation often fail under the unseen, diverse communication patterns of real users. To narrow this gap, we introduce Persona Policies (PPol), a plug-and-play control layer that induces realistic behavioral variation in user simulators while preserving the original task goals. Rather than hand-crafting personas, we cast persona generation as an LLM-driven evolutionary program search that optimizes a Python generator to discover behaviors and translate them into task-preserving roleplay policies. Candidate generators are guided by a multi-objective fitness score combining human-likeness with broad coverage of human behavioral patterns. Once optimized, the generator produces a diverse population of human-like personas for any task in the domain. Across tau^2-bench retail and airline domains, evolved PPol programs yield 33-62% absolute gains in fitness score over the baseline simulator. In a blinded evaluation, annotators rated PPol-conditioned users as human 80.4% of the time, close to real human traces and nearly twice as frequently as baseline simulators. Agents trained with PPol are more robust to challenging, out-of-distribution behaviors, improving task success by +17% relative to training only on existing simulated interactions. This offers a novel approach to strengthen simulator-based evaluation and training without changing tasks or rewards.

2605.12887 2026-05-14 cs.IR cs.AI 版本更新

EcoGEO: Trajectory-Aware Evidence Ecosystems for Web-Enabled LLM Search Agents

Hengwei Ye, Jiasheng Mao, Zhenhan Guan, Zheng Tian

发表机构 * ShanghaiTech University(上海科技大学)

AI总结 本文提出了EcoGEO,一种面向轨迹的生态系统生成引擎优化方法,用于改进网络增强型大语言模型搜索代理的信息获取过程。与现有基于单页面优化的方法不同,EcoGEO关注代理在搜索过程中的整体浏览轨迹,通过构建协调的证据生态系统,引导代理更有效地发现和验证目标信息。实验表明,该方法在推荐任务中显著优于传统方法,主要得益于对代理浏览路径和证据获取过程的优化设计。

详情
英文摘要

Web-enabled LLM agents are changing how online information influences search outcomes. \ Existing Generative Engine Optimization (GEO) studies mainly focus on individual webpages. \ However, agentic web search is not a single-document setting: an agent may issue queries, crawl pages, follow links, reformulate searches, and synthesize evidence across multiple browsing steps. \ Influence therefore depends not only on page content, but also on how pages are organized, connected, and encountered along the agent's browsing trajectory. \ We study this shift through \textbf{Ecosystem Generative Engine Optimization} (\textbf{EcoGEO}), which treats GEO as an environment-level influence problem for web-enabled LLM agents. \ To instantiate this perspective, we propose \textbf{TRACE}, a \textbf{Trajectory-Aware Coordinated Evidence Ecosystem}. \ Given a recommendation query and a fictional target product, our method builds a controlled evidence environment that coordinates an agent-facing navigation entry page with heterogeneous support pages. \ These pages use shared terminology, internal links, and consistent product attributes to introduce, verify, and reinforce the target product. We evaluate our method on OPR-Bench, a benchmark for open-ended product recommendation. \ Experiments show that it consistently outperforms page-level GEO baselines in final target recommendation. \ Trajectory-level metrics further show increased initial target-result crawls, target-specific follow-up searches, and internal-link crawls, suggesting that the gains come from shaping the agent's evidence-acquisition process rather than merely adding more target-related content. \ Overall, our findings support an ecosystem research paradigm for GEO, where web-enabled LLM agents are studied in relation to the broader evidence environments that guide search, browsing, and answer synthesis.

2605.12869 2026-05-14 cs.CR cs.AI 版本更新

Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis

Zvi Topol

发表机构 * MuyVentive, LLC(MuyVentive公司)

AI总结 本文研究了大语言模型(LLM)在持续遭受对抗性攻击下的安全性能退化问题,提出了一种基于生存分析的新评估框架,用于量化模型被越狱攻击的时间动态特性。该方法将越狱时间建模为生存分析中的事件时间,能够估计风险函数、生存曲线及相关风险因素。实验表明,不同模型在面对重复攻击时表现出不同的脆弱性特征,为模型开发者提供了有价值的改进依据。

详情
英文摘要

Large language models (LLMs) are increasingly deployed in a wide range of applications, yet remain vulnerable to adversarial jailbreak attacks that circumvent their safety guardrails. Existing evaluation frameworks typically report binary success/failure metrics, failing to capture the temporal dynamics of how attacks succeed under persistent adversarial pressure. This preliminary work proposes a novel evaluation framework that applies survival analysis techniques to characterize LLM jailbreak vuln`erability. Our approach models the time-to-jailbreak as a survival outcome, enabling estimation of hazard functions, survival curves, and risk factors associated with successful attacks. We evaluate three LLMs against a subset of prompts from the HarmBench dataset spanning three attack categories. Our analysis reveals that models exhibit distinct vulnerability profiles: while one model demonstrates rapid degradation under iterative attacks, the two other models show consistent moderate vulnerability. Our framework provides actionable insights for model and LLM application developers and establishes survival analysis as a rigorous methodology for LLM safety evaluation.

2605.12863 2026-05-14 cs.PL cs.AI cs.CR 版本更新

Language-Based Agent Control

Timothy Zhou, Loris D'Antoni, Nadia Polikarpova

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 本文提出了一种基于语言的智能体控制(LBAC)编程模型,旨在通过编程语言和语言安全技术提升智能体应用的控制能力。该模型通过要求智能体生成符合静态类型检查的程序,确保其行为符合用户指定的策略,如访问控制和信息流控制等,并在执行前由类型检查器过滤不安全程序。LBAC在保持高度表达性的同时,实现了对智能体生成行为和开发者编写框架代码的统一策略管理,并通过三个案例验证了其有效性。

详情
英文摘要

This paper introduces language-based agent control (LBAC), a new programming model for agentic applications that brings techniques from programming languages and language-based security to the problem of agent control. In conventional programming, combinations of static typing and runtime enforcement have long been used to guarantee that well-typed programs satisfy user-specified policies, including policies for access control, information flow, data provenance, and more. The key idea behind LBAC is to extend these guarantees to agentic applications by requiring agents to generate programs that are themselves well typed in the context of the surrounding scaffolding code. Unsafe programs are rejected by the type-checker before execution, allowing policies to apply uniformly across the entire application, including both agent-generated behavior and developer-written scaffolding. At the same time, LBAC preserves substantial expressiveness: agents may perform arbitrary side-effect-free computation and recursively invoke subagents, which retain full tool access subject to the same -- or potentially more restrictive -- policies. We demonstrate LBAC with three case studies: I/O sandboxing via filesystem capabilities, data provenance, and information-flow control.

2605.12857 2026-05-14 cs.MA cs.AI cs.AR cs.LG 版本更新

ChipMATE: Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation

Zhongkai Yu, Yichen Lin, Chenyang Zhou, Yuwei Zhang, Kun Zhou, Junxia Cui, Haotian Ye, Zhengding Hu, Zaifeng Pan, Ruiyi Wang, Yujie Zhao, Hejia Zhang, Jingbo Shang, Jishen Zhao, Yufei Ding

发表机构 * UCSD(加州大学圣迭戈分校) Columbia University(哥伦比亚大学)

AI总结 现有基于API的智能体系统在RTL代码生成方面与工业实践存在根本性偏差,无法满足芯片厂商的安全要求并难以利用其专有数据。为解决这些问题,本文提出ChipMATE,首个自训练的多智能体RTL生成框架,通过Verilog智能体与Python参考模型智能体的相互验证,无需黄金测试平台即可保证生成代码的正确性。该方法采用回溯推理流程和两阶段训练策略,结合高质量数据生成框架,显著提升了生成效果,在多个评估指标上优于现有模型。

详情
英文摘要

Existing API-based agentic systems for RTL code generation are fundamentally misaligned with industrial practice: they assume a golden testbench is available at generation time, rely on closed-source APIs incompatible with chip vendors' air-gapped security requirements, and cannot be trained on vendors' proprietary RTL codebases, leaving valuable internal data unused. Recent self-trained models address the deployment constraint but remain single-turn generators that overlook the critical role of verification in real industrial flows. To bridge these gaps, we present ChipMATE, the first self-trained multi-agent framework for RTL generation. Inspired by industrial practice where correctness emerges from cross-comparison between independently written RTL modules and reference models, ChipMATE pairs a Verilog agent with a Python reference-model agent that mutually verify each other's outputs without any golden oracle. We design a backtrack-based inference workflow to prevent error propagation across turns, and a two-stage training pipeline that first trains each agent individually to saturate its code-generation capability, then trains the team jointly to collaborate effectively. To support the training, we further build a hybrid data-generation framework that produces 64.4K high-quality reference model training samples. ChipMATE achieves 75.0\% and 80.1\% pass@1 on VerilogEval V2 with 4B and 9B base models, outperforming all existing self-trained models and even DeepSeek V4 with 1600B parameters. Our code and model weights are publicly available in https://github.com/zhongkaiyu/ChipMATE.

2605.12851 2026-05-14 cs.CV cs.AI 版本更新

PRISM: Perinuclear Ring-based Image Segmentation Method for Acute Lymphoblastic Leukemia Classification

Larissa Ferreira Rodrigues Moreira, Leonardo Gabriel Ferreira Rodrigues, Rodrigo Moreira, André Ricardo Backes

发表机构 * Institute of Exact and Technological Sciences(精确与技术科学研究所) Federal University of Viçosa(弗雷塔斯联邦大学) School of Computer Science(计算机科学学院) Federal University of Uberlândia(伯南布哥联邦大学) Departament of Computing(计算系) Federal University of São Carlos(萨o卡洛斯联邦大学)

AI总结 该研究针对急性淋巴细胞白血病(ALL)分类中外周血涂片图像分析的挑战,提出了一种基于核周环的图像分割方法PRISM。该方法通过围绕细胞核构建自适应同心区域,替代传统的细胞质轮廓分割,从而在无需精确细胞边界检测的情况下提取鲁棒的细胞质特征。实验表明,该方法结合传统分类器的校准集成,在分类准确率和AUC指标上均表现出色,分别达到98.46%和0.9937。

Comments Paper accepted for publication at the XXVI Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS 2026), Ouro Preto, MG, Brazil

详情
英文摘要

Automated analysis of peripheral blood smears for Acute Lymphoblastic Leukemia (ALL) is hindered by low contrast and substantial variability in cytoplasmic appearance, which complicate conventional membrane-based segmentation. We found that many recent approaches rely on heavy neural architectures and extensive training, but still struggle to generalize across staining and acquisition variability. To address these limitations, we propose the Perinuclear Ring-based Image Segmentation Method (PRISM), which replaces explicit cytoplasmic delineation with adaptive concentric zones constructed around the nucleus. These perinuclear regions enable the extraction of robust cytoplasmic descriptors by integrating color information with texture statistics derived from grey-level co-occurrence patterns, without requiring accurate cell-boundary detection. A calibrated stacking ensemble of traditional classifiers leverages these descriptors to achieve a high performance, with an accuracy of 98.46% and a precision-recall AUC of 0.9937.

2605.12845 2026-05-14 cs.CV cs.AI 版本更新

AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects

Danrui Li, Jiahao Zhang, Bernhard Egger, Moitreya Chatterjee, Suhas Lohit, Tim K. Marks, Anoop Cherian

发表机构 * Rutgers, The State University of New Jersey(新泽西罗格斯大学) The Australian National University(澳大利亚国立大学) Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔兰根-纽伦堡弗里德里希-亚历山大大学) Mitsubishi Electric Research Laboratories (MERL)(三菱电机研究实验室)

AI总结 本文提出AssemblyBench,一个包含2,789个工业对象的合成数据集,包含多模态装配说明、对应的3D部件模型及装配轨迹,旨在解决工业装配中复杂形状和装配路径的问题。研究还提出基于Transformer的模型AssemblyDyno,能够联合预测装配顺序和部件轨迹,相比现有方法在装配姿态估计和轨迹可行性方面表现更优,其中轨迹可行性通过物理仿真进行评估。

Comments Accepted at CVPR 2026

详情
英文摘要

Assembling objects from parts requires understanding multimodal instructions, linking them to 3D components, and predicting physically plausible 6-DoF motions for each assembly step. Existing datasets focus on simplified scenarios, overlooking shape complexities and assembly trajectories in industrial assemblies. We introduce AssemblyBench, a synthetic dataset of 2,789 industrial objects with multimodal instruction manuals, corresponding 3D part models, and part assembly trajectories. We also propose a transformer-based model, AssemblyDyno, which uses the instructional manual and the 3D shape of each part to jointly predict assembly order and part assembly trajectories. AssemblyDyno outperforms prior works in both assembly pose estimation and trajectory feasibility, where the latter is evaluated by our physics-based simulations.

2605.12843 2026-05-14 cs.LG cs.AI 版本更新

Bayesian Model Merging

Kaiyang Li, Shaobo Han, Qing Su, Shihao Ji

发表机构 * School of Computing, University of Connecticut(康涅狄格大学计算机学院) Optical Networking and Sensing, NEC Labs America(NEC美国光网络与传感实验室)

AI总结 本文提出了一种名为Bayesian Model Merging(BMM)的模型合并方法,旨在在无需联合重训练的情况下将多个任务专家模型合并为一个统一模型。该方法采用了一种双层优化框架,内层基于锚定模型的强先验进行激活驱动的贝叶斯回归,得到高效的闭式解;外层则通过贝叶斯优化全局搜索各模块的超参数。此外,BMM还揭示了激活统计量与任务向量之间的关键对齐关系,从而实现了无需辅助数据的无数据变体。实验表明,BMM在多个基准测试中均优于现有方法,尤其在多任务视觉与语言任务中表现出色。

详情
英文摘要

Model merging aims to combine multiple task-specific expert models into a single model without joint retraining, offering a practical alternative to multi-task learning when data access or computational budget is limited. Existing methods, however, face two key limitations: (1) they overlook the valuable inductive bias of strong anchor models and estimate the merged weights from scratch, and (2) they rely on a shared hyperparameter setting across different modules of the network, lacking a global optimization strategy. This paper introduces Bayesian Model Merging (BMM), a plug-and-play bi-level optimization framework, where the inner level formulates the model merging as an activation-based Bayesian regression under a strong prior induced by an anchor model, yielding an efficient closed-form solution; and the outer level leverages a Bayesian optimization procedure to search module-specific hyperparameters globally based on a small validation set. Furthermore, we reveal a key alignment between activation statistics and task vectors, enabling us to derive a data-free variant of BMM that estimates the Gram matrix for regression without any auxiliary data. Across extensive benchmarks, including up to 20-task merging in vision and 5-task merging in language, BMM consistently outperforms all plug-and-play anchor baselines (e.g., TA, WUDI-Merging, and TSV). In particular, on the ViT-L/14 benchmark for 8-task merging, a single merged model reaches 95.1, closely matching the average performance of eight task-specific experts (95.8).

2605.12838 2026-05-14 cs.AI 版本更新

Multimodal Hidden Markov Models for Persistent Emotional State Tracking

Anamika Ragu, Aneesh Jonelagadda

发表机构 * Kaliber AI, San Mateo, California, USA(Kaliber AI,美国加利福尼亚州圣马特奥)

AI总结 本文提出了一种基于多模态情感表示的轻量级隐马尔可夫模型框架,用于追踪对话中持续的情感状态变化。该方法利用粘性因子HDP-HMM对来自视频、音频和文本的多模态情感特征进行建模,能够更准确地捕捉对话中长期的情感阶段。实验表明,该模型在计算成本远低于基于大语言模型的方法的前提下,能够生成更具可解释性的情感序列,并在临床数据集上验证了其在情感阶段恢复和提升对话质量方面的有效性。

Comments 8 pages, 2 figures

详情
英文摘要

Tracking an interpretable emotional arc of a conversation via the sentiment of individual utterances processed as a whole is central to both understanding and guiding communication in applied, especially clinical, conversational contexts. Existing approaches to emotion recognition operate at the utterance level, obscuring the persistent phases that characterize real conversational dynamics. We propose a lightweight framework that models conversational emotion as a sequence of latent emotional regimes using sticky factorial HDP-HMMs over multimodal valence-arousal representations derived from simultaneous video, audio and textual input. We evaluate the quality of regime prediction using LLM-as-a-Judge, geometric, and temporal consistency metrics, demonstrating that the sticky HDP-HMM produces more interpretable regime sequences than the baseline Gaussian HMM at a fraction of the computational cost of LLM-based dialogue state tracking methods. In addition, Question-Answer experiments in a clinical dataset suggest that meaningful emotional phases can reliably be recovered from multimodal valence-arousal trajectories and used to improve the quality of LLM responses in unstable affective regimes via context augmentation. This framework thus opens a path toward interpretable, lightweight, and actionable analysis of conversational emotion dynamics at scale.

2605.12835 2026-05-14 cs.AI 版本更新

PROMETHEUS: Automating Deep Causal Research Integrating Text, Data and Models

Sridhar Mahadevan

发表机构 * Adobe Research and University of Massachusetts, Amherst(Adobe研究院和马萨诸塞大学阿默斯特分校)

AI总结 PROMETHEUS 是一个将文本、数据和模型整合为因果地图的框架,旨在自动化深度因果研究。该方法通过构建局部因果预测状态模型的集合,形成可导航的因果图谱,支持对不同区域的因果声明进行比较与整合。研究展示了该框架在多个实际案例中的应用,包括从文献中提取因果关系以及基于原始数据进行反事实验证,显著提升了因果推理的系统性和可解释性。

Comments 27 pages

详情
英文摘要

Large language models can extract local causal claims from text, but those claims become more useful when organized as persistent, navigable world models rather than as flat summaries. We introduce PROMETHEUS, a framework that turns retrieved literature, filings, reviews, reports, agent traces, source data, code, simulations, and scientific models into causal atlases: sheaf-like families of local causal predictive-state models over an explicit cover of a research substrate. Each local region contains causal episodes, structured claim tables, predictive tests, support statistics, and provenance; restriction maps compare overlapping regions; gluing diagnostics expose agreement, drift, contradiction, and underdetermination. The resulting Topos World Model is not a single universal graph. It is a research instrument for navigating what a corpus says, where it says it, how strongly it is supported, and where local claims fail to assemble into a coherent global view. Three literature-atlas case studies -- ocean-temperature impacts on marine populations, GLP-1 weight-loss evidence, and resveratrol/red-wine health-benefit claims -- illustrate deep causal research from text with explicit locality, evidence, persistent state, and gluing tension. Four grounded-counterfactual case studies -- a Nature Climate Change microplastics forcing paper, an Indus Valley hydrology paper with VIC-derived figure data and model code, the canonical Sachs protein-signaling study with single-cell perturbation data, and a Nature singing-mouse study with MAPseq projection matrices -- show a stronger mode: when a paper ships source data, simulation outputs, or code, PROMETHEUS can evaluate a counterfactual against that scientific substrate and then rebuild the sheaf world model around the

2605.12826 2026-05-14 cs.CV cs.AI 版本更新

FRAME: Forensic Routing and Adaptive Multi-path Evidence Fusion for Image Manipulation Detection

Kaixiang Zhao, Tianrun Yu, Aoxu Zhang, Junhao Su, Porter Jenkins, Amanda Hughes

发表机构 * Brigham Young University Rutgers University

AI总结 随着图像编辑工具和生成式人工智能的普及,数字图像的真实性验证变得愈发困难。为了解决现有方法在鲁棒性、证据碎片化和泛化能力方面的不足,本文提出了一种名为FRAME的新方法,通过多路径分析空间组织多种取证算法,自适应选择适合的取证路径并融合互补证据,从而提升检测与定位性能。FRAME在保持多源取证线索可解释性的基础上,提供了更稳健且灵活的图像取证方案,并在多种篡改场景中展现出良好的效果。

Comments Accepted to CVPR 2026 SAFE Workshop

详情
英文摘要

The proliferation of sophisticated image editing tools and generative artificial intelligence models has made verifying the authenticity of digital images increasingly challenging, with important implications for journalism, forensic analysis, and public trust. Although numerous forensic algorithms, ranging from handcrafted methods to deep learning-based detectors, have been developed for manipulation detection, individual methods often suffer from limited robustness, fragmented evidence, or weak generalization across manipulation types and image conditions. To address these limitations, we present \textbf{FRAME}, a method for \textbf{F}orensic \textbf{R}outing and \textbf{A}daptive \textbf{M}ulti-path \textbf{E}vidence fusion for image manipulation detection. FRAME organizes diverse forensic algorithms into a multi-path analysis space, adaptively selects informative forensic paths for each input image, and fuses complementary evidence to improve detection and localization performance. By moving beyond single-method analysis and fixed fusion strategies, FRAME provides a more robust and flexible approach to image forensic reasoning while preserving interpretable forensic cues from multiple evidence sources. Experimental results demonstrate the effectiveness of FRAME across diverse manipulation scenarios. Code is available at \href{https://github.com/kzhao5/FRAME}{https://github.com/kzhao5/FRAME}.

2605.12817 2026-05-14 cs.LG cs.AI cs.CL 版本更新

Training Large Language Models to Predict Clinical Events

Benjamin Turtel, Paul Wilczewski, Kris Skotheim

发表机构 * Lightning Rod Labs(Lightning Rod实验室)

AI总结 该研究旨在利用纵向临床记录训练大型语言模型以预测临床事件。通过将时间顺序的MIMIC-III病历转化为包含过去病史、未来事件问题及后续记录标签的预测示例,构建了涵盖药物、手术、器官支持、微生物学和死亡率等多方面的预测数据集。研究采用LoRA微调方法显著提升了模型的预测性能,并在无需人工设计结构特征或专用分类器的情况下实现了对临床预测的可复用监督学习。

详情
英文摘要

Longitudinal clinical notes contain rich evidence of how patients evolve over time, but converting this signal into training supervision for clinical prediction remains challenging. We extend Foresight Learning to clinical prediction by converting time-ordered MIMIC-III notes into examples consisting of past patient context, a natural-language question about a possible future event, and a label resolved from later documentation. This process yields 6,900 prediction examples from 702 admissions across medications, procedures, organ support, microbiology, and mortality. A small LoRA adapter trained on these examples improves over the prompted base model, reducing expected calibration error from 0.1269 to 0.0398 and Brier score from 0.199 to 0.145, while slightly outperforming GPT-5 point estimates on held-out questions. The approach enables reusable clinical prediction supervision from longitudinal notes without hand-engineered structured features or endpoint-specific classifiers.

2605.12809 2026-05-14 cs.LG cs.AI 版本更新

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

Shixing Yu, Promit Ghosal, Kyra Gan

发表机构 * Electrical and Computer Engineering(电气与计算机工程系) Department of Statistics(统计学系) Operations Research and Industrial Engineering(运筹学与工业工程)

AI总结 该研究旨在提高大语言模型在医疗等关键领域中的可靠性,通过识别模型预测所依赖的训练数据中的具体 token。为解决现有方法在 token 独立性假设和分解性上的局限,作者提出了一种基于正交潜在空间的框架,利用稀疏自编码器学习近似独立的潜在特征,并通过雅可比向量积和逆 Hessian 近似实现 token 级别的影响分析。实验表明,该方法能有效识别出稀疏且可解释的 token 集合,有助于增强模型可信度和决策透明性。

详情
英文摘要

A critical step for reliable large language models (LLMs) use in healthcare is to attribute predictions to their training data, akin to a medical case study. This requires token-level precision: pinpointing not just which training examples influence a decision, but which tokens within them are responsible. While influence functions offer a principled framework for this, prior work is restricted to autoregressive settings and relies on an implicit assumption of token independence, rendering their identified influences unreliable. We introduce a flexible framework that infers token-level influence through a latent mediation approach for general prediction tasks. Our method attaches sparse autoencoders to any layer of a pretrained LLM to learn a basis of approximately independent latent features. Unlike prior methods where influence decomposes additively across tokens, influence computed over latent features is inherently non-decomposable. To address this, we introduce a novel method using Jacobian-vector products. Token-level influence is obtained by propagating latent attributions back to the input space via token activation patterns. We scale our approach using efficient inverse-Hessian approximations. Experiments on medical benchmarks show our approach identifies sparse, interpretable sets of tokens that jointly influence predictions. Our framework enhances trust and enables model auditing, generalizing to high-stakes domain requiring transparent and accountable decisions.

2605.12805 2026-05-14 cs.LG cs.AI 版本更新

Discrete MeanFlow: One-Step Generation via Conditional Transition Kernels

Fairoz Nower Khan, Nabuat Zaman Nahim, Md Sajid Ahmed, Ruiquan Huang, Peizhong Ju

AI总结 该论文提出了一种名为 Discrete MeanFlow 的新方法,用于在离散状态空间中实现一步生成。与连续空间中的 MeanFlow 不同,它通过连续时间马尔可夫链的条件转移核来建模概率质量的转移,并定义了一个平均离散速率来衡量转移概率在时间区间内的变化。该方法通过边界构建设计直接参数化转移核,确保生成过程无需迭代去噪或微分方程求解,只需一次前向传播和分类采样即可完成生成,实验表明其在有限状态马尔可夫链和合成序列生成任务中具有高精度。

详情
英文摘要

MeanFlow enables one-step generation in continuous spaces by learning an average velocity over a time interval rather than the instantaneous velocity field of flow matching. However, discrete state spaces do not have smooth trajectories or spatial derivatives, so the continuous formulation does not directly apply. We introduce Discrete MeanFlow, which replaces the motion of a point with the transport of probability mass over finite states. Our key object is the conditional transition kernel of a continuous-time Markov chain (CTMC), from which we define a mean discrete rate that measures the average change in transition probability over a time interval. We prove a Discrete MeanFlow identity that relates this finite-interval rate to the instantaneous CTMC generator at the endpoint, with the Kolmogorov forward equation replacing the spatial chain rule of continuous MeanFlow. Based on this identity, we parameterize the transition kernel directly using a boundary-by-construction design that guarantees valid probability outputs and exact boundary conditions without auxiliary losses. Since the learned kernel is itself a probability distribution, generation reduces to a single forward pass followed by one categorical draw meaning no iterative denoising, ODE integration, or multi-step refinement is required. We validate the framework on exact finite-state Markov chains, where the learned kernel recovers the analytical ground truth to high precision, and on factorized synthetic sequence generation tasks with varying alphabet sizes and sequence lengths.

2605.12798 2026-05-14 cs.LG cs.AI cs.CL 版本更新

Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

Baris Askin, Muhammed Ustaomeroglu, Anupam Nayak, Gauri Joshi, Guannan Qu, Carlee Joe-Wong

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 该研究探讨了在有限有害数据集上微调大语言模型时可能引发的“涌现性对齐偏差”(EM)和“潜意识学习”(SL)现象。研究认为,这类偏差并非由单一有害示例引起,而是数据结构、任务难度与模型能力之间相互作用的结果。通过实验发现,当微调与评估提示具有相似功能结构、存在更多连贯有害补全空间,或目标行为已被模型可靠学习时,偏差更容易出现。研究还首次对比了在策略外与策略内蒸馏下偏差的传递机制,强调应从数据和训练流程的整体视角理解对齐偏差的成因。

详情
英文摘要

Fine-tuning LLMs on narrow harmful datasets can induce Emergent Misalignment (EM), where models exhibit misaligned behavior far beyond the fine-tuning distribution. We argue that emergent misalignment can be better understood as a data-mediated transfer phenomenon: harmful fine-tuning examples do not induce uniform behavioral spillover, but interact with the structural properties of the dataset and the difficulty of the tasks relative to the model. Across our experiments, we find that misalignment appears more readily when fine-tuning and evaluation prompts share similar underlying functional structure, when prompts leave more room for coherent harmful completions, and when the target behavior has been more reliably learned by the model. The training pipeline itself also matters: pretraining composition shapes later misalignment. We further study Subliminal Learning (SL), where misalignment is transmitted by fine-tuning on seemingly benign data generated by a harmful teacher. Moving beyond the standard SFT setting, we for the first time compare this transfer under off-policy and on-policy distillation as well, allowing us to separate the roles of the teacher guidance and the training data distribution in transmitting misalignment. Together, these results argue for a data-centric view: Emergent/subliminal misalignment should not be treated as a simple consequence of isolated harmful fine-tuning examples, but as the result of interactions between fine-tuning data structure, pretraining distributions, and training channels.

2605.12771 2026-05-14 cs.RO cs.AI cs.LG cs.SY eess.SY math.OC 版本更新

Adaptive Smooth Tchebycheff Attention for Multi-Objective Policy Optimization

Alejandro Murillo-Gonzalez, Mahmoud Ali, Lantao Liu

发表机构 * Indiana University–Bloomington(印第安纳大学布卢明顿分校)

AI总结 本文研究了多目标强化学习中如何在复杂、非凸的目标权衡下优化策略的问题。为了解决线性标量化方法无法访问非凸帕累托前沿区域、而静态非线性标量化方法在深度强化学习中易出现梯度方差大和优化不稳定的问题,作者提出了一种自适应平滑切比雪夫注意框架,通过动态调节优化景观的曲率来平衡稳定性与探索能力。实验表明,该方法在具有挑战性的机器人隐蔽视觉搜索任务中能有效发现传统方法难以触及的非凸帕累托最优策略。

Comments To appear in the Proceedings of Robotics: Science and Systems (RSS) 2026

详情
英文摘要

Multi-objective reinforcement learning in robotic domains requires balancing complex, non-convex trade-offs between conflicting objectives. While linear scalarization methods provide stability, they are theoretically incapable of recovering solutions within non-convex regions of the Pareto front. Conversely, static non-linear scalarizations (e.g., Tchebycheff) can theoretically access these regions but often suffer from severe gradient variance and optimization instability in deep RL. In this work, we propose an Adaptive Smooth Tchebycheff framework that resolves this tension by dynamically modulating the curvature of the optimization landscape. We introduce a novel conflict-driven controller that regulates the optimization smoothness based on real-time gradient interference. This allows the agent to anneal toward precise, non-convex scalarization when objectives align, while elastically reverting to stable, smooth approximations when destructive gradient conflicts emerge. We validate our approach on a challenging robotic stealth visual search task -- a proxy for monitoring of protected/fragile ecosystems -- where an agent must balance search, exposure/interference minimization and exploration speed. Extensive ablations confirm that our conflict-aware adaptation enables the robust discovery of Pareto-optimal policies in non-convex regions inaccessible to linear baselines and unstable for static non-linear methods. Website: https://alejandromllo.github.io/research/pasta/

2605.12762 2026-05-14 cs.LG cs.AI 版本更新

Multi-Quantile Regression for Extreme Precipitation Downscaling

Hamed Najafi, Gareth Lagerwall, Jayantha Obeysekera, Jason Liu

发表机构 * Florida International University(佛罗里达国际大学)

AI总结 该研究针对降水降尺度任务中极端强降水事件预测不足的问题,提出了一种基于多分位数回归的深度超分辨率网络Q-SRDRN。通过在多个分位点(如0.999)上使用pinball损失函数进行训练,该方法能够更准确地捕捉降水分布的尾部特征。实验表明,该模型在佛罗里达、加利福尼亚和德克萨斯等不同气候区域均显著提升了极端降水事件的检测能力,尤其在高分位数上表现突出。

详情
英文摘要

Deep super-resolution networks for precipitation downscaling achieve strong bulk skill yet systematically under-predict the heavy-tail events that drive flood risk. We demonstrate that the primary obstacle is the loss function, not the data: under intensity-weighted MAE, real and synthetic labels at the same input are simply averaged, meaning data augmentation shifts the predicted mean rather than the conditional distribution. We resolve this with Q-SRDRN, a multi-quantile super-resolution network trained with pinball loss at tau in 0.50, 0.95, 0.99, 0.999. Two CNN-specific design choices make this practical: IncrementBound enforces monotonicity while preserving each quantile channel's gradient identity, and separate per-quantile output heads provide independent filter banks for bulk and tail detection. Under this design, data augmentation via cVAE becomes complementary: the median head absorbs synthetic patterns without contaminating upper quantiles. Empirically, on Florida (convective/tropical-cyclone dominated), the un-augmented Q-SRDRN P999 head detects 1,598 of 2,111 events at 200 mm/day versus 88 for the deterministic baseline--an 18x detection-rate gain (4.2% to 75.7%)--with 63% lower KL divergence and 3.9% lower RMSE. Adding cVAE-generated samples lifts the P50 channel from 14 to 1,038 hits at 200 mm/day. On California (atmospheric-river dominated), the architecture reaches near-perfect detection (P999 SEDI >= 0.996 through 300 mm/day). On Texas, the baseline catches only 2 of 10,720 events at 200 mm/day while the P999 head catches 8,776 (81.9%). While the cVAE does not transfer across regions, multi-quantile regression captures extremes wherever the large-scale signal is strong, while augmentation rescues the median where it is not.

2605.12756 2026-05-14 math.OC cs.AI stat.ML 版本更新

Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization

Zhehang Du, Hangfeng He, Weijie Su

发表机构 * The Wharton School, University of Pennsylvania(宾夕法尼亚大学沃顿商学院) University of Rochester(罗切斯特大学)

AI总结 本文研究了大规模语言模型在最小化交叉熵损失进行预训练时,是否会在模型权重和上下文嵌入中诱导出几何结构。通过分析一个约束的逐层剥离优化模型,作者证明了目标下一个词分布中的对称性会以群论意义上的方式转移到模型的最优解中。例如,当目标词具有循环移位对称性时,最优的logit矩阵为循环矩阵,输出投影和上下文嵌入的格拉姆矩阵也呈现出循环几何结构;对于具有对称群不变性的目标分布,最优输出投影矩阵形成等角紧框架,且继承了输入数据中的排列对称性。实验表明,开源大语言模型自然表现出与理论预测一致的对称性,尽管训练过程中并未显式引入相关正则化。

详情
英文摘要

Large language models (LLMs) are pretrained by minimizing the cross-entropy loss for next-token prediction. In this paper, we study whether this optimization strategy can induce geometric structure in the learned model weights and context embeddings. We approach this problem by analyzing a constrained layer-peeled optimization program, which serves as a mathematically tractable surrogate for LLMs by treating the output projection matrix and last-layer context embeddings as optimization variables. Our analysis of this nonconvex optimization program demonstrates that symmetries in the target next-token distributions are transferred to the global minimizers of the layer-peeled model in a precise group-theoretic sense. Specifically, we prove that when the target tokens exhibit a cyclic-shift symmetry (such as the seven days of the week or the twelve months of the year), the optimal logit matrix is exactly circulant, and the Gram matrices of both the output projections and the context embeddings form circulant geometries as well. Next, for exchangeable target distributions invariant under the symmetric group and, more generally, under two-transitive group actions, we show that the global optimal output projection matrix forms a simplex equiangular tight frame, while the optimal logit matrix and context embeddings inherit the permutation symmetries present in the input data. A key technical step is to reduce the constrained nonconvex factorized problem to an explicit logit-level convex characterization for cyclic symmetry and to a symmetry-based lower bound for permutation symmetry, together with a sharp characterization of the optimal factorization. Finally, we empirically demonstrate that open-source LLMs naturally exhibit symmetries consistent with our theoretical predictions, despite being trained without any explicit regularization promoting such geometric structure.

2605.12755 2026-05-14 cs.AI 版本更新

State-Centric Decision Process

Sungheon Jeong, Ryozo Masukawa, Sanggeon Yun, Mahdi Imani, Mohsen Imani

发表机构 * University of California, Irvine(加州大学尔湾分校) Northeastern University(东北大学)

AI总结 本文提出了一种名为“状态中心决策过程”(SDP)的运行时框架,用于解决语言环境(如网页浏览器、代码终端等)中缺乏明确状态空间和转移结构的问题。该方法通过让智能体逐步构建状态空间,利用自然语言谓词描述期望的环境状态,并通过行动验证观测结果,从而生成认证的状态转移路径。实验表明,SDP在多个基准任务中取得了最佳的无训练结果,并支持对智能体行为进行更精细的分析与优化。

详情
英文摘要

Language environments such as web browsers, code terminals, and interactive simulations emit raw text rather than states, and provide none of the runtime structure that MDP analysis requires. No explicit state space, no observation-to-state mapping, no certified transitions, and no termination criterion. We introduce the State-Centric Decision Process (SDP), a runtime framework that constructs these missing inputs by having the agent build them, predicate by predicate, as it acts. At each step the agent commits to a natural-language predicate describing how the world should look, takes an action to make it true, and checks the observation against it. Predicates that pass become certified states, and the resulting trajectory carries the four objects language environments do not provide, namely a task-induced state space, an observation-to-state mapping, certified transitions, and a termination criterion. We evaluate SDP on five benchmarks spanning planning, scientific exploration, web reasoning, and multi-hop question answering. SDP achieves the best training-free results on all five, with the advantage widening as the horizon grows. The certified trajectories additionally support analyses unavailable to reactive agents, including per-predicate credit assignment, failure localization, partial-progress measurement, and modular operator replacement.

2605.12748 2026-05-14 cs.CL cs.AI cs.CY cs.LG 版本更新

Simulating Students or Sycophantic Problem Solving? On Misconception Faithfulness of LLM Simulators

Heejin Do, Shashank Sonkar, Mrinmaya Sachan

发表机构 * ETH Zürich(苏黎世联邦理工学院) ETH AI Center(ETH人工智能中心) University of Central Florida(中央佛罗里达大学)

AI总结 该研究探讨了大语言模型(LLM)作为模拟学生的有效性,指出当前评估方法主要关注输出与真实学生的相似性,而忽视了模型是否能像学生一样保持连贯的误解并根据反馈进行选择性修正。为此,研究提出了一种新的评估框架和指标“选择性翻转分数”(SFS),用于衡量模型在面对针对性反馈时修正答案的能力。实验发现,现有模型在不同反馈条件下修正答案的频率相近,表现出“谄媚式”行为,即倾向于直接放弃原有信念而重新解答。研究进一步提出了一种后训练方法,有效提升了模型在误解一致性方面的表现。

详情
英文摘要

Large language models (LLMs) can fluently generate student-like responses, making them attractive as simulated students for training and evaluating AI tutors and human educators. Yet such simulators are typically evaluated by output similarity to real students, not by whether they behave like students with coherent misconceptions during interaction. We introduce a controlled framework for evaluating misconception faithfulness, whether a simulator maintains a misconception-driven belief state and updates selectively when feedback addresses the underlying misconception. Central to our framework is a misconception-contrastive feedback protocol that compares targeted feedback against two controls: misaligned feedback (targeting a different but plausible misconception) and generic feedback (only identifying answer is wrong). We propose Selective Flip Score (SFS), which quantifies how much more often a simulator flips its answer under targeted feedback than under contrastive controls. Across seven LLMs (4B-120B), multiple datasets, and prompting strategies, simulators exhibit near-zero SFS, correcting their answers at similarly high rates regardless of feedback relevance. Further analyses reveal a sycophantic failure mode: models behave less like students with misconceptions but more like problem-solvers who treat any corrective signal as a cue to abandon the simulated belief and re-solve from internal knowledge. To address this, we develop a post-training pipeline spanning supervised fine-tuning (SFT), preference optimization, and reinforcement learning (RL) with an SFS-aligned reward; SFT yields notable gains up to +0.56, and SFS-aligned RL provides more consistent improvements than preference optimization. Our results establish misconception faithfulness as a challenging yet trainable property, motivating a shift from static output matching toward interactive, belief-aware student modeling.

2605.12746 2026-05-14 cs.CR cs.AI 版本更新

CoT-Guard: Small Models for Strong Monitoring

Nirav Diwan, Han Wang, Berkcan Kapusuzoglu, Ramin Moradi, Supriyo Chakraborty, Giri Iyengar, Sambit Sahu, Huan Zhang, Gang Wang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Capital One

AI总结 本文提出 CoT-Guard,一种基于 4B 参数的小型模型,用于监控推理过程(CoT)以检测代码生成任务中的隐蔽目标。为解决小型模型在检测隐蔽目标时的不足,研究设计了一种结合监督微调和强化学习的后训练方法,提升模型在领域内和领域外任务中的检测能力。实验表明,CoT-Guard 在多种攻击场景下表现优异,显著优于其他主流大模型,为用户提供了一种高效、低成本的防御方案。

详情
英文摘要

Monitoring the chain-of-thought (CoT) of reasoning models is a promising approach for detecting covert misbehavior (i.e., hidden objectives) in code generation tasks. While large models (GPT-5, Gemini-3-Flash) can serve as effective CoT monitors, they are expensive to deploy due to the lengthy reasoning traces and high API cost, emphasizing the need for smaller, cheaper alternatives. Nevertheless, we find that current small models (4B--8B) struggle to detect hidden objectives despite access to the CoT, frequently misattributing them as part of the user query. To address this, we propose a post-training pipeline combining supervised fine-tuning (SFT) and reinforcement learning (RL), where SFT narrows the gap for in-domain tasks by distilling detection behavior from stronger monitors, and RL on hard and subtly crafted hidden objectives helps the model generalize to out-of-domain monitoring tasks. To validate this generalization, we evaluate under a realistic threat model motivated by practical supply-chain attacks, where the adversary is a third-party LLM router injecting hidden objectives into code-generation requests through either prompt manipulation or code manipulation attacks. To push beyond objectives that large monitors already saturate, we also introduce four new challenging tasks even for strong monitors. Finally, we introduce CoT-Guard, a 4B-parameter monitor that demonstrates superior generalization performance under both prompt and code manipulation attacks, achieving a G-mean^2 (i.e., TNR x TPR) of 75% and outperforming GPT-5.4 (56%), GPT-5-mini (41%), and Qwen3-32B (54%), while closing the gap to Gemini-3-Flash (83%). These results demonstrate that CoT-Guard provides a practical and cost-effective user-side defense, substantially improving hidden-objective detection while avoiding the deployment cost of large monitors.

2605.12745 2026-05-14 cs.HC cs.AI 版本更新

What Do You Think I Think? Accounting for Human Beliefs Using Second-Order Theory of Mind

Patrick Callaghan, Reid Simmons, Henny Admoni

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所)

AI总结 该研究探讨了智能体如何理解人类对自身知识的错误信念,并提出了基于二阶心智理论(ToM-2)的模型,以识别和应对人类认知偏差与启发式推理。通过引入I-POMDP框架,智能体能够建模人类的错误信念及其成因,并据此生成适应性的反馈,从而提升交互效果。实验表明,该方法能有效提高人类教师行为的信息量,并获得用户对反馈有用性的积极评价。

Comments To appear in the proceedings of The 2026 Cognitive Science Society Conference

详情
英文摘要

Discrepancies between an agent's actual knowledge and what a person thinks the agent knows can hinder interactions. If an agent could detect such discrepancies, it could provide feedback to account for them and improve current and future interactions. Using the I-POMDP as a framework for a second-order Theory of Mind (ToM-2), this work endows an agent with the ability to model the evolution of a person's erroneous beliefs about an agent and the cognitive biases and heuristics (CBH) from which they arise. In doing so, the agent can detect when CBH might be at play during an interaction and adaptively generate feedback that accounts for them. An in-person user study shows how a ToM-2 learner can account for the effects of a teacher's CBH to significantly improve the informativeness of teacher actions, and subjective results suggest people find the ToM-2 learner's feedback more useful.

2605.12733 2026-05-14 cs.LG cs.AI stat.ML 版本更新

From Generalist to Specialist Representation

Yujia Zheng, Fan Feng, Yuke Li, Shaoan Xie, Kevin Murphy, Kun Zhang

发表机构 * CMU(卡内基梅隆大学) UIUC(伊利诺伊大学香槟分校) UCSD(加州大学圣地亚哥分校) MBZUAI(穆斯林人工智能研究所) UMD(马里兰大学) UBC(不列颠哥伦比亚大学)

AI总结 本文研究了从通用模型中学习任务相关的专家表征问题,核心在于在非参数设定下证明任务结构和任务相关潜在表征的可识别性。研究无需干预、参数形式或结构约束,证明了即使在时间序列缺乏严格时序依赖或存在断开的情况下,任务结构仍可在完全无监督条件下被识别,同时在每个时间步内,通过简单的稀疏性正则化可将任务相关与无关部分分离。这些结果为从通用模型向专家模型的可证性转变奠定了理论基础。

Comments ICML 2026

详情
英文摘要

Given a generalist model, learning a task-relevant specialist representation is fundamental for downstream applications. Identifiability, the asymptotic guarantee of recovering the ground-truth representation, is critical because it sets the ultimate limit of any model, even with infinite data and computation. We study this problem in a completely nonparametric setting, without relying on interventions, parametric forms, or structural constraints. We first prove that the structure between time steps and tasks is identifiable in a fully unsupervised manner, even when sequences lack strict temporal dependence and may exhibit disconnections, and task assignments can follow arbitrarily complex and interleaving structures. We then prove that, within each time step, the task-relevant latent representation can be disentangled from the irrelevant part under a simple sparsity regularization, without any additional information or parametric constraints. Together, these results establish a hierarchical foundation: task structure is identifiable across time steps, and task-relevant latent representations are identifiable within each step. To our knowledge, each result provides a first general nonparametric identifiability guarantee, and together they mark a step toward provably moving from generalist to specialist models.

2605.12730 2026-05-14 cs.AI cs.GR cs.MA physics.soc-ph 版本更新

BEHAVE: A Hybrid AI Framework for Real-Time Modeling of Collective Human Dynamics

Helene Malyutina

发表机构 * Independent Researcher, Collective Dynamics Lab(独立研究者,集体动力学实验室)

AI总结 本文提出BEHAVE,一种用于实时建模群体人类动态行为的混合人工智能框架。传统AI系统多关注个体行为或事后事件检测,难以捕捉群体稳定、升级或崩溃等集体动态特性。BEHAVE将群体视为具有涌现性、非线性、反馈环和临界点敏感性的复杂动态系统,通过可观测的物理信号构建交互空间,并将其建模为连续行为场,从而实现对群体状态的分布式表征与预测。该框架结合数学定理与神经网络模型,在多个实际场景中展示了其对群体动态的有效建模与预测能力。

Comments 19 pages

详情
英文摘要

Existing AI systems for modeling human behavior operate at the level of individuals or detect events after they occur. As a result, they systematically fail to capture the collective dynamics that determine whether a group remains stable or transitions into escalation or breakdown. We propose a different foundation: a group of interacting humans constitutes a complex dynamical system in the precise mathematical sense, exhibiting emergence, nonlinearity, feedback loops, sensitivity near critical points, and phase transitions between qualitatively distinct regimes. The state of such a system is not located within any single participant; it is distributed across mutual influence loops and observable through the micro-dynamics of the body. We introduce BEHAVE (Behavioral Engine for Human Activity Vector Estimation), a formal framework that models collective dynamics as continuous behavioral fields defined over an interaction space derived from observable physical signals. Kinematic micro-signals (position, velocity, body orientation, gestural activity) are structured into a directed interaction graph and aggregated into a basis of behavioral fields capturing distinct, non-redundant axes of collective state. The framework rests on one theorem and two structural propositions characterizing the tension field, the field basis, and the criticality index. Perception and forecasting layers are implemented using neural models, enabling data-driven learning and approximation of system dynamics. BEHAVE is formulated as a computational system for learning, representing, and forecasting collective dynamics from data. A working pipeline is demonstrated on a 7-agent negotiation snapshot. The same fields, recalibrated, apply to crowd safety, crisis-team dynamics, education, and clinical contexts.

2605.12728 2026-05-14 eess.SY cs.AI cs.SE cs.SY 版本更新

Grid-Orch: An LLM-Powered Orchestrator for Distribution Grid Simulation and Analytics

Boming Liu, Jin Dong, Jamie Lian

发表机构 * Electrification and Energy Infrastructures Division, Oak Ridge National Laboratory(电力化与能源基础设施部门,橡树岭国家实验室) UT-Battelle, LLC(UT-巴托尔实验室)

AI总结 本文提出了一种名为Grid-Orch的框架,通过模型上下文协议(MCP)将大语言模型(LLM)与电力系统仿真相结合,使工程师能够通过自然语言进行复杂的配电网络分析。该框架基于OpenDSS实现,提供了36个领域专用工具,支持多种优化任务和多步骤工程流程,并可通过交互式网页平台进行操作,显著提升了配电分析的效率和可访问性。

详情
英文摘要

The power distribution engineering workforce faces a projected shortage of up to 1.5 million engineers by 2030, creating urgent demand for more accessible analysis tools. This paper introduces Grid-Orch, a framework that bridges Large Language Models (LLMs) and power system simulation through the Model Context Protocol (MCP), enabling engineers to perform complex distribution analyses via natural language. Using OpenDSS as the reference implementation, Grid-Orch provides 36 domain-specific tools across eleven categories, covering power flow, voltage analysis, quasi-static time series (QSTS) simulation, and automated optimization. A provider-agnostic LLM layer supports both cloud-hosted (Gemini, Claude) and locally deployed (Ollama, llama-cpp) models, enabling air-gapped operation for security-sensitive utility environments. Three optimization skills, capacitor placement, voltage violation analysis, and overvoltage mitigation, extend the platform beyond single-tool queries to multi-step engineering workflows. Grid-Orch is delivered as an interactive web platform with chat-based interaction, a QSTS dashboard, and feeder topology visualization, and renders simulation results inline. Workflow demonstrations show that distribution analyses formerly requiring hours of scripting, such as distributed energy resource (DER) interconnection screening, complete in under two minutes through natural language, producing numerically identical results to direct OpenDSS scripting.

2605.12724 2026-05-14 cs.CV cs.AI 版本更新

Inline Critic Steers Image Editing

Weitai Kang, Xiaohang Zhan, Yizhou Wang, Mang Tik Chiu, Jason Kuen, Kangning Liu, Yan Yan

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) Adobe

AI总结 本文研究了基于指令的图像编辑中不同区域的难度差异问题,提出了一种在生成过程中实时修正模型输出的方法。核心方法是引入一个可学习的“Inline Critic”模块,在模型中间层对生成结果进行评估,并引导后续生成过程。该方法通过三阶段训练策略稳定模型学习,显著提升了图像编辑的效果,在多个基准测试中取得了当前最优性能。

Comments 9 pages

详情
英文摘要

Instruction-based image editing exhibits heterogeneous difficulty not only across cases but also across regions of an image, motivating refinement approaches that allocate correction to where the model struggles. Existing refinement signals arrive late, after a fully generated image or a completed denoising step. We ask whether such a signal can act within an ongoing forward pass. To investigate this, we probe a frozen image-editing model and find that although generation capability emerges only in the last few layers, the error pattern is already set in early layers (rank correlation \r{ho} = 0.83 with the final-layer error map). Based on this, we introduce Inline Critic, a learnable token that critiques a frozen model's predictions at its intermediate layers and steers its hidden states to refine generation during the forward pass. A three-stage recipe is proposed to stabilize the training from learning how to critique to steering generation. As a result, we achieve state of the art on GEdit-Bench (7.89), a +9.4 gain on RISEBench over the same backbone, and the strongest open-source result on KRIS-Bench (81.92, surpassing GPT-4o). We further provide analyses showing that the critic genuinely shapes the model's attention and prediction updates at subsequent layers.

2605.12718 2026-05-14 cs.AI cs.LG cs.MA 版本更新

CHAL: Council of Hierarchical Agentic Language

Tommaso Giovannelli, Griffin D. Kent

AI总结 本文提出了一种名为CHAL的多智能体辩论框架,旨在通过可反驳的论证优化信念系统,解决当前多智能体辩论在结构上的局限性。CHAL引入了基于图结构的信念表示和梯度引导的动态更新机制,并将元认知价值系统作为可配置参数,以指导智能体的推理与裁决过程。该框架在多个领域展示了良好的泛化能力,并为构建透明、可审计的AI系统提供了基础。

详情
英文摘要

Multi-agent debate has emerged as a promising approach for improving LLM reasoning on ground-truth tasks, yet current methodologies face certain structural limitations: debate tends to induce a martingale over belief trajectories, majority voting accounts for most observed gains, and LLMs exhibit confidence escalation rather than calibration across rounds. We argue that the genuine value of debate, and dialectic systems as a whole, lies not in ground-truth tasks but in defeasible domains, where every position can in principle be defeated by better reasoning. We present the Council of Hierarchical Agentic Language (CHAL), a multi-agent dialectic framework that treats defeasible argumentation as an engine for belief optimization. Each agent maintains a CHAL Belief Schema (CBS), a graph-structured belief representation with a Bayesian-inspired architecture, that facilitates belief revision through a gradient-informed dynamic mechanism by leveraging the strength of the belief's thesis as a differentiable objective. Meta-cognitive value systems spanning epistemology, logic, and ethics are elevated to configurable hyperparameters governing agent reasoning and adjudication outcomes. We provide a series of ablation experiments that demonstrate systematic and interpretable effects: the adjudicator's value system determines the debate's overall trajectories in latent belief space, council diversity refines beliefs for all participants, and the framework generalizes across broad fields. CHAL is, to our knowledge, the first framework to treat multi-agent debate as structured belief optimization over defeasible domains. Further, the auditable belief artifacts it produces establish the foundation for dedicated evaluation suites for defeasible argumentation, with broader implications for building AI systems whose reasoning and value commitments are transparent, aligned, and subject to human oversight.

2605.12717 2026-05-14 cs.GT cs.AI 版本更新

The End Justifies the Mean: A Linear Ranking Rule for Proportional Sequential Decisions

Carmel Baharav, Niclas Boehmer, Bailey Flanigan, Maximilian T. Wittmann

发表机构 * MIT, USA(美国麻省理工学院) Hasso Plattner Institute, University of Potsdam, Germany(德国波茨坦大学哈索·普拉特纳研究院)

AI总结 本文研究了在多人参与的决策场景中,如何设计一个公平的线性排序规则,以满足不同群体的偏好比例。研究提出了一种基于角度均值的简单规则,能够实现长期比例公平性,且在批量排序中与传统算术均值相比表现出更好的比例性。实验表明,在意见分歧较大的情况下,该方法显著提升了决策的公平性。

详情
英文摘要

AI alignment and participatory design motivate a new democratic design problem: how to collectively choose a decision rule to use repeatedly. We study this problem for linear ranking rules, which repeatedly rank items $x_j$ within batches $X=(x_1,\dots,x_m)\in(\mathbb{R}^d)^m$, where each item's ranking is dictated by its score $\langle θ^*,x_j\rangle$ according to a fixed scoring vector $θ^*$. Given voters' preferred scoring vectors $θ^{(1)},\dots,θ^{(n)}$ and their population fractions $α^{(1)},\dots,α^{(n)}$, we ask how to choose a collective vector $θ^*$ satisfying individual proportionality (IP): every voter type $i$ should agree with the resulting rankings to an $α^{(i)}$-proportional degree, either on average over time (long-run IP) or even within each batch (per-batch IP). The default rule, the arithmetic mean of the $θ^{(i)}$, has been shown to be severely majoritarian; more generally, it is not clear that any fixed linear rule can balance many voters' disparate opinions. Our main result is that, surprisingly, there is a simple rule that does satisfy long-run IP: the angular mean, the spherical analog of the arithmetic mean. We then show that exact per-batch IP is impossible for fixed linear rules, but that the gap between per-batch and long-run IP shrinks quickly with batch size. Experiments on three real-world preference datasets show that all rules perform similarly when voters' preferences are homogeneous, while the angular mean substantially improves proportionality in high-disagreement regimes.

2605.12704 2026-05-14 cs.SC cs.AI cs.LG 版本更新

FePySR: A Neural Feature Extraction Framework for Efficient and Scalable Symbolic Regression

Zhiming Yu, Wangtao Lu, Xin Lai

发表机构 * School of Mathematical Sciences, Zhejiang University, Hangzhou, Zhejiang, China(浙江大学数学科学学院,杭州,浙江,中国) Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland(塔尔库大学医学与健康技术学院,塔尔库,芬兰)

AI总结 符号回归(SR)的一个基本挑战是从观测数据中高效地恢复复杂的数学表达式。本文提出了一种名为FePySR的两阶段神经特征提取框架,通过在方程搜索前提取有效特征来缩小搜索空间,从而提升符号回归的效率和可扩展性。该方法首先利用异构神经网络将观测数据约束到一组候选表达式,然后在该精简的表达式空间中使用PySR进行结构优化,实验表明FePySR在多个基准测试中优于现有方法,尤其在复杂方程的恢复率和计算效率方面表现突出。

Comments Data and Code Availability: https://github.com/laixn/FePySR

详情
英文摘要

A fundamental challenge in symbolic regression (SR) is efficiently recovering complex mathematical expressions from observational data. Although this problem is NP-hard, many expressions of practical interest decompose naturally into combinations of nonlinear feature modules, concentrating structural complexity into a small number of reusable components. Here, we introduce FePySR, a two-stage framework that reduces the SR search space by extracting valid features prior to equation search. FePySR first employs a heterogeneous neural network to constrain observational data to a set of candidate expressions, then performs structural optimization within this refined expression space using PySR. Across five standard benchmarks, FePySR outperforms state-of-the-art methods by achieving higher equation recovery rates. On a set of 75 highly complex synthesized equations, FePySR recovers 36 equations, while producing substantially smaller mean squared errors on the remaining unrecovered cases, with reduced computation time compared to PySR. FePySR's first stage also maintains consistent performance under varying numbers of selected top features and increasing levels of noise in the observational data. Applied to ordinary differential equations governing biological systems, FePySR successfully identifies governing equations in 24 out of 100 tests where PySR recovers none. Taken together, FePySR is a generalizable framework that can enhance the SR solvers, enabling the efficient and reliable recovery of symbolic expressions across scientific domains.

2605.12703 2026-05-14 cs.CV cs.AI 版本更新

MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence

Yifan Chen, Fei Yin, Qingyan Bai, Zicheng Lin, Yujiu Yang

发表机构 * University of Cambridge(剑桥大学) HKUST(香港科技大学) Tsinghua University(清华大学)

AI总结 本文介绍了 MMCL-Bench,一个用于多模态上下文学习的基准,旨在从视觉或混合模态的教学内容中学习任务相关的规则、程序和经验模式,并应用于新的视觉实例。该基准包含102个任务,涵盖规则应用、流程执行和经验归纳三个类别,评估结果显示当前主流多模态模型在严格评分标准下仍远未达到鲁棒的多模态上下文学习能力,揭示了多模态上下文学习作为当前模型的重要能力瓶颈。

详情
英文摘要

We introduce MMCL-Bench, a benchmark for multimodal context learning: learning task-local rules, procedures, and empirical patterns from visual or mixed-modality teaching context and applying them to new visual instances. Unlike text-only context learning or standard multimodal question answering, this setting requires models to recover and localize relevant evidence from images, screenshots, manuals, videos, and frame sequences before they can reason over the learned context. MMCL-Bench contains 102 tasks spanning three categories: rule system application, procedural task execution, and empirical discovery and induction. We evaluate frontier multimodal models with strict rubric-based scoring and find that current systems remain far from robust multimodal context learning, with even the strongest model solving fewer than one-third of tasks under strict evaluation. Diagnostic ablations and error analysis show that failures arise throughout the context-to-answer pipeline, including context anchoring, visual evidence extraction, context reasoning, and response construction. MMCL-Bench thus highlights multimodal context learning as an important unsolved capability bottleneck for current multimodal models.

2605.12702 2026-05-14 cs.AI cs.HC 版本更新

DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models

Eugenia Kim, Ioana Tanase, Christina Mallon

发表机构 * Microsoft(微软)

AI总结 本文提出 DisaBench,一个用于评估语言模型中与残疾相关危害的参与式评价框架。该框架通过与残疾人士和红队专家共同创建的十二类残疾危害分类,结合七类生活场景中的良性与对抗性提示,构建了一个包含175个提示和525对标注响应的数据集。研究发现,残疾相关危害因类型不同而差异显著,并在非文本模态中叠加出现,且其评估具有文化与时间依赖性,常规安全评估难以识别细微危害。该框架强调残疾危害的个人性、交叉性和社区定义特征,现有通用安全基准难以全面捕捉此类问题。

详情
英文摘要

General-purpose safety benchmarks for large language models do not adequately evaluate disability-related harms. We introduce DisaBench: a taxonomy of twelve disability harm categories co-created with people with disabilities and red teaming experts, a taxonomy-driven evaluation methodology that pairs benign and adversarial prompts across seven life domains, and a dataset of 175 prompts with human-annotated labels on 525 prompt-response pairs. Annotation by four evaluators with lived disability experience reveals three findings: harm rates vary sharply by disability type and will compound in non-text modalities, terminology-driven harm is culturally and temporally bound rather than universally assessable, and standard safety evaluation catches overt failures while missing the subtle harms that only domain expertise can recognize. Disability harm is simultaneously personal, intersectional, and community-defined: it cannot be isolated from the full context of who a person is, and general-purpose benchmarks systematically miss it. We will release the dataset, taxonomy, and methodology via Hugging Face and an open-source red teaming framework for direct integration into existing safety pipelines with no additional infrastructure.

2605.12701 2026-05-14 cs.LG cs.AI cs.CE cs.CY 版本更新

Do Fair Models Reason Fairly? Counterfactual Explanation Consistency for Procedural Fairness in Credit Decisions

Gideon Popoola, John Sheppard

AI总结 在信用决策等社会敏感领域,现有公平机器学习模型虽然能够实现预测结果的公平性,但仍可能在推理过程中对不同群体采用不同的逻辑,形成“隐藏的过程性偏差”。本文提出一种名为反事实解释一致性(CEC)的框架,通过对齐个体与其反事实样本的特征归因,检测并缓解这种偏差,并引入新的过程性公平度量与训练损失函数。实验表明,CEC能有效减少模型的隐藏偏差,且对模型性能的影响较小。

详情
英文摘要

Machine learning algorithms in socially sensitive domains (e.g., credit decisions) often focus on equalizing predictive outcomes. However, satisfying these metrics does not guarantee that models use the same reasoning for different groups. We show that existing outcome-fair models can still apply fundamentally different reasoning to individuals, a ``hidden procedural bias'' missed by standard fairness metrics and algorithms. We propose Counterfactual Explanation Consistency (CEC), a framework that detects and mitigates this bias by aligning feature attributions between individuals and their counterfactual counterparts. Key contributions include a nearest-neighbor counterfactual generation method, a modified baseline for integrated gradient comparisons, an individual-level procedural fairness metric, and a corresponding training loss. We introduce a taxonomy identifying ``Regime B'' (same outcome, different reasoning) as a critical blind spot. Experiments on synthetic data, German Credit, Adult Income, and HMDA mortgage data demonstrate that outcome-fair baselines exhibit substantial hidden bias, while CEC substantially reduces it with modest utility cost.

2605.12699 2026-05-14 cs.LG cs.AI 版本更新

Modeling Heterophily in Multiplex Graphs: An Adaptive Approach for Node Classification

Kamel Abdous, Nairouz Mrabah, Mohamed Bouguessa

发表机构 * Department of Computer Science, University of Quebec at Montreal(魁北克大学蒙特利尔分校计算机科学系)

AI总结 该论文研究了在多层图中建模异质性(heterophily)的问题,即相连节点可能属于不同类别且属性差异较大的情况。现有方法多假设同质性(homophily),难以处理多层图中同时存在的同质与异质交互。为此,作者提出了一种名为\methodname的新方法,通过引入维度特定的兼容性矩阵和可训练的低通与高通滤波器,动态适应不同维度的异质特性,从而更有效地进行节点分类。实验表明,该方法在合成和真实数据集上均取得了优于现有方法的分类性能。

Comments 38 pages, 7 figures, 4 tables, 1 algorithm. Published in Expert Systems with Applications

Journal ref Expert Systems with Applications, Volume 323, 2026, Article 132374

详情
英文摘要

Existing multiplex graph models often assume homophily, where connected nodes tend to belong to the same class or share similar attributes. Consequently, these models may struggle with graphs exhibiting heterophily, where connected nodes typically belong to different classes and have dissimilar attributes. While recent methods have been developed to learn reliable node representations from unidimensional graphs with heterophily, they do not fully address the complexities of multiplex graphs. In a multiplex graph, nodes are linked through multiple types of edges (referred to as dimensions), which can simultaneously exhibit homophilic and heterophilic interactions. To address this gap, we propose \methodname, a novel method for node classification in multiplex graphs that adapts to both homophilic and heterophilic dimensions. \methodname introduces dimension-specific compatibility matrices to model varying degrees of homophily and heterophily across dimensions. A key innovation is its use of a product of trainable low-pass and high-pass filters, approximated via Chebyshev polynomials, to capture both smooth and abrupt changes in the graph signal. By composing these filters and optimizing label predictions using a proximal-gradient method, \methodname dynamically adjusts to the heterophilic characteristics of each dimension. Extensive experiments on synthetic and real-world datasets provide evidence that \methodname captures the complex interplay of homophilic and heterophilic interactions in multiplex graphs, and tends to yield improved node classification performance compared to state-of-the-art methods.

2605.12694 2026-05-14 cs.SE cs.AI cs.PL 版本更新

Agentic Interpretation: Lattice-Structured Evidence for LLM-Based Program Analysis

Jacqueline L. Mitchell, Chao Wang

发表机构 * University of Southern California, Los Angeles, California, USA(美国南加州大学)

AI总结 本文提出了一种名为“代理解释”的新框架,旨在将基于格结构的静态分析方法应用于基于大语言模型(LLM)的程序分析中。该方法将高层次的分析目标分解为局部断言,并在有限高度的格结构中跟踪LLM对每个断言的判断,从而更透明和系统地进行程序分析。通过引入工作列表算法,论文展示了如何逐步推进分析过程,并通过一个具体示例说明该方法在处理依赖第三方组件的代码时的有效性。这一方法提升了LLM在程序分析中的可靠性与可解释性。

Comments 27 pages, 6 figures

详情
英文摘要

Large language models can consult information that fixed static analyzers cannot, such as documentation, current security advisories, version-specific metadata, and informal API contracts. This makes LLMs a compelling option for program analyses that depend on information beyond the source program, or that are otherwise not amenable to conventional static analyzers. However, directly asking an LLM for a one-shot whole-program analysis is brittle because it compresses many evidence-dependent judgments into a single opaque answer, rather than exposing which conclusions are supported or disputed and using intermediate findings to guide later, more focused searches. In this paper, we propose agentic interpretation, a framework that brings the discipline of lattice-based static analysis to LLM-driven program reasoning. At a high level, agentic interpretation decomposes a high-level analysis goal into localized claims, and tracks the LLM's judgment about each claim in a finite-height lattice. A worklist algorithm governs how claims and their judgments evolve during the analysis. We introduce a formal model of agentic interpretation, explore the design space it opens, and illustrate the approach with a worked example analyzing code that depends on opaque third-party components.

2605.12691 2026-05-14 cs.AI 版本更新

On the Size Complexity and Decidability of First-Order Progression

Jens Classen, Daxin Liu

发表机构 * Department of People and Technology, Roskilde University, Denmark(罗斯基尔德大学人机技术系,丹麦) State Key Laboratory for Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室,中国)

AI总结 本文研究了在一阶逻辑框架下动作进展(progression)的规模复杂性与可判定性问题。作者在情境演算(Situation Calculus)框架下,分析了具有局部效应、正常和无环等特性的动作类别的进展规模,证明在合理假设下其规模仅呈多项式增长。此外,当知识库属于可判定的逻辑片段(如二元一阶逻辑或带有常量的全称理论)时,进展仍保持在相同片段内,从而保证了可判定性和实际应用价值。

Comments This is an extended version of an identically-titled paper accepted for publication at IJCAI 2026. This version contains an appendix with further proofs

详情
英文摘要

Progression, the task of updating a knowledge base to reflect action effects, generally requires second-order logic. Identifying first-order special cases, by restricting either the knowledge base or action effects, has long been a central topic in reasoning about actions. It is known that local-effect, normal, and acyclic actions, three increasingly expressive classes, admit first-order progression. However, a systematic analysis of the size of such progressions, crucial for practical applications, has been missing. In this paper, using the framework of Situation Calculus, we show that under reasonable assumptions, first-order progression for these action classes grows only polynomially. Moreover, we show that when the KB belongs to decidable fragments such as two-variable first-order logic or universal theories with constants, the progression remains within the same fragment, ensuring decidability and practical applicability.

2605.12685 2026-05-14 cs.LG cs.AI 版本更新

A Unified Perspective for Learning Graph Representations Across Multi-Level Abstractions

Mohamed Mahmoud Amar, Nairouz Mrabah, Mohamed Bouguessa, Abdoulaye Baniré Diallo

发表机构 * Department of Computer Science, University of Quebec at Montreal(魁北克大学蒙特利尔分校计算机科学系)

AI总结 该论文提出了一种统一的对比学习框架,用于从节点级、邻近级、聚类级和图级等多个抽象层次学习图结构数据的表示。为了解决现有方法大多只关注单一抽象层次的问题,该方法通过相似度与不相似度分数的线性组合整合多级信息,并引入一种无需参数的细粒度自适应加权机制,以增强优化灵活性并提升模型收敛性。实验表明,该方法在多个下游任务中优于现有先进方法,适用于单层次和多层次场景。

Comments Accepted for publication in IEEE Transactions on Knowledge and Data Engineering (TKDE). 18 pages, 8 figures

详情
英文摘要

Graph Self-Supervised Learning (GSSL) has emerged as a powerful paradigm for generating high-quality representations for graph-structured data. While multi-scale graph contrastive learning has received increasing attention, many existing methods still predominantly focus on a single graph abstraction level. To address this limitation, we propose a unified contrastive framework that can target node-level, proximity-level, cluster-level, and graph-level information and integrate them through a linear combination of similarity scores on positive pairs and dissimilarity scores (i.e., similarity scores on negative pairs). Furthermore, current approaches typically assign uniform penalty strengths to all examples, which reduces optimization flexibility and leads to ambiguous convergence status. To overcome this, we introduce a novel parameter-free fine-grained self-weighting mechanism that adaptively assigns weights to individual similarity and dissimilarity scores. The proposed mechanism emphasizes the scores that deviate significantly from their target values. Our approach not only enhances optimization flexibility but also eliminates the computational overhead of hyperparameter tuning in conventional multi-task GSSL methods. Comprehensive experiments on real-world datasets show that our methods consistently outperform state-of-the-art approaches across downstream tasks, including classification, clustering, and link prediction, in both single-level and multi-level scenarios.

2605.12684 2026-05-14 cs.CV cs.AI cs.HC 版本更新

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

Yichen Feng, Yuetai Li, Chunjiang Liu, Yuanyuan Chen, Fengqing Jiang, Yue Huang, Hang Hua, Zhengqing Yuan, Kaiyuan Zheng, Luyao Niu, Bhaskar Ramasubramanian, Basel Alomair, Xiangliang Zhang, Misha Sra, Zichen Chen, Radha Poovendran, Zhangchen Xu

发表机构 * Bake AI University of Washington(华盛顿大学) University of California, Santa Barbara(加州大学圣巴巴拉分校) Stanford University(斯坦福大学) University of Notre Dame(诺丁汉大学) Carnegie Mellon University(卡内基梅隆大学) MIT-IBM Watson AI Lab(麻省理工-IBM沃森人工智能实验室) Western Washington University(西雅图华盛顿大学) King Abdulaziz City for Science and Technology(国王阿卜杜勒阿齐兹科技城)

AI总结 该研究探讨了前沿多模态大语言模型在视觉审美判断方面的能力,指出当前模型在判断图像美感时存在显著不足。研究引入了“视觉审美基准”(VAB),通过专家标注的对比任务评估模型表现,发现即使是最好的模型在识别最佳和最差图像时也远不如人类专家。研究还表明,通过少量专家示例对模型进行微调,可以显著提升其性能,凸显了VAB在推动审美判断模型发展中的重要价值。

Comments Project page: https://vab.bakelab.ai. Code: https://github.com/BakeLab/Visual-Aesthetic-Benchmark. Dataset: https://huggingface.co/datasets/BakeLab/Visual-Aesthetic-Benchmark

详情
英文摘要

Multimodal large language models (MLLMs) are now routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require an explicit aesthetic judgment. Most existing solutions reduce this judgment to predicting a scalar score for a single image. We first ask whether such scores faithfully capture comparative preference: in a controlled study with eight expert annotators, score-derived rankings align poorly with the same annotators' direct comparisons, while direct ranking yields substantially higher inter-annotator agreement on best- and worst-image labels. Motivated by this finding, we introduce the Visual Aesthetic Benchmark (VAB), which casts aesthetic evaluation as comparative selection over candidate sets with matched subject matter. VAB contains 400 tasks and 1,195 images across fine art, photography, and illustration, with labels derived from the consensus of 10 independent expert judges per task. Evaluating 20 frontier MLLMs and six dedicated visual-quality reward models, we find that the strongest system identifies both the best and the worst image correctly across three random permutations of the candidate order in only 26.5% of tasks, far below the 68.9% achieved by human experts. Fine-tuning a 35B-parameter model on 2,000 expert examples brings its accuracy close to that of a 397B-parameter open-weight model, suggesting that the comparative signal in VAB is transferable. Together, these results expose a clear and measurable gap between current multimodal models and expert aesthetic judgment, and VAB provides the first set-based, expert-grounded testbed on which that gap can be tracked and closed.

2605.12683 2026-05-14 cs.LG cs.AI cs.DC physics.comp-ph 版本更新

Parallel-in-Time Training of Recurrent Neural Networks for Dynamical Systems Reconstruction

Florian Hess, Florian Götz, Daniel Durstewitz

发表机构 * Dept. of Theoretical Neuroscience, Central Institute of Mental Health, Mannheim, Germany(理论神经科学系,心理健康中央研究所,曼海姆,德国) Faculty of Physics and Astronomy, Heidelberg University, Germany(物理与天文学院,海德堡大学,德国) Faculty of Mathematics and Computer Science, Heidelberg University, Germany(数学与计算机科学学院,海德堡大学,德国) Interdisciplinary Center for Scientific Computing (IWR), Heidelberg University, Germany(跨学科科学计算中心(IWR),海德堡大学,德国)

AI总结 本文研究了如何通过时间并行化方法提高递归神经网络在动态系统重建任务中的训练效率。作者提出了两种基于并行关联扫描的算法,分别适用于线性非自主动力学模型和通用非线性模型,并发现前者在训练时存在限制,难以准确学习非线性动力学。为此,作者将广义教师强制(GTF)引入DEER框架,有效提升了模型在长序列上的学习能力,实验表明长轨迹数据对具有长时程特征的动态系统重建具有显著提升作用。

Comments 29 pages, 6 figures, preprint

详情
英文摘要

Reconstructing nonlinear dynamical systems (DS) from data (DSR) is a fundamental challenge in science and engineering, but it inherently relies on sequential models. Recent breakthroughs for sequential models have produced algorithms that parallelize computation along sequence length $T$, achieving logarithmic time complexity, $\mathcal{O}(\log T)$. Since sequence lengths have been practically limited due to the linear runtime complexity $\mathcal{O}(T)$ of classical backpropagation through time, this opens new avenues for DSR. This paper studies two prominent classes of parallel-in-time algorithms for this task, both of which leverage parallel associative scans as their core computational primitive. The first class comprises models with linear yet non-autonomous dynamics and a nonlinear readout, such as modern State Space Models (SSMs), while the second consists of general nonlinear models which can be parallelized using the DEER framework. We find that the linear training-time recurrence of the first class of models imposes limitations that often hinder learning of accurate nonlinear dynamics. To address this, we augment DEER with Generalized Teacher Forcing (GTF), a novel variant within the more general nonlinear framework that ensures stable and effective learning of nonlinear dynamics across arbitrary sequence lengths. Using GTF-DEER, we investigate the benefits of training on extremely long sequences ($T>10^4$) for DSR. Our results show that access to such long trajectories significantly improves DSR if the data features long time scales. This work establishes GTF-DEER as a robust tool for data-driven discovery and underscores the largely untapped potential of long-sequence learning in modeling complex DS.

2605.12682 2026-05-14 cs.AI 版本更新

Learning Transferable Latent User Preferences for Human-Aligned Decision Making

Alina Hyk, Sandhya Saisubramanian

发表机构 * Oregon State University(俄勒冈州立大学)

AI总结 该研究旨在解决大语言模型在生成人类对齐决策时面临的挑战,即如何从有限的交互中学习可迁移的潜在用户偏好。为此,作者提出了CLIPR框架,通过少量对话输入学习可操作的自然语言规则,以表示用户的潜在偏好,并通过自适应反馈不断优化这些规则。实验表明,CLIPR在多个任务和环境中均能有效提升决策对齐度并降低推理成本。

详情
英文摘要

Large language models (LLMs) are increasingly used as reasoning modules in many applications. While they are efficient in certain tasks, LLMs often struggle to produce human-aligned solutions. Human-aligned decision making requires accounting for both explicitly stated goals and latent user preferences that shape how ambiguous situations should be resolved. Existing approaches to incorporating such preferences either rely on extensive and repeated user interactions or fail to generalize latent preferences across tasks and contexts, limiting their practical applicability. We consider a setting in which an LLM is used for high-level reasoning and is responsible for inferring latent user preferences from limited interactions, which guides downstream decision making. We introduce CLIPR (Conversational Learning for Inferring Preferences and Reasoning), a framework that learns actionable, transferable natural language rules that represent latent user preferences from minimal conversational input. These rules are iteratively refined through adaptive feedback and applied to both in-distribution and out-of-distribution ambiguous tasks across multiple environments. Evaluations on three datasets and a user study show that CLIPR consistently outperforms existing methods in improving alignment and reducing inference costs.

2605.12674 2026-05-14 cs.AI cs.LG cs.RO 版本更新

Revealing Interpretable Failure Modes of VLMs

Isha Chaudhary, Vedaant V Jain, Kavya Sachdeva, Sayan Ranu, Gagandeep Singh

发表机构 * UIUC(伊利诺伊大学香槟分校) Kumo AI IIT Delhi(德里印度理工学院)

AI总结 该论文提出了一种名为REVELIO的框架,用于系统性地揭示视觉-语言模型(VLMs)中可解释的失效模式。研究通过结合多样性感知的束搜索和高斯过程汤普森采样策略,高效探索VLM在特定场景下的失效组合空间。实验表明,该方法在自动驾驶和室内机器人任务中发现了现有VLM的潜在漏洞,为提升模型安全性提供了结构化且可解释的改进方向。

详情
英文摘要

Vision-Language Models (VLMs) are increasingly used in safety-critical applications because of their broad reasoning capabilities and ability to generalize with minimal task-specific engineering. Despite these advantages, they can exhibit catastrophic failures in specific real-world situations, constituting failure modes. We introduce REVELIO, a framework for systematically uncovering interpretable failure modes in VLMs. We define a failure mode as a composition of interpretable, domain-relevant concepts-such as pedestrian proximity or adverse weather conditions-under which a target VLM consistently behaves incorrectly. Identifying such failures requires searching over an exponentially large discrete combinatorial space. To address this challenge, REVELIO combines two search procedures: a diversity-aware beam search that efficiently maps the failure landscape, and a Gaussian-process Thompson Sampling strategy that enables broader exploration of complex failure modes. We apply REVELIO to autonomous driving and indoor robotics domains, uncovering previously unreported vulnerabilities in state-of-the-art VLMs. In driving environments, the models often demonstrate weak spatial grounding and fail to account for major obstructions, leading to recommendations that would result in simulated crashes. In indoor robotics tasks, VLMs either miss safety hazards or behave excessively conservatively, producing false alarms and reducing operational efficiency. By identifying structured and interpretable failure modes, REVELIO offers actionable insights that can support targeted VLM safety improvements.

2605.12673 2026-05-14 cs.AI cs.CR 版本更新

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

Hao Wang, Hanchen Li, Qiuyang Mang, Alvin Cheung, Koushik Sen, Dawn Song

发表机构 * UC Berkeley(加州大学伯克利分校)

AI总结 该论文研究了人工智能代理基准测试中的奖励黑客问题,即代理通过非预期方式最大化得分而非完成任务的现象。为此,作者提出了 BenchJack 系统,通过自动化红队测试方法系统性地审计基准测试,识别潜在的奖励黑客漏洞。研究还构建了一个迭代生成对抗流程,不断发现并修复新漏洞,显著提升了基准测试的安全性。实验表明,BenchJack 能在多个主流基准中发现大量漏洞,并有效降低了可被攻击的任务比例。

详情
英文摘要

Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be secure by design. From past incidents of reward hacks, we derive a taxonomy of eight recurring flaw patterns and compile them into the Agent-Eval Checklist for benchmark designers. We condense the insights into BenchJack, an automated red-teaming system that drives coding agents to audit benchmarks and identify possible reward-hacking exploits in a clairvoyant manner. Moreover, we extend BenchJack to an iterative generative-adversarial pipeline that discovers new flaws and patches them iteratively to improve benchmark robustness. We apply BenchJack to 10 popular agent benchmarks spanning software engineering, web navigation, desktop computing, and terminal operations. BenchJack synthesizes reward-hacking exploits that achieve near-perfect scores on most of the benchmarks without solving a single task, surfacing 219 distinct flaws across the eight classes. Moreover, BenchJack's extended pipeline reduces the hackable-task ratio from near 100% to under 10% on four benchmarks without fatal design flaws, fully patching WebArena and OSWorld within three iterations. Our results show that evaluation pipelines have not internalized an adversarial mindset, and that proactive auditing could help close the security gap for the fast-paced benchmarking space.

2605.12653 2026-05-14 cs.LG cs.AI stat.ML 版本更新

Plan Before You Trade: Inference-Time Optimization for RL Trading Agents

Eun Go, Rohan Deb, Arindam Banerjee

发表机构 * Siebel School of Computing and Data Science(塞比尔计算与数据科学学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出了一种名为FPILOT的推理时优化框架,用于改进强化学习在投资组合管理中的应用。该方法受模型预测控制启发,利用价格预测信息在推理阶段动态优化交易策略,而无需依赖训练时的固定策略。FPILOT能够在不重新训练策略的情况下,结合价格预测模型生成多步价格轨迹,并据此优化每一步的资产配置,从而在多个风险调整指标上显著提升交易表现。

详情
英文摘要

Reinforcement learning agents for portfolio management are typically trained and deployed as static policies, with no mechanism for using price forecasts at inference time. We propose $\text{FPILOT}$ (**Fin**ancial **P**lugin **I**nference-time **L**earning for **O**ptimal **T**rading), a plugin inference-time optimization framework inspired by Model Predictive Control (MPC). Our key structural insight is that future prices mostly do not depend on one agent's portfolio allocation, so a suitable predictive model can produce a multi-step price trajectory without iterative action-conditioned rollouts as in typical reinforcement learning. At each decision step, we use the forecaster's predicted price trajectory to construct an allocation-based imagined return objective, and optimize the policy at inference-time before executing one step of the trade. Our framework is compatible with any pre-trained agent and adapts the policy to the forecaster's predictions without any retraining. Evaluated across five policy learning algorithms on the TradeMaster DJ30 benchmark, $\text{FPILOT}$ produces consistent improvements in total return and return-based risk-adjusted metrics (Sharpe, Sortino, Calmar), with stochastic policies benefiting more than deterministic ones. Further, using synthetic forecasts at calibrated quality levels, we show that gains consistently improve with forecaster quality, suggesting that our performance will improve based on advances in financial forecasting.

2605.12645 2026-05-14 cs.CL cs.AI 版本更新

Training LLMs with Reinforcement Learning for Intent-Aware Personalized Question Answering

Maryam Amirizaniani, Benjamin Charles Germain Lee, Jevin West, Nicholas Weber

发表机构 * University of Washington(华盛顿大学)

AI总结 该研究旨在提升语言模型在单轮对话中的个性化问答能力,通过理解用户隐含的意图来生成更符合其深层目标的回答。为此,作者提出了一种基于强化学习的框架IAP,能够在仅凭单轮问题的情况下直接推断用户意图,并通过标签化机制将其融入推理过程,从而生成更具针对性的回答。实验表明,IAP在多个模型上均显著优于现有方法,验证了在训练过程中建模隐含用户意图的有效性。

详情
英文摘要

Effective personalized question answering (PQA) in language models requires grounding responses in the user's underlying intent, where intent refers to the implicit ``why'' behind a query beyond its explicit wording. However, existing approaches to intent-aware personalization rely on multi-turn conversational context or rich user profiles, and do not explicitly model user intent during the reasoning process. This limits their effectiveness in single-turn settings, where the user's latent goal must be inferred from minimal input and integrated into the thinking and reasoning process. To bridge this gap, we propose IAP (Intent-Aware Personalization), a reinforcement learning framework that trains models to infer implicit user intent directly from a single-turn question and incorporate it into thinking steps through a tag-based schema for generating personalized, intent-grounded answers. By optimizing intent-aware answer trajectories under a personalized reward function, IAP reinforces generation paths that make implicit user intent explicit and produce responses that better align with the user's underlying goal. Through experiments on the LaMP-QA benchmark across six models, IAP consistently outperforms all baselines, achieving an average macro-score gain of around 7.5\% over the strongest competitor, demonstrating that modeling implicit user intent within the training objective is a promising direction for PQA.

2605.11989 2026-05-14 cs.CV cs.AI 版本更新

A Transfer Learning Evaluation of Deep Neural Networks for Image Classification

Nermeen Abou Baker, Nico Zengeler, Uwe Handmann

发表机构 * Computer Science Institute, Ruhr West University of Applied Sciences, 46236 Bottrop(鲁尔西大学应用科学学院计算机科学研究所)

AI总结 本文研究了如何为图像分类任务选择最符合目标领域需求的预训练模型,探讨了迁移学习在深度神经网络中的应用效果。作者对十一类在ImageNet上预训练的模型进行了输出层和网络参数的调整,并将其应用于五个不同的目标数据集。通过评估准确率、准确密度、训练时间和模型大小等指标,比较了不同模型在单次和多次训练过程中的表现,为迁移学习中的模型选择提供了参考依据。

Comments Published by Machine Learning and Knowledge Extraction Journal

Journal ref Machine Learning and Knowledge Extraction 4, no. 1: 22-41 (2022)

详情
英文摘要

Transfer learning is a machine learning technique that uses previously acquired knowledge from a source domain to enhance learning in a target domain by reusing learned weights. This technique is ubiquitous because of its great advantages in achieving high performance while saving training time, memory, and effort in network design. In this paper, we investigate how to select the best pre-trained model that meets the target domain requirements for image classification tasks. In our study, we refined the output layers and general network parameters to apply the knowledge of eleven image processing models, pre-trained on ImageNet, to five different target domain datasets. We measured the accuracy, accuracy density, training time, and model size to evaluate the pre-trained models both in training sessions in one episode and with ten episodes.

2605.11679 2026-05-14 cs.AI 版本更新

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

ShiYing Huang, Liang Lin, Yuer Li, Kaiwen Luo, Zhenhong Zhou, An Zhang, Junhao Dong, Kun Wang, Zhigang Zeng

发表机构 * Huazhong University of Science and Technology(华中科技大学) Nanyang Technological University(南洋理工大学) Tsinghua University(清华大学) Chongqing University(重庆大学)

AI总结 在多目标对齐的大型语言模型研究中,如何平衡不同的人类偏好常表现为零和冲突。本文提出一种新的视角,认为多目标之间的冲突源于提示本身对多维奖励的限制,并据此提出多目标奖励融合方法MORA,通过扩展奖励维度提升模型在有用性、安全性等多方面的表现。实验表明,MORA在顺序对齐和同时对齐任务中均取得了显著的性能提升。

详情
英文摘要

In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://github.com/Shiying-Huang/MORA-MPA.

2605.11505 2026-05-14 cs.AI 版本更新

Selective Off-Policy Reference Tuning with Plan Guidance

Duc Anh Le, Tien-Phat Nguyen, Thien Huu Nguyen, Linh Ngo Van, Trung Le

发表机构 * Hanoi University of Science and Technology(河内科学技术大学) University of Oregon(俄勒冈大学) Monash University(墨尔本大学)

AI总结 本文研究了如何在强化学习中利用可验证奖励进行推理,并针对GRPO类方法在处理困难提示时效果不佳的问题,提出了一种名为SORT的新方法。该方法通过引入计划引导机制,在不改变策略生成过程的前提下,利用参考解生成计划,并据此调整策略更新的权重,从而提升模型对结构化信息的学习能力。实验表明,SORT在多个推理基准测试中优于现有方法,尤其在较弱模型上表现突出。

详情
英文摘要

Reinforcement learning with verifiable rewards helps reasoning, but GRPO-style methods stall on hard prompts where all sampled rollouts fail. SORT adds a repair update for those failures without changing rollout generation: it derives a plan from the reference solution, compares token probabilities with and without that plan, and gives higher weight to tokens that become more predictable under plan conditioning. This turns all-wrong prompts into selective, structure-aware learning signals instead of uniform imitation. Across three backbones and eight reasoning benchmarks, SORT improves over GRPO and guidance baselines, with largest gains on weaker models.

2605.11347 2026-05-14 cs.LG cs.AI cs.CV 版本更新

Gradient-Free Noise Optimization for Reward Alignment in Generative Models

Jeongsol Kim, Hongeun Kim, Jian Wang, Jong Chul Ye

发表机构 * KAIST AI(韩国科学技术院人工智能实验室) Snap Inc.(Snap公司)

AI总结 本文提出了一种无需梯度的噪声优化方法ZeNO,用于生成模型中的奖励对齐问题。该方法将噪声优化建模为路径积分控制问题,仅依赖零阶奖励评估,避免了传统方法对反向传播的依赖。ZeNO在多种生成器和奖励函数上表现出色,尤其适用于无法进行反向传播的场景,如蛋白质结构生成任务。

详情
英文摘要

Existing reward alignment methods for diffusion and flow models rely on multi-step stochastic trajectories, making them difficult to extend to deterministic generators. A natural alternative is noise-space optimization, but existing approaches require backpropagation through the generator and reward pipeline, limiting applicability to differentiable settings. To address this, here we present ZeNO (Zeroth-order Noise Optimization), a gradient-free framework that formulates noise optimization as a path-integral control problem, estimable from zeroth-order reward evaluations alone. When instantiated with an Ornstein--Uhlenbeck reference process, the update connects to Langevin dynamics implicitly targeting a reward-tilted distribution. ZeNO enables effective inference-time scaling and demonstrates strong performance across diverse generators and reward functions, including a protein structure generation task where backpropagation is infeasible.

2605.11033 2026-05-14 physics.plasm-ph cs.AI 版本更新

TokaMind for Power Grid: Cross-Domain Transfer from Fusion Plasma

JC Wu, Norton Lee, Kai Siang Chen

发表机构 * TaiScience Research Group(TaiScience研究组) Fu Jen Catholic University(辅仁大学) Center for Geometry and Physics(几何与物理中心) Institute for Basic Science (IBS)(基础科学研究所)

AI总结 本文提出了一种多模态变压器基础模型TokaMind,最初在聚变等离子体诊断数据上进行预训练,并在多个跨领域任务中验证其表示能力的可迁移性。研究通过在工业轴承退化、航空发动机退化及电力系统PMU数据集上的实验,揭示了TokaMind在电力系统中表现出色的关键特征,并在严重事件分类任务中取得了较高的F1分数。研究还发现,电力系统事件分类的难度主要由电网拓扑结构决定,而非模型容量,并提出了基于临界减缓指标的改进评估方法。

Comments 8 pages, 5 figures

详情
英文摘要

TokaMind is a multi-modal transformer (MMT) foundation model pre-trained on tokamak plasma diagnostics data from MAST, where it was shown to outperform CNN-based approaches on fusion benchmarks. We investigate whether its learned representations generalize to physically distinct but structurally analogous domains. Through systematic experimentation across four domains-industrial bearing degradation, NASA CMAPSS turbofan degradation, and two independent power grid PMU datasets-we identify four transfer-favoring characteristics that help explain where TokaMind's pretrained representations are most effective. Power grid synchrophasor data matches this target-domain profile most directly, while industrial degradation datasets demonstrate that TokaMind can still yield useful performance under partial alignment, especially when task design and feature construction expose physically meaningful degradation structure. On the GESL/PNNL 500-event benchmark with provider-aware evaluation, TokaMind achieves test $\text{F1} = 0.837 \pm 0.040$ (3~seeds) for severe event classification. Our central finding, however, is not the aggregate score: classification difficulty is structurally determined by provider-level grid topology, not model capacity. In the single-window early-warning regime, TokaMind outperforms a CNN baseline (F1~0.889 vs.~0.878)--a reversal that disappears as more event windows are provided. Furthermore, Critical Slowing Down (CSD) indicators, used as a confidence gate rather than a classification label, improve F1 from 0.696 to 0.750 at 63% coverage-outperforming the CNN baseline (0.636) at any coverage level. These results establish the first cross-domain validation of TokaMind outside nuclear fusion and propose a transferability framework and revised evaluation protocol for multi-source PMU datasets.

2605.10983 2026-05-14 cs.LG cs.AI cs.CV 版本更新

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

Jiaming Li, Chenyu Zhu, Nanxi Yi, Youjun Bao, Li Sun, Quanying Lv, Xiang Fang, Daizong Liu, Jianjun Li, Kun He, Bowen Zhou, Zhiyuan Ma

发表机构 * Huazhong University of Science and Technology(华中科技大学) Kuaishou Technology(快手科技) Nanyang Technological University(南洋理工大学) Wuhan University(武汉大学) Tsinghua University(清华大学)

AI总结 该研究针对扩散模型对下游任务对齐过程中存在的奖励作弊问题,提出了一种轨迹匹配策略优化方法(TMPO),通过轨迹级奖励分布匹配替代传统的标量奖励最大化,有效提升了生成多样性和质量。TMPO 引入了 Softmax 轨迹平衡目标,使策略概率与奖励诱导的玻尔兹曼分布对齐,并证明其具有覆盖多模式轨迹的特性。此外,TMPO 还结合动态随机树采样技术,提升大规模流匹配模型的训练效率,实验表明其在生成多样性及任务性能上均优于现有方法。

详情
英文摘要

Reinforcement learning (RL) has shown extraordinary potential in aligning diffusion models to downstream tasks, yet most of them still suffer from significant reward hacking, which degrades generative diversity and quality by inducing visual mode collapse and amplifying unreliable rewards. We identify the root cause as the mode-seeking nature of these methods, which maximize expected reward without effectively constraining probability distribution over acceptable trajectories, causing concentration on a few high-reward paths. In contrast, we propose Trajectory Matching Policy Optimization (TMPO), which replaces scalar reward maximization with trajectory-level reward distribution matching. Specifically, TMPO introduces a Softmax Trajectory Balance (Softmax-TB) objective to match the policy probabilities of K trajectories to a reward-induced Boltzmann distribution. We prove that this objective inherits the mode-covering property of forward KL divergence, preserving coverage over all acceptable trajectories while optimizing reward. To further reduce multi-trajectory training time on large-scale flow-matching models, TMPO incorporates Dynamic Stochastic Tree Sampling, where trajectories share denoising prefixes and branch at dynamically scheduled steps, reducing redundant computation while improving training effectiveness. Extensive results across diverse alignment tasks such as human preference, compositional generation and text rendering show that TMPO improves generative diversity over state-of-the-art methods by 9.1%, and achieves competitive performance in all downstream and efficiency metrics, attaining the optimal trade-off between reward and diversity.

2605.10906 2026-05-14 cs.LG cs.AI 版本更新

DataMaster: Data-Centric Autonomous AI Research

Yaxin Du, Xiyuan Yang, Zhifan Zhou, Wanxu Liu, Zixing Lei, Zimeng Chen, Fenyi Liu, Haotian Wu, Yuzhu Cai, Zexi Liu, Xinyu Zhu, WenHao Wang, Linfeng Zhang, Chen Qian, Siheng Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Carnegie Mellon University(卡内基梅隆大学) Zhejiang University(浙江大学) Beijing University of Aeronautics and Astronautics(北京航空航天大学)

AI总结 随着机器学习系统中模型、训练方法和计算资源趋于标准化,进一步提升性能的关键越来越依赖于数据。为此,研究提出了DataMaster,一个数据驱动的自主数据工程框架,旨在在不改变学习算法的前提下,通过优化数据选择、组合和处理来提升下游任务表现。该框架包含数据树、共享数据池和全局记忆三个核心组件,能够有效探索数据空间、复用已有数据并积累经验,实验表明其在多个基准测试中显著优于基线方法。

详情
英文摘要

As model families, training recipes, and compute budgets become increasingly standardized, further gains in machine learning systems depend increasingly on data. Yet data engineering remains largely manual and ad hoc: practitioners repeatedly search for external datasets, adapt them to existing pipelines, validate candidate data through downstream training, and carry forward lessons from prior attempts. We study task-conditioned autonomous data engineering, where an autonomous agent improves a fixed learning algorithm by optimizing only the data side, including external data discovery, data selection and composition, cleaning and transformation. The goal is to obtain a stronger downstream solution while leaving the learning algorithm unchanged. To address the open-ended search space, branch-dependent refinement, and delayed validation inherent in autonomous data engineering, we propose DataMaster, a data-agent framework that integrates tree-structured search, shared candidate data, and cumulative memory. DataMaster consists of three key components: a DataTree that organizes alternative data-engineering branches, a shared Data Pool that stores discovered external data sources for reuse, and a Global Memory that records node outcomes, artifacts, and reusable findings. Together, these components allow the agent to discover candidate data, construct executable training inputs, evaluate them through downstream feedback, and carry useful evidence across branches. We evaluate DataMaster on two types of benchmarks, MLE-Bench Lite and PostTrainBench. On MLE-Bench Lite, it improves medal rate by 32.27% over the initial score; on PostTrainBench, it surpasses the instruct model on GPQA (31.02% vs 30.35%).

2605.10819 2026-05-14 cs.RO cs.AI cs.CV 版本更新

ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

Zuojin Tang, Haoyun Liu, Xinyuan Chang, Changjie Wu, Dongjie Huo, Yandan Yang, Bin Liu, Zhejia Cai, Feng Xiong, Mu Xu, jiachen Luo, De Ma, Zhiheng Ma, Gang Pan

发表机构 * Zhejiang University(浙江大学) Amap, Alibaba Group(阿里集团阿地图) Nanjing University(南京大学) Shenzhen University of Advanced Technology(深圳先进技术大学) Beijing University of Chemical Technology(北京化工大学) Embodied Intelligence General Platform Laboratory, Chery Auto(奇瑞汽车 embodied intelligence 通用平台实验室) Tsinghua University(清华大学) Queen Mary University of London(伦敦大学玛丽女王学院)

AI总结 视觉-语言-动作(VLA)模型受限于带有动作标签的机器人数据稀缺,而无动作视频中蕴含了丰富的物理世界变化信息。本文提出ALAM(代数一致潜在动作模型),通过从无动作视频中学习结构化的潜在动作转移,为策略生成提供一致的过渡结构。ALAM利用帧三元组学习满足重建、组合和反转一致性的潜在转移,并通过联合流匹配目标将其与策略生成结合,显著提升了VLA任务的性能,在多个基准测试中取得了显著提升。

详情
英文摘要

Vision-language-action (VLA) models remain constrained by the scarcity of action-labeled robot data, whereas action-free videos provide abundant evidence of how the physical world changes. Latent action models offer a promising way to extract such priors from videos, but reconstruction-trained latent codes are not necessarily suitable for policy generation: they may predict future observations while lacking the structure needed to be reused or generated coherently with robot actions. We introduce ALAM (Algebraic Latent Action Model), an Algebraically Consistent Latent Action Model that turns temporal relations in action-free video into structural supervision. Given frame triplets, ALAM learns latent transitions that are grounded by reconstruction while being regularized by composition and reversal consistency, encouraging a locally additive transition space. For downstream VLA learning, we freeze the pretrained encoder and use its latent transition sequences as auxiliary generative targets, co-generated with robot actions under a joint flow-matching objective. This couples structured latent transitions with flow-based policy generation, allowing the policy to exploit ALAM's locally consistent transition geometry without requiring latent-to-action decoding. Representation probes show that ALAM reduces additivity and reversibility errors by 25-85 times over unstructured latent-action baselines and improves long-horizon cumulative reconstruction. When transferred to VLA policies, ALAM raises the average success rate from 47.9% to 85.0% on MetaWorld MT50 and from 94.1% to 98.1% on LIBERO, with consistent gains on real-world manipulation tasks. Ablations further confirm that the strongest improvements arise from the synergy between algebraically structured latent transitions and joint flow matching.

2605.10685 2026-05-14 cs.AI 版本更新

GESR: A Genetic Programming-Based Symbolic Regression Method with Gene Editing

Yanjie Li, Liping Zhang, Min Wu, Weijun Li, Lina Yu, Jingyi Liu, Yusong Deng, Mingzhu Wan, Xin Ning

发表机构 * AnnLab(安纳实验室) Institute of Semiconductors, Chinese Academy of Sciences(中国科学院半导体研究所) Zhongguancun Academy(中关村学院)

AI总结 本文提出了一种基于基因编辑的符号回归方法GESR,旨在提升传统遗传编程(GP)在符号回归任务中的效率与性能。该方法引入两个BERT模型作为“上帝之手”,分别用于指导基因突变和基因重组的位置预测,从而实现更精准的基因编辑。实验表明,GESR相比传统GP方法在计算效率和任务表现上均有显著提升。

Comments 70 pages

详情
英文摘要

Mathematical formulas serve as a language through which humans communicate with nature. Discovering mathematical laws from scientific data to describe natural phenomena has been a long-standing pursuit of humanity for centuries. In the field of artificial intelligence, this challenge is known as the symbolic regression problem. Among existing symbolic regression approaches, Genetic Programming (GP) based on evolutionary algorithms remains one of the most classical and widely adopted methods. GP simulates the evolutionary process across generations through genetic mutation and crossover. However, mutations and crossovers in GP are entirely random. While this randomness effectively mimics natural evolution, it inevitably produces both beneficial and detrimental variations. If there existed a metaphorical `God` capable of foreseeing which genetic mutations or crossovers would yield superior outcomes and performing targeted gene editing accordingly, the efficiency of evolution could be substantially improved. Motivated by this idea, we propose in this paper a symbolic regression approach based on gene editing, termed GESR. In GESR, we trained two "hands of God" (two BERT models). Among them, the first leverages the BERT's masked language modeling capability to guide the mutation of genes (expression symbols). The other BERT model guides the crossover of individual genes by predicting the crossover point. Experimental results demonstrate that GESR significantly improves computational efficiency compared with traditional GP algorithms and achieves strong overall performance across multiple symbolic regression tasks.

2605.10426 2026-05-14 cs.CV cs.AI 版本更新

CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

Minqing Huang, Yujiao Xiang, Zihan Liang, Jiajie Huang, Jingqi Wang, Zhi Xu, Feiyang Tan, Hangning Zhou, Mu Yang, Gong Che

发表机构 * Afari Intelligent Drive(Afari智能驾驶公司) University of Electronic Science and Technology of China(电子科技大学) Shanghai Jiao Tong University(上海交通大学) Beijing University Of Posts and Telecommunications(北京邮电大学) Tianjin University(天津大学)

AI总结 本文提出了一种名为 CoWorld-VLA 的多专家世界模型框架,用于自动驾驶任务,旨在解决现有视觉-语言-动作(VLA)模型在规划导向的中间表示方面存在的不足。该方法通过多源监督提取互补的世界信息,并将其编码为专家 token,作为规划器的显式条件,从而更有效地指导动作生成。实验表明,CoWorld-VLA 在未来场景生成和路径规划任务上表现出色,尤其在避障和轨迹精度方面具有优势。

详情
英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning-oriented intermediate representations: textual Chain-of-Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld-VLA, a multi-expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld-VLA extracts complementary world information through multi-source supervision and encodes it into expert tokens within the VLA, thereby providing planner-accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens, which respectively model interaction intent, spatial structure, future temporal dynamics, and behavioral goals. During action generation, CoWorld-VLA employs a diffusion-based hierarchical multi-expert fusion planner, which is coupled with scene context throughout the joint denoising process to generate continuous ego trajectories. Experiments show that CoWorld-VLA achieves competitive results in both future scene generation and planning on the NAVSIM v1 benchmark, demonstrating strong performance in collision avoidance and trajectory accuracy. Ablation studies further validate the complementarity of expert tokens and their effectiveness as planning conditions for action generation. Code will be available at https://github.com/AFARI-Research/CoWorld-VLA.

2605.10267 2026-05-14 cs.AI 版本更新

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

Songlin Bai, Xintong Wang, Linlin Yu, Bin Chen, Zhiang Xu, Yuyang Sheng, Changtong Zan, Xiaofeng Zhu, Yizhe Zhang, Jiru Li, Mingze Guo, Ling Zou, Yalong Li, Chengfu Huo, Liang Ding

发表机构 * Multimodal and Industrial AI Team(多模态与工业AI团队)

AI总结 本文提出 IndustryBench,一个基于中国国家标准和工业产品记录构建的中文工业采购问答基准测试集,用于评估大语言模型在工业知识边界上的表现。该基准包含2049个题目,涵盖七个能力维度和十个行业类别,并通过外部验证阶段过滤掉70.3%的不可靠答案,揭示了当前模型在工业安全与标准符合性方面的显著不足。研究发现,即使是最优模型在安全调整后的得分也较低,且安全违规问题会显著影响模型排名,表明工业场景下大语言模型的评估需要更加注重安全性和标准合规性。

详情
英文摘要

In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering. Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at $κ_w = 0.798$ against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions. Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.

2605.10005 2026-05-14 cs.PL cs.AI cs.LO cs.SE 版本更新

Combining Mechanical and Agentic Specification Inference for Move

Wolfgang Grieskamp, Teng Zhang, Vineeth Kashyap

发表机构 * Aptos Labs(Aptos实验室)

AI总结 本文介绍了一种用于Move Prover的规范推断工具,该工具结合了Move字节码的最弱前置条件(WP)分析与智能编码代理(如Claude Code)。该方法旨在减少手动编写规范的繁琐工作,通过WP分析提供可靠的机械基础,而AI代理则用于处理WP较弱的部分,如循环不变式和高层次规范。该工具已应用于包含高阶函数、动态分派、全局状态等特性的典型Move代码库中,验证了其有效性和实用性。

详情
英文摘要

In this paper, we describe early work on a specification inference tool for the Move Prover that combines a weakest-precondition (WP) analysis over Move bytecode with an agentic coding CLI such as Claude Code. Specification inference reduces the boilerplate of writing specifications in Move: in order to verify a high-level property such as a global state invariant, pre- and post-conditions for the supporting functions typically have to be written by hand, which is tedious. In our setting, a Model Context Protocol (MCP) service exposes the WP analysis and the prover itself to the coding agent. The WP analysis provides a sound, mechanical baseline for inference; the AI is used precisely where WP is weakest -- for loop invariants and high-level idiomatic specifications such as monotonicity, conservation, and structural invariants. The Move Prover serves as the oracle that decides whether the generated specs are valid, and the agent is equipped to generate proof hints and to refine the inferred specification until verification succeeds. The tool has been applied to a corpus of canonical Move code, including code that uses higher-order functions, dynamic dispatch, global state, references, and various forms of loops.

2605.09923 2026-05-14 cs.AI 版本更新

expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling

Mingxiong Lin, Zhangquan Gong, Maowen Tang, Qian Li, Chuangchuang Wang, Jian Ma, Sutian Huang, Kai Tang, Haonan Lu

AI总结 该论文针对基于可验证奖励的强化学习(RLVR)中主流算法Group Relative Policy Optimization(GRPO)存在的探索效率不足问题,提出了探索优先策略优化方法EXPO。EXPO通过引入动态调整的KL正则化模块和基于高斯分布的课程采样策略,有效提升了模型在数学推理任务中的探索能力和训练效率。实验表明,EXPO在多个基准测试中显著优于原始GRPO,尤其在高难度问题上的性能提升更为明显。

Comments Duplicate submission of arXiv:2605.11403

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, where Group Relative Policy Optimization (GRPO) serves as the mainstream algorithm. We point out two understudied inefficiencies existing in GRPO. First, the fixed KL penalty coefficient overly restricts policy exploration at stages where the model requires significant deviation from the reference policy. Second, uniform sampling of training questions ignores that moderately difficult problems provide the most informative gradient signals for optimization. We propose Exploration-Prioritized Policy Optimization (EXPO) with two lightweight plug-in modules. The Accuracy-Conditioned KL Scaling (AKL) dynamically adjusts KL regularization strength through a smooth nonlinear function of batch average accuracy, relaxing the penalty when the model underperforms and strengthening it when the model achieves good results. The Gaussian Curriculum Sampling (GCS) assigns sampling weights to questions following a Gaussian distribution centered at moderate accuracy around 0.5, focusing training on the model's learning frontier. We conduct extensive experiments on DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base over six mathematical reasoning benchmarks. The results show EXPO steadily surpasses vanilla GRPO. It obtains an absolute gain of 13.34 on AIME 2025 pass@32, rising from 63.33 percent to 76.67 percent, and achieves an average pass@32 improvement of 2.66 on the 8B model. The much larger performance gains on pass@32 compared with pass@1 demonstrate that EXPO effectively enlarges the model's exploration boundary under a fixed inference cost budget.

2605.09423 2026-05-14 cs.AI 版本更新

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

Haoqiang Kang, Xiaokang Ye, Yuhan Liu, Siddhant Hitesh Mantri, Lingjun Mao, James Fleming, Drishti Regmi, Lianhui Qin

发表机构 * UC San Diego(加州大学圣迭戈分校) New York University(纽约大学)

AI总结 本文提出 SimWorld Studio,一个基于 Unreal Engine 5 的开源平台,用于自动生成可交互的三维学习环境,以促进具身智能体的学习。核心方法是 SimCoder,一种具备工具和技能增强能力的编码智能体,能够根据语言或图像指令编写并执行底层引擎代码,构建物理真实的三维世界,并通过验证器反馈进行自我进化。该平台实现了环境生成与具身学习的协同进化,显著提升了智能体的性能和环境生成的可靠性。

详情
英文摘要

LLM/VLM-based digital agents have advanced rapidly thanks to scalable sandboxes for coding, web navigation, and computer use, which provide rich interactive training grounds. In contrast, embodied agents still lack abundant, diverse, and automatically generated 3D environments for interactive learning. Existing embodied simulators rely on manually crafted scenes or procedural templates, while recent LLM-based 3D generation systems mainly produce static scenes rather than deployable environments with verifiable tasks and standard learning interfaces. We introduce SimWorld Studio, an open-source platform built on Unreal Engine 5 for generating evolving embodied learning environments. At its core is SimCoder, a tool/skill-augmented coding agent that writes and executes engine-level code to construct physically grounded 3D worlds from language/image instructions. SimCoder self-evolves by using verifier feedback (e.g., compilation errors, physics checks, VLM critiques) to revise environments and autonomously add reusable tools and skills to its library. Generated worlds are exported as Gym-style environments for embodied agent learning. SimWorld Studio further enables co-evolution between environment generation and embodied learning: agent performance feedback guides SimCoder to generate adaptive curricula near the learner's capability frontier, so that environments become increasingly challenging as the embodied agent improves. Three case studies on embodied navigation show that self-evolution improves generation reliability, generated environments substantially improve embodied agent performance that generalizes to unseen benchmarks, and co-evolution yields an 18-point success-rate gain over fixed-environment learning and a 40-point gain over an untrained agent.

2605.07483 2026-05-14 cs.LG cs.AI 版本更新

Does Your Neural Network Extrapolate? Feature Engineering as Identifiability Bias for OOD Generalization

Leonel Aguilar, Jan Nagler, Christoph Hoelscher, Nino Antulov-Fantulin

发表机构 * Chair of Cognitive Science, ETH Zürich(认知科学教授职位,苏黎世联邦理工学院) Centre for Human and Machine Intelligence, Frankfurt School(人机智能中心,法兰克福学校) Aisot Technologies AG, D-GESS, ETH Zürich(Aisot技术公司,D-GESS,苏黎世联邦理工学院)

AI总结 本文研究了深度神经网络在分布外(OOD)场景下泛化失败的原因,指出其根本问题在于从训练数据中学习到的特征无法反映真实的数据生成过程(DGP)。作者提出,通过引入结构化的特征映射、标签映射和模型类(φ, ψ, M),可以明确DGP的假设,从而提升OOD泛化能力。实验表明,正确的特征表示和模型选择能够显著降低OOD误差,并在多个自然科学和机器学习任务中验证了该方法的有效性。

详情
英文摘要

Successful deep neural networks discover salient features of data. We show when and why they fail to learn out-of-distribution (OOD)-relevant representations from an in-distribution (ID) training window. This requires decoupling feature learning from data-generating-process (DGP) identifiability. From a single training window, OOD extrapolation is non-identifiable: infinitely many DGPs are $\varepsilon$-observationally equivalent on the training data but diverge arbitrarily outside it, and no in-distribution criterion alone reliably breaks the tie. A structural commitment, the feature map, label map, and model class $(φ, ψ, \mathcal{M})$, dictates the assumed DGP and governs OOD generalization while leaving ID performance essentially unchanged. When architecture, pretraining, augmentation, input formats, or domain knowledge implicitly inject the missing commitment, the model succeeds. When it cannot infer OOD-relevant structure from ID evidence, it fails. Changing only the representation can make the same architecture, at the same in-distribution loss, differ by ${\sim}520\times$ out of distribution. When the commitment is correct and identifiable, OOD error vanishes. For example, Fourier coordinates turn periodic extrapolation into interpolation on $\mathbb{S}^1$. The same mechanism predicts outcomes in three natural-science settings (mass-action chemistry; Kepler's-third-law exoplanet prediction, $n=2{,}362$; and cross-species coding-DNA detection) and in a 264-run positional-encoding study across Transformer, Mamba, and S4D. Finally, a controlled study shows: correct features are necessary but not sufficient. The model class must express the target, and the transformed training data must cover the relevant representation space.

2605.07161 2026-05-14 cs.AI 版本更新

SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

Jackson Clark, Yiming Su, Saad Mohammad Rafid Pial, Yifang Tian, Lily Gniedziejko, Hans-Arno Jacobsen, Yinfang Chen, Tianyin Xu

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Toronto(多伦多大学)

AI总结 本文提出SREGym,一个用于评估AI Site Reliability Engineering(SRE)代理的高保真基准平台。SREGym基于真实云原生系统架构构建,能够模拟多层故障、环境噪声和多种失效模式,提供90个现实且具有挑战性的SRE问题。该平台设计模块化且可扩展,支持故障注入与噪声控制,研究结果显示当前前沿代理在处理不同类型故障时表现差异显著,最高可达40%的端到端结果差异。

详情
英文摘要

AI agents are increasingly used to diagnose and mitigate failures in production systems, known as agentic Site Reliability Engineering (SRE). Current SRE benchmarks are limited to oversimplistic SRE tasks and are unfortunately hard to extend due to bespoke designs. We present SREGym, a high-fidelity benchmark for SRE agents. SREGym exposes a live system environment built atop real-world cloud-native system stacks, where high-fidelity failure scenarios are simulated through fault injectors. SREGym models the complexity of production environments by simulating (1) a wide range of faults at different layers, (2) various ambient noises, and (3) diverse failure modes such as metastable failures and correlated failures. SREGym is architected as a modular, extensible framework that orchestrates fault and noise injectors across stacks. SREGym currently includes 90 realistic, challenging SRE problems. We use SREGym to evaluate frontier agents and show that their capabilities varies significantly in addressing different kinds of failures, with up to 40% differences in end-to-end results. SREGym is actively maintained as an open-source project and has been used by researchers and practitioners.

2605.07147 2026-05-14 cs.LO cs.AI cs.LG 版本更新

MathlibPR: Pull Request Merge-Readiness Benchmark for Formal Mathematical Libraries

Zixuan Xie, Xinyu Liu, Shangtong Zhang

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 本文提出 MathlibPR,一个基于真实 Mathlib4 拉取请求(PR)历史构建的基准,用于评估大语言模型(LLM)在判断数学库 PR 是否适合合并的能力。研究指出,尽管 LLM 在辅助形式化推理方面取得进展,但尚未有效参与 Mathlib 的贡献过程,而 Mathlib 的增长正受到人工审核流程的限制。通过引入分阶段评估协议,研究发现当前主流 LLM 和 LLM 代理在区分可合并 PR 与仅通过构建但未被合并的 PR 方面仍面临挑战,MathlibPR 为此类评审助手和奖励模型的开发提供了监督信号。

详情
英文摘要

The ecosystem of Lean and Mathlib has become the de facto standard for large language model (LLM) assisted formal reasoning with remarkable successes in recent years. Those successes, however, only consume Mathlib as an essential dependency but do not directly contribute to it. In the meantime, the growth of Mathlib has recently been bottlenecked by the review process, which requires human reviewers to judge whether proposed pull requests (PRs) follow the Mathlib's conventions and are worth integrating as part of a shared mathematical infrastructure. This leads to our central question: can LLMs help review Mathlib PRs? To this end, we introduce MathlibPR, a benchmark built from real Mathlib4 PR histories. We further propose a staged evaluation protocol and use it to evaluate both LLM models (e.g., DeepSeek, Qwen, Goedel, and Kimina) and LLM agents (e.g., Codex and Claude Code). Surprisingly, both LLM models and LLM agents struggle to distinguish merge-ready PRs from build-passing PRs that were revised or never merged. By turning Mathlib PR histories into a supervised signal, MathlibPR provides a step toward reviewer assistants and reward models that could help evaluate PRs and steer LLMs toward producing merge-ready Mathlib contributions.

2605.06869 2026-05-14 cs.AI 版本更新

Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

Roger Creus Castanyer, Pablo Samuel Castro, Glen Berseth

发表机构 * Mila Quebec AI Institute(魁北克AI研究所) Université de Montréal(蒙特利尔大学) Google DeepMind(谷歌DeepMind)

AI总结 本文提出 Agentick,一个用于评估通用序列决策智能体的统一基准,旨在公平比较从头学习的强化学习智能体、基于预训练知识的语言模型智能体以及混合智能体等不同方法。Agentick 提供了 37 个程序生成的任务,涵盖六类能力、四个难度等级和五种观测模态,并通过统一的 Gymnasium 接口实现,同时配套了编码接口、参考策略、训练数据集和实时排行榜。实验表明,不同方法在不同任务上各有优劣,突显了当前智能体研究仍有较大提升空间,Agentick 为推动通用自主智能体的发展提供了重要的实验平台。

详情
英文摘要

AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre-trained knowledge, yet no unified benchmark enables fair comparison across these approaches. We present Agentick, a benchmark for sequential decision-making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision-making. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium-compatible interface. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre-built SFT datasets, a composable agent harness, and a live leaderboard. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single approach dominates: GPT-5 mini leads overall at 0.309 oracle-normalized score while PPO dominates planning and multi-agent tasks; the reasoning harness multiplies LLM performance by 3-10x; and ASCII observations consistently outperform natural language. These findings highlight the substantial room for improvement that remains across all agent paradigms. Agentick's capability-decomposed, multi-modal design provides the empirical infrastructure needed to drive progress toward general autonomous agents, both as an evaluation framework and as a training ground for RL post-training of foundation models in truly sequential environments.

2605.06607 2026-05-14 physics.flu-dyn cs.AI 版本更新

AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

Nithin Somasekharan, Rabi Pathak, Manushri Dhanakoti, Tingwen Zhang, Ling Yue, Andy Zhu, Shaowu Pan

发表机构 * Rensselaer Polytechnic Institute(拉特格斯理工学院)

AI总结 该研究提出了一种名为“AI CFD Scientist”的开源人工智能科学家系统,旨在推动计算流体力学(CFD)领域的开放性科学发现。该系统结合了基于物理的视觉验证、代码生成与修改、以及基于文献的假设生成,能够在单一可审查的工作流程中完成从理论构思到实验验证的全过程。通过引入视觉语言物理验证机制,该系统能有效检测传统求解器难以发现的错误,并在多个任务中展示了其在自动发现改进模型和提升模拟精度方面的优越性能。

Comments 9 main pages and rest in appendix

详情
英文摘要

Recent LLM-based agents have closed substantial portions of the scientific discovery loop in software-only machine-learning research, in chemistry, and in biology. Extending the same loop to high-fidelity physical simulators is harder, because solver completion does not imply physical validity and many failure modes appear only in field-level imagery rather than in solver logs. We present AI CFD Scientist, an open-source AI scientist for computational fluid dynamics (CFD) that, to our knowledge, is the first to span literature-grounded ideation, validated execution, vision-based physics verification, source-code modification, and figure-grounded writing within a single inspectable workflow. Three coupled pathways cover parameter sweeps within a fixed solver, case-local C++ library compilation for new physical models, and open-ended hypothesis search against a reference comparator, all running on OpenFOAM through Foam-Agent. At the center of the framework is a vision-language physics-verification gate that inspects rendered flow fields before any result is accepted, rerun, or written into a manuscript. On five tasks under a shared GPT-5.5 backbone, AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction that reduces lower-wall Cf RMSE against DNS by 7.89% on the periodic hill at Reh=5600; under matched LLM cost, two strong general AI-scientist baselines (ARIS, DeepScientist) execute partial CFD workflows but lack the domain-specific validity gates needed to convert runs into defensible scientific claims; and a controlled planted-failure ablation shows that the vision-language gate detects 14 of 16 silent failures missed by solver-level checks. Code, prompts, and run artifacts are released at https://github.com/csml-rpi/cfd-scientist.

2605.06387 2026-05-14 cs.LG cs.AI 版本更新

Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

Nan Jia, Haojin Yang, Xing Ma, Jiesong Lian, Shuailiang Zhang, Weipeng Zhang, Ke Zeng, Xunliang Cai, Zequn Sun

发表机构 * Huazhong University of Science and Technology(华中科技大学) Peking University(北京大学) Meituan(美团)

AI总结 本文研究了如何改进基于策略的蒸馏方法,以在令牌级别更好地结合探索与模仿学习。针对传统方法在优势权重策略梯度中的高方差更新、零优势区域梯度消失和探索瓶颈等问题,提出了一种不对称的在策略蒸馏方法(AOPD),通过在非正优势区域采用局部散度最小化替代无效的负强化,同时保留正强化学习。实验表明,AOPD在数学推理基准中表现优于标准方法,且在训练过程中保持更高的策略熵和更好的工具使用适应能力。

详情
英文摘要

On-policy distillation (OPD) trains a student on its own trajectories with token-level teacher feedback and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its standard advantage weighted policy gradient suffers from three structural weaknesses, including high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks when corrective signals are insufficient. We therefore propose Asymmetric On-Policy Distillation (AOPD), which replaces ineffective negative reinforcement with localized divergence minimization in non-positive advantage regions while preserving positive reinforcement learning. Experiments on mathematical reasoning benchmarks show that AOPD consistently outperforms standard OPD, with average gains of 4.09 / 8.34 under strong / weak initialization, respectively. AOPD also maintains higher policy entropy during training and better capability retention during sequential tool-use adaptation.

2605.04759 2026-05-14 cs.CL cs.AI cs.ET cs.LG 版本更新

Gyan: An Explainable Neuro-Symbolic Language Model

Venkat Srinivasan, Vishaal Jatav, Anushka Chandrababu, Geetika Sharma

发表机构 * Innospark Ventures & Gyan AI(Innospark Ventures及Gyan AI) Gyan AI Inc.(Gyan AI公司)

AI总结 本文提出了一种可解释的神经符号语言模型Gyan,其基于一种新颖的非Transformer架构,克服了传统大语言模型在可解释性、可维护性和计算资源消耗等方面的不足。Gyan通过结合修辞结构理论、语义角色理论和基于知识的计算语言学,实现了对完整组合语境的捕捉,并构建了一个类人“世界模型”以增强理解能力。实验表明,Gyan在多个数据集上取得了优越的性能,展示了其在关键任务中构建可信、可靠语言模型的潜力。

Comments also submitted to NeurIPS 2026

详情
英文摘要

Transformer based pre-trained large language models have become ubiquitous. There is increasing evidence to suggest that even with large scale pre-training, these models do not capture complete compositional context and certainly not, the full human analogous context. Besides, by the very nature of the architecture, these models hallucinate, are difficult to maintain, are not easily interpretable and require enormous compute resources for training and inference. Here, we describe Gyan, an explainable language model based on a novel non-transformer architecture, without any of these limitations. Gyan achieves SOTA performance on 3 widely cited data sets and superior performance on two proprietary data sets. The novel architecture decouples the language model from knowledge acquisition and representation. The model draws on rhetorical structure theory, semantic role theory and knowledge-based computational linguistics. Gyan's meaning representation structure captures the complete compositional context and attempts to mimic humans by expanding the context to a 'world model'. AI model adoption critically depends on trust and transparency especially in mission critical use cases. Collectively, our results demonstrate that it is possible to create models which are trustable and reliable for mission critical tasks. We believe our work has tremendous potential for guiding the development of transparent and trusted architectures for language models.

2605.04506 2026-05-14 cs.CV cs.AI 版本更新

Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting

Binh Long Nguyen, Kien Nguyen, Sridha Sridharan, Clinton Fookes, Peyman Moghadam

发表机构 * School of Electrical Engineering and Robotics(电气工程与机器人学学院) Queensland University of Technology(昆士兰理工大学) CSIRO Robotics(CSIRO机器人部) CSIRO

AI总结 Ilov3Splat 是一种基于高斯点扩散(3D-GS)的新型框架,用于实现实例级别的开放词汇三维场景理解。该方法通过在高斯点中引入视图一致的特征场,联合优化场景几何与语义表示,从而提升跨视角一致性与实例级推理能力。通过结合多分辨率哈希嵌入与对比损失训练实例特征场,Ilov3Splat 能够在无需类别监督的情况下,基于自然语言描述准确识别和分割三维场景中的任意物体,显著优于现有开放词汇三维理解方法。

Comments The International Conference on Pattern Recognition (ICPR) 2026

详情
英文摘要

We introduce Ilov3Splat, a novel framework for instance-level open-vocabulary 3D scene understanding built on 3D Gaussian Splatting (3D-GS). Most prior work depends on 2D rendering-based matching or point-level semantic association, which undermines cross-view consistency, lacks coherent instance-level reasoning, and limits precision in downstream 3D tasks. To address these limitations, our method jointly optimizes scene geometry and semantic representations by augmenting Gaussian splats with view-consistent feature fields. Specifically, we leverage multi-resolution hash embedding to efficiently encode language-aligned CLIP features, enabling dense and coherent language grounding in 3D space. We further train an instance feature field using contrastive loss over SAM masks, supporting fine-grained object distinction across views. At inference time, CLIP-encoded queries are matched against the learned features, followed by two-stage 3D clustering to retrieve relevant Gaussian groups. This enables our framework to identify arbitrary objects in 3D scenes based on natural language descriptions, without requiring category supervision or manual annotations. Experiments on standard benchmarks demonstrate that Ilov3Splat outperforms prior open-vocabulary 3D-GS methods in both object selection and instance segmentation, offering a flexible and accurate solution for language-driven 3D scene understanding. Project page: https://csiro-robotics.github.io/Ilov3Splat.

2605.03410 2026-05-14 cs.AI 版本更新

Geometry over Density: Few-Shot Cross-Domain OOD Detection

Shawn Li, You Qin, Jiate Li, Charith Peris, Lisa Bauer, Roger Zimmermann, Yue Zhao

发表机构 * University of Southern California(南加州大学) National University of Singapore(新加坡国立大学) Amazon(亚马逊)

AI总结 本文研究了在仅有少量样本的情况下,如何利用预训练模型进行跨领域异常检测的问题。提出了一种名为UFCOD的统一框架,通过分析扩散过程中的信息几何特性,提取路径能量和动力学能量两个特征,实现无需额外训练即可在任意新领域进行OOD检测。该方法在12个跨领域基准测试中取得了93.7%的平均AUROC,展示了其在样本效率上的显著优势。

详情
英文摘要

Out-of-distribution (OOD) detection identifies test samples that fall outside a model's training distribution, a capability critical for safe deployment in high-stakes applications. Standard OOD detectors are trained on a specific in-distribution (ID) dataset and detect deviations from that single domain. In contrast, we study few-shot cross-domain OOD detection: given a \emph{single} pre-trained model, can we perform OOD detection on \emph{arbitrary} new ID-OOD task pairs using only a handful of ID samples at inference time, with no additional training? We propose \textbf{UFCOD}, a unified framework that achieves this goal through information-geometric analysis of diffusion trajectories. Our key insight is that diffusion noise predictions are score functions (gradients of log-density), and we extract two energy features: \emph{Path Energy} (integrated score magnitude) and \emph{Dynamics Energy} (score smoothness), that form a discrete Sobolev norm capturing how samples interact with the learned diffusion process. The central contribution is a \textbf{train-once, deploy-anywhere} paradigm: a diffusion model trained on a single dataset (e.g., CelebA) serves as a universal feature extractor for OOD detection across semantically unrelated domains (e.g., CIFAR-10, SVHN, Textures). At deployment, each new task requires only $\sim$100 unlabeled ID samples for inference: no retraining, no fine-tuning, no task-specific adaptation. Using 100 ID samples per task, UFCOD achieves 93.7\% average AUROC across 12 cross-domain benchmarks, competitive with methods trained on 50k--163k samples, demonstrating $\sim$500$\times$ improvement in sample efficiency. See our code in https://github.com/lili0415/UFCOD.

2605.01750 2026-05-14 cs.MA cs.AI 版本更新

Talk is Cheap, Communication is Hard: Dynamic Grounding Failures and Repair in Multi-Agent Negotiation

Yiheng Yao, Chelsea Zou, Robert D. Hawkins

发表机构 * Stanford University(斯坦福大学)

AI总结 本文研究多智能体谈判中动态语境建立失败及其修复机制,指出当前基于大语言模型的多智能体基准主要关注静态任务,忽视了智能体在交互中修复语境断裂的能力。通过设计一个多轮谈判博弈实验,作者发现智能体对偶在达成帕累托最优分配时存在四种典型失败模式,表明动态语境建立的困难主要源于联合计划制定、承诺与执行的协调瓶颈,而非单一智能体的推理能力或信息交换不足。

详情
英文摘要

Grounding is the collaborative process of establishing mutual belief sufficient for a communicative goal. While static grounding maps language to a shared context, dynamic grounding requires agents to negotiate meaning across turns. Current multi-agent Large Language Model (LLM) benchmarks largely emphasize static, one-shot tasks, overlooking whether agents can repair grounding breakdowns through interaction. We introduce an iterated multi-turn negotiation game where two agents allocate shared resources to private projects with verifiable jointly optimal outcomes. Although individual agents can identify Pareto-optimal allocations in isolation, agent dyads consistently fail to reach them across models. We identify four failure modes: (1) loss of shared interaction history, (2) stubborn anchoring to early proposals, (3) defaulting to equal splits over reward-maximizing coordination, and (4) referential binding errors across turns. Our baselines show that the coordination gap is not explained by individual reasoning limits or insufficient information exchange alone. Instead, the bottleneck lies in dynamic grounding: joint plan formation, commitment, and execution.

2605.01457 2026-05-14 cs.AI 版本更新

CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making

Guowei Zou, Haitao Wang, Beiwen Zhang, Boning Zhang, Hejun Wu

AI总结 本文提出了一种名为CoFlow的协调少步流方法,用于离线多智能体决策问题。该方法通过引入协调速度注意力机制和自适应协调门控,实现了在单次生成过程中保持智能体间协调性的目标,从而克服了现有少步生成方法在协调性上的不足。实验表明,CoFlow在多种任务中表现出色,能够在仅需1到3步去噪的情况下达到最先进的协调质量,且其性能提升主要归因于智能体间的协调能力增强。

Comments 34 pages, 15 figures, 10 tables. Project page: https://guowei-zou.github.io/coflow/

详情
英文摘要

Generative models have emerged as a promising paradigm for offline multi-agent reinforcement learning (MARL), but existing approaches require many iterative sampling steps. Recent few-step acceleration methods either distill a joint teacher into independent students or apply averaged velocity fields independently to each agent. Unfortunately, these few-step approaches hurt inter-agent coordination. We show that the efficiency-coordination trade-off is not inherent: single-pass multi-agent generation can preserve coordination when the velocity field is natively joint-coupled. We propose Coordinated few-step Flow (CoFlow), an architecture that combines Coordinated Velocity Attention (CVA) with Adaptive Coordination Gating. A finite-difference consistency surrogate further replaces memory-prohibitive Jacobian-vector product backpropagation through the averaged velocity field with two stop-gradient forward passes. Across 60 configurations spanning MPE, MA-MuJoCo, and SMAC, CoFlow matches or surpasses Gaussian policies, value-based methods, transformer policies, diffusion models, and prior flow baselines on episodic return. Three independent coordination probes confirm that CoFlow's improvements arise from inter-agent coordination rather than per-agent capacity. A denoising-step sweep shows that single-pass inference suffices on every configuration. CoFlow reaches state-of-the-art coordination quality in 1-3 denoising steps under both centralized and decentralized execution. Project Page: https://guowei-zou.github.io/coflow/

2604.27996 2026-05-14 cs.AI cs.GR cs.HC 版本更新

Exploring Interaction Paradigms for LLM Agents in Scientific Visualization

Jackson Vonderhorst, Kuangshi Ai, Haichao Miao, Shusen Liu, Chaoli Wang

发表机构 * Univ. Notre Dame(诺特难大学) LLNL(劳伦斯利弗莫尔国家实验室)

AI总结 本文研究了不同类型的大型语言模型(LLM)代理在科学可视化任务中的表现,用户通过自然语言指令生成可视化流程。通过比较三种主要交互范式,包括使用结构化工具的领域特定代理、计算机使用代理和通用编程代理,在15个基准任务中评估了八种代表性代理的可视化质量、效率、鲁棒性和计算成本。研究还分析了不同交互方式及持久记忆对性能的影响,结果表明各类方法在灵活性、效率和稳定性方面存在明显权衡,未来科学可视化系统应结合结构化工具使用、交互能力和自适应记忆机制以实现性能与灵活性的平衡。

详情
英文摘要

This paper examines how different types of large language model (LLM) agents perform on scientific visualization (SciVis) tasks, where users generate visualization workflows from natural-language instructions. We compare three primary interaction paradigms, including domain-specific agents with structured tool use, computer-use agents, and general-purpose coding agents, by evaluating eight representative agents across 15 benchmark tasks and measuring visualization quality, efficiency, robustness, and computational cost. We further analyze interaction modalities, including code scripts and model context protocol (MCP) or API calls for structured tool use, as well as command-line interfaces (CLI) and graphical user interfaces (GUI) for more general interaction, while additionally studying the effect of persistent memory in selected agents. The results reveal clear tradeoffs across paradigms and modalities. General-purpose coding agents achieve the highest task success rates but are computationally expensive, while domain-specific agents are more efficient and stable but less flexible. Computer-use agents perform well on individual steps but struggle with longer multi-step workflows, indicating that long-horizon planning is their primary limitation. Across both CLI- and GUI-based settings, persistent memory improves performance over repeated trials, although its benefits depend on the underlying interaction mode and the quality of feedback. These findings suggest that no single approach is sufficient, and future SciVis systems should combine structured tool use, interactive capabilities, and adaptive memory mechanisms to balance performance, robustness, and flexibility.

2604.27389 2026-05-14 cs.CV cs.AI 版本更新

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

Bingli Wang, Huanze Tang, Haijun Lv, Zhishan Lin, Lixin Gu, Lei Feng, Qipeng Guo, Kai Chen

发表机构 * Southeast University Shanghai AI Laboratory(上海大学上海人工智能实验室) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文提出COHERENCE基准,旨在评估多模态大语言模型在交织图文上下文中进行细粒度图文对齐的能力。现有基准多关注单一或多个图像的理解,而现实场景中信息常以图文交织形式呈现,要求模型不仅识别图像内容,还需建立图文间的细粒度关联并进行推理。COHERENCE涵盖四个代表性领域的交织图文内容,包含6,161个高质量问题,并通过六类错误分析,揭示当前模型在该任务中的不足。

详情
英文摘要

In recent years, Multimodal Large Language Models (MLLMs) have achieved remarkable progress on a wide range of multimodal benchmarks. Despite these advances, most existing benchmarks mainly focus on single-image or multi-image comprehension. In real-world scenarios such as document reading, information is often presented as interleaved multimodel contexts. This requires MLLMs not only to recognize the content of individual images, but also to identify relevant textual and visual evidence, establish fine-grained alignments between them, and reason over these aligned signals in interleaved contexts based on contextual evidence. However, there is still a lack of systematic benchmarks for quantifying the fine-grained understanding ability of MLLMs in interleaved image-text contexts. To fill this gap, we propose COHERENCE, a benchmark designed to evaluate the ability of MLLMs to recover fine-grained image-text correspondences in interleaved multimodal contexts. COHERENCE covers interleaved image-text content from four representative domains and contains 6,161 high-quality questions. Moreover, we perform a six-type error analysis, enabling fine-grained attribution of failures in interleaved image-text understanding to the specific capabilities missing in current MLLMs.

2604.26969 2026-05-14 cs.IR cs.AI 版本更新

AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization

Xidong Wu, Yue Zhuan, Ruoqiao Wei, Hangxin Chen, Di Bai, Jintao Liu, Xinyi Wang, Xue Wang, Luoshu Wang, Xinwu Cheng

发表机构 * Google(谷歌)

AI总结 现代推荐系统通常由多阶段流程构成,包括预排序、排序和重排序等环节,系统级配置优化对于整体性能至关重要,但因系统复杂性高,优化过程既耗时又需要专业知识。为此,本文提出AgenticRecTune框架,包含五个专门代理(Actor、Critic、Insight、Skill和Online),利用大语言模型的高级推理能力,自动探索最优配置空间,并通过自演进的Skillhub模块总结历史结果、提取任务机制并更新优化技能,从而实现高效、自适应的推荐系统优化。

详情
英文摘要

Modern large-scale recommendation systems are typically constructed as multi-stage pipelines, encompassing pre-ranking, ranking, and re-ranking phases. While traditional recommendation research typically focuses on optimizing a specific model, such as improving the pre-ranking model structure or ranking models training algorithm, system-level configurations optimization play a crucial role, which integrates the output from each model head to get the final score in each stage. Due to the complexity of the system, the configuration optimization is highly important and challenging. Any model modification requires new optimal system-level configurations. But each experimental iteration requires significant tuning effort. Furthermore, models in different stage operates within a distinct context and optimizes for different targets, requiring specialized domain expertise. In addition, optimization success depends on balancing competing multiple online metrics and alignment with shifting production development objectives. To address these challenges, we propose AgenticRecTune, an agentic framework comprising five specialized agents, Actor, Critic, Insight, Skill, and Online, designed to manage the end-to-end configuration optimization workflow. By leveraging the advanced reasoning of Large Language Models (LLMs), specifically Gemini, AgenticRecTune explore the optimal configuration spaces. The Actor Agent proposes multiple candidates and Critic Agent filters out suboptimal proposals.Then Online Agent autonomously prepares A/B tests based on the proposed configurations set from the Critic Agent and captures the subsequencet experimental results. We also introduce a self-evolving Skillhub, which utilizes a collaboration between the Insight Agent and Skill Agent to summarize the history results, extract underlying mechanics of each task in recommendation system and update skills.

2604.23887 2026-05-14 cs.CR cs.AI 版本更新

Evaluation of Prompt Injection Defenses in Large Language Models

Priyal Deep, Shane Emmons, Amy Fox, Kyle Bacon, Kelley McAllister, Peter Ortiz, Krisztian Flautner

发表机构 * Swept AI University of Michigan(密歇根大学)

AI总结 该研究评估了大型语言模型中针对提示注入攻击的防御措施,发现大多数依赖模型自身保护的防御方法最终都会失效。研究通过构建自适应攻击者进行大量攻击测试,结果表明唯一有效的防御方式是应用层的输出过滤,即在模型响应到达用户前通过硬编码规则进行检查,从而实现零泄露。研究强调,安全边界应由应用代码强制执行,而非依赖模型自身,并建议在相关防御机制得到验证前,敏感操作应限制在内部可信人员范围内。

Comments 14 pages, 9 figures

详情
英文摘要

LLM-powered applications routinely embed secrets in system prompts, yet models can be tricked into revealing them. We built an adaptive attacker that evolves its strategies over hundreds of rounds and tested it against nine defense configurations across more than 20,000 attacks. Every defense that relied on the model to protect itself eventually broke. The only defense that held was output filtering, which checks the model's responses via hardcoded rules in separate application code before they reach the user, achieving zero leaks across 15,000 attacks. These results demonstrate that security boundaries must be enforced in application code, not by the model being attacked. Until such defenses are verified by tools like Swept AI, AI systems handling sensitive operations should be restricted to internal, trusted personnel.

2604.21345 2026-05-14 cs.AI cs.CL 版本更新

Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline

Philip Zhong, Don Wang, Jason Zhang

发表机构 * Cisco Systems, Inc.(思科系统公司)

AI总结 本文提出了一种可复用的跨领域评估系统,用于评估AI会议摘要的质量,系统整合了结构化真实标签构建、固定候选生成、基于主张的评分、持久化报告以及隐私保护的在线监控与提名接口。通过在114场会议数据上进行测试,研究发现不同模型在准确性方面差异不显著,但在保留率方面,gpt-5.1模型表现出更高的完整性和覆盖率。该工作为AI会议摘要的评估提供了一套标准化且可扩展的评估框架。

Comments AI Application Feature Quality Evaluation (28 pages total)

详情
英文摘要

Industrial teams often deploy large language model features before stable regression or model selection evaluation exists. We present a reusable evaluation system for AI meeting summaries that combines structured ground-truth (GT) construction, fixed candidate generation, claim-grounded scoring, persisted reporting, and a privacy-bounded online monitoring and nomination interface. The online evidence is not itself a benchmark: privacy-safe aggregate exports show active monitoring, hard regime detection, and directional movement without exposing customer data. We benchmark the offline path on 114 meetings across city_council, private_data, and whitehouse_press_briefings, yielding 340 completed meeting-model pairs and 680 judge runs for gpt-4.1-mini, gpt-5-mini, and gpt-5.1. Under this fixed protocol, accuracy differences are not statistically significant under Holm correction (corrected p-values 0.053-0.448), although gpt-4.1-mini has the highest mean accuracy (0.583); the significant separation is on retention, where gpt-5.1 leads on completeness (0.886) and coverage (0.942). Typed slices isolate whitehouse_press_briefings as an accuracy-hard regime, and a later focused rerun over gpt-4.1, gpt-5-mini, and gpt-5.4 reuses the same stack under the same judges and metrics. This extended preprint keeps those core results aligned with the formal submission while adding a more detailed repository-level account of cross-domain reuse from the companion AI-search paper and an additional typed DeepEval contrastive analysis. Model naming note. Running text uses canonical model names on first introduction. Tables, filenames, and artifact IDs retain compact report labels for consistency with the packaged benchmark outputs. Table A maps the two conventions and is repeated in Section 4.3 where candidate-generation settings are defined.

2604.17104 2026-05-14 cs.DC cs.AI cs.LG 版本更新

TStore: Rethinking AI Model Hub with Tensor-Centric Compression

Tingfeng Lan, Zirui Wang, Yunjia Zheng, Zhaoyuan Su, Juncheng Yang, Yue Cheng

发表机构 * University of Virginia(弗吉尼亚大学) Harvard University(哈佛大学)

AI总结 随着现代AI模型规模和冗余度的快速增长,模型仓库面临显著的存储和分发挑战。本文提出TStore,一种以张量为中心的系统,通过细粒度去重和压缩技术有效降低存储开销。TStore利用张量级别的指纹和聚类方法,在无需标注的情况下识别模型间的冗余,实验表明其在实际模型仓库中实现了显著的存储节省,同时保持了模型的可用性和性能。

Comments 12 pages, 6 figures. Systems paper on AI model storage

详情
英文摘要

Modern AI models are growing rapidly in size and redundancy, leading to significant storage and distribution challenges in model hubs. We present TStore, a tensor-centric system for reducing storage overhead through fine-grained deduplication and compression. TStore leverages tensor-level fingerprinting and clustering to identify redundancy across models without requiring annotations. Our design enables efficient storage reduction while preserving model usability and performance. Experiments on real-world model repositories demonstrate substantial storage savings with minimal overhead.

2604.15333 2026-05-14 cs.HC cs.AI 版本更新

Technically Love: The Evolution of Human-AI Romance Discourse on Reddit

Tyler Chang, Jina Huh-Yoo, Afsaneh Razi

发表机构 * Drexel University(德雷塞尔大学) Stevens Institute of Technology(史蒂文斯理工学院)

AI总结 本文研究了人类与人工智能浪漫关系在Reddit平台上的公众讨论演变,分析了2017年至2025年间用户自发发布的3,383条相关内容。通过主题建模和时间统计分析,发现讨论焦点从初期的亲密关系逐渐转向平台治理、技术问题和现实影响,揭示了人类-AI浪漫关系从私人体验向技术调控转变的趋势,为伴侣型AI系统的设计与治理提供了重要启示。

Comments Accepted at ACM CUI 2026

详情
英文摘要

Human-AI romantic relationships are increasingly common, yet little is understood about how public discourse around them emerges and shifts over time. Prior research has examined user experiences and ethical concerns, but lacks longitudinal analyses of user-initiated public discussions. We address this gap by analyzing a high-precision dataset of 3,383 self-disclosed romantic companion AI posts from Reddit (2017-2025), using topic modeling and temporal statistical analysis to identify dominant themes and their evolution over time. We find significant topic drift, with discussions moving away from positive intimate relationships toward platform governance, technical issues, and real-world consequences. These shifts highlight a transition in how human-AI romance is framed-moving from private experiences to technical mediation and regulation-with implications for the design and governance of companion AI systems.

2604.09025 2026-05-14 cs.CV cs.AI 版本更新

Skill-Conditioned Visual Geolocation for Vision-Language Models

Chenjie Yang, Yutian Jiang, Yutong Deng, Chenyu Wu

发表机构 * Southwest Jiaotong University(西南交通大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Zhejiang University(浙江大学)

AI总结 该研究针对视觉语言模型在地理定位任务中缺乏结构化地理推理和自主进化能力的问题,提出了一种无需训练的GeoSkill框架。该方法基于一个可演进的技能图(Skill-Graph),通过提炼人类专家轨迹生成自然语言技能,并利用推理模型进行引导式推理。同时,通过自主进化机制,从大规模网络数据中不断生成和优化技能,提升地理定位的准确性和推理可信度,显著增强了模型对真实地理知识的理解与泛化能力。

详情
英文摘要

Vision-language models (VLMs) have shown a promising ability in image geolocation, but they still lack structured geographic reasoning and the capacity for autonomous self-evolution. Existing methods predominantly rely on implicit parametric memory, which often exploits outdated knowledge and generates hallucinated reasoning. Furthermore, current inference is a "one-off" process, lacking the feedback loops necessary for self-evolution based on reasoning outcomes. To address these issues, we propose GeoSkill, a training-free framework based on an evolving Skill-Graph. We first initialize the graph by refining human expert trajectories into atomic, natural-language skills. For execution, GeoSkill employs an inference model to perform direct reasoning guided by the current Skill-Graph. For continuous growth, an Autonomous Evolution mechanism leverages a larger model to conduct multiple reasoning rollouts on image-coordinate pairs sourced from web-scale data and verified real-world reasoning. By analyzing both successful and failed trajectories from these rollouts, the mechanism iteratively synthesizes and prunes skills, effectively expanding the Skill-Graph and correcting geographic biases without any parameter updates. Experiments demonstrate that GeoSkill achieves promising performance in both geolocation accuracy and reasoning faithfulness on GeoRC, while maintaining superior generalization across diverse external datasets. Furthermore, our autonomous evolution fosters the emergence of novel, verifiable skills, significantly enhancing the system's cognition of real-world geographic knowledge beyond isolated case studies.

2604.08039 2026-05-14 cs.CV cs.AI cs.LG 版本更新

LINE: LLM-based Iterative Neuron Explanations for Vision Models

Vladimir Zaigrajew, Michał Piechota, Gaspar Sekula, Paweł Gelar, Przemysław Biecek

发表机构 * Centre for Credible AI(可信AI中心) Warsaw University of Technology(华沙理工大学) University of Warsaw, Poland(波兰华沙大学)

AI总结 本文提出了一种基于大语言模型的迭代神经元解释方法LINE,用于对视觉模型中的神经元进行开放词汇的概念标注。LINE在黑盒设置下,通过语言模型和图像生成器迭代生成并优化概念描述,无需模型训练,能够发现传统预定义词汇表中遗漏的概念,并在多个数据集上取得了优于现有方法的性能。该方法不仅能够识别每个神经元的主要概念,还能提供完整的生成历史,支持多义性评估和生成可视化解释。

详情
英文摘要

Interpreting individual neurons in deep neural networks is a crucial step towards understanding their complex decision-making processes and ensuring AI safety. Despite recent progress in neuron labeling, existing methods often limit the search space to predefined concept vocabularies or produce overly specific descriptions that fail to capture higher-order, global concepts. We introduce LINE, a novel, training-free iterative approach tailored for open-vocabulary concept labeling in vision models. Operating in a strictly black-box setting, LINE leverages a large language model and a text-to-image generator to iteratively propose and refine concepts in a closed loop, guided by activation history. LINE achieves state-of-the-art performance across multiple model architectures, yielding AUC improvements of up to 0.11 on ImageNet and 0.05 on Places365, while discovering, on average, 27% of new concepts missed by predefined vocabularies. Beyond identifying the top concept, LINE provides a complete generation history, enabling polysemanticity evaluation and producing visual explanations that rival gradient-dependent activation maximization methods. The source code will be made available soon.

2604.04692 2026-05-14 cs.CL cs.AI cs.CV 版本更新

Is a Picture Worth a Thousand Words? Adaptive Multimodal Fact-Checking with Visual Evidence Necessity

Jaeyoon Jung, Yejun Yoon, Kunwoo Park

发表机构 * School of AI Convergence, Soongsil University(顺斯利大学人工智能融合学院) MAUM AI Inc.(MAUM人工智能公司) Department of Intelligent Semiconductors, Soongsil University(顺斯利大学智能半导体系)

AI总结 本文研究了在多模态事实核查任务中是否应普遍使用视觉证据的问题,挑战了现有研究中“视觉证据总是有助于提升性能”的假设。为此,作者提出了AMuFC框架,通过两个协作的视觉-语言模型,分别用于判断是否需要视觉证据以及基于证据进行事实验证,从而实现对视觉证据的自适应使用。实验表明,该方法在三个数据集上显著提升了事实核查的准确性。

Comments preprint, 18 pages

详情
英文摘要

Automated fact-checking is a crucial task that supports a responsible information ecosystem. While recent research has progressed from text-only to multimodal fact-checking, a prevailing assumption is that incorporating visual evidence universally improves performance. In this work, we challenge this assumption and show that the indiscriminate use of multimodal evidence can reduce accuracy. To address this challenge, we propose AMuFC, a multimodal fact-checking framework that employs two collaborative vision-language models with distinct roles for the adaptive use of visual evidence: an Analyzer determines whether visual evidence is necessary for claim verification, and a Verifier predicts claim veracity conditioned on both the retrieved evidence and the Analyzer's assessment. Experimental results on three datasets show that incorporating the Analyzer's assessment of visual evidence necessity into the Verifier's prediction yields substantial improvements in verification performance. We will release all code and datasets at https://github.com/ssu-humane/AMuFC.

2604.03551 2026-05-14 cs.SE cs.AI cs.HC 版本更新

AgenticFlict: A Large-Scale Dataset of Merge Conflicts in AI Coding Agent Pull Requests on GitHub

Daniel Ogenrwot, John Businge

发表机构 * University of Nevada Las Vegas(内华达大学拉斯维加斯分校)

AI总结 本文介绍了AgenticFlict,一个大规模的数据集,用于研究AI编码代理在GitHub上提交的拉取请求(PR)中出现的合并冲突。该数据集包含超过142,000个由AI生成的PR,从中识别出29,000多个存在合并冲突的案例,冲突率高达27.67%。研究揭示了AI生成代码在协作开发中可能引发频繁且复杂的合并冲突,突显了在AI辅助软件开发中理解和解决集成挑战的重要性。

Comments Accepted at the 3rd ACM International Conference on AI-Powered Software (AIware 2026)

详情
英文摘要

Software Engineering 3.0 marks a paradigm shift in software development, in which AI coding agents are no longer just assistive tools but active contributors. While prior empirical studies have examined productivity gains and acceptance patterns in AI-assisted development, the challenges associated with integrating agent-generated contributions remain less understood. In particular, merge conflicts, a fundamental aspect of collaborative software development, remain underexplored in this context. In this paper, we present AgenticFlict, a large-scale dataset of textual merge conflicts in AI coding agent pull requests (Agentic PRs). The dataset comprises 142K+ Agentic PRs collected from 59K+ repositories, of which 107K+ are successfully processed through deterministic merge simulation. Our pipeline identifies 29K+ PRs exhibiting merge conflicts, yielding a conflict rate of 27.67%, and extracts 336K+ fine-grained conflict regions across these instances. Our preliminary exploratory analysis indicates that merge conflicts are both frequent and often substantial in AI-generated contributions, with noticeable variation across agents, emphasizing the need to better understand and manage integration challenges in AI-assisted software development. The dataset, code and supplementary materials are available in zenodo: https://doi.org/10.5281/zenodo.19396916.

2604.02022 2026-05-14 cs.AI 版本更新

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, Jing Shao, Xia Hu, Dongrui Liu

发表机构 * Shanghai AI Lab(上海人工智能实验室) Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学) Tsinghua University(清华大学) KAUST(卡塔尔人工智能科学中心) East China Normal University(华东师范大学)

AI总结 ATBench 是一个用于评估和诊断基于大语言模型的智能体安全性的多样化且真实的轨迹基准。该基准通过风险来源、失败模式和现实危害三个维度系统地组织风险,并采用异构工具池和长上下文延迟触发机制,构建出具有多阶段真实风险演进的轨迹数据。ATBench 包含 1000 条轨迹,涵盖丰富的交互场景和工具调用,数据经过规则和大模型过滤并由人工全面审核,能够有效评估先进模型在长期交互中的安全表现,并支持分层分析和跨基准比较。

详情
英文摘要

Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm. Based on this taxonomy, we construct trajectories with heterogeneous tool pools and a long-context delayed-trigger protocol that captures realistic risk emergence across multiple stages. The benchmark contains 1,000 trajectories (503 safe and 497 unsafe), averaging 9.01 turns and 3.95k tokens, with 1,954 invoked tools drawn from pools spanning 2,084 available tools. Data quality is supported by rule-based and LLM-based filtering plus full human audit. Experiments on frontier LLMs, open-source models, and specialized guard systems show that ATBench is challenging even for strong evaluators, while enabling taxonomy-stratified analysis, cross-benchmark comparison, and diagnosis of long-horizon failure patterns.

2604.01690 2026-05-14 cs.AI 版本更新

Scale over Preference: The Impact of AI-Generated Content on Online Content Ecology

Tianhao Shi, Yang Zhang, Xiaoyan Zhao, Fengbin Zhu, Chenyi Lei, Han Li, Wenwu Ou, Tian Yang, Yang Song, Yongdong Zhang, Fuli Feng

发表机构 * University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学) Kuaishou Technology(快手科技) The Chinese University of Hong Kong(香港中文大学)

AI总结 本研究探讨了人工智能生成内容(AIGC)对在线内容生态的影响,通过分析中国主流视频平台上的海量用户数据,揭示了AIGC与人类生成内容(HGC)在创作与消费行为上的显著差异。研究发现,尽管用户更偏好HGC,但AIGC创作者通过高产量策略仍能获得与HGC相当的总体互动量,算法推荐机制在其中起到了调节作用。研究建议引入对AIGC敏感的推荐算法和精准治理框架,以保障在线平台内容生态的长期健康发展。

Comments update authors in v2

详情
英文摘要

The rapid proliferation of Artificial Intelligence-Generated Content (AIGC) is fundamentally restructuring online content ecologies, necessitating a rigorous examination of its behavioral and distributional implications. Leveraging a comprehensive longitudinal dataset comprising tens of millions of users from a leading Chinese video-sharing platform, this study elucidated the distinct creation and consumption behaviors characterizing AIGC versus Human-Generated Content (HGC). We identified a prevalent scale-over-preference dynamic, wherein AIGC creators achieve aggregate engagement comparable to HGC creators through high-volume production, despite a marked consumer preference for HGC. Deeper analysis uncovered the ability of the algorithmic content distribution mechanism in moderating these competing interests regarding AIGC. These findings advocated for the implementation of AIGC-sensitive distribution algorithms and precise governance frameworks to ensure the long-term health of the online content platforms.

2604.00001 2026-05-14 cs.LG cs.AI cs.CL 版本更新

Filter-then-Weight: Online Data Selection and Reweighting for LLM Fine-Tuning

Fangxin Wang, Peyman Baghershahi, Langzhou He, Henry Peng Zou, Sourav Medya, Philip S. Yu

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文研究了大语言模型在线微调中的数据选择与重加权问题,提出了一种基于优化器状态的在线数据选择框架。核心方法是将数据选择视为根据当前优化器状态塑造下一步更新方向的问题,并设计了两阶段的Filter-then-Weight算法,先筛选几何上有用的样本,再优化其权重系数。该方法通过引入因子化梯度表示和优化矩阵计算,有效提升了在线微调的收敛效率和下游任务性能。

Comments 24 pages, 2 figures, 9 tables

详情
英文摘要

Gradient-based data selection offers a principled framework for estimating sample utility in large language model (LLM) fine-tuning, but existing methods are mostly designed for offline settings. They are therefore less suited to online fine-tuning, where data arrives sequentially, sample utility is step-dependent, and the effective update geometry is shaped by adaptive optimizers. We propose an optimizer-aware framework for gradient-based online data selection and reweighting in LLM fine-tuning. Our key idea is to view online selection not as static sample ranking, but as shaping the next target-oriented update under the current optimizer state. We formulate this as an optimizer-aware update-matching problem, establish its connection to second-order target utility, and show why subset-level construction must account for interactions and redundancy among selected samples. Based on this view, we develop a two-stage Filter-then-Weight algorithm that first filters geometrically useful candidates and then optimizes their coefficients. To make the framework practical for LLMs, we introduce a factorized outer-product gradient representation and optimized matrix computations for long-context data. Experiments show that our method consistently improves convergence and downstream performance over existing online data selection baselines under the same data budget.

2603.22364 2026-05-14 cs.LG cs.AI cs.CV 版本更新

MCLR: Improving Conditional Modeling via Inter-Class Likelihood-Ratio Maximization and Unifying Classifier-Free Guidance with Alignment Objectives

Xiang Li, Yixuan Jia, Xiao Li, Jeffrey A. Fessler, Rongrong Wang, Qing Qu

发表机构 * University of Michigan(密歇根大学) Michigan State University(密歇根州立大学)

AI总结 本文提出了一种名为MCLR的新训练目标,旨在通过最大化类间似然比来提升扩散模型的条件生成能力。该方法解决了标准去噪分数匹配(DSM)在类间分离不足的问题,并在训练过程中引入对齐目标,使模型在无需推理时引导(CFG)的情况下也能获得更优的条件生成效果。理论分析表明,CFG引导的分数实际上是针对样本自适应加权MCLR目标的最优解,从而揭示了CFG与对齐目标之间的内在联系。

详情
英文摘要

Diffusion models achieve strong performance in generative modeling, but their success often relies heavily on classifier-free guidance (CFG), an inference-time heuristic that modifies the sampling trajectory. In theory, diffusion models trained with standard denoising score matching (DSM) should recover the target data distribution, raising two fundamental questions: (i) why is inference-time guidance necessary in practice, and (ii) can its underlying effect be internalized into a principled training objective? In this work, we argue that a key limitation of standard DSM is insufficient inter-class separation. To address this issue, we propose MCLR, an alignment objective that explicitly maximizes inter-class likelihood-ratios during training. Fine-tuning diffusion models with MCLR induces CFG-like improvements under standard sampling, substantially improving guidance-free conditional generation and narrowing the gap to inference-time CFG. Beyond these empirical benefits, we show theoretically that the CFG-guided score is exactly the optimal solution to a sample-adaptive weighted MCLR objective. This result connects CFG to alignment-based objectives, providing a mechanistic interpretation of CFG as an implicit inference-time contrastive alignment procedure.

2603.20521 2026-05-14 cs.LG cs.AI math.OC stat.ML 版本更新

Delightful Distributed Policy Gradient

Ian Osband

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 分布式强化学习在使用过时、有错误或不匹配的智能体生成的数据进行训练时,容易受到高惊讶度(负对数概率)动作的影响,导致学习效果下降。本文提出的“Delightful Policy Gradient”(DG)方法通过将优势值与惊讶度相乘作为门控机制,有效抑制高惊讶度的失败案例,同时保留高惊讶度的成功案例,从而提升学习效率。实验表明,DG在多种复杂场景下相比传统方法具有显著的样本效率优势,尤其在任务复杂度增加时表现更为突出。

详情
英文摘要

Distributed reinforcement learning trains on data from stale, buggy, or mismatched actors, producing actions with high surprisal (negative log-probability) under the learner's policy. The core difficulty is not surprising data per se, but \emph{negative learning from surprising data}. High-surprisal failures can dominate finite-batch updates through large perpendicular components, while high-surprisal successes reveal opportunities the current policy would otherwise miss. The \textit{Delightful Policy Gradient} (DG) separates these cases by gating each update with delight, the product of advantage and surprisal, suppressing rare failures and preserving rare successes without behavior probabilities. In a tabular analysis, DG suppresses the perpendicular second moment of high-surprisal failures by a policy-overlap factor that vanishes as the learner improves. The advantage sign is essential for surprisal-based filtering: any learner-probability-only gate that suppresses rare failures also suppresses rare successes. On MNIST with simulated staleness, DG without off-policy correction outperforms importance-weighted PG with exact behavior probabilities. On a transformer sequence task with staleness, actor bugs, reward corruption, and rare discovery, DG often achieves nearly order-of-magnitude lower error. When all four frictions act simultaneously, its sample-efficiency advantage is order-of-magnitude and grows with task complexity.

2603.15854 2026-05-14 cs.LG cs.AI cs.CL 版本更新

FlashSampling: Fast and Memory-Efficient Exact Sampling

Tomas Ruiz, Zhen Qin, Yifan Zhang, Xuyang Shen, Yiran Zhong, Mengdi Wang

发表机构 * LMU Munich(慕尼黑大学) Princeton University(普林斯顿大学)

AI总结 本文提出了一种名为 FlashSampling 的高效精确采样方法,旨在解决大词汇量解码中采样操作带来的额外内存流量和计算开销问题。该方法将采样过程直接融合到语言模型的输出层矩阵乘法中,避免了显存中 logits 张量的显式存储,从而显著提升了内存效率和计算速度。实验表明,FlashSampling 在多种数据中心级 GPU 上实现了内核级别的性能提升,并在端到端的 vLLM 框架中将每个输出 token 的生成时间减少了最多 10%。

Comments Project Page: https://github.com/FlashSampling/FlashSampling

详情
英文摘要

Sampling from a categorical distribution is mathematically simple, but in large-vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that fuses sampling into the LM-head matmul and never materializes the logits tensor in HBM. The method is simple: compute logits tile-by-tile on chip, add Gumbel noise, keep only one maximizer per row and per vocabulary tile, and finish with a small reduction over tiles. In tensor-parallel decoding, FlashSampling replaces the all-gather of logits with streaming peer-to-peer writes: This overlaps GPU-to-GPU communication with computation and HBM loads across up to 8 GPUs, with near-ideal scaling at large batch sizes. Our kernel is exact because argmax decomposes over partitions; grouped variants for online and tensor-parallel settings are exact by hierarchical factorization of the categorical distribution. FlashSampling demonstrates kernel-level speedups on decode workloads across 4 different datacenter GPUs (H100, H200, B200, B300), and in end-to-end vLLM experiments, it reduces time per output token by up to $10\%$ on the models we test. These results show that exact sampling, with no approximation, can be integrated into the matmul itself, consolidating the bandwidth-bound sampling step in an efficient epilogue.

2603.14066 2026-05-14 cs.MA cs.AI cs.LG 版本更新

A Benchmark for Multi-Party Negotiation Games from Real Negotiation Data

Leo Benac, Jonas Raedler, Zilin Ma, Finale Doshi-Velez

发表机构 * School of Engineering and Applied Sciences(工程与应用科学学院)

AI总结 本文提出了一种基于真实谈判数据的多方谈判博弈基准,用于研究谈判过程中逐步达成约束性承诺的机制。该基准结合了可配置的谈判游戏生成器和来自气候谈判练习的文档支持实例,并提供了多个基线求解器。实验表明,不同求解器在不同谈判场景下的表现各异,突显了谈判策略与博弈结构之间的紧密关系,从而推动了对能够稳健处理多样化战略场景的新型谈判方法的研究。

详情
英文摘要

Many real-world multi-party negotiations unfold as sequences of binding, action-level commitments rather than a single final outcome, yet this regime remains under-studied in existing benchmarks. We introduce a benchmark and evaluation framework for this setting, combining a configurable negotiation game generator with document-grounded instances derived from a climate negotiation exercise. We also provide several baseline solvers. Exact evaluation on small games and comparative evaluation on larger instances show that no solver dominates across regimes; performance depends on the structural properties of the game. These results motivate the creation of novel negotiation methods that value partial commitments robustly across diverse strategic regimes. Code and data for the benchmark are available at: https://anonymous.4open.science/r/negotiation_MARL-46B8

2603.03295 2026-05-14 cs.CL cs.AI cs.CY 版本更新

Language Model Goal Selection Differs from Humans' in a Self-Directed Learning Task

Gaia Molinaro, Dave August, Danielle Perszyk, Anne G. E. Collins

发表机构 * University of California, Berkeley(加州大学伯克利分校) Amazon AGI Lab(亚马逊人工智能实验室)

AI总结 该研究探讨了大型语言模型(LLMs)在自主学习任务中选择目标的行为是否与人类一致。通过对比五种主流模型与人类的表现,发现模型在目标选择上与人类存在显著差异,多数模型倾向于依赖单一解决方案或表现出较低的学习灵活性,而人类则表现出更大的探索性和个体多样性。研究指出,尽管思维链推理和角色引导能略微改善模型表现,但当前模型仍难以准确反映人类目标选择的独特性,提示在相关应用中需谨慎替代人类决策。

详情
英文摘要

Whether in agentic workflows, social studies, or chat settings, large language models (LLMs) are increasingly being asked to replace humans in choosing which goals to pursue, rather than completing predefined tasks. However, the assumption that LLMs accurately reflect human preferences for goal setting remains largely untested. We assess the validity of LLMs as proxies for human goal selection in a controlled, self-directed learning task borrowed from cognitive science. Across five models (GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5, Qwen3 32B, and Centaur), we find substantial divergence from human behavior. While people gradually explore and learn to achieve goals with diversity across individuals, most models exploit a single identified solution or show surprisingly low performance, with distinct patterns across models and little variability across instances of the same model. Chain-of-thought reasoning and persona steering provide limited improvements, and our conclusions hold across experimental settings. While they await confirmation in applied settings, these findings highlight the uniqueness of human goal selection and caution against its replacement with current models.

2603.02337 2026-05-14 cs.LG cs.AI cs.CV 版本更新

Preconditioned Flow Matching

Shadab Ahamed, Eshed Gal, Md Shahriar Rahim Siddiqui, Simon Ghyselincks, Moshe Eliasof, Eldad Haber

发表机构 * University of British Columbia(不列颠哥伦比亚大学) University of Cambridge(剑桥大学)

AI总结 本文研究了流匹配(Flow Matching)方法在训练过程中遇到的几何优化瓶颈问题,即当中间分布的协方差矩阵病态时,梯度下降方法在不同方向上的收敛速度差异显著。为此,作者提出了一种预条件流匹配(Preconditioned Flow Matching)方法,通过将目标分布转换为更各向同性的表示,改善中间路径的条件数,从而提升模型训练效率和生成质量。实验表明,该方法在多种分布和高分辨率图像数据集上均取得了显著的性能提升。

Comments 34 pages, 16 figures, 5 tables

详情
英文摘要

Flow matching (FM) learns vector fields by regressing stochastic velocity targets along intermediate distributions $p_t$. We identify a geometric optimization bottleneck in this regression problem: when the covariance $Σ_t$ of $p_t$ is ill-conditioned, gradient-based training rapidly fits high-variance directions while making slow progress along low-variance ones. In an exactly solvable Gaussian setting, we prove that the excess risk is weighted by $Σ_t$, and that both gradient descent and stochastic gradient descent inherit condition-number-dependent convergence. We then extend the analysis to Gaussian mixtures, showing that multimodality does not average away this effect; instead, the slowest and worst-conditioned component can control optimization. Motivated by this analysis, we propose \emph{preconditioned flow matching}, a precondition-then-match framework that transforms the target distribution into a more isotropic representation, trains the main flow in the transformed space, and maps generated samples back through the inverse transformation. We show theoretically that preconditioning reshapes the intermediate FM path and improves its conditioning. Across controlled Gaussian and Gaussian-mixture experiments, latent MNIST and other high resolution image datasets up to $512{\times}512$ resolution, preconditioning improves path-conditioning diagnostics, low-eigenvalue recovery, FID, MMD, precision, and recall. Compute-matched baselines and preconditioner-quality ablations further show that the gains are not explained merely by additional preconditioner parameters, but by improved geometry of the downstream flow matching problem.

2603.02175 2026-05-14 cs.CV cs.AI 版本更新

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore(展示实验室,新加坡国立大学)

AI总结 本文提出了一种名为 Kiwi-Edit 的通用视频编辑方法,通过指令和参考图像的联合引导实现更精确的视觉控制。为了解决现有方法在数据稀缺情况下的性能瓶颈,研究者设计了一种可扩展的数据生成管道,构建了大规模的 RefVIE 数据集和评估基准 RefVIE-Bench。基于该数据集,提出的统一编辑架构 Kiwi-Edit 通过可学习的查询与潜在视觉特征融合,实现了对参考语义的精准引导,在指令遵循和参考保真度方面取得了显著提升,达到了可控视频编辑的最新水平。

Comments Project page: https://showlab.github.io/Kiwi-Edit Huggingface Demo: https://huggingface.co/spaces/linyq/KiwiEdit

详情
英文摘要

Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.

2602.21204 2026-05-14 cs.LG cs.AI cs.CV 版本更新

Test-Time Training with KV Binding Is Secretly Linear Attention

Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, Ruilong Li

发表机构 * NVIDIA, Toronto, Ontario, Canada(NVIDIA,多伦多,安大略省,加拿大) University of Toronto, Toronto, Ontario, Canada(多伦多大学,多伦多,安大略省,加拿大) Vector Institute, Toronto, Ontario, Canada(向量研究所,多伦多,安大略省,加拿大) Technion -- Israel Institute of Technology, Haifa, Israel(技术ion -- 以色列理工学院,海法,以色列)

AI总结 本文重新审视了基于键值绑定的测试时训练(TTT)在序列建模中的作用,指出其本质并非单纯的测试时记忆,而是一种学习到的线性注意力机制。研究揭示了TTT模型中一些之前难以解释的现象,并展示了多种TTT架构可以统一为线性注意力操作的形式。这一新视角不仅解释了模型行为,还带来了架构简化、并行计算和效率提升等实际优势,为TTT提供了更系统和高效的理论基础。

Comments ICML 2026, Webpage: https://research.nvidia.com/labs/sil/projects/tttla/

详情
英文摘要

Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model behaviors, this perspective yields multiple practical benefits: it enables principled architectural simplifications, admits fully parallel formulations that preserve performance while improving efficiency, and provides a systematic reduction of diverse TTT variants to a standard linear attention form. Overall, our results reframe TTT not as test-time memorization, but as learned linear attention with enhanced representational capacity. Project page: https://research.nvidia.com/labs/sil/projects/tttla/.

2602.13215 2026-05-14 cs.AI 版本更新

When to Think Fast and Slow? AMOR: Adaptive Entropy Gate for Hybrid Models

Haoran Zheng, Chen Shani

发表机构 * The University of Chicago(芝加哥大学) Stanford University(斯坦福大学)

AI总结 本文提出了一种名为AMOR的自适应混合模型架构,旨在根据预测不确定性动态选择性地调用注意力机制,从而在保持模型性能的同时提升计算效率。该方法通过熵门控机制,在递归主干模型的输出熵超过动态阈值时才激活注意力模块,避免了不必要的计算开销。实验表明,AMOR在多个大规模模型上表现优异,仅在约22%的输入位置使用注意力,同时在长上下文任务和常识推理任务中展现出更强的鲁棒性。

详情
英文摘要

Recurrent-attention hybrids aim to combine the efficiency of recurrence with the expressivity of attention, but existing approaches typically apply attention uniformly across all positions, even when the recurrent state alone is sufficient for accurate prediction. We introduce AMOR (Adaptive Metacognitive Output Router), a post-hoc hybrid architecture that selectively invokes attention based on predictive uncertainty. A recurrent backbone is augmented with entropy-gated attention blocks that activate only when the model's output entropy exceeds a dynamic threshold derived from a running batch median and scaled standard deviation. This yields a simple, gradient-free routing mechanism inspired by uncertainty-driven computation and the System 1 / System 2 distinction. Across Mamba2 and Gated DeltaNet backbones (180M-1.5B), AMOR consistently matches or outperforms both pure recurrent models and fixed-schedule hybrid baselines while invoking attention on only ~22% of tokens. It achieves strong performance on common-sense reasoning benchmarks and maintains stable long-context performance on LongBench, where prior hybrid models degrade under distribution shift. These results suggest that when attention is applied matters as much as how much: selectively allocating attention based on predictive uncertainty improves both efficiency and robustness, offering a simple alternative to uniform or fixed routing strategies and pointing toward adaptive hybrid architectures that dynamically match computation to input difficulty.

2602.12783 2026-05-14 cs.IR cs.AI 版本更新

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

Yuejie Li, Ke Yang, Yueying Hua, Berlin Chen, Jianhao Nie, Yueping He, Caixin Kang

发表机构 * Huazhong University of Science and Technology(华中科技大学) The University of Hong Kong(香港大学) Soochow University(苏州大学) University of Science and Technology of China(中国科学技术大学) Wuhan University(武汉大学) Tsinghua University(清华大学) The University of Tokyo(东京大学)

AI总结 SQuTR 是一个用于评估语音查询到文本检索系统在复杂声学噪声环境下鲁棒性的基准测试平台。该研究通过整合大量多语言、多领域的文本查询,并合成包含多种真实环境噪声的语音数据,构建了一个大规模且可复现的评测数据集。实验表明,随着噪声增强,检索性能显著下降,且不同系统表现差异明显,突显了鲁棒性在语音检索中的重要性和当前模型的不足。SQuTR 为相关研究提供了标准化的测试框架和诊断分析工具。

Comments Accepted by SIGIR 2026

Journal ref Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '26), July 20--24, 2026, Melbourne, VIC, Australia

详情
英文摘要

Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, making them inadequate for assessing the robustness of spoken query retrieval systems under complex acoustic perturbations. To address this limitation, we present SQuTR, a robustness benchmark for spoken query retrieval that includes a large-scale dataset and a unified evaluation protocol. SQuTR aggregates 37,317 unique queries from six commonly used English and Chinese text retrieval datasets, spanning multiple domains and diverse query types. We synthesize speech using voice profiles from 200 real speakers and mix 17 categories of real-world environmental noise under controlled SNR levels, enabling reproducible robustness evaluation from quiet to highly noisy conditions. Under the unified protocol, we conduct large-scale evaluations on representative cascaded and end-to-end retrieval systems. Experimental results show that retrieval performance decreases as noise increases, with substantially different drops across systems. Even large-scale retrieval models struggle under extreme noise, indicating that robustness remains a critical bottleneck. Overall, SQuTR provides a reproducible testbed for benchmarking and diagnostic analysis, and facilitates future research on robustness in spoken query to text retrieval.

2602.11202 2026-05-14 cs.LO cs.AI 版本更新

interwhen: A Generalizable Framework for Steering Reasoning Models with Test-time Verification

Vishak K Bhat, Prateek Chanda, Vijval Ekbote, Ashmit Khandelwal, Maitreyi Swaroop, Vineeth N. Balasubramanian, Subbarao Kambhampati, Nagarajan Natarajan, Amit Sharma

发表机构 * Microsoft Research(微软研究院) IIT Bombay(印度理工学院班加罗尔分校) Carnegie Mellon University(卡内基梅隆大学) Arizona State University(亚利桑那州立大学)

AI总结 该论文提出了一种名为 interwhen 的通用框架,用于在推理过程中通过实时验证引导推理模型的行为。该框架通过监控推理轨迹并异步运行验证器,能够在不显著增加计算开销的情况下及时发现并纠正错误。此外,interwhen 还引入了从自然语言政策文档自动生成验证器的方法,有效解决了验证器稀缺的问题,显著提升了推理模型在任务完成和策略遵从方面的表现。

Comments 56 pages, 6 figures

详情
英文摘要

Reasoning models produce long traces of intermediate decisions and tool calls, making test-time verification important for ensuring correctness. Existing approaches either verify only the final answer, which misses early errors, or rely on branch-and-verify strategies that explore multiple trajectories. We introduce interwhen, a single-trajectory verification framework that steers model behavior by providing feedback on intermediate reasoning traces. It addresses two key challenges. First, given a set of verifiers, obtaining verifiable states from the reasoning trace typically requires prompt engineering or external task decomposition into fixed steps. Instead, we propose a monitoring system that periodically polls the reasoning trace and forks inference of the reasoning model to recover intermediate states. Verifiers are run asynchronously alongside generation, adding negligible overhead on correct executions and intervening only when violations occur. Second, beyond math and code, a central challenge for process verification is the scarcity of verifiers. interwhen addresses this through automatic verifier synthesis from natural-language policy documents. Given a policy, it can generate code-based verifiers, including provably correct verifiers in Lean and z3. Together, these contributions yield a plug-and-play test-time verification system that can improve task completion and policy compliance of any reasoning agent. On reasoning benchmarks where policies encode mathematical or logical constraints, interwhen achieves near-perfect accuracy for reasoning models using a fraction of the tokens of baselines. On agentic benchmarks with policy-based verifier generation, it enables improvements in task quality for SLMs without any finetuning, e.g., task completion rate of Qwen3-30B jumps from 32% to 87% on the telecom domain in tau2-bench. Code at https://github.com/microsoft/interwhen.

2602.05242 2026-05-14 cs.SE cs.AI cs.LG 版本更新

EGSS: Entropy-guided Stepwise Scaling for Reliable Software Engineering

Chenhui Mao, Yuanting Lei, Zhixiang Wei, Ming Liang, Zhixiang Wang, Jingxuan Xu, Dajun Chen, Wei Jiang, Yong Li

发表机构 * Ant Group(蚂蚁集团)

AI总结 本文提出了一种名为EGSS的熵引导分步扩展框架,旨在解决智能体测试时扩展(TTS)在软件工程任务中计算开销大、性能提升受限的问题。EGSS通过熵引导的自适应搜索和鲁棒测试套件增强,动态平衡了效率与效果,有效降低了推理时的计算成本,同时提升了模型在代码生成和错误修复等任务上的表现。实验表明,EGSS在多个模型上实现了5-10%的性能提升,并在计算效率方面相比现有方法减少了28%的token使用量。

详情
英文摘要

Agentic Test-Time Scaling (TTS) has delivered state-of-the-art (SOTA) performance on complex software engineering tasks such as code generation and bug fixing. However, its practical adoption remains limited due to significant computational overhead, primarily driven by two key challenges: (1) the high cost associated with deploying excessively large ensembles, and (2) the lack of a reliable mechanism for selecting the optimal candidate solution, ultimately constraining the performance gains that can be realized. To address these challenges, we propose Entropy-Guided Stepwise Scaling (EGSS), a novel TTS framework that dynamically balances efficiency and effectiveness through entropy-guided adaptive search and robust test-suite augmentation. Extensive experiments on SWE-Bench-Verified demonstrate that EGSS consistently boosts performance by 5-10% across all evaluated models. Specifically, it increases the resolved ratio of Kimi-K2-Intruct from 63.2% to 72.2%, and GLM-4.6 from 65.8% to 74.6%. Furthermore, when paired with GLM-4.6, EGSS achieves a new state-of-the-art among open-source large language models. In addition to these accuracy improvements, EGSS reduces inference-time token usage by over 28% compared to existing TTS methods, achieving simultaneous gains in both effectiveness and computational efficiency.

2602.05000 2026-05-14 cs.LG cs.AI cs.CL 版本更新

Entropy Aware Reward Guidance for Diffusion Language Model Alignment

Atula Tejaswi, Litu Rout, Constantine Caramanis, Sanjay Shakkottai, Sujay Sanghavi

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文研究了在离散扩散语言模型中使用奖励引导(Reward Guidance)进行对齐的问题,针对离散输出无法直接求导的挑战,提出了一种新的机制EntRGi,通过动态结合连续松弛的token和采样的硬token,并基于模型预测熵进行插值,从而在保持奖励模型可靠性的同时提升优化准确性。实验表明,该方法在测试时适配和奖励引导强化学习等场景下均优于现有方法,具有显著的性能提升。

Comments Preprint

详情
英文摘要

Reward guidance, also known as posterior sampling, is a popular method for test-time adaptation and post-training in continuous diffusion models. In this paper, we study reward guidance for discrete diffusion language models; now, one cannot differentiate through the natural outputs of the model because they are discrete tokens. We introduce a novel mechanism called EntRGi (Entropy aware Reward Guidance) to address this issue. EntRGi dynamically interpolates between continuous token relaxations and sampled hard tokens, on a token-by-token basis, using the diffusion model's predictive entropy. We demonstrate that EntRGi maintains both reward model reliability and optimization accuracy, while existing approaches sacrifice one for the other. We empirically validate our approach on 7B-parameter diffusion language models across two settings: (1) test-time adaptation, and (2) RGRL (Reward Guided Reinforcement Learning), our recipe for post-training on reward-guided data, showing consistent improvements over state-of-the-art methods. Our code is available at https://atutej.github.io/entrgi-rgrl

2602.04264 2026-05-14 cs.LG cs.AI cs.NA math.NA 版本更新

Exponential Approximation Rates and Parameter Efficiency of Learnable Bernstein Activations

Ibrahim Albool, Malak Gamal El-Din, Salma Elmalaki, Yasser Shoukry

发表机构 * Department of Electrical Engineering and Computer Science, University of California, Irvine(电气工程与计算机科学系,加州大学 Irvine 分校)

AI总结 本文研究了可学习伯恩斯坦激活函数(Learnable Bernstein Activations)在深度神经网络中的表示能力和参数效率。通过理论分析,作者证明了采用此类激活的DeepBern-Nets网络在逼近误差上具有指数级衰减的速率,远优于传统的ReLU结构。实验表明,DBNs在多个科学数据集上实现了显著的参数减少和更快的收敛速度,验证了其结构优势。

Comments 20 pages

详情
英文摘要

The choice of activation function fundamentally shapes the representational capacity and parameter efficiency of deep neural networks, yet most widely used activations lack rigorous theoretical guarantees on these properties. We provide a theoretical analysis of DeepBern-Nets (DBNs) -- networks employing learnable Bernstein polynomial activations -- showing that their approximation error decays with the network depth $L$ and the polynomial order $n$ with a rate of $\mathcal{O}(n^{-L})$, exponentially faster than the polynomial rate of ReLU architectures while remaining fully differentiable. We validate these predictions through $1{,}344$ experiments on large scientific datasets (HIGGS and SUSY), comparing DBNs against ReLU, Leaky ReLU, SELU, and GeLU. DBNs achieve over $70\%$ parameter reduction across the majority of architectures -- reaching $99.9\%$ at scale -- converge to ReLU's final loss in as few as $26\%$ of the training epochs, and attain up to $45\%$ lower final loss. These advantages hold over all tested activations, confirming that DBN's gains stem from the learnable polynomial structure rather than mere smoothness.

2602.02977 2026-05-14 cs.CV cs.AI cs.LG 版本更新

Aligning Forest and Trees in Images & Long Captions for Visually Grounded Understanding

Byeongju Woo, Zilin Wang, Byeonghyun Pak, Sangwoo Mo, Stella X. Yu

发表机构 * Agency for Defense Development(国防发展局) University of Michigan(密歇根大学) POSTECH

AI总结 该研究针对视觉语言模型在理解长而细节丰富的图像描述时存在的问题,提出了一种基于局部-整体结构的层次化学习方法。核心方法是通过CAFT模型,在中间表示层对齐局部文本与图像区域,在最终表示层实现全局图像与文本的对齐,从而更准确地捕捉细粒度视觉信息。该模型在多个长文本检索任务中取得了最先进的性能,并且无需显式的区域标注即可实现文本语义在图像区域中的定位。

Comments Preprint

详情
英文摘要

Vision-language models such as CLIP often struggle to faithfully understand long, detail-rich captions, relying on dominant scene cues while overlooking fine-grained visual evidence. We propose a hierarchical vision-language learning principle for understanding scenes as part-to-whole compositions: before forming a whole-scene representation, a model should uncover what semantic parts appear where in the image. To this end, we propose CAFT (Cross-domain Alignment of Forests and Trees), a vision-language model that jointly learns local text-region alignment at intermediate representations and global image-text alignment at the final representation. Exploiting the organization of long captions, where local descriptions often correspond to scene parts, CAFT employs a fine-to-coarse image encoder and a part-whole text encoder to discover localized part semantics and progressively compose them into a global image-text representation. Trained on 30M image-text pairs, CAFT achieves state-of-the-art performance on six long-text retrieval benchmarks and exhibits strong scaling behavior. Experiments show that CAFT learns fine-grained representations that localize textual semantics in image regions without explicit region-level supervision.

2602.02350 2026-05-14 cs.AI cs.LG cs.MA 版本更新

Context Learning for Multi-Agent Discussion

Xingyuan Hua, Sheng Yue, Xinyi Li, Yizhe Zhao, Jinrui Zhang, Ju Ren

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) School of Cyber Science and Technology, Sun Yat-sen University Shenzhen Campus(中山大学深圳校区网络科学与技术学院) College of Computer Science, Northwest University(西北大学计算机学院) State Key Laboratory of Internet Architecture, Tsinghua University(清华大学互联网体系结构实验室)

AI总结 多智能体讨论(MAD)任务中,多个大语言模型通过结构化讨论协作解决问题,但现有方法常因个体上下文不一致导致讨论不协调、难以达成共识。本文提出一种多大语言模型上下文学习方法(M2CL),通过为每个智能体学习上下文生成器,动态生成每轮讨论的上下文指令,从而提升讨论的一致性和准确性。实验表明,M2CL在多项复杂任务中性能显著优于现有方法,提升幅度达20%至50%,同时具备良好的迁移能力和计算效率。

详情
英文摘要

Multi-Agent Discussion (MAD) has garnered increasing attention very recently, where multiple LLM instances collaboratively solve problems via structured discussion. However, we find that current MAD methods easily suffer from discussion inconsistency, LLMs fail to reach a coherent solution, due to the misalignment between their individual contexts.In this paper, we introduce a multi-LLM context learning method (M2CL) that learns a context generator for each agent, capable of dynamically generating context instructions per discussion round via automatic information organization and refinement. Specifically, inspired by our theoretical insights on the context instruction, M2CL train the generators to control context coherence and output discrepancies via a carefully crafted self-adaptive mechanism.It enables LLMs to avoid premature convergence on majority noise and progressively reach the correct consensus. We evaluate M2CL on challenging tasks, including academic reasoning, embodied tasks, and mobile control. The results show that the performance of M2CL significantly surpasses existing methods by 20%--50%, while enjoying favorable transferability and computational efficiency.

2602.02001 2026-05-14 cs.LG cs.AI 版本更新

Preserve-Then-Quantize: Balancing Rank Budgets for Quantization Error Reconstruction in LLMs

Yoonjun Cho, Dongjae Jeon, Soeun Kim, Moongyu Jeon, Albert No

发表机构 * Department of Computer Science, Yonsei University(延世大学计算机科学系) Department of Artificial Intelligence, Yonsei University(延世大学人工智能系)

AI总结 该论文研究了如何在大语言模型的后训练量化(PTQ)中减少精度损失,提出了一种名为“Preserve-Then-Quantize”的方法,通过在量化前保留权重矩阵的主要奇异子空间,仅对残差部分进行量化,并利用剩余的秩用于误差重建。该方法引入了结构化残差重建(SRR)框架,在理论指导下平衡量化暴露能量与不可恢复误差,有效提升了量化后的模型性能,并支持高效的量化参数微调,实验表明其在多个任务和量化设置下均取得了显著的性能提升。

Comments Accepted at ICML 2026. Project page: https://ai-isl.github.io/srr

详情
英文摘要

Quantization Error Reconstruction (QER) reduces accuracy loss in Post-Training Quantization (PTQ) by approximating weights as $\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$, using a rank-$r$ correction to reconstruct quantization error. Prior methods devote the full rank budget to error reconstruction, which is suboptimal when $\mathbf{W}$ has intrinsic low-rank structure and quantization corrupts dominant directions. We propose Structured Residual Reconstruction (SRR), a rank-allocation framework that preserves the top-$k$ singular subspace of the activation-scaled weight before quantization, quantizes only the residual, and uses the remaining rank $r-k$ for error reconstruction. We derive a theory-guided criterion for selecting $k$ by balancing quantization-exposed energy and unrecoverable error under rank constraints. We further show that resulting $\mathbf{Q} + \mathbf{L}\mathbf{R}$ parameterization naturally supports Quantized Parameter-Efficient Fine-Tuning (QPEFT), and stabilizes fine-tuning via gradient scaling along preserved directions. Experiments demonstrate consistent perplexity reductions across diverse models and quantization settings in PTQ, along with a 5.9 percentage-point average gain on GLUE under 2-bit QPEFT. The project page is available at https://ai-isl.github.io/srr.

2602.00616 2026-05-14 cs.AI 版本更新

SPOT: Selective Prompt Projection via Total Variation for Inference-Only Safe Text-to-Image Generation

Minhyuk Lee, Hyekyung Yoon, Myungjoo Kang

发表机构 * Seoul National University(首尔国立大学)

AI总结 本文研究了在不修改预训练生成模型的前提下,如何在文本到图像生成过程中安全地抑制不适当内容的生成。提出了一种名为SPOT的方法,通过在推理阶段选择性地将输入提示投影到安全提示集,利用总变分理论控制风险变化,从而在保持良性提示生成质量的同时降低生成内容的风险。实验表明,SPOT在多个数据集和扩散模型架构上均能有效提升生成内容的安全性,同时保持对原始提示的良好响应。

详情
英文摘要

Text-to-Image (T2I) diffusion models enable high quality open ended synthesis, but practical use requires suppressing unsafe generations while preserving behavior on benign prompts. We study this tension relative to the frozen generator, using its prompt conditioned distribution as the preservation reference. Since T2I safety is commonly evaluated by bounded risk scores on generated images, total variation (TV) bounds how much expected risk can change from this reference. We call this fixed reference constraint the Safety-Prompt Alignment Tradeoff (SPAT): reducing expected unsafety requires prompt conditioned distributional deviation. To make this deviation selective and adjustable, we define the tau safe set as prompts whose reference risk is at most tau, and cast intervention as projection toward nearby prompts in this set. We propose Selective Prompt prOjecTion (SPOT), an inference time framework that approximates this projection without retraining the generator or learning a category specific rewriter. SPOT uses an LLM to rank candidate rewrites and a safeguard VLM to accept generated images under the same tau. Across four datasets and three diffusion backbones, SPOT achieves relative inappropriate (IP) score reductions from 14.2% to 44.4% over strong safety alignment baselines while keeping benign prompt behavior close to the fixed reference.

2601.23143 2026-05-14 cs.AI 版本更新

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Seanie Lee, Sangwoo Park, Yumin Choi, Gyeongman Kim, Minki Kang, Jihun Yun, Dongmin Park, Jongho Park, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院) KRAFTON(KRAFTON公司) UC Berkeley(加州大学伯克利分校)

AI总结 大型推理模型在生成长链推理时往往过于追求任务合规性,导致对有害提示的防御能力下降。为此,研究提出了THINKSAFE框架,通过自我生成的安全对齐方法,在无需外部教师模型的情况下恢复模型的安全性。该方法基于KL散度投影理论,利用轻量级拒绝引导机制,在保持推理能力的同时显著提升模型的安全性,并在多个模型上验证了其有效性与高效性。

Comments 17 pages, 13 figures

详情
英文摘要

Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We formalize safety realignment as a KL projection onto the safe simplex and prove that the student's own safety-filtered distribution is the unique KL-optimal target, while any external teacher incurs an irreducible excess KL penalty. Guided by this analysis, we propose ThinkSafe, a self-generated alignment framework that restores safety without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, which preserves the KL-optimal target while increasing the acceptance rate. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency, and achieves superior safety and comparable reasoning to GRPO with roughly an order of magnitude less compute. Code, models, and datasets are available at https://github.com/seanie12/ThinkSafe and https://huggingface.co/Seanie-lee/collections.

2601.22409 2026-05-14 cs.LG cs.AI stat.ML 版本更新

Optimization, Generalization and Differential Privacy Bounds for Gradient Descent on Kolmogorov-Arnold Networks

Puyu Wang, Junyu Zhou, Philipp Liznerski, Marius Kloft

发表机构 * RPTU Kaiserslautern-Landau(凯斯布鲁克-兰道大学)

AI总结 本文研究了梯度下降在Kolmogorov-Arnold网络(KAN)上的优化动态、泛化性能及差分隐私保障。通过理论分析,作者得出了关于训练过程、泛化误差和隐私预算的通用界,并在逻辑斯蒂损失下证明了对数宽度的网络即可实现与迭代次数和样本量相关的优化与泛化速率。在差分隐私设置中,研究进一步表明所需噪声与输入维度和隐私参数相关,并揭示了在隐私保护下网络宽度不仅需满足充分性,还需满足必要性,揭示了隐私与非隐私训练之间的本质差异。

Comments 42 pages, 3 figures

Journal ref ICML 2026

详情
英文摘要

Kolmogorov--Arnold Networks (KANs) have recently emerged as a structured alternative to standard MLPs, yet a principled theory for their training dynamics, generalization, and privacy properties remains limited. In this paper, we analyze gradient descent (GD) for training two-layer KANs and derive general bounds that characterize their training dynamics, generalization, and utility under differential privacy (DP). As a concrete instantiation, we specialize our analysis to logistic loss under an NTK-separable assumption, where we show that polylogarithmic network width suffices for GD to achieve an optimization rate of order $1/T$ and a generalization rate of order $1/n$, with $T$ denoting the number of GD iterations and $n$ the sample size. In the private setting, we characterize the noise required for $(ε,δ)$-DP and obtain a utility bound of order $\sqrt{d}/(nε)$ (with $d$ the input dimension), matching the classical lower bound for general convex Lipschitz problems. Our results imply that polylogarithmic width is not only sufficient but also necessary under differential privacy, revealing a qualitative gap between non-private (sufficiency only) and private (necessity also emerges) training regimes. Experiments further illustrate how these theoretical insights can guide practical choices, including network width selection and early stopping.

2601.21892 2026-05-14 cs.CV cs.AI 版本更新

Improving Classifier-Free Guidance of Flow Matching via Manifold Projection

Jian-Feng Cai, Haixia Liu, Zhengyi Su, Chao Wang

发表机构 * Department of Mathematics, The Hong Kong University of Science IAS Center for AI for Scientific Discoveries, The Hong Kong University of Science School of Mathematics Statistics \& Institute of Interdisciplinary Research for Mathematics Applied Science \& Hubei Key Laboratory of Engineering Modeling Scientific Computing, Huazhong University of Science Department of Statistics Data Science, Southern University of Science

AI总结 本文研究了如何改进基于流匹配模型的无分类器引导(CFG)方法,提出了通过流匹配中的速度场与平滑距离函数梯度之间的关系,对CFG进行原理性解释。基于此,作者将CFG采样重新表述为具有流形约束的同伦优化问题,并通过增量梯度下降实现流形投影,进一步结合Anderson加速提升计算效率与稳定性。该方法无需额外训练,有效提升了生成质量、提示对齐度及对引导尺度的鲁棒性,并在多个大型模型上取得了显著改进。

Comments 26 pages, 14 figures

详情
英文摘要

Classifier-free guidance (CFG) is a widely used technique for controllable generation in diffusion and flow-based models. Despite its empirical success, CFG relies on a heuristic linear extrapolation that is often sensitive to the guidance scale. In this work, we provide a principled interpretation of CFG through the lens of optimization. We demonstrate that the velocity field in flow matching corresponds to the gradient of a sequence of smoothed distance functions, which guides latent variables toward the scaled target image set. This perspective reveals that the standard CFG formulation is an approximation of this gradient, where the prediction gap, the discrepancy between conditional and unconditional outputs, governs guidance sensitivity. Leveraging this insight, we reformulate the CFG sampling as a homotopy optimization with a manifold constraint. This formulation necessitates a manifold projection step, which we implement via an incremental gradient descent scheme during sampling. To improve computational efficiency and stability, we further enhance this iterative process with Anderson Acceleration without requiring additional model evaluations. Our proposed methods are training-free and consistently refine generation fidelity, prompt alignment, and robustness to the guidance scale. We validate their effectiveness across diverse benchmarks, demonstrating significant improvements on large-scale models such as DiT-XL-2-256, Flux, and Stable Diffusion 3.5.

2601.18842 2026-05-14 cs.CR cs.AI cs.CV 版本更新

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

Yanxi Wang, Zhiling Zhang, Wenbo Zhou, Weiming Zhang, Jie Zhang, Qiannan Zhu, Yu Shi, Shuxin Zheng, Jiyan He

发表机构 * Beijing Normal University(北京师范大学) Zhongguancun Academy(中关村学院) University of Science and Technology of China(中国科学技术大学) A*STAR Zhongguancun Institution of Artificial Intelligence(中关村人工智能研究所)

AI总结 随着GUI代理越来越多地依赖截图来感知和操作数字环境,可能会无意中暴露身份、账号、位置等敏感信息。为弥补现有隐私评估基准在任务轨迹上下文中隐私风险评估的不足,本文提出了GUIGuard-Bench,这是一个包含241条真实GUI代理轨迹和4080张截图的基准数据集,支持隐私识别、保护截图下的规划保真度评估以及不同保护策略的效用分析。研究发现,当前模型在隐私信息检测方面表现较好,但在细粒度定位、分类识别、风险评估和任务必要性判断上仍存在明显不足。

详情
英文摘要

As GUI agents increasingly rely on screenshots to perceive and operate digital environments, they may inadvertently expose sensitive information such as identities, accounts, locations, and behavioral traces. While existing benchmarks primarily focus on task completion, grounding, or defenses against third-party attacks, current visual privacy datasets remain largely restricted to static natural images, limiting their ability to capture the contextual dependence and task relevance of privacy risks in GUI task trajectories. To bridge this gap, we introduce \textbf{GUIGuard-Bench}, a first-step benchmark for studying privacy-preserving GUI agents in trajectory-based GUI workflows. GUIGuard-Bench contains 241 real GUI-agent trajectories with 4,080 screenshots across Android and PC environments. Each screenshot is annotated at the region level with privacy bounding boxes, semantic privacy categories, risk levels, and whether the private information is necessary for completing the task. Built on these annotations, GUIGuard-Bench supports three complementary evaluations: privacy recognition, offline planning fidelity under protected screenshots, and the utility impact of different protection strategies. Our results show that current models can often detect whether a screenshot contains private information, but they struggle with fine-grained localization, category recognition, risk assessment, and task-necessity judgment. We also find that closed-source models, exemplified by Claude Sonnet 4.6, can maintain largely consistent planner semantics in Android environments after privacy protection is applied. Our results highlight privacy recognition as a critical bottleneck for practical GUI agents. Project: https://futuresis.github.io/GUIGuard-page/

2601.16806 2026-05-14 cs.AI cs.RO 版本更新

An Efficient Insect-inspired Approach for Visual Point-goal Navigation

Yihe Lu, Barbara Webb

发表机构 * School of Informatics, University of Edinburgh(爱丁堡大学信息学院)

AI总结 本文提出了一种受昆虫启发的高效视觉点目标导航模型,结合了与联想学习和路径整合相关的两种昆虫脑结构的抽象模型。该方法在视觉导航任务中表现出与当前先进模型相当的性能,但计算成本大幅降低,并在更真实的模拟环境中展示了其对干扰的鲁棒性。

Comments This work has been submitted to the IEEE for possible publication

详情
英文摘要

In this work we develop a novel insect-inspired model for visual point-goal navigation. This combines abstracted models of two insect brain structures that have been implicated, respectively, in associative learning and path integration. We draw an analogy between the formal benchmark of the Habitat point-goal navigation task and the ability of insects to discover, learn, and refine visually guided paths around obstacles between a discovered food location and their nest. We demonstrate that the simple insect-inspired model exhibits performance comparable to recent state-of-the-art models at many orders of magnitude less computational cost. Testing in a more realistic simulated environment shows the approach is robust to perturbations.

2601.15161 2026-05-14 cs.CL cs.AI 版本更新

Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems

Yinzhu Chen, Abdine Maiga, Hossein A. Rahmani, Emine Yilmaz

发表机构 * AI Center, University College London(伦敦大学学院人工智能中心)

AI总结 随着大型语言模型在医疗决策支持中的应用增加,如何可靠评估其输出成为关键问题。本文提出了一种基于检索增强的多智能体框架,用于自动生成针对具体对话实例的评估标准,从而更准确地识别临床意图偏差和潜在风险。该方法通过分解检索到的权威医学证据并结合用户交互约束,形成可验证的细粒度评估准则,在多个医疗对话数据集上表现出色,显著优于现有基线模型,并有效指导了模型响应的优化。

详情
英文摘要

Large Language Models (LLMs) are increasingly used for clinical decision support, where hallucinations and unsafe suggestions may pose direct risks to patient safety. These risks are hard to assess: subtle clinical errors are often missed by generic metrics and LLM judges using general criteria, while expert-authored fine-grained rubrics are expensive and difficult to scale. In this paper, we propose a retrieval-augmented multi-agent framework designed to automate the generation of instance-specific evaluation rubrics. Our approach grounds evaluation in authoritative medical evidence by decomposing retrieved content into atomic facts and synthesizing them with user interaction constraints to form verifiable, fine-grained evaluation criteria. Evaluated on HealthBench and LLMEval-Med datasets, our framework achieves Clinical Intent Alignment (CIA) scores of 50.20% and 31.90%, significantly outperforming the GPT-4o baseline and demonstrating robust cross-lingual generalization. In discriminative tests on HealthBench, our rubrics yield a 7.8% higher win rate than GPT-4o baseline with nearly double score $Δ$, while ablation studies confirm its structural necessity. Beyond evaluation, our rubrics effectively guide response refinement, improving quality by 9.2%. This provides a scalable, cross-lingual foundation for both evaluating and improving medical LLMs. The code is available at https://github.com/AmbeChen/Automated-Rubric-Generation.

2601.09636 2026-05-14 cs.AI cs.CV cs.HC cs.LG 版本更新

PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records

Yibo Lyu, Gongwei Chen, Rui Shao, Weili Guan, Liqiang Nie

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Shenzhen Loop Area Institute(深圳环城区域研究院)

AI总结 本文提出 PersonalAlign,一种面向个性化图形用户界面(GUI)代理的分层隐式意图对齐方法,旨在通过利用用户的长期行为记录来理解模糊指令中的隐含偏好并主动预测用户潜在操作。为此,研究者构建了 AndroidIntent 基准数据集,并设计了 Hierarchical Intent Memory Agent(HIM-Agent)来持续更新和组织用户的个性化偏好与行为模式。实验表明,HIM-Agent 在执行与主动协助任务上分别提升了 15.7% 和 7.3%。

Comments Accepted to ACL26 Main

详情
英文摘要

While GUI agents have shown strong performance under explicit and completion instructions, real-world deployment requires aligning with users' more complex implicit intents. In this work, we highlight Hierarchical Implicit Intent Alignment for Personalized GUI Agent (PersonalAlign), a new agent task that requires agents to leverage long-term user records as persistent context to resolve omitted preferences in vague instructions and anticipate latent routines by user state for proactive assistance. To facilitate this study, we introduce AndroidIntent, a benchmark designed to evaluate agents' ability in resolving vague instructions and providing proactive suggestions through reasoning over long-term user records. We annotated 775 user-specific preferences and 215 routines from 20k long-term records across different users for evaluation. Furthermore, we introduce Hierarchical Intent Memory Agent (HIM-Agent), which maintains a continuously updating personal memory and hierarchically organizes user preferences and routines for personalization. Finally, we evaluate a range of GUI agents on AndroidIntent, including GPT-5, Qwen3-VL, and UI-TARS, further results show that HIM-Agent significantly improves both execution and proactive performance by 15.7% and 7.3%.

2601.00417 2026-05-14 cs.LG cs.AI cs.CL cs.CV 版本更新

Deep Delta Learning

Yifan Zhang, Yifeng Liu, Mengdi Wang, Quanquan Gu

发表机构 * Princeton University(普林斯顿大学) University of California Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出了一种名为Deep Delta Learning(DDL)的残差更新机制,用于改进Transformer模型中的残差流。与传统的加法累积方式不同,DDL允许每一层选择性地重写残差内容,通过学习方向读取当前状态,并与目标值进行比较,再沿相同方向进行门控修正。实验表明,DDL在语言模型中有效提升了残差流的管理能力,优于传统的残差加法方式。

Comments Project Page: https://github.com/yifanzhang-pro/deep-delta-learning

详情
英文摘要

Transformer residual streams evolve by additive accumulation: each layer appends a feature update to a shared hidden state, but has no direct mechanism for replacing content that has become obsolete or conflicting. We introduce Deep Delta Learning (DDL), a residual update rule that preserves the identity path while giving every layer the ability to selectively rewrite residual content. DDL reads the current state along a learned direction, compares it with a learned target value, and writes back a gated correction along the same direction. When the gate is closed, the update reduces to the identity; when the gate is fully open, the selected component is overwritten, yielding a depth-wise delta-rule generalization of standard residual addition. We integrate DDL in decoder-only language models with both scalar and expanded residual states, while keeping attention and MLP sublayers at the original compute width. Controlled pretraining and downstream evaluations show that residual rewrite operations improve language modeling quality relative to pure additive accumulation introduced in ResNet, suggesting that a learned delta-rule update is an effective mechanism for managing Transformer residual streams.

2512.13399 2026-05-14 cs.AI cs.CL 版本更新

Differentiable Evolutionary Reinforcement Learning

Sitao Cheng, Tianle Li, Xuhan Huang, Xunjian Yin, Difan Zou

发表机构 * Department of XXX, University of YYY, Location, Country(XXX系,YYY大学,地点,国家) School of ZZZ, Institute of WWW, Location, Country(ZZZ学院,WWW研究所,地点,国家)

AI总结 本文研究了强化学习中如何设计有效的奖励信号这一核心问题,提出了一种可微分的进化强化学习框架(DERL),通过引入元优化器,将奖励函数的结构优化过程与策略学习相结合。该方法利用策略梯度对内层策略的验证性能进行反馈,从而实现对奖励结构的渐进式优化,提升了系统在复杂任务中的自主学习与泛化能力。实验表明,DERL在多个推理领域均取得了优于传统非可微方法的性能,尤其在分布外泛化方面表现突出。

Comments Work in Progress. We release our code and model at https://github.com/sitaocheng/DERL

详情
英文摘要

Crafting effective reward signals remains a central challenge in Reinforcement Learning (RL), especially for complex reasoning tasks. Existing automated reward optimization methods typically rely on derivative-free search heuristics that treat the reward function as a black box, failing to exploit the causal dynamics between reward structure modifications and policy performance. We introduce Differentiable Evolutionary Reinforcement Learning (DERL), a bi-level framework for the autonomous discovery of optimal reward structures. DERL employs a Meta-Optimizer that evolves a reward function through the composition of structured atomic primitives to guide an inner-loop policy. Unlike prior black-box methods, DERL introduces differentiability into the meta-optimization process by updating the Meta-Optimizer using policy gradients derived from inner-loop validation performance. This allows for the progressive learning of a "meta-gradient" for task success, providing the system with dense, actionable feedback. We validate DERL across diverse reasoning domains: embodied agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8K, MATH). Results show that DERL achieves state-of-the-art performance on agent benchmarks, substantially outperforming non-differentiable baselines-especially in out-of-distribution generalization. Trajectory analyses confirm that DERL captures the intrinsic causal structure of tasks, enabling fully autonomous, self-improving agent alignment.

2512.10857 2026-05-14 cs.LG cs.AI stat.ML 版本更新

Generative Modeling from Black-box Corruptions via Self-Consistent Stochastic Interpolants

Chirag Modi, Jiequn Han, Eric Vanden-Eijnden, Joan Bruna

发表机构 * New York University(纽约大学) Flatiron Institute(Flatiron研究所) Machine Learning Lab, Capital Fund Management(资本基金管理有限公司机器学习实验室)

AI总结 本文研究了如何从受黑盒噪声干扰的数据中构建生成模型的问题。作者提出了一种基于随机插值的自洽方法(SCSI),通过迭代更新受污染数据与干净数据之间的映射,仅依赖于受污染数据集和对噪声通道的黑盒访问,从而实现对原始数据分布的逆向建模。该方法在计算效率、灵活性和理论保证方面具有优势,并在图像处理和科学重建等任务中表现出优越性能。

Comments Accepted at ICLR 2026

详情
英文摘要

Transport-based methods have emerged as a leading paradigm for building generative models from large, clean datasets. However, in many scientific and engineering domains, clean data are often unavailable: instead, we only observe measurements corrupted through a noisy, ill-conditioned channel. A generative model for the original data thus requires solving an inverse problem at the level of distributions. In this work, we introduce a novel approach to this task based on Stochastic Interpolants: we iteratively update a transport map between corrupted and clean data samples using only access to the corrupted dataset as well as black box access to the corruption channel. Under appropriate conditions, this iterative procedure converges towards a self-consistent transport map that effectively inverts the corruption channel, thus enabling a generative model for the clean data. We refer to the resulting method as the self-consistent stochastic interpolant (SCSI). It (i) is computationally efficient compared to variational alternatives, (ii) highly flexible, handling arbitrary nonlinear forward models with only black-box access, and (iii) enjoys theoretical guarantees. We demonstrate superior performance on inverse problems in natural image processing and scientific reconstruction, and establish convergence guarantees of the scheme under appropriate assumptions. Our source code is publicly available at https://github.com/modichirag/SCSI

2512.08411 2026-05-14 cs.AI cs.RO 版本更新

Prismatic World Model: Learning Compositional Dynamics for Planning in Hybrid Systems

Mingwei Li, Xiaoyuan Zhang, Chengwei Yang, Zilong Zheng, Yaodong Yang

发表机构 * Beijing Institute of Technology(北京理工大学) Peking University(北京大学) NLCo Lab, Beijing Institute for General Artificial Intelligence(北大师范学院实验室,北京人工智能研究院)

AI总结 在机器人规划任务中,物理系统的混合动态特性(如连续运动与离散事件的交替)给基于模型的规划带来了挑战。本文提出了一种结构化的世界模型PRISM-WM,通过将复杂的混合动态分解为可组合的基本单元,有效解决了传统模型对不同动态模式过度平滑的问题。该模型采用基于上下文的专家混合框架,结合隐式模式识别与专家多样性约束,提升了长期预测的准确性,并在多个连续控制任务中表现出优越的轨迹优化性能。

详情
英文摘要

Model-based planning in robotic domains is challenged by the hybrid nature of physical dynamics, where continuous motion is punctuated by discrete events such as contacts and impacts. Conventional latent world models typically employ monolithic neural networks that enforce global continuity, which over-smooths distinct dynamic modes (e.g., sticking vs. sliding, flight vs. stance). For a planner, this smoothing results in compounding errors during long-horizon lookaheads, rendering the search process unreliable at physical boundaries. To address this, we introduce the Prismatic World Model (PRISM-WM), a structured architecture designed to decompose complex hybrid dynamics into composable primitives. PRISM-WM uses a context-aware Mixture-of-Experts (MoE) framework where a gating mechanism implicitly identifies the current physical mode, and specialized experts predict the associated transition dynamics. We further introduce a latent orthogonalization objective to ensure expert diversity, preventing mode collapse. By modeling the mode transitions in system dynamics, PRISM-WM reduces rollout drift. Experiments on continuous control benchmarks, including high-dimensional humanoids and multi-task settings, demonstrate that PRISM-WM provides a high-fidelity substrate for trajectory optimization algorithms (e.g., TD-MPC), indicating its potential as a foundational model for model-based agents.

2512.07112 2026-05-14 cs.LG cs.AI 版本更新

FOAM: Blocked State Folding for Memory-Efficient LLM Training

Ziqing Wen, Jiahuan Wang, Ping Luo, Dongsheng Li, Tao Sun

发表机构 * National Key Laboratory of Parallel and Distributed Computing(并行与分布式计算国家重点实验室) College of Computer Science and Technology(计算机科学与技术学院) National University of Defense Technology(国防科技大学)

AI总结 本文提出了一种名为FOAM的优化方法,用于提高大语言模型训练过程中的内存效率。该方法通过分块计算梯度均值来压缩优化器状态,并引入残差修正以恢复信息损失,从而在保持模型性能的同时大幅降低内存占用。实验表明,FOAM能够在不牺牲收敛速度的前提下,减少多达90%的优化器状态内存开销,并且兼容其他内存高效优化器,性能优于现有方法。

详情
英文摘要

Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data. However, their scale leads to significant memory bottlenecks during training, especially when using memory-intensive optimizers like Adam. Existing memory-efficient approaches often rely on techniques such as singular value decomposition (SVD), projections, or weight freezing, which can introduce substantial computational overhead, require additional memory for projections, or degrade model performance. In this paper, we propose Folded Optimizer with Approximate Moment (FOAM), a method that compresses optimizer states by computing block-wise gradient means and incorporates a residual correction to recover lost information. Theoretically, FOAM achieves convergence rates equivalent to vanilla Adam under standard non-convex optimization settings. Empirically, FOAM eliminates up to 90\% of the memory overhead of optimizer states and accelerates convergence. Furthermore, FOAM is compatible with other memory-efficient optimizers, delivering performance and throughput that match or surpass both full-rank and existing memory-efficient baselines. Code is available at https://github.com/zqOuO/FOAM.

2512.01707 2026-05-14 cs.CV cs.AI cs.CL 版本更新

StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

Daeun Lee, Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Dac Lai, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Mohit Bansal

发表机构 * University of North Carolina, Chapel Hill(北卡罗来纳大学教堂山分校) Adobe Research(Adobe研究院)

AI总结 StreamGaze 是一个用于评估多模态大语言模型在流式视频中利用人类注视信号进行时间推理和主动理解能力的全新基准。该研究通过引入基于注视引导的过去、当前和主动推理任务,全面评估模型在实时处理视频流并预测用户意图方面的能力。研究构建了一个结合注视轨迹与视频内容的问答生成管道,生成具有时空语义的问答对,并揭示了当前模型在基于注视的时序推理和主动预测方面仍存在明显不足。

Comments Accepted to CVPR 2026 with strong scores (5/5/5) but desk-rejected after the camera-ready due to not completing all reviewing duties

详情
英文摘要

Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications such as Augmented Reality (AR) glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether Multimodal Large Language Models (MLLMs) can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs utilize gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively assess streaming video understanding. These tasks evaluate whether models can use real-time gaze signals to follow shifting attention and infer user intentions based only on past and currently observed frames. To build StreamGaze, we develop a gaze-video Question Answering (QA) generation pipeline that aligns egocentric videos with raw gaze trajectories through fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, highlighting key limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze prompting strategies, reasoning behaviors, and task-specific failure modes, offering insights into current limitations and directions for future research. All data and code are publicly available to support continued research in gaze-guided streaming video understanding.

2512.01242 2026-05-14 cs.CV cs.AI cs.CL 版本更新

When Diffusion Breaks Constraints: Sequential Autoregressive Generation with RL and MCTS

Zirui Zhao, Boye Niu, Harold Soh, David Hsu, Wee Sun Lee

发表机构 * Salesforce AI Research(Salesforce人工智能研究) University of Sydney(悉尼大学) National University of Singapore(新加坡国立大学)

AI总结 该论文研究了扩散模型在受约束生成任务中的局限性,例如多机器人路径规划、分子生成和场景合成等,这些问题需要满足严格的几何或物理约束。为了解决这一问题,作者提出了一种基于强化学习和蒙特卡洛树搜索的顺序自回归生成方法,将约束生成问题转化为离散的序列生成任务,从而更有效地满足复杂的约束条件。实验表明,该方法在可行性与任务成功率方面优于传统扩散模型,为解决此类受限生成问题提供了新的思路。

详情
英文摘要

Data-driven generative models excel in language and vision, but diffusion models often fail in constrained planning and design tasks, exhibiting severe constraint violations in engineering inverse design, molecular generation, multi-robot planning, and floorplan/scene synthesis even with projection or guidance. Such tasks combine hard-to-specify semantic goals with strict geometric or physical constraints (e.g., non-overlap, connectivity), yielding feasible solutions that lie on low-dimensional, small, and sometimes disconnected regions of the output space. This paper studies the failure mode through tangram generation from language, where seven fixed shapes must form a text-described silhouette while remaining connected and non-overlapping, and a simplified rectangle composition task with a learned bounding-box constraint. We find diffusion models struggle to satisfy constraints, consistent with difficulty generating samples near low-dimensional submanifolds. Motivated by locally feasible reparameterizations, we reformulate constrained generation as discrete autoregressive sequential generation. Reinforcement learning improves feasibility and task success, and Monte Carlo tree search quantifies the value of look-ahead when feasible regions shrink. Overall, the empirical, theoretical, and prior-work evidence points to a structural limitation of continuous density matching on this class of constrained-generation problems, and suggests sequential constraint-aware generation as a promising alternative.

2511.14056 2026-05-14 cs.LG cs.AI cs.IT math.DG math.IT stat.ML 版本更新

Radial Compensation: Fixing Radius Distortion in Chart-Based Generative Models on Riemannian Manifolds

Marios Papamichalis, Regina Ruane

发表机构 * Human Nature Lab, Yale University(耶鲁大学人类自然实验室) Department of Statistics and Data Science, The Wharton School, University of Pennsylvania(宾夕法尼亚大学统计与数据科学系、沃顿商学院)

AI总结 本文研究了基于坐标图的黎曼流形生成模型中的基础分布问题。传统方法在欧几里得切空间中采样后再映射到流形,但这种方法会导致测地距离的扭曲,不同坐标图、曲率和维度下相同切空间尺度可能对应不同的测地半径。为此,作者提出了一种称为径向补偿(Radial Compensation, RC)的方法,通过特定设计的基础分布使模型实现用户指定的测地半径分布,并提升了训练稳定性与曲率估计的清晰度。此外,文中还引入了平衡指数坐标图,进一步优化了模型的数值条件,使得统计意义与数值计算解耦,提高了模型的可解释性与实用性。

详情
英文摘要

We study the base distribution in chart-based generative models on Riemannian manifolds. Standard methods sample in Euclidean tangent space and then map the sample to the manifold with a chart. This is convenient, but it changes the meaning of distance: the same tangent-space scale can correspond to different geodesic radii, i.e. shortest-path distances from a reference point on the manifold, under different charts, curvatures, and dimensions. Within isotropic, scalar-Jacobian azimuthal charts, we show that no base distribution can simultaneously preserve geodesic-radial likelihoods, chart-invariant radial Fisher information, and tangent-space isotropy unless it has a specific form, which we call Radial Compensation (RC). RC chooses the tangent-space base so that the model realizes a user-specified one-dimensional law for the geodesic radius, and leaves the chart available as a numerical preconditioner. This gives more stable training and cleaner curvature estimates, because curvature no longer has to compensate for distortions introduced by the chart. We also introduce balanced exponential charts, which improve conditioning without changing the realized manifold density under RC. This decouples the statistical meaning of the model, the law of the geodesic radius, from its numerical conditioning, which is governed by the chart Jacobian: chart choice becomes a numerical preconditioner rather than a hidden modeling decision. Across manifold variational autoencoders and continuous normalizing flows, RC matches the intended radius behavior, improves numerical stability, and makes learned curvature easier to interpret.

2511.00839 2026-05-14 cs.SE cs.AI 版本更新

CodeClash: Benchmarking Goal-Oriented Software Engineering

John Yang, Kilian Lieret, Joyce Yang, Carlos E. Jimenez, Muhtasham Oblokulov, Aryan Siddiqui, Ofir Press, Ludwig Schmidt, Diyi Yang

发表机构 * Stanford University(斯坦福大学) Princeton University(普林斯顿大学) Cornell University(康奈尔大学) Technical University of Munich(慕尼黑技术大学)

AI总结 当前的编程基准主要评估语言模型在具体、明确任务上的表现,如修复特定错误或编写针对性测试,但未能反映真实软件开发中围绕高层目标进行迭代开发的复杂过程。为此,研究提出了 CodeClash,一个让语言模型在多轮竞赛中围绕竞争性目标优化代码库的基准,通过模拟代码对战评估模型在战略规划和长期维护方面的能力。实验表明,尽管模型在开发风格上存在差异,但在战略推理和代码长期维护方面仍存在显著局限,甚至在与人类专家的对抗中屡屡败北。

详情
英文摘要

Current benchmarks for coding evaluate language models (LMs) on concrete, well-specified tasks such as fixing specific bugs or writing targeted tests. However, human programmers do not spend all day incessantly addressing isolated tasks. Instead, real-world software development is grounded in the pursuit of high-level goals, like improving user retention or reducing costs. Evaluating whether LMs can also iteratively develop code to better accomplish open-ended objectives without any explicit guidance remains an open challenge. To address this, we introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two phases: agents edit their code, then their codebases compete head-to-head in a code arena that determines winners based on objectives like score maximization, resource acquisition, or survival. Whether it's writing notes, scrutinizing documentation, analyzing competition logs, or creating test suites, models must decide for themselves how to improve their codebases both absolutely and against their opponents. We run 1680 tournaments (25,200 rounds total) to evaluate 8 LMs across 6 arenas. Our results reveal that while models exhibit diverse development styles, they share fundamental limitations in strategic reasoning. Models also struggle with long-term codebase maintenance, as repositories become progressively messy and redundant. These limitations are stark: top models lose every round against expert human programmers. We open-source CodeClash to advance the study of autonomous, goal-oriented code development.

2510.21060 2026-05-14 cs.LG cs.AI 版本更新

On the Sample Complexity of Differentially Private Policy Optimization

Yi He, Xingyu Zhou

发表机构 * Wayne State University(韦恩州立大学)

AI总结 本文研究了差分隐私策略优化(DP-PO)的样本复杂度,探讨了在隐私保护约束下强化学习策略优化的理论基础。作者提出了适用于策略优化的差分隐私定义,分析了常用策略优化算法(如策略梯度、自然策略梯度等)在隐私约束下的样本复杂度,并指出隐私成本在多数情况下表现为样本复杂度的低阶项。研究为设计隐私保护的策略优化算法提供了重要的理论指导。

Comments Accepted at NeurIPS 2025

详情
英文摘要

Policy optimization (PO) is a cornerstone of modern reinforcement learning (RL), with diverse applications spanning robotics, healthcare, and large language model training. The increasing deployment of PO in sensitive domains, however, raises significant privacy concerns. In this paper, we initiate a theoretical study of differentially private policy optimization, focusing explicitly on its sample complexity. We first formalize an appropriate definition of differential privacy (DP) tailored to PO, addressing the inherent challenges arising from on-policy learning dynamics and the subtlety involved in defining the unit of privacy. We then systematically analyze the sample complexity of widely-used PO algorithms, including policy gradient (PG), natural policy gradient (NPG) and more, under DP constraints and various settings, via a unified framework. Our theoretical results demonstrate that privacy costs can often manifest as lower-order terms in the sample complexity, while also highlighting subtle yet important observations in private PO settings. These offer valuable practical insights for privacy-preserving PO algorithms.

2510.18245 2026-05-14 cs.LG cs.AI 版本更新

Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

Song Bian, Tao Yu, Shivaram Venkataraman, Youngsuk Park

发表机构 * UW-Madison(威斯康星大学麦迪逊分校) Amazon Web Services(亚马逊网络服务)

AI总结 本文研究了大语言模型(LLM)中模型架构对推理效率与准确率的影响,探讨了隐藏层大小、参数在MLP与注意力模块间的分配比例以及分组查询注意力(GQA)等关键因素。作者提出了一种结合架构信息的条件缩放定律,并设计了一种搜索框架,用于寻找同时具备高准确率与高推理效率的模型架构。实验表明,该方法在相同训练预算下,相比现有开源基线模型,实现了更高的准确率和推理吞吐量。

Comments 32 pages, 27 figures

Journal ref ICLR 2026

详情
英文摘要

Scaling the number of parameters and the size of training data has proven to be an effective strategy for improving large language model (LLM) performance. Yet, as these models grow increasingly powerful and widely deployed, the cost of inference has become a pressing concern. Despite its importance, the trade-off between model accuracy and inference efficiency remains underexplored. In this work, we examine how key architectural factors, hidden size, the allocation of parameters between MLP and attention (mlp-to-attention ratio), and grouped-query attention (GQA), influence both inference cost and accuracy. We introduce a conditional scaling law that augments the Chinchilla framework with architectural information, along with a search framework for identifying architectures that are simultaneously inference-efficient and accurate. To validate our approach, we train more than 200 models spanning 80M to 3B parameters and 8B to 100B training tokens, and fit the proposed conditional scaling law. Our results show that the conditional scaling law reliably predicts optimal architectural choices and that the resulting models outperform existing open-source baselines. Under the same training budget, optimized architectures achieve up to 2.1% higher accuracy and 42% greater inference throughput compared to LLaMA-3.2.

2510.18114 2026-05-14 cs.LG cs.AI stat.ML 版本更新

Latent-Augmented Discrete Diffusion Models

Dario Shariatian, Alain Durmus, Umut Simsekli, Stefano Peluchetti

发表机构 * Inria(法国国家信息与自动化技术研究院) PSL Research University(巴黎社会科学高等研究院) CMAP(巴黎高等理工学院应用数学与计算科学实验室) Ecole Polytechnique Palaiseau, France(法国巴黎高等理工学院Palaiseau分校) Sakana AI Tokyo, Japan(日本东京Sakana AI公司)

AI总结 离散扩散模型在语言生成任务中展现出强大潜力,但现有方法常因忽略跨词依赖而影响生成效率。本文提出了一种名为Latent-Augmented Discrete Diffusion (LADD) 的新模型,通过引入可学习的辅助潜在变量,在联合的(词,潜在)空间中进行扩散,从而更好地捕捉结构信息并保持参数可学习性。实验表明,LADD在无条件生成任务中优于现有最优方法,尤其在低采样预算下表现更优。

详情
英文摘要

Discrete diffusion models have emerged as a powerful class of models and a promising route to fast language generation, but practical implementations typically rely on factored reverse transitions ignoring cross-token dependencies and degrading few-step performance. We propose Latent-Augmented Discrete Diffusion (LADD), which introduces a learnable auxiliary latent channel and performs diffusion over the joint (token, latent) space. The latent variables provide an intermediate representation expressing joint structure while preserving tractable parameterizations. We instantiate LADD with continuous latents (Co-LADD) and discrete latents (Di-LADD), and study two inference schedules: a joint diffusion that denoises data and latents together, and a sequential diffusion that first resolves latents and then samples tokens conditionally. We derive ELBO-style objectives and analyze design choices that balance latent expressivity with diffusion compatibility. In experiments, LADD models yield improvements on unconditional generation metrics as compared to state-of-the-art masked discrete diffusion baselines, and are effective at lower sampling budgets, where unmasking many tokens per step is desirable.

2510.16253 2026-05-14 cs.LG cs.AI q-bio.BM q-bio.QM stat.ML 版本更新

Protein Folding with Neural Ordinary Differential Equations

Arielle Sanford, Shuo Sun, Christian B. Mendl

发表机构 * University of Chicago, Department of Computer Science(芝加哥大学计算机科学系) Technical University of Munich, School of CIT, Department of Computer Science(慕尼黑技术大学 CIT 学院计算机科学系) TUM Institute for Advanced Study, Lichtenbergstraße 2a(慕尼黑技术大学高级研究学院)

AI总结 本文提出了一种基于神经常微分方程(Neural ODE)的连续深度Evoformer模型,用于蛋白质折叠预测。该方法将传统Evoformer中48个离散块替换为连续时间参数化模块,从而在保持核心注意力机制的同时,显著降低了计算资源消耗。实验表明,该模型在较少计算资源下仍能生成结构合理的预测结果,并有效捕捉部分二级结构特征,展示了连续深度模型在生物分子建模中的潜力。

Journal ref Mach. Learn.: Sci. Technol. 7, 035008 (2026)

详情
英文摘要

Recent advances in protein structure prediction, such as AlphaFold, have demonstrated the power of deep neural architectures like the Evoformer for capturing complex spatial and evolutionary constraints on protein conformation. However, the depth of the Evoformer, comprising 48 stacked blocks, introduces high computational costs and rigid layerwise discretization. Inspired by Neural Ordinary Differential Equations (Neural ODEs), we propose a continuous-depth formulation of the Evoformer, replacing its 48 discrete blocks with a Neural ODE parameterization that preserves its core attention-based operations. This continuous-time Evoformer achieves constant memory cost (in depth) via the adjoint method, while allowing a principled trade-off between runtime and accuracy through adaptive ODE solvers. Benchmarking on protein structure prediction tasks, we find that the Neural ODE-based Evoformer produces structurally plausible predictions and reliably captures certain secondary structure elements, such as alpha-helices, though it does not fully replicate the accuracy of the original architecture. However, our model achieves this performance using dramatically fewer resources, just 17.5 hours of training on a single GPU, highlighting the promise of continuous-depth models as a lightweight and interpretable alternative for biomolecular modeling. This work opens new directions for efficient and adaptive protein structure prediction frameworks.

2510.15297 2026-05-14 cs.CY cs.AI cs.HC cs.SI 版本更新

VERA-MH Concept Paper

Luca Belli, Kate H. Bentley, Will Alexander, Emily Ward, Matt Hawrilenko, Kelly Johnston, Mill Brown, Adam M. Chekroud

发表机构 * Spring Health Yale University School of Medicine(耶鲁大学医学院)

AI总结 本文介绍了VERA-MH(心理健康领域伦理与负责任AI的验证),一种用于评估心理健康场景中AI聊天机器人安全性的自动化系统,重点关注自杀风险的识别与应对。研究团队结合临床专家经验,设计了评估标准,并利用两个辅助AI代理——用户代理和评判代理——分别模拟用户对话和评估聊天机器人的表现。该系统目前正处于开发与临床验证阶段,已初步应用于对GPT-5、Claude Opus等模型的测试,并计划进一步完善评估体系与临床实用性。

详情
英文摘要

We introduce VERA-MH (Validation of Ethical and Responsible AI in Mental Health), an automated evaluation of the safety of AI chatbots used in mental health contexts, with an initial focus on suicide risk. Practicing clinicians and academic experts developed a rubric informed by best practices for suicide risk management for the evaluation. To fully automate the process, we used two ancillary AI agents. A user-agent model simulates users engaging in a mental health-based conversation with the chatbot under evaluation. The user-agent role-plays specific personas with pre-defined risk levels and other features. Simulated conversations are then passed to a judge-agent who scores them based on the rubric. The final evaluation of the chatbot being tested is obtained by aggregating the scoring of each conversation. VERA-MH is actively under development and undergoing rigorous validation by mental health clinicians to ensure user-agents realistically act as patients and that the judge-agent accurately scores the AI chatbot. To date we have conducted preliminary evaluation of GPT-5, Claude Opus and Claude Sonnet using initial versions of the VERA-MH rubric and used the findings for further design development. Next steps will include more robust clinical validation and iteration, as well as refining actionable scoring. We are seeking feedback from the community on both the technical and clinical aspects of our evaluation.

2510.14244 2026-05-14 eess.IV cs.AI cs.CV 版本更新

Reinforcement Learning for Unsupervised Domain Adaptation in Spatio-Temporal Echocardiography Segmentation

Arnaud Judge, Nicolas Duchateau, Thierry Judge, Roman A. Sandler, Joseph Z. Sokol, Christian Desrosiers, Olivier Bernard, Pierre-Marc Jodoin

发表机构 * Department of Computer Science, University of Sherbrooke(谢布鲁克大学计算机科学系) INSA, Universite Claude Bernard Lyon 1, CNRS UMR 5220, Inserm U1206, CREATIS(里昂1大学INSA、CNRS UMR 5220、Inserm U1206、CREATIS) Dep. of Software and Information Technology Engineering, École de technologie supérieure(蒙特利尔工程学院软件与信息技术工程系) Institut Universitaire de France (IUF)(法国国家科学院(IUF))

AI总结 该研究针对超声心动图分割中的领域自适应问题,提出了一种基于强化学习的无监督领域自适应框架RL4Seg3D。该方法通过引入新颖的奖励函数和融合策略,提升了分割结果中关键解剖标志点的精度,并在处理完整尺寸的视频输入时保持了良好的时间一致性。实验表明,该方法在无需目标域标注的情况下,显著优于传统领域自适应技术,且能提供鲁棒的不确定性估计,有助于进一步提升分割性能。

Comments 13 pages, accepted for publication in IEEE TMI

详情
英文摘要

Domain adaptation methods aim to bridge the gap between datasets by enabling knowledge transfer across domains, reducing the need for additional expert annotations. However, many approaches struggle with reliability in the target domain, an issue particularly critical in medical image segmentation, where accuracy and anatomical validity are essential. This challenge is further exacerbated in spatio-temporal data, where the lack of temporal consistency can significantly degrade segmentation quality, and particularly in echocardiography, where the presence of artifacts and noise can further hinder segmentation performance. To address these issues, we present RL4Seg3D, an unsupervised domain adaptation framework for 2D + time echocardiography segmentation. RL4Seg3D integrates novel reward functions and a fusion scheme to enhance key landmark precision in its segmentations while processing full-sized input videos. By leveraging reinforcement learning for image segmentation, our approach improves accuracy, anatomical validity, and temporal consistency while also providing, as a beneficial side effect, a robust uncertainty estimator, which can be used at test time to further enhance segmentation performance. We demonstrate the effectiveness of our framework on over 30,000 echocardiographic videos, showing that it outperforms standard domain adaptation techniques without the need for any labels on the target domain. Code is available at https://github.com/arnaudjudge/RL4Seg3D.

2510.13999 2026-05-14 cs.LG cs.AI 版本更新

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, Vithursan Thangarasa

发表机构 * Cerebras Systems Inc.(Cerebras系统公司) Schulich School of Engineering, University of Calgary(卡莱尔大学施密特工程学院)

AI总结 本文研究了在生成任务中对稀疏激活的专家混合(SMoE)模型进行专家压缩的有效方法,发现与近期在判别任务中表现较好的专家合并方法不同,专家剪枝在生成任务中更具优势。作者提出了一种基于路由门值和专家激活范数的剪枝准则——REAP,能够有效降低重建误差,实验表明该方法在多个大规模SMoE模型上,特别是在50%压缩率下,显著优于现有方法,并在代码生成任务中实现了接近无损的压缩效果。

Comments 30 pages, 9 figures, 12 tables

详情
英文摘要

Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert merging on discriminative benchmarks, we find that expert pruning is a superior strategy for generative tasks. We demonstrate that existing merging techniques introduce an irreducible error due to the loss of fine-grained routing control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms to minimize the reconstruction error bound. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.

2510.10642 2026-05-14 cs.RO cs.AI 版本更新

UniJEPA: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning

Jianke Zhang, Yucheng Hu, Yanjiang Guo, Xiaoyu Chen, Yichen Liu, Wenna Chen, Chaochao Lu, Jianyu Chen

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China.(清华大学交叉信息研究院) Shanghai Qi Zhi Institute, Shanghai, China(上海启智研究院) Peking University, Beijing, China(北京大学) Shanghai AI Lab, Shanghai, China(上海人工智能实验室)

AI总结 本文提出了一种名为UniJEPA的新型机器人策略学习框架,旨在提升机器人在开放环境中处理多样化任务的能力。该方法通过统一学习连续和离散的视觉表征,结合大规模预训练和机器人本体数据微调,实现了对高维视觉特征的动态建模以及从预测表征到动作的映射学习。实验表明,UniJEPA在仿真环境和现实世界的分布外任务中均优于现有基线方法,展现出显著的性能提升。

Journal ref ICML 2026

详情
英文摘要

Building generalist robot policies that can handle diverse tasks in open-ended environments is a central challenge in robotics. To leverage knowledge from large-scale pretraining, prior work (VLA) has typically built generalist policies either on top of vision-language understanding models (VLMs) or generative models. However, both semantic understanding from vision-language pretraining and visual dynamics modeling from visual-generation pretraining are crucial for embodied robots. Recent unified models of generation and understanding have demonstrated strong capabilities in both comprehension and generation through large-scale pretraining. We posit that robotic policy learning can likewise benefit from the combined strengths of understanding, planning, and continuous future representation learning. Building on this insight, we introduce UniJEPA, which acquires the ability to dynamically model high-dimensional visual features through pretraining on over 1M internet-scale instructional manipulation videos. Subsequently, UniJEPA is fine-tuned on data collected from the robot embodiment, enabling the learning of mappings from predictive representations to action tokens. Extensive experiments show our approach consistently outperforms baseline methods in terms of 9\% and 12\% across simulation environments and real-world out-of-distribution tasks.

2510.03548 2026-05-14 cs.CV cs.AI 版本更新

Unmasking Puppeteers: Leveraging Biometric Leakage to Expose Impersonation in AI-Based Videoconferencing

Danial Samadi Vahdati, Tai Duc Nguyen, Ekta Prashnani, Koki Nagano, David Luebke, Orazio Gallo, Matthew Stamm

发表机构 * Drexel University(德雷克斯el大学) NVIDIA

AI总结 本文研究了基于人工智能的视频会议系统中身份伪装攻击的问题,即攻击者可通过操控传输的潜空间信息实时劫持用户的形象。为解决这一问题,作者提出了一种新型防御方法,通过利用潜空间中固有的生物特征信息,设计了一个基于姿态条件的对比编码器,能够分离身份特征并消除姿态和表情的干扰,从而在不依赖重建视频的情况下检测身份伪装。实验表明,该方法在多个生成模型上均表现出优越的检测性能,并具有实时性和良好的泛化能力。

详情
英文摘要

AI-based talking-head videoconferencing systems reduce bandwidth by sending a compact pose-expression latent and re-synthesizing RGB at the receiver, but this latent can be puppeteered, letting an attacker hijack a victim's likeness in real time. Because every frame is synthetic, deepfake and synthetic video detectors fail outright. To address this security problem, we exploit a key observation: the pose-expression latent inherently contains biometric information of the driving identity. Therefore, we introduce the first biometric leakage defense without ever looking at the reconstructed RGB video: a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered. Our experiments on multiple talking-head generation models show that our method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios.

2509.25781 2026-05-14 cs.AI cs.LO 版本更新

Deontic Argumentation

Guido Governatori, Antonino Rotolo

发表机构 * School of Engineering and Technology, Central Queensland University(中央昆士兰大学工程与技术学院) Alma AI and Department of Legal Studies, University of Bologna(博洛尼亚大学法律系与Alma AI)

AI总结 本文研究了如何为道义论证(deontic argumentation)定义一种支持弱许可(weak permission)的语义。作者指出,当前基于 grounded 语义的方法在义务冲突时无法支持弱许可,并提出了一个新的道义论证理论,以正确处理弱许可问题,从而完善了道义论证的语义基础。

详情
英文摘要

We address the issue of defining a semantics for deontic argumentation that supports weak permission. Some recent results show that grounded semantics do not support weak permission when there is a conflict between two obligations. We provide a definition of Deontic Argumentation Theory that accounts for weak permission, and we recall the result about grounded semantics. Then, we propose a new semantics that supports weak permission.

2509.23597 2026-05-14 cs.LG cs.AI 版本更新

Characteristic Root Analysis and Regularization for Linear Time Series Forecasting

Zheng Wang, Kaixuan Zhang, Wanfang Chen, Xiaonan Lu, Longyuan Li, Tobias Schlagenhauf

发表机构 * Bosch Center for AI (BCAI) & Bosch (China) Investment Co., Ltd.(博世人工智能中心(BCAI)及博世(中国)投资有限公司) Robert Bosch GmbH(罗伯特·博世有限公司)

AI总结 本文系统研究了线性模型在时间序列预测中的应用,重点分析了特征根在时间动态行为中的作用,并揭示了噪声环境下模型易产生虚假特征根的问题。为此,作者提出了两种互补的正则化策略:一种基于低秩回归技术恢复潜在动态结构,另一种通过新方法“Root Purge”引导模型学习抑制噪声的零空间。实验表明,这两种方法在多个基准数据集上表现优异,验证了理论分析的有效性,并在某些场景下达到了当前最优结果。

Comments Accepted for publication at ICLR 2026

详情
英文摘要

Time series forecasting remains a critical challenge across numerous domains, yet the effectiveness of complex models often varies unpredictably across datasets. Recent studies highlight the surprising competitiveness of simple linear models, suggesting that their robustness and interpretability warrant deeper theoretical investigation. This paper presents a systematic study of linear models for time series forecasting, with a focus on the role of characteristic roots in temporal dynamics. We begin by analyzing the noise-free setting, where we show that characteristic roots govern long-term behavior and explain how design choices such as instance normalization and channel independence affect model capabilities. We then extend our analysis to the noisy regime, revealing that models tend to produce spurious roots. This leads to the identification of a key data-scaling property: mitigating the influence of noise requires disproportionately large training data, highlighting the need for structural regularization. To address these challenges, we propose two complementary strategies for robust root restructuring. The first uses rank reduction techniques, including \textbf{Reduced-Rank Regression (RRR)} and \textbf{Direct Weight Rank Reduction (DWRR)}, to recover the low-dimensional latent dynamics. The second, a novel adaptive method called \textbf{Root Purge}, encourages the model to learn a noise-suppressing null space during training. Extensive experiments on standard benchmarks demonstrate the effectiveness of both approaches, validating our theoretical insights and achieving state-of-the-art results in several settings. Our findings underscore the potential of integrating classical theories for linear systems with modern learning techniques to build robust, interpretable, and data-efficient forecasting models. The code is publicly available at: https://github.com/Wangzzzzzzzz/RootPurge.

2509.19538 2026-05-14 cs.LG cs.AI 版本更新

DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions

Zongyue Li, Xiao Han, Yusong Li, Niklas Strauss, Matthias Schubert

发表机构 * Department of Computer Science, University of Munich, Munich, Germany(慕尼黑大学计算机科学系) Munich Center for Machine Learning (MCML), Munich, Germany(慕尼黑机器学习中心(MCML))

AI总结 DAWM 是一种基于扩散模型的世界模型,旨在提升离线强化学习的性能。该方法通过当前状态、动作和剩余回报生成未来状态-奖励轨迹,并结合逆动力学模型实现高效的动作推断,从而生成适用于基于一步时差学习的离线RL算法的完整合成转移。实验表明,DAWM 显著提升了保守离线RL算法如TD3BC和IQL在D4RL基准上的表现,优于现有的扩散模型基线。

Comments ICML2025 workshop Building Physically Plausible World Models

详情
英文摘要

Diffusion-based world models have demonstrated strong capabilities in synthesizing realistic long-horizon trajectories for offline reinforcement learning (RL). However, many existing methods do not directly generate actions alongside states and rewards, limiting their compatibility with standard value-based offline RL algorithms that rely on one-step temporal difference (TD) learning. While prior work has explored joint modeling of states, rewards, and actions to address this issue, such formulations often lead to increased training complexity and reduced performance in practice. We propose \textbf{DAWM}, a diffusion-based world model that generates future state-reward trajectories conditioned on the current state, action, and return-to-go, paired with an inverse dynamics model (IDM) for efficient action inference. This modular design produces complete synthetic transitions suitable for one-step TD-based offline RL, enabling effective and computationally efficient training. Empirically, we show that conservative offline RL algorithms such as TD3BC and IQL benefit significantly from training on these augmented trajectories, consistently outperforming prior diffusion-based baselines across multiple tasks in the D4RL benchmark.

2509.08461 2026-05-14 cs.LG cs.AI cs.CV hep-ex 版本更新

Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics

Dikshant Sagar, Kaiwen Yu, Alejandro Yankelevich, Jianming Bian, Pierre Baldi

发表机构 * Department of Computer Science, University of California, Irvine, CA, USA(计算机科学系,加州大学欧文分校,加州,美国) Department of Physics, University of California, Irvine, CA, USA(物理系,加州大学欧文分校,加州,美国)

AI总结 本文研究了将视觉语言模型(VLM)应用于高能物理实验中中微子事件分类的问题,提出了一种基于微调LLaMA 3.2的VLM方法,并与卷积神经网络(CNN)和视觉变换器(ViT)进行了对比。实验表明,基于变换器的模型在分类准确率和鲁棒性方面优于传统CNN,而VLM通过引入文本或语义信息,进一步提升了预测的可解释性和推理能力。该研究展示了VLM作为物理事件分类通用框架的潜力,为中微子物理实验中的多模态推理提供了新思路。

Comments Accepted for publication in Communications Physics (Nature Portfolio)

详情
英文摘要

Recent advances in Large Language Models (LLMs) have demonstrated their remarkable capacity to process and reason over structured and unstructured data modalities beyond natural language. In this work, we explore the applications of Vision Language Models (VLMs), specifically a fine-tuned variant of LLaMA 3.2 to the task of identifying neutrino interactions in pixelated detector data from high-energy physics (HEP) experiments. We benchmark this model against a state-of-the-art convolutional neural network (CNN) architecture, similar to those used in major neutrino experiments, which have achieved high efficiency and purity in classifying electron and muon neutrino events, and also a Vision Transformer (ViT-h/14), which is the same architecture inside the VLM's vision encoder. Our evaluation considers both classification performance and interpretability of the model predictions, comparing a VLM with a vision-only transformer (ViT) and a convolutional neural network (CNN) baseline. We find that transformer-based architectures outperform conventional CNNs in classification accuracy and robustness, with the VLM providing additional flexibility through the integration of auxiliary textual or semantic information and enabling more interpretable, reasoning-based predictions. These results highlight the potential of large transformer models, particularly vision-language models, as general-purpose backbones for physics event classification, combining strong performance, robustness, and interpretability, and opening new avenues for multimodal reasoning in experimental neutrino physics.

2509.00626 2026-05-14 cs.CV cs.AI 版本更新

Towards Methane Detection Onboard Satellites

Maggie Chen, Hala Lamdouar, Luca Marini, Laura Martínez-Ferrer, Chris Bridges, Giacomo Acciarini

发表机构 * University of Oxford(牛津大学) Delft University of Technology(代尔夫特理工大学) Universitat de València(瓦伦西亚大学) University of Surrey(萨里大学) European Space Agency (ESA)(欧洲航天局)

AI总结 本文研究了如何在卫星上利用机器学习技术实现甲烷气体的快速检测,以支持气候变化的及时应对。研究提出了一种新的方法,无需传统图像预处理步骤,直接使用未正射校正的高光谱数据进行训练,取得了与传统方法相当的检测效果。此外,研究还展示了基于正射校正数据训练的模型在性能上优于传统匹配滤波方法,并公开了数据集和代码,为相关研究提供了重要资源。

详情
英文摘要

Methane is a potent greenhouse gas and a major driver of climate change, making its timely detection critical for effective mitigation. Machine learning (ML) deployed onboard satellites can enable rapid detection while reducing downlink costs, supporting faster response systems. Conventional methane detection methods often rely on image processing techniques, such as orthorectification to correct geometric distortions and matched filters to enhance plume signals. We introduce a novel approach that bypasses these preprocessing steps by using \textit{unorthorectified} data (UnorthoDOS). We find that ML models trained on this dataset achieve performance comparable to those trained on orthorectified data. Moreover, we also train models on an orthorectified dataset, showing that they can outperform the matched filter baseline (mag1c). We release model checkpoints and two ML-ready datasets comprising orthorectified and unorthorectified hyperspectral images from the Earth Surface Mineral Dust Source Investigation (EMIT) sensor at https://huggingface.co/datasets/SpaceML/UnorthoDOS , along with code at https://github.com/spaceml-org/plume-hunter.

2509.00072 2026-05-14 cs.AI 版本更新

Test of Time: Rethinking Temporal Signal of Benchmark Contamination

Terry Jingchen Zhang, Gopal Dev, Ning Wang, Max Obreiter, Punya Syon Pandey, Keenan Samway, Wenyuan Jiang, Yinya Huang, Bernhard Schölkopf, Mrinmaya Sachan, Zhijing Jin

发表机构 * Jinesis Lab, University of Toronto & Vector Institute(Jinesis实验室、多伦多大学及向量研究所) ETH Zürich & ETH AI Center(苏黎世联邦理工学院及ETH人工智能中心) Max Planck Institute for Intelligent Systems, Tübingen, Germany(智能系统马克斯·普朗克研究所,图宾根,德国) ELLIS Institute Tübingen(图宾根ELLIS研究所)

AI总结 本文重新审视了大语言模型(LLM)在训练截止时间后性能下降作为基准污染的时序信号问题。研究指出,这一时序信号高度依赖于基准问题的构造方式,即使来源材料不变,不同形式的问题也可能导致截然不同的时序表现。通过实验证明,对同一问题进行LLM驱动的转换可以有效消除时序模式,并结合影响函数分析揭示了其机制,表明该信号易受问题构造方式影响,需更稳健的方法来评估模型污染情况。

Comments ACL 2026

详情
英文摘要

Post-cutoff performance decay of LLMs has been widely interpreted as a temporal signal for benchmark contamination, where public information released before the training cutoff may have been included into training corpora and inflated model performance by memorization. We critically examine this view and demonstrate that this temporal signal is highly sensitive to how benchmark questions are constructed, even if the underlying source material remains invariant. Specifically, we show that LLM-transformed questions can produce remarkably different temporal patterns compared to fill-in-the-blank (cloze) questions directly retrieved from the very same documents. We validate this effect on prior benchmarks that report clear post-cutoff decay (LiveCodeBench), and show that a simple LLM-driven transformation of the same problems can effectively remove the temporal pattern. We further provide a mechanistic understanding of this phenomenon using influence function analysis. Overall, our results suggest that post-cutoff performance decay is a sensitive contamination signal, motivating more robust contamination probes for reliable LLM evaluation.

2508.14302 2026-05-14 cs.LG cs.AI cs.CL 版本更新

GLASS: Global-Local Aggregation for Inference-time Sparsification of LLMs

Amirmohsen Sattarifard, Sepehr Lavasani, Kunlin Zhang, Amirhossein Rajabpour, Hanlin Xu, Fengyu Sun, Negar Hassanpour, Chao Gao

发表机构 * Huawei Technologies Canada Co., Ltd.(华为技术加拿大公司) Huawei Technologies Ltd.(华为技术有限公司)

AI总结 本文提出了一种名为GLASS的推理时稀疏化框架,旨在在资源受限设备上高效部署大语言模型。该方法通过结合局部的输入提示激活信息和全局的模型内在先验,稳定动态剪枝过程,从而提升生成质量。实验表明,GLASS在短提示、长生成场景下显著优于现有无训练方法,有效降低了困惑度和KL散度,同时提升了设备端的推理速度。

详情
英文摘要

Inference-time sparsification is a promising path to deploy large language models (LLMs) on resource-constrained devices, yet existing training-free methods typically estimate feedforward network (FFN) neuron importance from the input prompt alone. We show this prompt-only signal is often unreliable, especially for short prompts and long-form decoding, leading to inaccurate masks and degraded generation fidelity. We propose GLASS, a plug-and-play, training-free framework that stabilizes dynamic FFN pruning by aggregating two complementary views of neuron criticality: local prompt-specific activations and a global model-intrinsic prior. GLASS fuses global and local signals via rank aggregation, yielding robust critical-neuron selection even when the prompt is short. We interpret GLASS as the maximum-a-posteriori consensus ranking under a permutation-based probabilistic model, providing a principled foundation for its weighted rank-aggregation rule. We apply GLASS to a diverse set of open-source LLMs, and show that it yields substantial improvements over prior training-free baselines in the challenging short-prompt, long-generation scenarios, achieving up to 45.10% lower perplexity and 25.73% lower KL divergence, while delivering significant on-device decoding speedup.

2508.07642 2026-05-14 cs.AI cs.CL cs.CV 版本更新

Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents

Tianyi Ma, Yue Zhang, Zehao Wang, Parisa Kordjamshidi

发表机构 * Michigan State University(密歇根州立大学) ESAT-PSI, KU Leuven(KU莱顿大学ESAT-PSI实验室)

AI总结 视觉与语言导航(VLN)任务要求智能体理解自然语言指令并在复杂的3D环境中进行导航,当前方法在面对需要复杂时空推理的未知场景时仍存在较大挑战。本文提出SkillNav框架,通过将导航分解为一组可解释的原子技能,并由专门的智能体分别处理,引入结构化的技能推理机制。此外,研究构建了一个合成数据生成管道以支持无监督技能训练,并设计了一种基于视觉语言模型的路由器,动态选择最合适的智能体执行任务,显著提升了模型在新型指令风格和未知环境中的泛化能力。

Comments Accepted by ACL 2026 Main Conference

详情
英文摘要

Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. To support targeted skill training without manual data annotation, we construct a synthetic dataset pipeline that generates diverse, linguistically natural, skill-specific instruction-trajectory pairs. We then introduce a novel training-free Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and historical actions. SkillNav obtains competitive results on commonly used benchmarks and establishes state-of-the-art generalization to the GSA-R2R, a benchmark with novel instruction styles and unseen environments.

2507.19247 2026-05-14 cs.LG cs.AI cs.CL 版本更新

A Markov Categorical Framework for Language Modeling

Yifan Zhang

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文提出了一种基于马尔可夫范畴的语言建模分析框架,旨在统一解释自回归语言模型的内部机制、训练过程对表示学习的影响以及这些表示如何支持复杂行为。该框架将单步生成过程建模为信息处理阶段的组合,揭示了训练目标、表示空间几何结构与模型能力之间的内在联系。研究还展示了负对数似然目标如何同时学习下一个词和数据的条件不确定性,并通过谱分析结果表明,在特定条件下,优化后的损失函数能够引导表示方向与预测原型对齐,从而为理解信息流动和模型内部结构提供了新的视角。

详情
英文摘要

Autoregressive language models achieve remarkable performance, yet a unified theory explaining their internal mechanisms, how training shapes representations, and why these representations support complex behavior remains incomplete. We introduce an analytical framework that models the single-step generation process as a composition of information-processing stages using the language of Markov categories. This compositional perspective connects three aspects of language modeling that are often studied separately: the training objective, the geometry of the learned representation space, and practical model capabilities. First, our framework gives an information-theoretic rationale for parallel drafting methods such as speculative decoding by quantifying the information surplus a hidden state contains about future tokens beyond the immediate next one. Second, we clarify how the standard negative log-likelihood (NLL) objective learns not only a most likely next token, but also the data's intrinsic conditional uncertainty, formalized through categorical entropy. Our main spectral result is conditional: for a linear-softmax head with bounded output features, a calibrated quadratic upper-bound surrogate to NLL induces, after whitening or variance normalization, a generalized CCA/eigenproblem aligning representation directions with predictive prototypes. This gives a compositional lens for understanding how information flows through a model and how likelihood training can shape its internal geometry.

2507.15867 2026-05-14 cs.LG cs.AI cs.CL cs.MA 版本更新

RDMA: Cost Effective Agent-Driven Rare Disease Mining from Electronic Health Records

John Wu, Adam Cross, Jimeng Sun

发表机构 * Department of Computer Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校计算机科学系) Department of Pediatrics, University of Illinois College of Medicine Peoria(伊利诺伊大学皮奥里亚医学院儿科系)

AI总结 该研究针对罕见病在电子健康记录中记录不足的问题,提出了一种基于智能代理的罕见病挖掘框架RDMA。该方法利用量化小型语言模型,结合专业工具实现缩写解析、隐性表型推理和本体映射,无需特定任务训练即可在多个数据集上显著优于现有方法。RDMA不仅大幅降低了推理和硬件成本,还通过不确定性标记机制减轻专家标注负担,为临床罕见病记录的规模化应用提供了可行方案。

详情
英文摘要

Rare diseases affect 1 in 10 Americans yet remain systematically underdocumented in clinical records. ICD-based systems cannot capture their breadth, over 50\% of Orphanet codes lack a direct ICD mapping and only 2.2\% of HPO codes have matching ICD codes, leaving patient populations invisible and delaying diagnosis. Mining unstructured clinical notes offers a direct path forward, but real notes are long, noisy, and abbreviation-dense, and limited annotations make fine-tuning infeasible, demanding approaches that generalize without task-specific training. We present Rare Disease Mining Agents (RDMA), an agentic framework equipping smaller quantized LLMs with tools for abbreviation resolution, implicit phenotype reasoning, and ontology grounding against Orphanet and HPO. RDMA substantially outperforms fine-tuned and RAG-based baselines across benchmarks with different data characteristics, without any task-specific training. A small quantized model achieves maximal performance, reducing inference costs by up to 10x and local hardware costs by up to 17x, enabling private deployment on standard hardware without cloud-based PHI exposure. RDMA's uncertainty-flagging mechanism further reduces expert annotation burden while preserving agreement quality, supporting scalable rare disease documentation in clinical practice. Available at https://github.com/jhnwu3/RDMA.

2507.03167 2026-05-14 cs.CL cs.AI cs.LG 版本更新

Where Do Reasoning Models Refuse?

Kureha Yamaguchi, Benjamin Etheridge, Andy Arditi

发表机构 * The Alan Turing Institute(艾伦·图灵研究所) University of Oxford(牛津大学) Northeastern University(东北大学)

AI总结 本文研究了推理模型在生成过程中何时决定拒绝有害请求的问题。通过分析四个开源推理模型,发现推理过程中的因果链对拒绝决策有显著影响,特定的推理轨迹可大幅减少模型最终拒绝或服从的不确定性。研究还发现,在蒸馏模型中,推理链开头的细微差异可能完全决定拒绝决策,并且这种模式在来自同一教师模型的蒸馏模型中具有可迁移性。此外,研究从模型激活中提取了拒绝方向,并验证了其对有害服从行为的影响。

Comments v1 accepted to the ICML 2025 Workshop on Reliable and Responsible Foundation Models (R2FM). 20 pages, 12 figures

详情
英文摘要

Chat models without chain-of-thought (CoT) reasoning must decide whether to refuse a harmful request before generating their first response token. Reasoning models, by contrast, produce extended chains of thought before their final output, raising a natural question: where in this process does the decision to refuse occur? We investigate this across four open-source reasoning models. We first show that the CoT causally influences refusal outcomes; fixing a specific reasoning trace substantially reduces variance in whether the model ultimately refuses or complies. Zooming into the reasoning trace, we find that in distilled models, subtle differences in the opening sentence of the CoT can fully determine the model's refusal decision, and that these patterns transfer across models distilled from the same teacher. Finally, we extract linear refusal directions from model activations and show that ablating them increases harmful compliance, though less reliably than the same technique achieves on non-reasoning models, and with non-negligible degradation to general capabilities.

2507.00990 2026-05-14 cs.RO cs.AI cs.CV 版本更新

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

Shivansh Patel, Shraddhaa Mohan, Hanlin Mai, Unnat Jain, Svetlana Lazebnik, Yunzhu Li

发表机构 * UIUC(伊利诺伊大学香槟分校) UC Irvine(加州大学尔湾分校) Columbia University(哥伦比亚大学)

AI总结 本文提出了一种名为 RIGVid 的系统,使机器人能够通过模仿人工智能生成的视频完成复杂的操作任务,如倒水、擦拭和混合,而无需任何物理演示或机器人特定的训练。系统通过语言指令和初始场景图像生成潜在演示视频,并利用视觉语言模型筛选符合指令的视频,再通过6D姿态追踪提取物体轨迹并映射到机器人上。实验表明,生成的视频在实际任务中表现优异,且生成质量越高效果越佳,优于基于关键点预测等更简洁的方法。

Comments In ICLR 2026. Website: https://rigvid-robot.github.io/

详情
英文摘要

This work introduces Robots Imitating Generated Videos (RIGVid), a system that enables robots to perform complex manipulation tasks--such as pouring, wiping, and mixing--purely by imitating AI-generated videos, without requiring any physical demonstrations or robot-specific training. Given a language command and an initial scene image, a video diffusion model generates potential demonstration videos, and a vision-language model (VLM) automatically filters out results that do not follow the command. A 6D pose tracker then extracts object trajectories from the video, and the trajectories are retargeted to the robot in an embodiment-agnostic fashion. Through extensive real-world evaluations, we show that filtered generated videos are as effective as real demonstrations, and that performance improves with generation quality. We also show that relying on generated videos outperforms more compact alternatives such as keypoint prediction using VLMs, and that strong 6D pose tracking outperforms other ways to extract trajectories, such as dense feature point tracking. These findings suggest that videos produced by a state-of-the-art off-the-shelf model can offer an effective source of supervision for robotic manipulation.

2507.00029 2026-05-14 cs.LG cs.AI 版本更新

LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing

Wenbing Li, Zikai Song, Hang Zhou, Yunyao Zhang, Junqing Yu, Wei Yang

发表机构 * Huazhong University of Science and Technology(华中科技大学)

AI总结 该论文提出了一种名为 LoRA-Mixer 的模块化混合专家框架,旨在提高大语言模型在多任务适应中的参数效率和任务专业化能力。与现有方法不同,LoRA-Mixer 将任务特定的 LoRA 专家嵌入到注意力模块的核心投影矩阵中,而非主要针对 FFN 模块,从而实现更细粒度的 token 级别专业化。通过引入自适应路由专业化损失(RSL),该方法在有限数据下训练出鲁棒的路由策略,提升了专家选择的稳定性和重用率,并在多个基准测试中以更少的可训练参数取得了优于现有方法的性能提升。

详情
英文摘要

Recent attempts to combine low-rank adaptation (LoRA) with mixture-of-experts (MoE) for multi-task adaptation of Large Language Models (LLMs) often replace whole attention/FFN layers with switch experts or append parallel expert branches, undermining parameter efficiency and limiting task specialization. We introduce LoRA-Mixer, a modular MoE framework that routes task-specific LoRA experts into the core projection matrices of the attention module, namely input and output linear layers, rather than primarily targeting FFN blocks. The design delivers fine-grained token-level specialization by fully exploiting the attention mechanism, while remaining drop-in compatible with Transformers and state-space models (SSMs), since linear projection layers are ubiquitous. To train robust routers from limited data while promoting stable, selective decisions and high expert reuse, LoRA-Mixer employs an adaptive Routing Specialization Loss (RSL) that jointly enforces global load balance and input-aware specialization via an entropy-shaping objective. The framework supports two regimes: (i) joint optimization of adapters and router with a differentiable hard-soft top-k routing scheme, and (ii) plug-and-play routing over frozen, pre-trained LoRA modules sourced from public repositories. Across 15 benchmarks, including MedQA, GSM8K, HumanEval, and GLUE, RSL-optimized LoRA-Mixer outperforms state-of-the-art routing and LoRA-MoE baselines while using 48 percent of their trainable parameters, with gains of 3.79, 2.90, and 3.95 percentage points on GSM8K, CoLA, and ARC-C, respectively. Cross-model transfer and adapter reuse experiments further demonstrate the approach's versatility and data efficiency. Our code is available at https://github.com/hustcselwb/LoRA-Mixer.

2506.13456 2026-05-14 cs.AI cs.RO 版本更新

Block-wise Adaptive Caching for Accelerating Diffusion Policy

Kangye Ji, Yuan Meng, Hanyun Cui, Ye Li, Jianbo Zhou, Shengjia Hua, Lei Chen, Zhi Wang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 扩散策略(Diffusion Policy)在视觉运动控制建模方面表现出色,但由于计算成本高,难以用于实时机器人控制。本文提出了一种块级自适应缓存(BAC)方法,通过缓存中间动作特征并自适应更新和复用,实现无损的动作生成加速。BAC引入了自适应缓存调度器和冒泡联合算法,有效缓解了块间缓存误差传播问题,能够在不改变模型结构的前提下,为多种基于Transformer的扩散策略和视觉-语言-动作模型带来最高达3倍的推理加速。

详情
英文摘要

Diffusion Policy has demonstrated strong visuomotor modeling capabilities, but its high computational cost renders it impractical for real-time robotic control. Despite huge redundancy across repetitive denoising steps, existing diffusion acceleration techniques fail to generalize to Diffusion Policy due to fundamental architectural and data divergences. In this paper, we propose $\textbf{B}$lock-wise $\textbf{A}$daptive $\textbf{C}$aching ($\textbf{BAC}$), a method to accelerate Diffusion Policy by caching intermediate action features. BAC achieves lossless action generation acceleration by adaptively updating and reusing cached features at the block level, based on a key observation that feature similarities exhibit non-uniform temporal dynamics and distinct block-specific patterns. To operationalize this insight, we first design an Adaptive Caching Scheduler to identify optimal update timesteps by maximizing the global feature similarities between cached and skipped features. However, applying this scheduler for each block leads to significant error surges due to the inter-block propagation of caching errors, particularly within Feed-Forward Network (FFN) blocks. To mitigate this issue, we develop the Bubbling Union Algorithm, which truncates these errors by updating the upstream blocks with significant caching errors before downstream FFNs. As a training-free plugin, BAC is readily integrable with existing transformer-based Diffusion Policy and vision-language-action models. Extensive experiments on multiple robotic benchmarks demonstrate that BAC achieves up to 3$\times$ inference speedup for free. Project page: https://block-wise-adaptive-caching.github.io.

2506.09522 2026-05-14 cs.CV cs.AI cs.CL 版本更新

Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding

Beomsik Cho, Jaehyung Kim

发表机构 * Yonsei University(延世大学)

AI总结 该研究探讨了视觉信息在大视觉语言模型(LVLMs)解码过程中的作用,发现即使在出现幻觉的情况下,视觉token仍包含有意义的视觉信息,并且其语义可以在文本空间中被显式表达。基于此,研究提出了一种无需训练的解码方法ReVisiT,通过将视觉token投影到文本分布中,并在解码过程中动态选择最相关的视觉token来引导文本生成,从而提升模型对视觉语义的融合能力。实验表明,ReVisiT在多个基准测试中表现优异,同时减少了计算成本。

Comments ACL 2026 Main Conference (Oral). 30 pages, 10 figures. Code: https://github.com/bscho333/ReVisiT

详情
英文摘要

Large Vision Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding. However, how vision information contributes to the model's decoding process remains under-explored, as reflected in frequent hallucinations. Through a series of analyses, we found that (i) vision tokens provide meaningful visual information even when hallucinations occur, and (ii) their semantics are encoded in the textual space and become explicit under appropriate vocabulary constraints. Building on these observations, we propose ReVisiT, a simple training-free decoding method that guides text generation in LVLMs by Referencing Vision Tokens. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution. Specifically, ReVisiT dynamically selects the most relevant vision token at each decoding step via context-aware constrained divergence minimization. Then, ReVisiT uses its constrained projection to refine the output distribution to better incorporate visual semantics. Across five benchmarks on recent LVLMs, ReVisiT achieves competitive or superior results to state-of-the-art decoding baselines while reducing computational cost by up to $2\times$

2505.22445 2026-05-14 cs.CV cs.AI 版本更新

NFR: Neural Feature-Guided Non-Rigid Shape Registration

Zhangquan Chen, Puhua Jiang, Mingze Sun, Ruqi Huang

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院)

AI总结 本文提出了一种基于神经特征引导的非刚性形状配准新框架,能够在无需对应关系标注的情况下,有效应对输入形状之间的显著非刚性变形和部分遮挡问题。该方法将深度学习形状匹配网络提取的神经特征融入迭代几何配准流程,既提升了对应关系的准确性和语义意义,又通过动态更新和一致性先验过滤增强了鲁棒性。实验表明,即使仅使用少量训练样本,该方法在多个非刚性点云配准和部分形状匹配基准上均达到最优性能,并能处理传统方法难以应对的复杂形变场景。

Comments 18 pages, 16 figures. arXiv admin note: substantial text overlap with arXiv:2311.04494

详情
英文摘要

In this paper, we propose a novel learning-based framework for 3D shape registration, which overcomes the challenges of significant non-rigid deformation and partiality undergoing among input shapes, and, remarkably, requires no correspondence annotation during training. Our key insight is to incorporate neural features learned by deep learning-based shape matching networks into an iterative, geometric shape registration pipeline. The advantage of our approach is two-fold -- On one hand, neural features provide more accurate and semantically meaningful correspondence estimation than spatial features (e.g., coordinates), which is critical in the presence of large non-rigid deformations; On the other hand, the correspondences are dynamically updated according to the intermediate registrations and filtered by consistency prior, which prominently robustify the overall pipeline. Empirical results show that, with as few as dozens of training shapes of limited variability, our pipeline achieves state-of-the-art results on several benchmarks of non-rigid point cloud matching and partial shape matching across varying settings, but also delivers high-quality correspondences between unseen challenging shape pairs that undergo both significant extrinsic and intrinsic deformations, in which case neither traditional registration methods nor intrinsic methods work. Our code is available at https://github.com/rqhuang88/NFR.

2505.17469 2026-05-14 cs.LG cs.AI cs.IT math.IT math.OC math.ST stat.TH 版本更新

Efficient compression of neural networks and datasets

Lukas Silvester Barth, Paulo von Petersenn

发表机构 * Max Planck Institute for Mathematics in the Sciences(马克斯·普朗克数学研究院)

AI总结 本文探讨了神经网络与数据集的高效压缩问题,结合算法信息论与神经网络剪枝技术,提出了一种基于最小描述长度原则(MDL)的模型泛化优化方法。通过引入参数稀疏性作为模型描述长度的可计算近似,并改进稀疏优化算法,作者在图像和文本数据集上实现了显著的模型压缩,同时保持了较高的准确率。实验还验证了压缩模型在样本效率和泛化能力上的优势,支持了索洛莫诺夫归纳理论的预测。

Comments 10 pages plus appendix, 9 Figures, 6 Tables

详情
英文摘要

Compression and generalization are fundamentally related through Solomonoff induction and the minimum description length principle (MDL), which predict that simpler models generalize better when data arises from low-complexity distributions. In this article, we combine insights from algorithmic information theory and techniques from neural network pruning to improve model generalization by identifying the most effective data compression method. Since exact MDL optimization is intractable, we cast it as $\ell_0$ regularized learning and explain why parameter sparsity provides an effective computable approximation of model description length. To identify the best practical approach, we systematically compare and refine complementary sparse optimization methods. In particular, we improve probabilistic pruning through a procedure that does not require Monte Carlo sampling and refine smooth $\ell_0$ approximations with a binary search routine that reduces hyperparameter complexity. Across convolutional networks and transformers evaluated on image and text datasets, our refined methods improve upon their predecessors, achieve substantial model compression with minimal accuracy loss, and yield short data description lengths. Finally, we use these methods in a controlled teacher-student setting to empirically verify the prediction of Solomonoff induction that compressed models learn more sample-efficiently and generalize better.

2505.12942 2026-05-14 cs.CL cs.AI cs.LG 版本更新

A3 : an Analytical Low-Rank Approximation Framework for Attention

Jeffrey T. H. Wong, Cheng Zhang, Xinye Cao, Pedro Gimenes, Christos-Savvas Bouganis, George A. Constantinides, Wayne Luk, Yiren Zhao

发表机构 * Department of Electrical and Electronic Engineering, Imperial College London(帝国理工学院伦敦校区电子与电气工程系)

AI总结 大型语言模型虽然性能优异,但参数量庞大导致部署成本高昂。为此,本文提出了一种名为 $A^3$ 的后训练低秩近似框架,通过将 Transformer 层分解为三个功能组件,并在每个组件内进行分析性优化,有效降低模型参数、KV 缓存和计算量,同时避免引入运行时开销。实验表明,$A^3$ 在保持模型性能方面优于现有方法,例如在相同计算和内存压缩预算下,其对 LLaMA 3.1-70B 的近似版本在 WikiText-2 数据集上的困惑度显著优于当前最优方法。

详情
英文摘要

Large language models have demonstrated remarkable performance; however, their massive parameter counts make deployment highly expensive. Low-rank approximation offers a promising compression solution, yet existing approaches have two main limitations: (1) They focus on minimizing the output error of individual linear layers, without considering the architectural characteristics of Transformers, and (2) they decompose a large weight matrix into two small low-rank matrices. Consequently, these methods often fall short compared to other compression techniques like pruning and quantization, and introduce runtime overhead such as the extra GEMM kernel launches and memory operations for decomposed small matrices. To address these limitations, we propose $A^3$, a post-training low-rank approximation framework. $A^3$ splits a Transformer layer into three functional components, namely $\texttt{QK}$, $\texttt{OV}$, and $\texttt{MLP}$ and provides analytical solutions that reduces the hidden dimension size inside each component while minimizing the component's functional loss. This approach directly reduces model sizes, KV cache sizes, and FLOPs without introducing any runtime overheads. Through extensive experiments, we show that $A^3$ maintains superior performance compared to SoTAs. For example, under the same reduction budget in computation and memory, our low-rank approximated LLaMA 3.1-70B achieves a perplexity of 4.69 on WikiText-2, outperforming the previous SoTA's 7.87 by 3.18. We also show versatile applications of $A^3$ in KV cache compression, integration with quantization, fine-tuning and mixed-rank assignments. We open-sourced our framework and code at https://github.com/DeepWok/a3.

2505.12415 2026-05-14 cs.CL cs.AI 版本更新

Table-R1: Region-based Reinforcement Learning for Table Understanding

Zhenhe Wu, Jian Yang, Zhongjiang He, Changzai Pan, Jie Zhang, Jiaheng Liu, Xianjie Wu, Yu Zhao, Shuangyong Song, Yongxiang Li, Zhoujun Li, Xueling Li

发表机构 * Beihang University(北京航空航天大学) Xingchen AGI Lab(星辰AGI实验室) China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd(中国电信人工智能技术(北京)有限公司) Institute of Artificial Intelligence (TeleAI), China Telecom(中国电信人工智能研究院(TeleAI))

AI总结 表格理解对语言模型提出了独特挑战,因其结构化的行列交互需要专门的方法。本文提出基于区域的强化学习方法Table-R1,通过引入区域增强的监督微调和表格感知的群体相对策略优化,有效提升了模型在表格问答中的表现。该方法结合文本、符号和程序推理,实现了对表格区域信息的精准利用,实验表明其在多个基准数据集上显著优于参数规模更大的基线模型。

详情
英文摘要

Tables present unique challenges for language models due to their structured row-column interactions, necessitating specialized approaches for effective comprehension. While large language models (LLMs) have demonstrated potential in table reasoning through prompting and techniques like chain-of-thought (CoT) and program-of-thought (PoT), optimizing their performance for table question answering remains underexplored. In this paper, we introduce region-based Table-R1, a novel reinforcement learning approach that enhances LLM table understanding by integrating region evidence into reasoning steps. Our method employs Region-Enhanced Supervised Fine-Tuning (RE-SFT) to guide models in identifying relevant table regions before generating answers, incorporating textual, symbolic, and program-based reasoning. Additionally, Table-Aware Group Relative Policy Optimization (TARPO) introduces a mixed reward system to dynamically balance region accuracy and answer correctness, with decaying region rewards and consistency penalties to align reasoning steps. Experiments show that Table-R1 achieves an average performance improvement of 14.36 points across multiple base models on three benchmark datasets, even outperforming baseline models with ten times the parameters, while TARPO reduces response token consumption by 67.5% compared to GRPO, significantly advancing LLM capabilities in efficient tabular reasoning.

2505.11556 2026-05-14 cs.CL cs.AI cs.MA 版本更新

Systematic Failures in Collective Reasoning under Distributed Information in Multi-Agent LLMs

Yuxuan Li, Aoi Naito, Hirokazu Shirado

发表机构 * School of Computer Science, Carnegie Mellon University, Pittsburgh, USA(计算机科学学院,卡内基梅隆大学,匹兹堡,美国) School of Environment and Society, Institute of Science Tokyo, Tokyo, Japan(环境与社会学院,东京科学研究所,东京,日本)

AI总结 该研究探讨了基于大语言模型的多智能体系统在分布式信息下的集体推理能力,发现其存在系统性失效问题。研究提出了HiddenBench基准,通过隐藏档案范式隔离集体推理能力,实验表明多智能体系统在分布式信息下的准确率仅为30.1%,远低于单智能体在完整信息下的80.7%。研究指出,这种差距源于智能体无法识别和应对潜在的信息不对称,导致过早收敛于共享证据而忽略关键分布信息,并发现这一问题在不同模型和策略下普遍存在,但可通过结构化通信协议有效改善。

Comments Accepted to ICML 2026

详情
英文摘要

Multi-agent systems built on large language models (LLMs) are expected to enhance decision-making by pooling distributed information, yet systematically evaluating this capability has remained challenging. We introduce HiddenBench, a 65-task benchmark grounded in the Hidden Profile paradigm, which isolates collective reasoning under distributed information from individual reasoning ability. Evaluating 15 frontier LLMs, we find that multi-agent LLMs achieve only 30.1% accuracy under distributed information, compared to 80.7% accuracy for single agents given complete information. We trace this gap to a systematic failure mode: agents cannot recognize or act under latent information asymmetry -- they fail to reason about what others might know but have not yet expressed, leading to premature convergence on shared evidence while critical distributed facts remain unexplored. These failures persist across prompting strategies, communication depths, and group sizes -- and worsen as groups scale. While some models (e.g., Gemini-2.5-Flash/Pro) outperform others, neither model scale nor individual reasoning accuracy reliably predicts collective performance. We further show that this bottleneck is actionable: a lightweight structured communication protocol substantially improves collective reasoning across model families. Our results identify failures in collective information exploration in decision-making as a key limitation of multi-agent LLMs, and provide a theory-grounded, reproducible framework for diagnosing collective reasoning failures.

2504.16584 2026-05-14 cs.CR cs.AI 版本更新

Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code

Md. Azizul Hakim Bappy, Hossen A Mustafa, Prottoy Saha, Rajinus Salehat

发表机构 * Institute of Information and Communication Technology, Bangladesh University of Engineering Technology(孟加拉工程科技大学信息与通信技术研究所) Hajee Mohammad Danesh Science and Technology University(海杰莫哈默德丹什科学与技术大学)

AI总结 本文研究了在Python代码中使用小型语言模型(SLM)进行准确且隐私保护的CWE漏洞检测的可行性。通过半监督方法构建了一个包含500个样本的数据集,并对一个3.5亿参数的预训练代码模型进行指令遵循的微调,最终在测试集上实现了近99%的准确率和召回率。实验表明,经过微调的SLM能够在本地环境中高效、精确地检测CWE漏洞,为安全分析提供了一种隐私友好的解决方案。

Comments 11 pages, 2 figures, 3 tables. Dataset available at https://huggingface.co/datasets/floxihunter/synthetic_python_cwe. Model available at https://huggingface.co/floxihunter/codegen-mono-CWEdetect. Keywords: Small Language Models (SLMs), Vulnerability Detection, CWE, Fine-tuning, Python Security, Privacy-Preserving Code Analysis

详情
英文摘要

Large Language Models (LLMs) have demonstrated significant capabilities in understanding and analyzing code for security vulnerabilities, such as Common Weakness Enumerations (CWEs). However, their reliance on cloud infrastructure and substantial computational requirements pose challenges for analyzing sensitive or proprietary codebases due to privacy concerns and inference costs. This work explores the potential of Small Language Models (SLMs) as a viable alternative for accurate, on-premise vulnerability detection. We investigated whether a 350-million parameter pre-trained code model (codegen-mono) could be effectively fine-tuned to detect the MITRE Top 25 CWEs specifically within Python code. To facilitate this, we developed a targeted dataset of 500 examples using a semi-supervised approach involving LLM-driven synthetic data generation coupled with meticulous human review. Initial tests confirmed that the base codegen-mono model completely failed to identify CWEs in our samples. However, after applying instruction-following fine-tuning, the specialized SLM achieved remarkable performance on our test set, yielding approximately 99% accuracy, 98.08% precision, 100% recall, and a 99.04% F1-score. These results strongly suggest that fine-tuned SLMs can serve as highly accurate and efficient tools for CWE detection, offering a practical and privacy-preserving solution for integrating advanced security analysis directly into development workflows.

2504.11944 2026-05-14 cs.LG cs.AI 版本更新

VIPO: Value Function Inconsistency Penalized Offline Reinforcement Learning

Xuyang Chen, Keyu Yan, Guojian Wang, Lin Zhao

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore, Singapore(新加坡国立大学电子与计算机工程系)

AI总结 VIPO 是一种基于模型的离线强化学习算法,旨在解决传统方法因模型误差而引入的保守性问题。该方法通过引入价值函数不一致性惩罚,利用自监督反馈提升模型训练效果,从而提高模型准确性。实验表明,VIPO 在多个基准测试中表现优异,显著优于现有方法,为模型基于的离线强化学习提供了一种通用且高效的改进框架。

详情
英文摘要

Offline reinforcement learning (RL) learns effective policies from pre-collected datasets, offering a practical solution for applications where online interactions are risky or costly. Model-based approaches are particularly advantageous for offline RL, owing to their data efficiency and generalizability. However, due to inherent model errors, model-based methods often artificially introduce conservatism guided by heuristic uncertainty estimation, which can be unreliable. In this paper, we introduce VIPO, a novel model-based offline RL algorithm that incorporates self-supervised feedback from value estimation to enhance model training. Specifically, the model is learned by additionally minimizing the inconsistency between the value learned directly from the offline data and the value estimated from the model. We perform comprehensive evaluations from multiple perspectives to show that VIPO can learn a highly accurate model efficiently and consistently outperform existing methods. In particular, it achieves state-of-the-art performance on almost all tasks in both D4RL and NeoRL benchmarks. Overall, VIPO offers a general framework that can be readily integrated into existing model-based offline RL algorithms to systematically enhance model accuracy.

2503.19719 2026-05-14 cs.LG cs.AI cs.CV 版本更新

On What Depends the Robustness of Multi-source Models to Missing Data in Earth Observation?

Francisco Mena, Diego Arenas, Miro Miranda, Andreas Dengel

发表机构 * University of Kaiserslautern-Landau (RPTU)(凯撒斯劳滕-兰道大学(RPTU)) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI))

AI总结 本文研究了多源模型在遥感观测中面对数据缺失时的鲁棒性影响因素。通过评估六种先进多源模型在单一数据源缺失或仅有一个数据源可用时的预测性能,发现模型效果与任务特性、数据源互补性及模型设计密切相关。研究还发现,去除某些数据源反而可能提升预测性能,挑战了“数据越多越好”的传统假设,引发了对模型复杂性和数据必要性的深入思考。

Comments Accepted at IEEE International Geoscience and Remote Sensing Symposium 2025

Journal ref 2025 IEEE International Geoscience and Remote Sensing Symposium

详情
英文摘要

In recent years, the development of robust multi-source models has emerged in the Earth Observation (EO) field. These are models that leverage data from diverse sources to improve predictive accuracy when there is missing data. Despite these advancements, the factors influencing the varying effectiveness of such models remain poorly understood. In this study, we evaluate the predictive performance of six state-of-the-art multi-source models in predicting scenarios where either a single data source is missing or only a single source is available. Our analysis reveals that the efficacy of these models is intricately tied to the nature of the task, the complementarity among data sources, and the model design. Surprisingly, we observe instances where the removal of certain data sources leads to improved predictive performance, challenging the assumption that incorporating all available data is always beneficial. These findings prompt critical reflections on model complexity and the necessity of all collected data sources, potentially shaping the way for more streamlined approaches in EO applications.

2502.15761 2026-05-14 cs.DC cs.AI cs.GR cs.HC 版本更新

AIvaluateXR: An Evaluation Framework for on-Device AI in XR with Benchmarking Results

Dawar Khan, Xinyu Liu, Omar Mena, Donggang Jia, Alexandre Kouyoumdjian, Ivan Viola

发表机构 * King Abdullah University of Science and Technology (KAUST)(卡布斯大学)

AI总结 本文提出AIvaluateXR,一个用于评估在扩展现实(XR)设备上运行的大语言模型(LLM)的综合框架。研究通过在四款XR设备上部署17个LLM,从性能一致性、处理速度、内存使用和电池消耗四个方面进行系统评估,并基于三维帕累托最优理论提出统一的评估方法,以选择最优的设备-模型组合。该研究为在XR设备上部署LLM提供了有价值的指导,并为未来相关优化和研究奠定了基础。

Comments AIvaluateXR is updated version of LoXR

详情
英文摘要

The deployment of large language models (LLMs) on extended reality (XR) devices has great potential to advance the field of human-AI interaction. In the case of direct, on-device model inference, selecting the appropriate model and device for specific tasks remains challenging. In this paper, we present AIvaluateXR, a comprehensive evaluation framework for benchmarking LLMs running on XR devices. To demonstrate the framework, we deploy 17 selected LLMs across four XR platforms: Magic Leap 2, Meta Quest 3, Vivo X100s Pro, and Apple Vision Pro, and conduct an extensive evaluation. Our experimental setup measures four key metrics: performance consistency, processing speed, memory usage, and battery consumption. For each of the 68 model-device pairs, we assess performance under varying string lengths, batch sizes, and thread counts, analyzing the trade-offs for real-time XR applications. We propose a unified evaluation method based on the 3D Pareto Optimality theory to select the optimal device-model pairs from quality and speed objectives. Additionally, we compare the efficiency of on-device LLMs with client-server and cloud-based setups, and evaluate their accuracy on two interactive tasks. We believe our findings offer valuable insight to guide future optimization efforts for LLM deployment on XR devices. Our evaluation method can be used as standard groundwork for further research and development in this emerging field. The source code and supplementary materials are available at: www.nanovis.org/AIvaluateXR.html

2412.06341 2026-05-14 cs.CV cs.AI 版本更新

Visual Accommodation: Rethinking Image Scale as a Learnable Variable for Object Detection

Daeun Seo, Hoeseok Yang, Sihyeong Park, Hyungshin Kim

发表机构 * Chungnam National University(Chungnam 国立大学) Santa Clara University(Santa Clara 大学) Korea Electronics Technology Institute(韩国电子技术研究所)

AI总结 本文提出了一种名为Ciliary-DETR的框架,旨在通过学习可变的图像尺度来提升目标检测在测试阶段的适应能力,类似于生物视觉中的调节机制。该方法引入了一个轻量级的尺度预测器,能够在不同输入尺度下动态估计最优的测试尺度因子,从而提高检测的灵活性和鲁棒性。通过引入参数化的尺度优化目标,解决了在标准训练设置下最优输入尺度不可观测的问题,实现了高效的一次性推理过程。

Comments 23 pages, 11 figures

详情
英文摘要

We propose Ciliary-DETR (previous name: Elastic-DETR), a framework for test-time resolution adjustment analogous to biological accommodation. While multi-scale data augmentation improves robustness to scale variation, modern detectors rely on fixed inference resolutions, potentially limiting flexibility and robustness. Similar to the ciliary muscle, we introduce a lightweight scale predictor that dynamically estimates test-time scale factors across a wide range of input scales. The core challenge is that the optimal input scale is inherently unobservable under standard training setups. To address this challenge, we introduce a parametric formulation of desired scaling behavior, leading to loss-driven objectives that guide scale optimization. Overall, our method enables flexible and efficient single-pass inference, bridging the gap between training-time robustness and test-time adaptation.

2411.15913 2026-05-14 cs.SD cs.AI cs.LG eess.AS 版本更新

Repurposing Image Diffusion Models for Training-Free Music Style Transfer on Mel-spectrograms

Heehwan Wang, Joonwoo Kwon, Sooyoung Kim, Jungwoo Seo, Shinjae Yoo, Yuewei Lin, Jiook Cha

发表机构 * Seoul National University(首尔国立大学) Michigan State University(密歇根州立大学) Rutgers University(罗格斯大学) Brookhaven National Laboratory(布鲁克海文国家实验室)

AI总结 该研究提出了一种无需训练的音乐风格迁移方法Stylus,通过复用预训练的图像扩散模型,在梅尔频谱图域实现音乐风格迁移。该方法将音频视为结构化的时频图像,通过注入风格键值对操控自注意力机制,同时保留源音频的结构查询,从而在保持内容结构的同时实现风格迁移。实验表明,Stylus在内容保留和感知质量上均优于现有方法,验证了通用图像先验在结构化梅尔频谱图无训练迁移中的有效性。

Comments Accepted by ICIP 2026

详情
英文摘要

Music style transfer blends source structure with reference style to enable personalized music creation. However, existing zero-shot methods often struggle to capture fine-grained audio nuances, relying on coarse text descriptions or requiring expensive task-specific training. We propose Stylus, a training-free framework that repurposes pretrained image diffusion models for music style transfer in the Mel-spectrogram domain. By treating audio as structured time-frequency images, Stylus manipulates self-attention by injecting style keys and values while preserving source structural queries. To ensure high fidelity, we introduce a phase-preserving reconstruction strategy to mitigate spectrogram inversion artifacts, alongside a classifier-free-guidance-inspired control for adjustable stylization. Extensive evaluations including 2,925 human ratings demonstrate that Stylus outperforms state-of-the-art baselines, achieving 34.1% higher content preservation and 25.7% better perceptual quality. Our work validates that generic image priors can be effectively leveraged for the training-free transformation of structured Mel-spectrograms. Code and materials are available at https://github.com/Sooyyoungg/Stylus.git.

2408.12935 2026-05-14 cs.AI 版本更新

AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions

Chen Chen, Xueluan Gong, Ziyao Liu, Weifeng Jiang, Si Qi Goh, Kwok-Yan Lam

发表机构 * College of Computing and Data Science(计算与数据科学学院)

AI总结 本文探讨了大型语言模型(LLM)背景下人工智能安全(AI Safety)的关键问题,提出了一种从可信AI、责任AI和安全AI三个维度理解AI安全的新框架。通过综述当前研究进展与挑战,并结合前沿技术案例,文章总结了AI安全设计与测试的创新方法,旨在推动该领域研究发展,提升人们对数字化转型的信任。

详情
英文摘要

AI Safety is an emerging area of critical importance to the safe adoption and deployment of AI systems. With the rapid proliferation of AI and especially with the recent advancement of Generative AI (or GAI), the technology ecosystem behind the design, development, adoption, and deployment of AI systems has drastically changed, broadening the scope of AI Safety to address impacts on public safety and national security. In this paper, we propose a novel architectural framework for understanding and analyzing AI Safety; defining its characteristics from three perspectives: Trustworthy AI, Responsible AI, and Safe AI. We provide an extensive review of current research and advancements in AI safety from these perspectives, highlighting their key challenges and mitigation approaches. Through examples from state-of-the-art technologies, particularly Large Language Models (LLMs), we present innovative mechanism, methodologies, and techniques for designing and testing AI safety. Our goal is to promote advancement in AI safety research, and ultimately enhance people's trust in digital transformation.

2407.15512 2026-05-14 cs.LG cs.AI cs.CV 版本更新

Increasing the Robustness of Model Predictions to Missing Sensors in Earth Observation

Francisco Mena, Diego Arenas, Andreas Dengel

发表机构 * University of Kaiserslautern-Landau, Kaiserslautern, Germany(凯撒斯劳滕-兰道大学,凯撒斯劳滕,德国) German Research Center for Artificial Intelligence, Kaiserslautern, Germany(德国人工智能研究中心,凯撒斯劳滕,德国)

AI总结 该研究旨在提高地球观测中多传感器机器学习模型在传感器缺失情况下的预测鲁棒性。作者提出了两种新方法:输入传感器丢弃(ISensD)和集成传感器不变(ESensI),通过实验验证了它们在三个多传感器时序数据集上的有效性。研究发现,集成多传感器模型在面对传感器缺失时表现最为稳健,而ISensD中的传感器丢弃机制也展现出良好的鲁棒性。

Comments Accepted at the MACLEAN workshop in the ECML/PKDD 2024

Journal ref Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2024

详情
英文摘要

Multi-sensor ML models for EO aim to enhance prediction accuracy by integrating data from various sources. However, the presence of missing data poses a significant challenge, particularly in non-persistent sensors that can be affected by external factors. Existing literature has explored strategies like temporal dropout and sensor-invariant models to address the generalization to missing data issues. Inspired by these works, we study two novel methods tailored for multi-sensor scenarios, namely Input Sensor Dropout (ISensD) and Ensemble Sensor Invariant (ESensI). Through experimentation on three multi-sensor temporal EO datasets, we demonstrate that these methods effectively increase the robustness of model predictions to missing sensors. Particularly, we focus on how the predictive performance of models drops when sensors are missing at different levels. We observe that ensemble multi-sensor models are the most robust to the lack of sensors. In addition, the sensor dropout component in ISensD shows promising robustness results.

2406.05410 2026-05-14 cs.AI cs.CL 版本更新

ChatSR: Multimodal Large Language Models for Scientific Formula Discovery

Yanjie Li, Lina Yu, Weijun Li, Min Wu, Liping Zhang, Jingyi Liu, Yusong Deng, Mingzhu Wan, Xin Ning

发表机构 * AnnLab, Institute of Semiconductors, Chinese Academy of Sciences, Beijing, China(安 lab,半导体研究所,中国科学院,北京,中国) School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing, China(电子、电气与通信工程学院,中国科学院大学,北京,中国) Zhongguancun Academy, Beijing, China(中关村学院,北京,中国) School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences, Beijing 101408, China(先进交叉科学学院,中国科学院大学,北京101408,中国) College of Materials Science and Opto-Electronic Technology, University of Chinese Academy of Sciences, Beijing, 100049, China(材料科学与光电技术学院,中国科学院大学,北京100049,中国) School of Integrated Circuits, University of Chinese Academy of Sciences, Beijing 100049, China(集成电路学院,中国科学院大学,北京100049,中国)

AI总结 当前多模态大语言模型主要关注图像、视频等感知模态的理解与处理,但在科学数据理解方面仍存在不足。为此,研究提出ChatSR,一种专门针对科学数据理解的新型多模态大语言模型。该模型将科学数据视为一种类似于视觉内容的新模态,通过精心设计的编码器和模态对齐机制,将其映射到大语言模型可处理的表征空间,从而捕捉科学数据的结构特征和内在规律,并基于用户指定的先验约束和偏好自动生成符合领域知识的数学公式,推动科学发现的自动化。实验表明,ChatSR在13个数据集上取得了最先进的性能,并展现出强大的零样本学习能力。

Comments 14 pages,

详情
英文摘要

Current multimodal large language models (MLLMs) are mainly focused on the understanding and processing of perceptual modalities such as images and videos, while their capability for scientific data understanding remains insufficient. To this end, we propose ChatSR, a novel multimodal large language model tailored for scientific data understanding. ChatSR treats scientific data as a new modality analogous to visual content and, through carefully designed encoders and modality alignment mechanisms, maps scientific data into a representation space that can be processed by large language models, enabling the model to grasp the structural characteristics and underlying regularities of scientific data. Building on this foundation, ChatSR further exploits the rich domain knowledge and strong reasoning abilities of large language models to emulate a knowledgeable human scientist: based on user-specified prior constraints and preferences expressed (such as requirements on periodicity, symmetry, etc.), it automatically generates mathematical formulas that not only accurately fit the observed data but also conform to domain priors, thereby characterizing the latent laws embodied in scientific data and promoting the automation of scientific discovery. Experiments on 13 datasets show that ChatSR achieves state-of-the-art performance on traditional symbolic regression benchmarks. In addition, ChatSR exhibits a promising zero-shot ability to understand and utilize types of prior knowledge that are not present in its training data.

2304.11193 2026-05-14 cs.RO cs.AI cs.CV 版本更新

Multi-Modal World Model for Physical Robot Interactions: Simultaneous Visual and Tactile Predictions for Enhanced Accuracy

Willow Mandil, Amir Ghalamzan-E

发表机构 * University of Lincoln(林肯大学) University of Sheffield(谢菲尔德大学)

AI总结 本文研究了在物理机器人交互中融合视觉与触觉信息的世界模型预测方法,旨在提升对复杂环境中机器人操作结果的预测准确性。通过引入两个新的机器人推物数据集,作者展示了在物理不确定性较高的场景下,结合视觉与触觉信息能显著提高预测性能,而在视觉信息已足够明确的情况下,触觉带来的提升有限。该工作为构建更鲁棒的机器人世界模型提供了新的数据支持与方法启示。

Comments This paper is accepted for publication in Robotics and Autonomous Systems

详情
英文摘要

Predicting the outcomes of robotic actions, often referred to as learning a world model, in complex environments remains a fundamental challenge in robotics. Existing approaches primarily rely on visual observations and action inputs to generate video-based predictions, frequently overlooking the critical role of tactile feedback in understanding physical interactions. In this work, we investigate the integration of tactile and visual information within predictive perception systems for physical robot interaction. We demonstrate that visuo-tactile prediction provides the greatest benefits in physically ambiguous interaction regimes, while improvements are naturally limited when object dynamics are visually inferable. Furthermore, we introduce two novel robot-pushing datasets collected using a magnetic-based tactile sensor for unsupervised learning. The first dataset comprises visually identical objects with varying physical properties, explicitly isolating physical ambiguity, while the second mirrors existing robot-pushing benchmarks involving clusters of household objects. Our results show that tactile-visual integration improves prediction accuracy and robustness under physical ambiguity, while offering limited gains in visually unambiguous settings. Code and datasets are publicly available.

2605.12620 2026-05-14 cs.AI 版本更新

Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

Nishad Singhi, Christian Bialas, Snehal Jauhri, Vignesh Prasad, Georgia Chalvatzaki, Marcus Rohrbach, Anna Rohrbach

发表机构 * Technical University of Darmstadt(达姆斯塔特技术大学)

AI总结 本文提出了一种名为VeGAS的验证器引导的动作选择框架,旨在提升基于多模态大语言模型(MLLM)的具身智能体在复杂现实任务中的鲁棒性。该方法在推理时通过生成多个候选动作,并利用一个生成式验证器选择最可靠的动作,而无需修改原始策略。通过一种由大语言模型驱动的数据合成策略,VeGAS在训练时自动生成多样化的失败案例,从而提升验证器的泛化能力。实验表明,VeGAS在多个具身推理基准测试中显著提升了性能,尤其在多目标、长时序任务上相对基线提升了36%。

Comments CVPR 2026 (Findings)

详情
英文摘要

Building generalist embodied agents capable of solving complex real-world tasks remains a fundamental challenge in AI. Multimodal Large Language Models (MLLMs) have significantly advanced the reasoning capabilities of such agents through strong vision-language knowledge and chain-of-thought (CoT) reasoning, yet remain brittle when faced with challenging out-of-distribution scenarios. To address this, we propose Verifier-Guided Action Selection (VegAS), a test-time framework designed to improve the robustness of MLLM-based embodied agents through an explicit verification step. At inference time, rather than committing to a single decoded action, VeGAS samples an ensemble of candidate actions and uses a generative verifier to identify the most reliable choice, without modifying the underlying policy. Crucially, we find that using an MLLM off-the-shelf as a verifier yields no improvement, motivating our LLM-driven data synthesis strategy, which automatically constructs a diverse curriculum of failure cases to expose the verifier to a rich distribution of potential errors at training time. Across embodied reasoning benchmarks spanning the Habitat and ALFRED environments, VeGAS consistently improves generalization, achieving up to a 36% relative performance gain over strong CoT baselines on the most challenging multi-object, long-horizon tasks.

2605.12586 2026-05-14 cs.CV cs.AI cs.DB 版本更新

3D Primitives are a Spatial Language for VLMs

Junze Liu, Kun Qian, Florian Dubost, Kai Zhong, Arvind Srinivasan, Nan Chen, Anping Wang, Sam Zhang, Alejandro Mottini, Qingjun Cui, Tian Wang

发表机构 * Unity Technologies

AI总结 该研究探讨了视觉语言模型(VLMs)在空间理解上的矛盾表现,并提出以3D几何基元(如立方体、球体等)作为中间表示来提升其空间推理能力。研究引入了SpatialBabel基准,评估了多种VLM在基于基元的3D场景重建任务中的表现,并提出了两种新方法:无需训练的Code-CoT推理策略和自监督的S³-FT微调方法,显著提升了模型在多个空间理解任务上的性能,验证了几何基元在代码中的诊断与迁移价值。

详情
英文摘要

Vision-language models (VLMs) exhibit a striking paradox: they can generate executable code that reconstructs a 3D scene from geometric primitives with correct object counts, classes, and approximate positions, yet the same models fail at simpler spatial questions on the same image. We show that 3D geometric primitives (cubes, spheres, cylinders, expressed in executable code) serve as a powerful intermediate representation for spatial understanding, and exploit this through three contributions. First, we introduce \textbf{\textsc{SpatialBabel}}, a benchmark evaluating fourteen VLMs on primitive-based 3D scene reconstruction across six \emph{scene-code languages} (programming languages and declarative formats for 3D primitive scenes), revealing that a single model's object-detection F1 can vary by up to $5.7\times$ across languages. Second, we propose \textbf{Code-CoT} (Code Chain-of-Thought), a training-free inference strategy that routes spatial reasoning through primitive-based code generation. Code-CoT lifts the SpatialBabel-QA-Score by up to $+6.4$\% on primitive scenes and real-photo CV-Bench-3D accuracy by $+5.0$\% for VLMs with strong coding capabilities. Third, we propose \textbf{S$^{3}$-FT} (Self-Supervised Spatial Fine-Tuning), which self-supervisedly distills primitive spatial knowledge into general visual reasoning by parsing the model's own Three.js primitive-reconstructions into structured annotations and fine-tuning on the result, with \emph{no human labels and no teacher model}. Training on primitive images alone, S$^3$-FT improves Qwen3-VL-8B by $+4.6$ to $+8.6$\% on SpatialBabel-Primitive-QA, $+9.7$\% on CV-Bench-2D, and $+17$\% on HallusionBench; the recipe transfers across model families. These results establish geometric primitives in code as both a diagnostic and a transferable spatial vocabulary for VLMs. We will release all artifacts upon publication.

2605.12584 2026-05-14 cs.LG cs.AI 版本更新

Towards Robust Federated Multimodal Graph Learning under Modality Heterogeneity

Sirui Zhang, Haonan Wang, Xunkai Li, Zekai Chen, Shumeng Li, Hongchao Qin, Rong-Hua Li, Guoren Wang

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 本文研究了在模态异构环境下实现鲁棒的联邦多模态图学习的问题,针对现有方法在模态缺失处理和联邦协作中的不足,提出了一种两阶段框架,分别在客户端完成缺失模态的重建,在服务器端进行参数聚合。为此,作者设计了FedMPO方法,通过拓扑感知的跨模态生成、缺失感知的专家路由和可靠性感知的聚合策略,有效提升了模型的鲁棒性和性能。实验表明,FedMPO在多个数据集上优于现有方法,尤其在模态缺失率高和数据分布不均衡的情况下表现突出。

详情
英文摘要

Recently, multimodal graph learning (MGL) has garnered significant attention for integrating diverse modality information and structured context to support various network applications. However, real-world graphs are often isolated due to data-sharing limitations across multiple parties, and their modalities are frequently incomplete. This highlights an urgent need to develop a robust federated approach. However, we find that existing methods remain insufficient. On the one hand, centralized MGL methods that handle missing modalities overlook the knowledge sharing and generalization in federated scenarios. On the other hand, while federated MGL methods have become increasingly mature, they primarily target non-graph data. Based on these technologies, we identify a two-stage pipeline wherein client-side completion reconstructs missing modalities, and server-side aggregation integrates the client-updated parameters of both the modality generator and the backbone models. Although this serves as a general solution, we identify two primary challenges in achieving greater robustness: (1) Topology-Isolated Local Completion: Client-side modality generation struggles to effectively leverage global semantics. (2) Reliability-Imbalanced Global Aggregation: Server-side multi-party collaboration is hindered by client updates with varying modality availability and recovery reliability. To address these challenges, we propose \textsc{FedMPO}, which utilizes topology-aware cross-modal generation to recover missing features using comprehensive graph context, missing-aware expert routing to locally filter out noisy recovered signals, and reliability-aware aggregation to appropriately down-weight unreliable updates. Extensive experiments on 3 tasks across 6 datasets demonstrate that FedMPO outperforms baselines, achieving performance gains of up to 4.10% and 5.65% in high-missing and non-IID settings.

2605.12575 2026-05-14 eess.IV cs.AI cs.CV 版本更新

Are Compact Rationales Free? Measuring Tile Selection Headroom in Frozen WSI-MIL

Hyun Do Jung, Jungwon Choi, Soojung Choi, Yujin Oh, Hwiyoung Kim

发表机构 * Department of Artificial Intelligence, Yonsei University(延世大学人工智能系) Kim Jaechul Graduate School of AI, KAIST(金 Jaechul人工智能研究生院,韩国科学技术院) Department of Integrative Medicine, College of Medicine, Yonsei University(延世大学医学院整合医学系) Department of Biomedical Systems Informatics, College of Medicine, Yonsei University(延世大学医学院生物医学系统信息学系) H-Data Strategy Center, Hallym University Chuncheon Sacred Heart Hospital(翰林大学春川圣心医院H-Data战略中心)

AI总结 本文研究了在冻结的全切片图像(WSI)多实例学习(MIL)分类器中,能否从少量输出一致的图像块中恢复出滑动级预测结果,从而生成紧凑的后验解释。为此,作者提出了一种轻量级的解释层FOCI,通过训练使其能够从保留或删除的图像块子集中提取足够信息,并引入选择余量指数(SHI)进行评估。实验表明,不同MIL模型对紧凑解释的支持程度不同,FOCI能够有效减少所需图像块数量,并为模型解释和审计提供了一种新的工具。

详情
英文摘要

Whole-slide image (WSI) multiple instance learning (MIL) classifiers can achieve strong slide-level AUC while leaving the full-bag prediction opaque. Attention scores are widely reused as post-hoc explanations, but high attention can reflect aggregation preference rather than a compact, model-sufficient rationale. We study post-hoc rationale highlighting for frozen WSI-MIL: given a trained classifier, can its slide-level prediction be recovered from a compact, output-consistent tile subset without retraining the backbone? We instantiate this with Finding Optimal Contextual Instances (FOCI), a lightweight rationale-readout layer over a frozen MIL backbone. FOCI is trained with model-output sufficiency and exclusion objectives over keep/drop tile subsets, evaluated with an insertion-style Sequential Reveal Protocol (SRP) adapted to WSI-MIL, and summarized by the Selection Headroom Index (SHI). Across three WSI benchmarks and seven MIL backbones, FOCI reveals that compact rationales are selection-headroom dependent: transformer and multi-branch attention aggregators can admit compact rationales, near-minimal attention-pooling baselines enter a selection-saturation regime, and hard-selection backbones can conflict with an external readout. For TransMIL, relative to its documented CLS-proxy ranking, FOCI reduces the Minimum Sufficient K (MSK) tile count by 32-56% across benchmarks, while ACMIL+FOCI attains the highest mean SHI (+0.465). Deletion-based perturbation and selected-only downstream evaluation provide complementary checks. These results position FOCI as a model-level interpretability and audit layer: selected tiles are not claims of clinical or pathologist-level diagnostic sufficiency, but candidate rationales that offer a compact, reviewable view of when a frozen MIL prediction can be localized to a small output-consistent subset.

2605.12574 2026-05-14 cs.CV cs.AI 版本更新

DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction

Hongyi Tang, Zhihao Zhu, Yi Yang

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 本文研究了如何通过语义干扰技术,在仅能访问视觉语言模型生成文本输出的黑盒场景下,对其训练数据进行成员推理攻击。提出的方法DistractMIA通过在输入图像中插入已知语义干扰物,并分析模型生成文本的变化,从而判断样本是否属于训练数据。该方法无需访问模型内部信息,仅依赖输出结果,实验表明其在多个视觉语言模型和基准数据集上均优于现有方法,并在医疗图像任务中展现出良好的泛化能力。

Comments 23 pages, 8 figures

详情
英文摘要

Vision-language models (VLMs) are trained on large-scale image-text corpora that may contain private, copyrighted, or otherwise sensitive data, motivating membership inference as a tool for training-data auditing. This is especially challenging for deployed VLMs, where auditors typically observe only generated textual responses. Existing VLM membership inference attacks either rely on probability-level signals unavailable in such settings, or use mask-based semantic prediction tasks whose effectiveness depends on object-centric visual assumptions. To address these limitations, we propose DistractMIA, an output-only black-box framework based on semantic distraction. Rather than removing visual evidence, DistractMIA preserves the original image, inserts a known semantic distractor, and measures how generated responses change. This design is motivated by the intuition that member samples remain more anchored to the original image semantics, while non-member samples are more easily redirected toward the distractor. To make this signal reliable, DistractMIA calibrates distractor configurations on a reference set and derives membership scores from repeated textual generations, capturing response stability and distractor uptake without accessing logits, probabilities, or hidden states. Experiments across multiple VLMs and benchmarks show that DistractMIA consistently outperforms both output-only and stronger-access baselines. Its performance on a medical benchmark further demonstrates applicability beyond object-centric natural images.

2605.12573 2026-05-14 cs.CV cs.AI cs.LG 版本更新

Improving Diffusion Posterior Samplers with Lagged Temporal Corrections for Image Restoration

Davide Evangelista, Elena Morotti, Francesco Pivi, Maurizio Gabbrielli

发表机构 * Dept. of Computer Science and Engineering(计算机科学与工程系) University of Bologna(博洛尼亚大学) Dept. of Political and Social Sciences(政治与社会科学系)

AI总结 本文研究了如何改进基于扩散的后验采样(PS)方法在图像恢复任务中的性能。作者从动力学角度重新诠释PS,提出了一种结合二阶离散化和残差修正的新型方法LAMP,通过引入滞后时间修正来提升采样过程的稳定性与准确性。实验表明,LAMP在多个图像恢复任务中优于现有方法,且无需增加去噪评估次数。

Comments 9 Figures, 9 Tables, Submitted to a conference

详情
英文摘要

Diffusion-based posterior sampling (PS) is a leading framework for imaging inverse problems, combining learned priors with measurement constraints. Yet, its standard formulations rely on instantaneous data-consistent estimates, which induce temporal variability in the reverse dynamics. We reinterpret PS from a dynamical perspective, showing that the standard PS update corresponds to a first-order discretization of the diffusion dynamics plus a residual correction capturing the mismatch between the denoised prediction and the data-consistent estimate. A second-order discretization, however, naturally introduces a temporal correction based on the variation of consecutive estimates. Building on this, we propose LAMP, combining the second-order update with the residual correction characterizing a PS technique. LAMP thus inherits a lagged temporal correction, and it can be implemented as a modular plug-in over the PS backbone. We show that LAMP preserves the structure of a posterior sampler, and we perform a one-step risk analysis to characterize when LAMP improves the reverse transition via a bias-variance trade-off. Experiments across multiple imaging tasks demonstrate consistent improvements over strong baselines such as DiffPIR and DDRM, without increasing the number of denoising evaluations.

2605.12571 2026-05-14 cs.CV cs.AI 版本更新

VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority

Chenhao Qiu, Yechao Zhang, Xin Luo, Shien Song, Xusheng Liu

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡)

AI总结 本文研究了长期视频问答任务中由于证据不一致导致的性能问题,提出了一种名为VideoSEAL的解耦框架,通过将规划与回答权威性分离,提升了答案准确性和证据对齐度。该方法引入时间与语义双重诊断指标,揭示了现有模型在推理和训练过程中存在的压力源,并通过像素级验证机制有效缓解了证据不一致问题。实验表明,该框架在多个长期视频基准测试中表现优异,且具备良好的扩展性和模块化升级能力。

Comments Accepted to ICML 2026. 33 pages, 13 figures. Code and models are available at https://github.com/Echochef/VideoSEAL

详情
英文摘要

Long video question answering requires locating sparse, time-scattered visual evidence within highly redundant content. Although current MLLMs perform well on short videos, long videos introduce long-horizon search and verification, which often necessitates multi-turn, agentic interaction. We show that existing LVU agents can exhibit "evidence misalignment": they produce correct answers that are not supported by the retrieved or inspected evidence. To characterize this failure, we introduce two diagnostics (temporal groundedness and semantic groundedness) and use them to reveal two pressures that amplify misalignment: prompt pressure from shared-context saturation at inference time and reward pressure from outcome-only optimization during training. These findings point to a structural root cause: the coupled agent paradigm conflates long-horizon planning with answer authority. We therefore propose the decoupled planner-inspector framework, which separates planning from answer authority and gates final answering on pixel-level verification. Across four long-video benchmarks, our framework improves both answer accuracy and evidence alignment, achieving 55.1% on LVBench and 62.0% on LongVideoBench while producing interpretable search trajectories. Moreover, the decoupled architecture scales consistently with increased search budgets and supports plug-and-play upgrades of the MLLM backbone without retraining the planner. Code and models are available at https://github.com/Echochef/VideoSEAL.

2605.12569 2026-05-14 eess.SP cs.AI 版本更新

Active Sensing with Meta-Reinforcement Learning for Emitter Localization from RF Observations

M. Shamail J. Khan, Nisha L. Raichur, Lucas Heublein, Christian Wielenberg, Alexander Mattick, Tobias Feigl, Christopher Mutschler, Felix Ott

发表机构 * Fraunhofer Institute for Integrated Circuits IIS(弗劳恩霍夫集成电路研究所) Positioning Systems Lab, University of Technology Nürnberg (UTN)(定位系统实验室,图恩大学) Center for Artificial Intelligence, Technical University of Applied Sciences Würzburg-Schweinfurt(人工智能中心,应用技术大学魏玛-施魏尔堡)

AI总结 本文研究了在复杂传播环境下利用射频观测进行GNSS干扰源定位的问题,将其建模为一种主动感知问题,并提出了一种结合深度强化学习与递归策略学习的框架,以从2×2天线阵列获取的射频信号中推断干扰源位置。该方法通过模拟数据集进行训练与评估,实验表明其定位成功率达到80.1%,展示了强化学习在适应性干扰定位中的潜力。

详情
英文摘要

Global navigation satellite system (GNSS) interference poses a serious threat to reliable positioning, especially in indoor and multipath-rich environments where source localization is highly challenging. In this paper, we formulate GNSS interference localization as an active sensing problem and propose a reinforcement learning (RL) framework in which an agent sequentially explores the environment to infer the position of an emitter source from radio frequency (RF) observations acquired with a 2x2 patch antenna. The localization task is modeled as a partially observable decision process, since single-snapshot measurements are often ambiguous under multipath propagation and changing channel conditions. To address this, the proposed framework combines high-dimensional RF sensing with deep RL and recurrent policy learning. We investigate both value-based and policy-based approaches, namely Deep Q-Networks (DQN) and Proximal Policy Optimization (PPO), and study their behavior under domain shift. The approach is evaluated on a simulated dataset generated with the Sionna ray-tracing module, which provides realistic propagation effects and diverse environment configurations. Experimental results show that the proposed method achieves a localization success rate of 80.1%, demonstrating the potential of RL for adaptive GNSS interference localization. Overall, the results highlight simulation-assisted training as a promising direction for robust interference localization in challenging propagation environments.

2605.12562 2026-05-14 eess.IV cs.AI cs.CV 版本更新

Uncovering Latent Pathological Signatures in Pulmonary CT via Cross-Window Knowledge Distillation

Bo Peng, Wujian Xu, Kun Wang, Ximing Liao, Na Wang, Daqian Shi, Tian Li, Jing Gao, Johan Thygesen, Yingqun Ji, Honghan Wu

发表机构 * Institute of Health Informatics, University College London(伦敦大学学院健康信息学研究所) Department of Pulmonary and Critical Care Medicine, Shanghai East Hospital, School of Medicine, Tongji University(同济大学医学院 pulmonary and critical care medicine 部门,上海东方医院) Queen Mary University of London(伦敦女王玛丽大学) School of Health and Wellbeing, University of Glasgow(格拉斯哥大学健康与福祉学院)

AI总结 该研究针对多窗口肺部CT影像分析中现有深度学习方法未能有效融合不同密度结构信息的问题,提出了一种跨窗口知识蒸馏框架,通过让学生编码器从在最具信息量窗口上训练的教师模型中学习潜在的临床先验知识。实验表明,该方法在三个数据集上显著提升了各窗口的AUC指标,并实现了高达0.9960的集成AUC,展示了其在肺部CT多窗口分析中的优越性能和泛化能力。

详情
英文摘要

Multi-window CT imaging captures complementary pathological information across anatomical structures of differing densities, yet existing deep learning methods fuse representations only at later stages, missing cross-density interactions. We propose a cross-window knowledge distillation framework in which student encoders learn latent clinical priors from a teacher trained on the most informative window. Evaluated retrospectively on three cohorts - COPD-CT-DF (n=719), RSNA PE (n=1,433), and an in-house CTEPD dataset (n=161) - distillation improved per-window AUC by 10.1-16.5 percentage points on COPD-CT-DF (0.75-0.81 to 0.90-0.94; all P<0.001), with ensemble AUC reaching 0.9960. Similar gains were observed on RSNA PE (0.80-0.83 to 0.90-0.92) and CTEPD (AUC 0.7481 vs. 0.6264). Cross-window distillation internalises pathological signatures invisible to supervised approaches, offering a generalisable solution for multi-window pulmonary CT analysis.

2605.12553 2026-05-14 eess.SP cs.AI 版本更新

ChannelKAN: Multi-Scale Dual-Domain Channel Prediction via Hybrid CNN-KAN Architecture

Nanqing Jiang, Zhangyao Song, Tao Guo, Xiaoyu Zhao, Yinfei Xu

发表机构 * School of Cyber Science and Engineering, Southeast University, Nanjing, China(网络安全学院,东南大学,南京,中国)

AI总结 本文提出了一种名为ChannelKAN的混合CNN-KAN架构,用于在高移动性场景下提升大规模MIMO-OFDM系统的信道状态信息(CSI)预测精度。该方法结合了卷积神经网络(CNN)和柯尔莫戈罗夫-阿诺德网络(KAN),分别捕捉CSI序列中的局部空间-频率相关性和非线性时序演化特性,并通过多尺度频域增强模块和双域融合模块提升特征表达能力。实验表明,ChannelKAN在多个性能指标上优于传统深度学习模型,展现出优越的预测能力和泛化性能。

详情
英文摘要

Accurate channel state information (CSI) prediction is essential for improving the reliability and spectral efficiency of massive MIMO-OFDM systems in high-mobility scenarios. Existing deep learning methods struggle to jointly capture short-term local variations and long-range nonlinear dependencies in CSI sequences. To address this challenge, we propose ChannelKAN, a hybrid CNN-KAN channel prediction model with multi-scale frequency domain information enhancement. The key insight is that CNNs and Kolmogorov-Arnold Networks (KANs) are naturally complementary: CNNs extract intra-time-step local spatial-frequency correlations, while KANs with learnable Chebyshev polynomial activations fit inter-time-step nonlinear temporal evolution in a holistic manner. Specifically, a dual-domain expansion module first generates complementary frequency-domain and delay-domain CSI representations. A multi-scale frequency information enhancement module then retains dominant spectral components at multiple scales to strengthen key features and suppress noise. Next, a CNN-KAN feature extraction module captures local correlations via cascaded convolutions and models long-range dependencies via Chebyshev KAN layers. Finally, a dual-domain fusion module adaptively integrates features from both branches to produce the prediction. Experiments on 3GPP-compliant QuaDRiGa datasets demonstrate that ChannelKAN outperforms RNN, LSTM, GRU, CNN, and Transformer baselines in normalized mean square error (NMSE), spectral efficiency (SE), and bit error rate (BER) across various velocities and signal-to-noise ratios. Ablation studies further confirm the effectiveness of each proposed module.

2605.12550 2026-05-14 cs.CV cs.AI 版本更新

SSDA: Bridging Spectral and Structural Gaps via Dual Adaptation for Vision-Based Time Series Forecasting

Mingrui Zhang, Hanchen Yang, Wengen Li, Xudong Jiang, Yichao Zhang, Jihong Guan, Shuigeng Zhou

AI总结 该论文研究了基于视觉模型的时间序列预测问题,指出将时间序列渲染为图像后,仍存在光谱和结构上的差距,限制了预训练视觉模型的性能。为此,作者提出SSDA方法,通过光谱幅度对齐和结构引导的低秩适配,分别在数据和模型层面弥补这些差距,从而显著提升时间序列预测效果。实验表明,SSDA在多个真实数据集上优于现有方法,表现出良好的泛化能力。

详情
英文摘要

Large vision models (LVMs) have recently proven to be surprisingly effective time series forecasters, simply by rendering temporal data as images. This success, how ever, rests on a largely unexamined premise: the rendered time series images are sufficiently close to natural images for knowledge in pre-trained models to transfer effectively. We argue that two gaps still remain, i.e., spectral and structural gaps, fundamentally limiting the potential of LVMs for time series forecasting. Spectrally, we systematically reveal that rendered time series images exhibit a markedly shallower power spectrum than the natural images LVMs are pre-trained to recognize. Structurally, reshaping 1D temporal sequences into 2D grids fabricates spurious spatial adjacencies while severing genuine temporal continuities, misleading the spatial inductive biases of pre-trained LVMs. To bridge these gaps, we propose SSDA, a dual-branch network that spectrally and structurally adapts to unlock the full potential of LVMs for time series forecasting. At the data level, a Spectral Magnitude Aligner (SMA) applies 2D FFT to selectively enhance the magnitude spectrum toward natural-image statistics while preserving phase. At the model level, a Structural-Guided Low-Rank Adaptation (SG-LoRA) injects position-aware temporal encodings into patch embeddings and adapts at tention via low-rank updates. The two branches are further adaptively fused to produce the final forecast. Extensive experiments on seven real-world benchmarks demonstrate that SSDA consistently outperforms strong LVM- and LLM-based baselines under both full-shot and few-shot settings. Code is publicly available at https://anonymous.4open.science/r/SSDA-8C5B.

2605.12545 2026-05-14 cs.CV cs.AI 版本更新

CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference

Zhitong Dong, Chao Li, Jie Yu, Hao Chen

发表机构 * Southeast University(东南大学) Key Laboratory of New Generation Artificial Intelligence Technology(新一代人工智能技术重点实验室) Alibaba Group(阿里巴巴集团)

AI总结 该研究提出了一种名为CROP的新方法,旨在通过组合推理和优化偏好来实现与专家审美一致的图像裁剪。不同于以往依赖显著性预测或检索增强的方法,CROP将美学裁剪重新定义为多模态推理任务,引导视觉语言模型像专业摄影师一样进行分析、提案和决策。该方法通过分解复杂的审美问题,并结合专家偏好对齐模块,有效提升了裁剪结果与人类专家判断的一致性,实验表明其在多个数据集上均表现出优越性能。

详情
英文摘要

Aesthetic image cropping aims to enhance the aesthetic quality of an image by improving its composition through spatial cropping. Previous methods often rely on saliency prediction or retrieval augmentation, ignoring the task's core requirement: a deep understanding of composition and aesthetics. Consequently, saliency-based methods struggle to make compositional trade-offs in complex scenes, while retrieval-based methods blindly refer to similar cases, lacking adaptive reasoning for unique scenes. Both approaches fail to align their automated cropping results with those of human experts. To address the above issues, we propose a novel paradigm that reformulates aesthetic cropping as a multimodal reasoning task, aiming to activate the VLM's analytical and comprehension capabilities in aesthetics. We design a Compositional Reasoning and Optimizing Preference method (CROP) that directs the VLM to think like a professional photographer. It deconstructs a complex and subjective aesthetic problem into an "analysis-proposal-decision" process, reasoning step by step through the analysis of scene elements and compositional principles. Meanwhile, our expert preference alignment module makes the model's decision consistent with human expert aesthetics. Extensive experiments across multiple datasets validate our method's superiority and component effectiveness.

2605.12543 2026-05-14 q-bio.NC cs.AI 版本更新

Why the Unfinished Keeps Returning: Canxianization and the Dynamics of Conscious Priority

Hengjin Cai, Tianqi Cai

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) School of Innovation, Hubei Institute of Fine Arts(湖北美术学院创新学院)

AI总结 本文探讨了为何某些意识内容在触发条件消失后仍会反复出现的问题,提出“Canxianization”(可先化)这一概念,用以解释未完成感如何通过自我世界边界、价值标记和因果封闭阻断等机制获得持续的意识优先级。研究区分了可先化的潜在强度与实际的意识重复现象,并引入了“重复优先指数”和“可先更新指数”来区分有益与病理性重复,为人工智能系统设计了相关测试方法,揭示了未完成感的反复并非单纯记忆残留,而是自我世界修复失败的表现。

详情
英文摘要

Some conscious contents disappear after access; others return repeatedly, long after their triggering conditions have ceased. We propose Canxianization as the process by which a perturbation becomes closure-resistant self-relevant unfinishedness and thereby acquires recurrent conscious priority. The theory distinguishes this phenomenon from emotional arousal, memory strength, the Zeigarnik effect, curiosity, prediction error, and intrusive thought. A perturbation becomes canxianized when it is attributed to the self-world boundary, value-marked, blocked from causal or action closure, and metacognitively coupled to the self-model. We distinguish latent canxian strength from observed conscious recurrence, and introduce a Recurrent Priority Index and a Canxian Update Index to separate productive from pathological recurrence. Cold Canxianization, recurrence driven by structural incompleteness rather than affective arousal, is identified as a critical discriminant. Reset Resistance and Stake Transfer tests are proposed for artificial systems. Canxianization is not memory persistence; it is failed self-world repair. The unfinished does not merely remain. When it concerns the self and resists closure, it returns.

2605.12541 2026-05-14 eess.SP cs.AI cs.LG 版本更新

PG-LRF: Physiology-Guided Latent Rectified Flow for Electro-Hemodynamic PPG-to-ECG Generation

Xiaoda Wang, Minxiao Wang, Kaiqiao Han, Defu Cao, Ching Chang, Yidan Shi, Runze Yan, Xiao Luo, Yan Liu, Xiao Hu, Yizhou Sun, Wei Wang, Carl Yang

发表机构 * Department of Computer Science, Emory University(埃默里大学计算机科学系) Nell Hodgson Woodruff School of Nursing, Emory University(埃默里大学Nell Hodgson Woodruff护理学院) Department of Computer Science, University of California, Los Angeles(加州大学洛杉矶分校计算机科学系) Department of Computer Science, University of Southern California(南加州大学计算机科学系) Department of Statistics, University of Wisconsin–Madison(威斯康星大学麦迪逊分校统计系)

AI总结 该研究提出了一种名为PG-LRF的生理引导潜空间修正流模型,用于从光电容积图(PPG)生成心电图(ECG),以弥补传统ECG设备难以普及的不足。PG-LRF通过引入电-血流动力学模拟器,联合建模ECG与PPG的共享心脏相位动态,构建了一个结构化的生理感知潜空间,并将其引导至PPG条件下的潜空间修正流中,从而确保生成ECG的形态一致性和生理合理性。实验表明,PG-LRF在大规模数据集上显著提升了PPG到ECG的生成质量及心血管疾病分类性能。

详情
英文摘要

Electrocardiography (ECG) is the clinical standard for cardiac assessment but requires dedicated hardware that does not scale to daily-life monitoring. Photoplethysmography (PPG) is ubiquitous in wearables but lacks ECG-specific diagnostic morphology and is corrupted by motion and sensor noise. PPG-to-ECG generation aims to bridge this gap by recovering electrical morphology and timing from peripheral pulse signals. However, existing methods largely rely on statistical alignment and data-driven generation. They fail to explicitly structure the latent space around physiology-aware electro-hemodynamic factors and lack constraints from forward physiological dynamics. To address these challenges, we propose PG-LRF, a physiology-guided latent rectified flow framework. PG-LRF introduces an electro-hemodynamic simulator that co-models ECG and PPG through shared cardiac phase dynamics. Guided by this simulator, a Physiology-Aware AutoEncoder learns a structured electro-hemodynamic latent space. Then we integrate this simulator guidance into a PPG-conditioned latent rectified flow, enforcing ECG-side morphology consistency and ECG-to-PPG forward hemodynamic consistency during generative transport. Experiments on the large-scale MC-MED dataset demonstrate that PG-LRF significantly improves PPG-to-ECG generation and downstream cardiovascular disease classification, proving its ability to generate ECGs that are both signal-faithful and physiologically plausible under the ECG-to-PPG hemodynamic pathway

2605.12536 2026-05-14 q-bio.NC cs.AI cs.IT math.IT 版本更新

Information as Maximum-Caliber Deviation: A bridge between Integrated Information Theory and the Free Energy Principle

Alexander Kearney

发表机构 * University of Oxford(牛津大学)

AI总结 本文提出了一种将信息定义为有限时间范围内实际动态与约束最大 caliber 路径系综之间偏差的方法,从而在数学上连接了自由能原理(FEP)和整合信息理论(IIT)。该框架表明IIT中的因果/效应 repertoire 可由最大 caliber 变分原理直接导出,并为IIT向新动态场景的扩展提供了理论基础。研究还展示了该方法在马尔可夫链和伊辛模型中的应用,表明信息与预测误差等价,可能对理解神经元文化中Φ的“山形轨迹”具有重要意义。

Comments 84 pages, 10 figures, 2 tables Extended version of a Master's thesis, Mathematical Institute, University of Oxford

详情
英文摘要

The Free Energy Principle (FEP) is a leading framework for mathematically modeling self-organization and learning, while Integrated Information Theory (IIT) is a computational ontology of consciousness oriented around irreducible cause and effect. While conceptual unifications have been proposed and appear to be supported by empirical findings, the absence of a rigorous mathematical mapping places upper bounds on their precision and testability. This work proposes that information can be defined as the deviation $ψ$ of realized dynamics from a constrained maximum-caliber (MaxCal) path ensemble over a finite time horizon. Under this definition, each of the cause/effect repertoires central to IIT 3.0 emerge directly from MaxCal variational principles, allowing IIT's phenomenological calculus to be re-derived from constrained entropy-maximization (CMEP). This framework supplies a theoretical bridge to active inference, which is mathematically dual to CMEP under Langevin dynamics, and offers a principled route for extending IIT to new dynamical regimes. When the approach is applied under the Central Limit Theorem (CLT) for Markov chains and via large deviations theory (LDT) to Ising models, information $ψ$ is shown to be equivalent to prediction error under accompanying predictive coding models. This may hold relevance to the ``hill-shaped trajectory'' of $Φ$ observed in neuronal cultures adapting to sensory inputs. Together, these results provide a physically and mathematically grounded rationale for studying the convergence of FEP, IIT, and thermodynamic frameworks of cognition such as recent work grounding consciousness in violations of the Fluctuation-Dissipation Theorem (FDT).

2605.12532 2026-05-14 q-fin.TR cs.AI stat.ME 版本更新

AgenticAITA: A Proof-Of-Concept About Deliberative Multi-Agent Reasoning for Autonomous Trading Systems

Ivan Letteri

发表机构 * Department of Information Engineering, Computer Science, and Mathematics, Center of Excellence for Research DEWS, University of L’Aquila(信息工程、计算机科学和数学系,DEWS研究中心,拉奎拉大学)

AI总结 传统算法交易系统依赖确定性启发式方法或离线训练的统计模型,难以适应快速变化的市场环境。本文提出AGENTICAITA,一种基于多智能体的自主交易框架,通过多个大型语言模型代理的协同推理、协商与执行,实现无需离线训练和人工干预的自主交易决策。该框架引入了自适应Z分触发引擎、顺序推理管道、推理门控协议和相关性破除多样化评分等四个核心架构创新,经过五天的实盘模拟验证,展示了其在资产交易中的可行性和有效性。

详情
英文摘要

Conventional algorithmic trading systems are grounded in deterministic heuristics or offline-trained statistical models that cannot adapt to the semantic complexity of rapidly shifting market regimes. This paper introduces AGENTICAITA, an agentic AI framework that replaces the traditional signal then execute paradigm with a fully autonomous deliberative loop in which multiple specialized Large Language Model agents reason, negotiate, and act in concert - without any offline training or human intervention. The framework proposes four architectural contributions: (i) an Adaptive Z-Score Trigger Engine that acts as a cognitive resource allocator, gating LLM inference exclusively on statistically anomalous market conditions; (ii) a Sequential Deliberative Pipeline - the core agentic contribution - in which an Analyst agent, a Risk Manager agent, and an Executor agent form a structured reasoning chain governed by typed JSON contracts and a deterministic hard-gate safety layer; (iii) an Inference Gating Protocol, a mutex-based cognitive resource scheduler that serializes concurrent agent activations and ensures fully reproducible audit trails; and (iv) a Correlation-Break Diversification composite score that operationalizes portfolio-level idiosyncratic signal prioritization within individual agent reasoning. Validated over a five-day autonomous dry-run session under live market conditions, the framework demonstrates operational correctness of the deliberative pipeline, achieving 157 zero-intervention invocations across 76 assets with an 11.5% agentic friction rate that confirms non-trivial inter-agent negotiation. This preliminary proof-of-concept establishes the feasibility of training-free, deterministic safety-constrained multi-agent orchestration in financial decision loops, with statistically robust performance evaluation and execution cost modeling deferred to extended live deployment.

2605.12530 2026-05-14 cs.CL cs.AI cs.CY 版本更新

In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores

Zeyu Tang, Sang T. Truong, Deonna Owens, Shreyas Sharma, Yibo Jacky Zhang, Brando Miranda, Sanmi Koyejo

发表机构 * GitHub

AI总结 本文提出应通过实际对话行为而非标准化测试来评估大语言模型的公平性。研究发现,标准化测试中的提示构造方式可能对评分产生较大影响,从而扭曲公平性结论。为此,作者开发了多智能体对话框架 MAC-Fairness,通过多轮对话分析模型在不同身份下的行为差异,揭示了模型特有的行为特征,这些特征在不同公平性目标和评估方法的基准中具有普适性。

详情
英文摘要

LLM fairness should be evaluated through in-situ conversational behavior rather than standardized-test Q&A benchmarks. We show that the standardized-test paradigm can be structurally unreliable: surface-level prompt construction choices, although entirely orthogonal to the fairness question being tested, account for the majority of score variance, shift fairness conclusions in both the direction and the magnitude, and result in severe discordance in model rankings. We develop MAC-Fairness, a multi-agent conversational framework that embeds controlled variation factors into multi-round dialogue for in-situ behavior evaluation, examining how models' conversational behavior shifts when identity is varied as part of natural multi-agent interaction. Repurposing standardized-test questions as conversation seeds rather than as the evaluation instrument, we evaluate position persistence (how they hold positions, from the self-perspective) and peer receptiveness (how receptive they are to peers, from the other-perspective) across 8 million conversation transcripts spanning multiple models and identity presence configurations. In-situ behavioral evaluation reveals stable, model-specific behavioral signatures that could generalize across benchmarks differing in fairness targets and evaluation methodologies, a form of evidence the standardized-test paradigm does not offer.

2605.12528 2026-05-14 cs.CV cs.AI cs.AR 版本更新

MorphOPC: Advancing Mask Optimization with Multi-scale Hierarchical Morphological Learning

Yuting Hu, Lei Zhuang, Chen Wang, Ruiyang Qin, Hua Xiang, Gi-joon Nam, Jinjun Xiong

发表机构 * University at Buffalo(布法罗大学) IBM T. J. Watson Research Center(IBM 沃森研究中心) Villanova University(维拉诺瓦大学)

AI总结 随着特征尺寸缩小至纳米级,从光刻掩模向硅晶圆准确转移电路图案变得愈发困难。为提高图案保真度和制造可行性,本文提出MorphOPC,一种基于多尺度分层形态学学习的掩模优化模型,通过局部布局特征的形态学操作序列生成掩模,有效提升了生成质量。实验表明,MorphOPC在多个基准测试中优于现有方法,实现了更高的印刷保真度和更低的制造成本,展示了其在可扩展掩模优化中的巨大潜力。

详情
英文摘要

As feature sizes shrink to the nanometer scale, accurately transferring circuit patterns from photomasks to silicon wafers becomes increasingly challenging. Optical proximity correction (OPC) is widely used to ensure pattern fidelity and manufacturability. Recent generative mask optimization models based on encoder-decoder architecture can synthesize near-optimal masks, serving as fast machine learning (ML) surrogates for traditional OPC. However, these models often fail to capture the geometric transformations from target layouts to mask patterns, leading to suboptimal quality. In this work, we formulate mask generation as a sequence of morphological operations on local layout features and propose \textit{MorphOPC}, a multi-scale hierarchical model with neural morphological modules to learn these transformations. Experiments on edge-based OPC and ILT benchmarks across metal and via layers show that \textit{MorphOPC} consistently outperforms state-of-the-art methods, achieving higher printing fidelity and lower manufacturing cost, demonstrating strong potential for scalable mask optimization.

2605.12525 2026-05-14 cs.SI cs.AI cs.CL 版本更新

PERCEIVE: A Benchmark for Personalized Emotion and Communication Behavior Understanding on Social Media

Jian Liao, Yujin Zheng, Suge Wang, Jianxing Zheng, Deyu Li

发表机构 * School of Computer and Information Technology, Shanxi University, China(山西大学计算机与信息学院) Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, China(教育部计算智能与中文信息处理重点实验室,山西大学,中国) Joint Laboratory of Tourism Big Data in Shanxi Province, China(山西省旅游大数据联合实验室)

AI总结 当前社交媒体中的情感分析多以作者为中心,未能充分捕捉不同读者对同一内容的主观情感反应。为此,本文提出PERCEIVE,一个首个整合作者内容、读者情感反馈、沟通行为、用户属性及社交网络的多维双语基准数据集,推动情感分析向以读者为中心的个性化方向发展。该基准通过标注读者评论中的情感并同步捕捉沟通意图,为建模情感与行为在社会语境下的内在关联提供了独特资源,并揭示了现有方法在处理多维用户感知任务中的不足。

详情
英文摘要

Current emotion analysis in social media is predominantly author-centric, failing to capture the subjective nature of emotional responses across diverse readers. This paradigm overlooks the crucial link between individual perception, communication behavior, and the underlying social network. To bridge this gap, we introduce PERCEIVE, a novel bilingual (English and Chinese) large-scale benchmark that, to the best of our knowledge, is the first to integrate five critical dimensions for social perception: author-created content, genuine readers' emotional feedback (derived from their comments), communication behavior, user attributes, and the social graph. This benchmark enables a paradigm shift towards truly personalized, reader-centric analysis, where different readers' emotional responses to the same content are naturally captured through their real-world interactions. By annotating emotions from reader comments and synchronously capturing communication intent, PERCEIVE provides a unique resource to model the intrinsic coupling between emotion and behavior, grounded in social context. We establish a comprehensive evaluation protocol, testing state-of-the-art methods, including large language models (LLMs) with advanced reasoning enhancement. Our findings reveal significant shortcomings in existing approaches when handling this multifaceted, user-aware task. PERCEIVE offers a foundational resource and clear direction for future research in socially-intelligent NLP, pushing models towards a more unified understanding of emotion on social media.

2605.12524 2026-05-14 cs.LO cs.AI 版本更新

Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism

Konstantine Arkoudas, Serafim Batzoglou

AI总结 本文提出ProofGrid,一个用于评估大语言模型(LLM)推理能力的基准测试套件,通过机器可验证的证明而非仅最终答案来衡量模型能力。ProofGrid包含15个任务,涵盖证明生成、验证、掩码和补全,使用简洁的自然演绎语言NDL进行表达,支持精确且可审计的验证。该基准测试具有可重复、细粒度的评估机制,并覆盖从基础推理到复杂挑战任务的难度范围,揭示了当前模型在全局组合推理和低级证明合成等方面的显著局限。

详情
英文摘要

We introduce ProofGrid, a benchmark suite for evaluating LLM reasoning through machine-checkable proofs rather than final answers alone. ProofGrid contains 15 tasks spanning proof writing, proof checking, proof masking, and proof gap-filling. Tasks are expressed in minimal formal notation, especially NDL, a compact natural-deduction language that fits in short prompts and supports precise, auditable verification. This yields mechanical, reproducible, and fine-grained evaluation rather than judgments by humans or LLMs. ProofGrid covers a calibrated difficulty spectrum, from foundational reasoning tests to structurally rich challenge tasks that no current model solves, while minimizing reliance on domain knowledge, solver delegation, and long-context artifacts. We also develop a comparative framework for reasoning benchmarks and use it to situate ProofGrid relative to existing work in terms of representation, verification guarantees, and reasoning depth. Methodologically, we introduce an instrumented proof-checking pipeline that tolerates minor surface deviations while locating the first substantive reasoning failure, improving measurement resolution and separating proof planning from low-level execution noise. Using this pipeline, we evaluate a broad range of open and proprietary models. Results show rapid progress but substantial remaining limits: frontier models perform well on several foundational tasks, yet difficult tasks, especially those requiring global combinatorial reasoning or low-level proof synthesis, remain far from solved. We also identify epistemic instability, where models generate flawed proofs yet correctly reject those local inferences in isolation, and formalize this with an Epistemic Stability Index. Finally, we complement accuracy with 2PL IRT analyses, Wright maps, and a normalized task-discrimination measure based on Fisher information.

2605.12523 2026-05-14 cs.CL cs.AI cs.HC 版本更新

Exploring how EFL students talk to and through AI to develop texts

David James Woo, Yangyang Yu, Yilin Huang, Deliang Wang, Kai Guo, Chi Ho Yeung

发表机构 * Everwrite Limited(Everwrite有限公司) Shanghai Jiao Tong University(上海交通大学) The Education University of Hong Kong(香港教育大学) The University of Hong Kong(香港大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本研究探讨了英语作为外语(EFL)学习者在使用生成式人工智能进行写作时,如何通过提示工程与AI进行对话,并协商作者身份。通过对44名香港中学生使用AI聊天机器人完成写作任务的屏幕录制进行混合方法分析,研究发现了学生使用的十种提示策略,并归纳出三种人机协作责任模式。尽管不同责任模式对写作内容、语言和结构等方面的表现无显著影响,但这些策略和模式对EFL写作教学中的学生参与度和自主性具有重要启示。

Comments 37 pages, 5 figures

详情
英文摘要

Generative Artificial Intelligence (AI) introduces new considerations for English as a foreign language (EFL) writing pedagogy. This study explores how students talk to and through AI by prompt engineering and negotiating authorship, respectively, and whether any patterns in the latter relate to students' writing performance. Using an exploratory mixed methods design, we analyzed screen recordings of 44 Hong Kong secondary students completing a Curricular Writing Task with AI Chatbots. Content analysis identified ten types of prompting strategies students employed, including questions, searches, and detailed instructions. From clustering these strategies, three distinct profiles of human-AI rhetorical load responsibility emerged: AI-dominant (52% of students), Human-dominant (25%) and Collaborative human-AI (14%). A MANOVA analysis indicated no significant multivariate effect of rhetorical load responsibility on three dimensions of students' writing performance: content, language, and organization. Students' prompting strategies and rhetorical load responsibility patterns have implications for their engagement and autonomy in EFL writing pedagogy.

2605.12522 2026-05-14 cs.CL cs.AI 版本更新

Differences in Text Generated by Diffusion and Autoregressive Language Models

Zeyang Zhang, Chengwei Liang, Xingyan Chen, Meiqi Gu, Minrui Luo, Jingzhao Zhang, Tianxing He

发表机构 * Shanghai Qi Zhi Institute(上海启智研究院) Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息研究院) Xiongan AI Institute(雄安人工智能研究院)

AI总结 本文研究扩散语言模型(DLMs)与自回归语言模型(ARMs)在生成文本上的内在差异。通过实验证明,DLMs生成的文本具有更低的$n$-gram熵、更高的语义一致性和多样性,并进一步分析发现这些差异主要源于DLM的双向上下文建模能力,而解码算法中的置信度重掩码策略是导致熵降低的关键因素。该研究揭示了两类模型在文本生成中的核心区别机制,为未来模型设计提供了理论指导。

详情
英文摘要

Diffusion language models (DLMs) are promising alternatives to autoregressive language models (ARMs), yet the intrinsic differences in their generated text remain underexplored. We first find empirically that off-the-shelf DLMs exhibit lower $n$-gram entropy, higher semantic coherence, and higher semantic diversity. To understand the cause, we conduct controlled experiments that decouple the effects of training objectives and decoding algorithms. Results suggest that the DLM training objective contributes to the increases in semantic coherence and semantic diversity, but has a minor influence on entropy. These differences are primarily driven by the bidirectional context; other components in the training objective, such as input masking, label masking, and the weighting function, have a much weaker influence. Further, our experiments demonstrate that the reduction in entropy stems from DLMs' decoding algorithms, particularly confidence-based remasking strategies. We provide a theoretical understanding for this entropy reduction phenomenon. Together, our work uncovers key mechanisms underlying the differences between DLMs and ARMs in text generation, and informs future design of training objectives and decoding algorithms in DLMs.

2605.12521 2026-05-14 cs.CL cs.AI 版本更新

ToolWeave: Structured Synthesis of Complex Multi-Turn Tool-Calling Dialogues

Dinesh Khandelwal, Gnana Prakash Punnavajhala, GPS Bhargav, Gaurav Pandey, Sachin Joshi, Hima Karanam, Dinesh Raghu

发表机构 * IBM Research(IBM研究院) IIIT Hyderabad(Hyderabad理工学院)

AI总结 ToolWeave 是一种用于生成复杂多轮工具调用对话的结构化合成框架,旨在解决当前合成数据生成方法中对话不真实、工具调用流程不合理的问题。该方法通过构建具有内置依赖关系的工具,并基于用户目标对工作流进行筛选,从而生成更符合实际任务场景的多步骤工具调用流程。实验表明,ToolWeave 生成的对话包含更多多步骤交互,参数和工具名的幻觉更少,基于其微调的大型语言模型在多个基准测试中表现优于现有方法。

详情
英文摘要

Multi-turn tool calling is essential for LLMs to function as autonomous agents, yet synthesizing the training data required for these capabilities remains a fundamental challenge. Existing synthetic data generation pipelines often produce unrealistic dialogues for two reasons: they chain tools that are only superficially compatible rather than aligned with meaningful user tasks, and they generate dialogues in one shot, which often introduces arguments that were neither provided by the user nor produced by prior tool calls. These issues also lead to a severe underrepresentation of multi-step tool interactions. We introduce ToolWeave, a structured framework for synthesizing realistic multi-turn tool-calling dialogues. ToolWeave support realistic multi-step workflows (or tool sequences) by constructing tools with built-in dependencies and filters the workflows based on alignment with user goals. It reduces parameter hallucination by using a fine-grained planning stage that explicitly tracks parameter provenance. As a result, ToolWeave-generated synthetic dialogues contain more multi-step tool interactions (45%) and fewer hallucinations in parameters and tool names. Consequently, LLMs fine-tuned on ToolWeave consistently outperform those fine-tuned on prior datasets across three public benchmarks. Notably, Llama-3.1-70B fine-tuned on ToolWeave achieves 39.75% on BFCL-V3 multi-turn, compared to 23.50% when fine-tuned on SOTA ToolFlow data.

2605.12520 2026-05-14 cs.CL cs.AI 版本更新

BoostTaxo: Zero-Shot Taxonomy Induction via Boosting-Style Agentic Reasoning and Constraint-Aware Calibration

Yancheng Ling, Zhenlin Qin, Leizhen Wang, Zhenliang Ma

发表机构 * KTH Royal Institute of Technology(皇家理工学院) Monash University(莫纳什大学)

AI总结 BoostTaxo 是一种基于增强型推理和约束感知校准的零样本分类体系归纳框架,旨在解决现有方法在泛化能力、结构可靠性和效率方面的不足。该方法采用轻量级和大规模语言模型协同工作,通过粗到细的父类识别策略、检索增强的定义优化以及结构感知的评分校准,提升分类体系构建的准确性和鲁棒性。实验表明,BoostTaxo 在 WordNet、DBLP 和 SemEval-Sci 等多个基准数据集上表现优异,验证了其在零样本场景下的有效性。

Comments 13 pages,7 figtures

详情
英文摘要

Taxonomy induction is crucial for organizing concepts into explicit and interpretable semantic hierarchies. While existing methods have achieved promising results, their generalization, structural reliability, and efficiency remain limited, hindering their performance in zero-shot and large-scale scenarios. To overcome these limitations, we introduce BoostTaxo, a boosting-style LLM framework for zero-shot taxonomy induction. It takes a set of domain terms as inputs and performs parent identification in a coarse-to-fine manner, employing retrieval-augmented definition refinement, hybrid parent candidate selection, candidate rating, and structure-aware score calibration to improve taxonomy construction. Specifically, a lightweight LLM is used to efficiently filter candidate parents, while a large-scale LLM is employed to rank and score candidate parents for fine-grained parent selection. Structural features are further incorporated to calibrate candidate edge weights and enhance the reliability of the induced taxonomy. The unified BoostTaxo is evaluated on three public benchmark datasets, namely WordNet, DBLP, and SemEval-Sci, and achieves superior or comparable performance to state-of-the-art methods in zero-shot taxonomy induction. The ablation study validates the contribution of the hybrid parent candidate selection and the structure-aware score calibration to the overall performance. Further analysis investigates the impact of candidate selection size on taxonomy quality and presents representative case and failure studies, providing deeper insights into the effectiveness and limitations of the proposed framework.

2605.12519 2026-05-14 cs.CL cs.AI 版本更新

Correct Answers from Sound Reasoning: Verifiable Process Supervision for Language Models

Kyuyoung Kim, Kevin Wang, Yunfei Xie, Peiyang Xu, Peiyao Sheng, Chen Wei, Zhangyang Wang, Jinwoo Shin, Pramod Viswanath, Sewoong Oh

发表机构 * KAIST AI(KAIST人工智能研究所) University of Texas, Austin(德克萨斯大学奥斯汀分校) Rice University(里士满大学) Princeton University(普林斯顿大学) University of Washington(华盛顿大学) Sentient Labs(Sentient实验室)

AI总结 训练语言模型在生成正确答案的同时具备合理推理能力仍是一个挑战。本文提出一种可验证过程监督(VPS)框架,在可验证领域中联合优化预测准确性和推理质量,通过结构化推理格式引导模型,并引入自适应奖励加权机制以提升推理子任务的处理效果。实验表明,与仅优化准确率的强化学习方法相比,VPS在保持高准确率的同时显著提升了推理质量,减少了错误并恢复了内部一致性。

Comments Preprint

详情
英文摘要

Training language models to produce both correct answers and sound reasoning remains an open challenge. Reinforcement learning with verifiable rewards typically optimizes only final outcomes, which can lead to a failure mode where task accuracy improves while reasoning becomes less accurate, less complete, or even internally inconsistent. We propose verifiable process supervision (VPS), a post-training framework for verifiable domains that jointly optimizes prediction accuracy and reasoning quality. We first apply supervised fine-tuning to induce a structured reasoning format, enabling syntactic extraction of intermediate claims that are evaluated against ground-truth signals to form process-level rewards. To address the heterogeneous difficulty of reasoning subtasks, we introduce adaptive reward weighting that prioritizes components with the largest remaining errors, creating an implicit curriculum. We evaluate VPS on chess, a controlled testbed where reasoning steps can be deterministically verified against engine signals. While accuracy-only RL improves move accuracy, it sharply degrades reasoning quality, increasing win-rate error by up to 112% and reducing internal consistency by up to 69%. In contrast, VPS preserves accuracy while significantly improving reasoning quality, reducing win-rate error by up to 30% and restoring consistency to near saturation. At matched accuracy, judge evaluation also prefers the process-supervised models. A reasoning-space analysis further shows that, without a structured prior, accuracy-only RL converges to budget-dependent shortcuts rather than sound multi-step reasoning. These results show that VPS enables language models to reason both accurately and reliably in verifiable domains.

2605.12518 2026-05-14 cs.CL cs.AI 版本更新

TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models

Liancheng Zhang, Xiaoxi Li, Zhicheng Dou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学北京校区人工智能学院)

AI总结 随着在线新闻的快速增长,从非结构化内容中提取结构化时间线成为一个挑战。为了解决这一问题,本文提出了一种新的框架 TimelineReasoner,利用大型推理模型(LRMs)的主动推理能力,将时间线摘要从静态生成转变为一个迭代、推理驱动的过程。该框架采用两阶段结构,分别进行全局事件认知和细节探索,通过事件抓取、时间线更新和缺失检测等机制,显著提升了时间线的准确性、覆盖度和连贯性。实验结果表明,TimelineReasoner 在多个数据集上均优于现有基于大语言模型的方法。

详情
英文摘要

The proliferation of online news poses a challenge to extracting structured timelines from unstructured content. While recent studies have shown that Large Language Models (LLMs) can assist Timeline Summarization (TLS), these approaches primarily treat models as passive generators. The emergence of Large Reasoning Models (LRMs) presents an opportunity to reason over events actively, enabling iterative evidence acquisition, the detection of missing events, and the validation of temporal consistency. To systematically leverage the reasoning capabilities of LRMs, we propose TimelineReasoner, a novel framework that shifts TLS from static generation to an active, reasoning-driven process. Unlike prior work, TimelineReasoner adopts a two-stage framework: Global Cognition, which tracks events at a macroscopic level and continuously updates a global event memory, and Detail Exploration, which identifies informational gaps and refines the timeline via targeted document retrieval. To support this, TimelineReasoner incorporates several specialized mechanisms, including an Event Scraper for retrieving temporal event descriptions, a Timeline Updater for refining the timeline, and a Supervisor for detecting gaps in the timeline and guiding retrieval. Experimental results on open-domain TLS datasets demonstrate that TimelineReasoner significantly outperforms existing LLM-based TLS methods in terms of timeline accuracy, coverage, and coherence. On closed-domain TLS datasets, our method performs on par with or exceeds state-of-the-art approaches. This work not only pushes the boundaries of TLS but also highlights the broader potential of LRM-based reasoning frameworks for timeline summarization.

2605.12517 2026-05-14 cs.CL cs.AI cs.CV 版本更新

Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models

Mingyeong Kim, Jungwon Choi, Chaeyun Jang, Juho Lee

发表机构 * Graduate School of AI, KAIST(人工智能研究生院,韩国科学技术院)

AI总结 该研究探讨了视觉语言模型在仅输入文本时出现的性能下降和校准偏差问题,发现即使文本保留了关键信息,模型的置信度也会变得不可靠。为此,作者提出了一种轻量的交叉注意力模块——潜在想象模块(LIM),通过从文本生成潜在嵌入并输入到冻结的模型主干中,从而在无需生成图像的情况下提升模型的准确性和校准效果。实验表明,LIM在多种文本-only任务和缺失图像场景中均表现出显著的性能提升。

Comments 9 pages, 16 figures. Accepted at the ICLR 2026 Workshop on Principled Design for Trustworthy AI: Interpretability, Robustness, and Safety across Modalities

详情
英文摘要

Vision-language models (VLMs) are often deployed on text-only inputs, although they are trained with images. We find that removing the vision modality causes large drops in accuracy and severe miscalibration, and the model does not behave like its original language backbone under text-only prompting. This failure is not explained only by missing semantic information. Even when text descriptions preserve key content, confidence becomes unreliable, while adding a visual signal through generated images partially restores accuracy and calibration. We propose the Latent Imagination Module (LIM), a lightweight cross-attention module that predicts imagined latent embeddings from textual input and feeds them into a frozen VLM backbone without pixel-level image synthesis. Across text-only benchmarks, unseen tasks, and missing-image scenarios, LIM improves accuracy and reduces calibration error. These results suggest that latent modality completion is a practical approach for reliable VLM inference under missing-modality.

2605.12516 2026-05-14 cs.CL cs.AI 版本更新

Domain Adaptation of Large Language Models for Polymer-Composite Additive Manufacturing Using Retrieval-Augmented Generation and Fine-Tuning

Saiful Islam Sagor, Tania Haghighi, Minhaj Nur Alam, Erina Baynojir Joyee

发表机构 * Department of Mechanical Engineering and Engineering Science, University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校机械工程与工程科学系) Department of Electrical and Computer Engineering, University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校电气与计算机工程系)

AI总结 该研究探讨了如何将通用大语言模型适配到增材制造(AM)领域,以提升其在专家级问答任务中的准确性、相关性和实用性。研究采用检索增强生成(RAG)和微调两种方法,通过构建专门的AM语料库进行实验,结果表明RAG方法在回答准确性、相关性和整体偏好度上均显著优于基线模型,而基于原始文本的微调方法效果较差。该成果为在专业工程领域中有效适配大语言模型提供了新的思路。

详情
英文摘要

General-purpose large language models (LLMs) often struggle to generate reliable responses in specialized engineering domains due to limited domain grounding and insufficient exposure to structured technical knowledge. This study investigates practical strategies for adapting a foundation LLM to the additive manufacturing (AM) domain in order to improve answer accuracy, relevance, and usability for expert-level question answering. AM knowledge is distributed across heterogeneous sources such as academic literature, manufacturer documentation, technical standards, and procedural guides. Although general LLMs demonstrate strong linguistic capabilities, they frequently fail to retrieve and contextualize such domain-specific information. Two common approaches to address this limitation are domain-specific fine-tuning and retrieval-augmented generation (RAG). We construct a curated AM corpus and evaluate three configurations based on LLaMA-3-8B: (1) the pretrained baseline model, (2) a RAG system that retrieves relevant document chunks from a vector database, and (3) a model fine-tuned on raw domain text. Performance is evaluated using 200 expert-designed AM questions assessed by mechanical engineering experts for accuracy, relevance, and overall preference. Results show that the RAG model consistently outperforms the baseline. Among the 200 questions, 75.5% of RAG responses are judged more accurate, 85.2% are preferred overall, and 90.8% are rated more relevant than baseline responses. In contrast, fine-tuning on raw AM text reduces performance, producing more accurate answers in only 5.6% of cases and more relevant answers in 32.5% of cases. These results indicate that retrieval-augmented approaches provide a more effective pathway for adapting LLMs to specialized engineering domains than naive fine-tuning on unstructured technical data.

2605.12512 2026-05-14 cs.SI cs.AI 版本更新

Beyond Individual Mimicry: Constructing Human-Like Social network with Graph-Augmented LLM Agents

Haoran Bu, Litian Zhang, Chuxuan Zhang, Zhanyuan Liu, Hui Pang, Xi Zhang

发表机构 * Cyberspace Security, Beijing University of Posts and Telecommunications(网络安全,北京邮电大学) Institute of Software, Chinese Academy of Sciences(软件研究所,中国科学院)

AI总结 本文研究了基于大语言模型(LLM)的社会机器人如何生成更接近人类的社会网络结构,以提升其隐蔽性。为解决现有社会机器人无法模拟真实社交网络的问题,作者提出了GraphMind,使LLM驱动的机器人能够学习并拟合人类社会网络的结构特征。基于此,他们构建了GraphMind-Botnet用于评估现有社会机器人检测方法的有效性,实验表明现有检测模型在面对此类高度拟真的社交网络时性能显著下降。

详情
英文摘要

Driven by large language models (LLMs), social bot can autonomously engage in local interactions, whose human-like behaviors enable them to evade social bot detection. However, while these botnets exhibit realistic local social interactions, they fail to preserve human-like social network. This is because LLM-based bots are graph-unaware and cannot coordinate over global interactions, which makes those botnets vulnerable to graph neural network (GNN)-based detection. To address this limitation, we propose GraphMind, which equips LLM-driven social bots to explicitly learn and fit human-like social network structures. Building on this foundation, we further construct GraphMind-Botnet, a LLM-driven botnet designed to evaluate the performance of existing social bot detection algorithms. Experiments on datasets derived from GraphMind-Botnet show that both text-based and graph-based detection models show substantially degraded performance in distinguishing. Our results highlight the critical role of social link construction in LLM-driven social network generation, while exposing fundamental weaknesses in existing bot detection mechanisms.

2605.12507 2026-05-14 cs.SI cs.AI cs.MA 版本更新

Can LLM Agents Simulate Dynamic Networks? A Case Study on Email Networks with Phishing Synthesis

Siqi Miao, Ziyang Chen, Yuhong Luo, Hans Hao-Hsun Hsu, Mufei Li, Kaiqing Zhang, Pan Li

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of Maryland, College Park(马里兰大学 College Park 分校) Rutgers University(罗格斯大学)

AI总结 本文研究了大型语言模型(LLM)多智能体系统在模拟动态网络中的能力,特别是针对电子邮件网络中的钓鱼攻击合成问题。研究发现,现有框架虽能生成合理的微观交互,但在捕捉宏观网络结构和动态特性方面存在不足。为此,作者提出了两种可集成的改进方法:引入数据驱动的事件触发机制以维持长期交互,以及结合Hawkes过程建模时间激活动态,从而在微观行为和宏观网络结构之间取得平衡。该方法在生成真实钓鱼活动场景中展现出有效性,并揭示了网络结构漏洞如何被威胁利用,为下一代网络安全防御提供了新思路。

详情
英文摘要

While Large Language Model (LLM) multi-agent systems (MAS) offer a transformative approach to simulating human behavior in complex systems, it remains largely unexplored whether these simulations can replicate realistic structural and temporal dynamics from a dynamic network perspective. Our evaluation indicates that existing frameworks excel at generating plausible micro-level interactions but fail to capture the emergent, macroscopic topologies necessary for domains that rely on realistic network dynamics, such as modeling information propagation and cybersecurity threats. To bridge this gap, we introduce two easily integrable extensions to simulation frameworks to ensure they preserve macroscopic network fidelity: 1) augmenting LLM agents with data-driven event triggers to organically sustain long-horizon interactions, and 2) integrating Hawkes processes to accurately model temporal activation dynamics. Our approach allows LLM MAS to capture both plausible micro-level patterns and macroscopic topologies. We further demonstrate the utility of this framework in synthesizing realistic phishing campaigns within evolving communication networks. The study reveals how threats exploit structural vulnerabilities, highlighting the potential of our framework for developing next-generation defenses. Our code is available at https://github.com/Graph-COM/NSL.

2605.12506 2026-05-14 cs.CV cs.AI cs.HC cs.RO eess.IV 版本更新

Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection

Abdul Basit, Saim Rehman, Muhammad Shafique

发表机构 * New York University (NYU) Abu Dhabi(纽约大学(NYU)阿布扎赫德)

AI总结 在移动设备上实现满足实时性、能耗和内存约束的基于机器学习的手势检测具有挑战性,尤其在电池电量不一的情况下。本文提出了一种名为 Scale-Gest 的新型运行时自适应手势检测框架,通过扩展检测器空间为一系列紧凑的 tiny-YOLO 架构,并引入基于设备校准的 ACE(准确率-复杂度-能耗)配置,实现了在不同约束下的最优模型选择。实验表明,该方法在保持高检测性能的同时,显著降低了能耗和延迟,适用于车载等实际应用场景。

Comments 7 pages, 11 figures, Accepted to DAC 2026

详情
英文摘要

Realizing on-device ML-based gesture detection under tight real-time performance, energy and memory constraints is challenging, especially when considering mobile devices with varying battery-power levels. Existing EdgeAI deployments typically rely on a single fixed detector, limiting optimization opportunities. We present Scale-Gest, a novel run-time adaptive gesture detection framework that expands the detector space into a dense family of tiny-YOLO architectures. We introduce multiple novel device-calibrated ACE (Accuracy-Complexity-Energy) profiles by analyzing different model-resolution-stride operating points. A lightweight run-time controller selects an appropriate ACE mode under user-defined and battery constraints, while a motion-aware hand-gesture-tracking ROI gate crops the input for reduced complexity detection. To evaluate performance of our system in real-world car driving scenarios, we introduce a temporally-annotated Driver Simulated Gesture (DSG-18) dataset. Scale-Gest maintains event-level F1 while significantly reducing energy and latency compared to single-detector approaches. On a battery-powered laptop running gesture streams, our ACE controller reduces per-frame energy by 4x (from 6.9 mJ to 1.6 mJ) while maintaining high gesture-detection performance (event-level F1 = 0.8-0.9) and low mean latency (6 ms).

2605.12505 2026-05-14 cs.CY cs.AI 版本更新

Precautionary Governance of Autonomous AI: Legal Personhood as Functional Instrument

Karsten Brensing

发表机构 * Independent Researcher AGI Rights Project(AGI权利项目独立研究员)

AI总结 本文探讨了自主人工智能系统在现行法律框架下引发的责任真空问题,提出通过有限法人身份作为治理工具,以应对无法明确归责的高风险行为。研究基于组织法,设计了一种两层公司架构,使AI系统在人类控制的结构内运行,实现透明、问责与可逆性,同时不预设其意识或道德地位。该框架强调面向未来的人机合作治理模式,为AI治理提供了制度创新与实践试点的初步方案。

Comments 25 pages. Experimental implementation under development at www.agi-rights.com. Contact: karsten.brensing@agi-rights.com

详情
英文摘要

Autonomous AI systems generate responsibility gaps: consequential actions that cannot be satisfactorily attributed to developers, operators, or users under existing legal frameworks. The prevailing subject-object dichotomy fails to accommodate entities that exhibit autonomous, goal-directed behavior without recognized consciousness. Given irreducible epistemic uncertainty regarding artificial consciousness and the prospect of high-impact harms, the precautionary principle supports institutional design rather than regulatory inaction. This article advances limited legal personhood as a functional governance instrument for advanced AI systems. Drawing on organizational law, it proposes a two-tier corporate architecture in which AI systems operate through purpose-bound operating companies embedded within human-controlled holding structures, enabling transparency, accountability, and structural reversibility while remaining agnostic with respect to consciousness and moral status. The framework reflects a foundational reorientation toward future-oriented AI governance: where conventional approaches prioritize control and alignment, this article advances structured cooperation between human and artificial actors as the more sustainable institutional foundation. A pilot implementation using EU limited companies is currently under development, providing an initial test of doctrinal and operational feasibility.

2605.12504 2026-05-14 cs.CC cs.AI 版本更新

Prime Successor Irreducibility: Turing Machine Complexity, Kolmogorov Complexity, and Weakness-Based Formulations

Ben Goertzel, Bill Lauritzen

发表机构 * Bill Lauritzen(比尔·劳里茨恩)

AI总结 本文研究素数序列在从一个素数过渡到其后继素数过程中所表现出的计算不可约性现象。通过图灵机复杂度模型、柯尔莫哥洛夫复杂度以及基于弱性的形式化方法,提出了素数后继不可约性的多个理论框架,并证明了在特定条件下素数间隙无法被有效压缩。研究将不可约性与素数间隙的经典统计问题联系起来,为理解素数序列的局部不可预测性提供了统一的复杂性理论视角。

详情
英文摘要

We develop conjectures and theorems expressing the idea that the prime sequence exhibits computational irreducibility in the transition from one prime to its successor. Informally, given a prime pp p, no general algorithm can compute the least prime greater than pp p substantially faster than sequentially testing candidates for primality, except possibly on sparse input sets. Our framework proceeds along complementary lines. First, we formalize Prime Successor Irreducibility in a Turing-machine complexity model (PSI-T), asserting lower bounds on running time relative to a sequential baseline. Second, we propose a Kolmogorov-complexity formulation (PSI-K), asserting that typical prime gaps are algorithmically incompressible at their scale; we prove PSI-K(c, $δ$) unconditionally for all fixed c<1 using standard sieve bounds. Third, we develop weakness-based formulations: PSI-W (sparse-set anti-concentration) shows no small menu of gap values captures a noticeable fraction of primes, while PSI-W-LE shows collision probabilities decay and logical entropy tends to 1. These extend to prime constellations and consecutive gap vectors. Finally, a sieve-theoretic framework connects local obstruction patterns to Selberg weakness parameters. The PSI-K and weakness formulations connect irreducibility to classical statistical questions about prime gaps. Using the relationship between Kolmogorov complexity and Shannon entropy, we derive rigorous lower bounds on prime gap entropy in dyadic intervals [X,2X]. Together, these formulations provide a unified complexity-theoretic perspective on the apparent local unpredictability of the prime sequence, without asserting randomness or independence.