arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4077
2605.10853 2026-05-12 cs.CL

Grounded Satirical Generation with RAG

Oona Itkonen, Yuxin Su, Linyao Du, Ona De Gibert

AI总结 本文研究了基于现实背景的讽刺生成问题,提出了一种结合检索增强生成(RAG)的方法,用于在芬兰语环境下生成基于当前新闻的讽刺词典定义。研究还构建了一个新的任务特定评估框架,并通过多人标注分析了不同实验条件下的生成效果,发现生成内容更偏向政治性而非幽默性。实验表明,RAG和基于主题的词选择提升了政治相关性,但对幽默生成效果不明显,同时大型语言模型在政治相关性判断上与人类一致,但在幽默判断上表现较差。

详情
英文摘要

Humor generation remains challenging task for Large Language Models (LLMs), due to their subjective nature. We focus on satire, a form of humor strongly shaped by context. In this work, we present a novel pipeline for grounded satire generation that uses Retrieval-Augmented Generation (RAG) over current news to produce satirical dictionary definitions in the Finnish context. We also introduce a new task-specific evaluation framework and annotate 100 generated definitions with six human annotators, enabling analysis across multiple experimental conditions, including cultural background, source-word type, and the presence or absence of RAG. Our results show that the generated definitions are perceived as more political than humorous. Both topic-based word selection and RAG improve the political relevance of the outputs, but neither yields clear gains in humor generation. In addition, our LLM-as-a-judge evaluation of five state-of-the-art models indicates that LLMs correlate well with human judgments on political relevance, but perform poorly on humor. We release our code and annotated dataset to support further research on grounded satire generation and evaluation.

2605.10851 2026-05-12 cs.AI cs.CL cs.LG

The Generalized Turing Test: A Foundation for Comparing Intelligence

Daniel Mitropolsky, Susan S. Hong, Riccardo Neumarker, Emanuele Rimoldi, Tomaso Poggio

AI总结 本文提出了一种通用图灵测试(GTT),作为一种通过不可区分性来比较任意智能体能力的正式框架。该框架定义了智能体之间的相对智能关系,并研究了其结构特性及变体形式,同时在多个现代模型上进行了实证评估,结果显示出与现有排名一致的分层结构。该研究为智能的评估和潜在训练目标提供了一个与具体数据集或基准无关的统一视角。

详情
英文摘要

We introduce the Generalized Turing Test (GTT), a formal framework for comparing the capabilities of arbitrary agents via indistinguishability. For agents A and B, we define the Turing comparator A $\geq$ B to hold if B, acting as a distinguisher, cannot reliably distinguish between interactions with A (instructed to imitate B) and another instance of B. This yields a dataset- and task-agnostic notion of relative intelligence. We study the comparator's structure, including conditions under which it is transitive and therefore induces an ordering over equivalence classes, and we define and analyze variants with querying, bounded interaction, and fixed distinguishers. To complement the theory, we instantiate the framework on a collection of modern models, empirically evaluating pairwise indistinguishability across thousands of trials. The resulting comparisons exhibit a stratified structure consistent with existing rankings, hinting that the proposed framework yields meaningful empirical orderings. Our results position indistinguishability as a unifying lens for reasoning about intelligence, suggesting a foundation for evaluation and, potentially, training objectives that are inherently independent of fixed datasets or benchmarks.

2605.10850 2026-05-12 cs.CV

Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA

Ruinan Jin, Beidi Zhao, Myeongkyun Kang, Qiong Zhang, Xiaoxiao Li

AI总结 本文研究了医学视觉问答(VQA)中自验证机制的可靠性边界,指出当前常用的通过重新调用相同视觉语言模型(VLM)进行自验证的做法存在根本性不可靠的问题。作者提出了一种诊断框架,通过分解验证器的行为为判别能力和一致性偏差,揭示了验证器与生成器之间的能力耦合会导致“验证幻觉”现象,即在错误答案被错误接受的情况下,验证器错误率和一致性偏差同时升高的状态。实验表明,验证机制无法提供独立的安全保障,且在多轮交互中错误答案可能被错误验证所固化,凸显出自验证在实际临床应用中可能存在的严重风险。

Comments 31 pages, 12 figures

详情
英文摘要

Self-verification, re-invoking the same vision language model (VLM) in a fresh context to check its own generated answer, is increasingly used as a default safety layer for medical visual question answering (VQA). We argue that this practice is fundamentally unreliable. We introduce [METHOD NAME], a diagnostic framework for mapping the reliability boundary of medical VLM self-verification by decomposing verifier behavior into discrimination capability and agreement bias. Because the verifier and answer generator are capacity-coupled, the verifier can overly agree with the generator, creating a verification mirage: a regime with both high verifier error and high agreement bias, driven by false acceptance of incorrect answers. Evaluating six open-weight VLMs across five medical VQA datasets and seven medical tasks, we find that this boundary is strongly task-conditioned. Knowledge-intensive clinical tasks fall deepest into the mirage, simpler tasks are more resistant, and perceptual tasks lie in between. Verification also fails to provide an independent safety signal: logistic mixed-effects analysis shows that verifier error and agreement bias become more likely when the generator is wrong, while saliency analyses show that verifiers under-attend to image evidence relative to generators, a phenomenon we call the lazy verifier. Cross-verification reduces but does not eliminate the mirage. Moreover, when verification is reused in multi-turn actor-verifier loops, most initially wrong answers become locked in by false verification. Since our experiments use clean benchmarks, the observed reliability boundary likely underestimates failures in real clinical deployment.

2605.10847 2026-05-12 cs.LG

Conditional anomaly detection methods for patient-management alert systems

Michal Valko, Gregory Cooper, Amy Seybert, Shyam Visweswaran, Melissa Saul, Miloš Hauskrecht

AI总结 本文研究了用于患者管理预警系统的条件异常检测方法,旨在从数据的子集属性中识别异常模式,且异常的判定依赖于其他属性的取值。研究聚焦于基于实例的条件异常检测方法,通过距离度量来识别数据集中对异常检测最关键的例子,并探讨了多种度量方式与度量学习方法以优化检测性能。实验结果表明,所提出的方法在检测社区获得性肺炎患者的异常入院决策和确认肝素诱导血小板减少症的HPF4检测异常订单等实际问题中具有显著优势。

Comments Published at Workshop on Machine Learning in Health Care Applications ICML-2008 - MLHealth

详情
英文摘要

Anomaly detection methods can be very useful in identifying unusual or interesting patterns in data. A recently proposed conditional anomaly detection framework extends anomaly detection to the problem of identifying anomalous patterns on a subset of attributes in the data. The anomaly always depends (is conditioned) on the value of remaining attributes. The work presented in this paper focuses on instance-based methods for detecting conditional anomalies. The methods rely on the distance metric to identify examples in the dataset that are most critical for detecting the anomaly. We investigate various metrics and metric learning methods to optimize the performance of the instance-based anomaly detection methods. We show the benefits of the instance-based methods on two real-world detection problems: detection of unusual admission decisions for patients with the community-acquired pneumonia and detection of unusual orders of an HPF4 test that is used to confirm Heparin induced thrombocytopenia - a life-threatening condition caused by the Heparin therapy.

2605.10845 2026-05-12 cs.CV cs.CL

BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation

Qi Yang, Xiangyao Ma, Xiao Wang, Hao Wang, Rui Wang

AI总结 随着跨语言交流的日益频繁,富含视觉内容的PDF等文档中的语言障碍仍然是一个实际瓶颈。现有文档翻译方法在语言处理与版式保留之间面临矛盾,BabelDOC通过引入中间表示框架,将视觉布局信息与语义内容解耦,实现了术语提取、跨页上下文处理等文档级翻译操作,并通过自适应排版引擎将翻译内容重新锚定到原始布局中。实验表明,BabelDOC在版式保真度、视觉美观性和术语一致性方面优于现有方法,同时保持了较高的翻译精度。

Comments ACL 2026 System Demonstration paper. 2 figures

详情
英文摘要

As global cross-lingual communication intensifies, language barriers in visually rich documents such as PDFs remain a practical bottleneck. Existing document translation pipelines face a tension between linguistic processing and layout preservation: text-oriented Computer-Assisted Translation (CAT) systems often discard structural metadata, while document parsers focus on extraction and do not support faithful re-rendering after translation. We introduce BabelDOC, an Intermediate Representation (IR)-based framework for layout-preserving PDF translation. BabelDOC decouples visual layout metadata from semantic content, enabling document-level translation operations such as terminology extraction, cross-page context handling, glossary-constrained generation, and formula placeholdering. The translated content is then re-anchored to the original layout through an adaptive typesetting engine. Experiments on a curated 200-page benchmark, together with human evaluation and multimodal LLM-as-a-judge evaluation, show that BabelDOC improves layout fidelity, visual aesthetics, and terminology consistency over representative baselines, while maintaining competitive translation precision. The open-source toolkit and its interactive downstream applications are publicly available and have attracted over 8.4K GitHub stars and 17 contributors at the time of writing. A demonstration video is also available.

2605.10835 2026-05-12 cs.CV cs.LG

Transcoda: End-to-End Zero-Shot Optical Music Recognition via Data-Centric Synthetic Training

Daniel Dratschuk, Paul Swoboda

AI总结 光学乐谱识别(OMR)任务面临缺乏大规模真实扫描数据集的瓶颈,现有方法多依赖少量样本迁移或过于简化的合成训练。本文提出Transcoda系统,通过改进的合成数据生成、**kern编码的规范化以及基于语法规则的解码方法,有效解决了乐谱文本编码的非唯一性问题。该方法在单块GPU上仅用6小时即可训练出一个5900万参数的紧凑模型,在合成乐谱数据集和历史波兰乐谱数据集上均取得优于现有方法的显著性能提升。

Comments 13 pages, 7 figures

详情
英文摘要

Optical Music Recognition (OMR), the task of transcribing sheet music into a structured textual representation, is currently bottlenecked by a lack of large-scale, annotated datasets of real scans. This forces models to rely on either few-shot transfer or synthetic training pipelines that remain overly simplistic. A secondary challenge is encoding non-uniqueness: in the popular Humdrum **kern format for transcribing music, multiple different text encodings can render into the same visual sheet music. This one-to-many mapping creates a harder learning task and introduces high uncertainty during decoding. We propose Transcoda, an OMR system built on (i) an advanced synthetic data generation pipeline, (ii) a normalization of the **kern encoding to enforce a unique normal form and (iii) grammar-based decoding to ensure the syntactic correctness of the output. This approach allows us to train a compact 59M-parameter model in just 6 hours on a single GPU that outperforms billion-parameter baselines. Transcoda achieves the best score among state of the art baselines on a newly curated benchmark of synthetically rendered scores at 18.46% OMR-NED (compared to 43.91% for the next-best system, Legato) and reduces the error rate on historical Polish scans to 63.97% OMR-NED (down from 80.16% for SMT++).

2605.10834 2026-05-12 cs.AI cs.CR

From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

Pedro Conde, Henrique Branquinho, Valerio Mazzone, Bruno Mendes, André Baptista, Nuno Moniz

AI总结 本文提出了一种用于评估AI渗透测试代理在真实世界场景中表现的实用评估协议,旨在弥补现有基准在复杂性和战略决策方面的不足。该方法通过验证漏洞发现、结合基于大语言模型的语义匹配和双图分辨率评分等技术,能够在多攻击面和漏洞类别组成的复杂目标中进行更真实的评估。该协议不仅提升了AI渗透测试代理的比较分析的实用性,还提供了可复现的专家标注数据集和代码,推动了该领域的进一步研究。

详情
英文摘要

AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets. Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code execution, exploit reproduction, or trajectory similarity, in simplified or narrow settings. These tools are valuable for measuring bounded capabilities, yet they do not adequately capture the complexity, open-ended exploration, and strategic decision-making required in realistic pentesting. In this paper, we present a practical evaluation protocol that shifts assessment from task completion to validated vulnerability discovery, allowing evaluation in sufficiently complex targets spanning multiple attack surfaces and vulnerability classes. The protocol combines structured ground-truth with LLM-based semantic matching to identify vulnerabilities, bipartite resolution to score findings under realistic ambiguity, continuous ground-truth maintenance, repeated and cumulative evaluation of stochastic agents, efficiency metrics, and reduced-suite selection for sustainable experimentation. This protocol extends the state of the art by enabling a more realistic, operationally informative comparison of AI pentesting agents. To enable reproducibility, we also release expert-annotated ground truth and code for the proposed evaluation protocol: https://github.com/jd0965199-oss/ethibench.

2605.10833 2026-05-12 cs.CV cs.AI

MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection

Xiran Zhao, Jing Jin, Yan Bai, Zhongan Wang, Yifeng Sun, Yihang Lou, Xuanyu Zhu, Tao Feng, Yingna Wu

AI总结 本文提出MMVIAD,首个面向工业异常检测的多视角连续视频数据集,涵盖多种物体类别、环境和异常类型,并支持多项任务评估。为提升模型在细粒度缺陷识别和时序定位上的表现,研究设计了两阶段的后训练流程,显著提升了模型性能,优于现有主流模型。该工作为工业视频理解与异常检测提供了新的基准和方法。

详情
英文摘要

Industrial anomaly detection is critical for manufacturing quality control, yet existing datasets mainly focus on static images or sparse views, which do not fully reflect continuous inspection processes in real industrial scenarios. We introduce MMVIAD (Multi-view Multi-task Video Industrial Anomaly Detection), to the best of our knowledge the first continuous multi-view video dataset for industrial anomaly detection and understanding, together with a benchmark for multi-task evaluation. MMVIAD contains object-centric 2-second inspection clips with approximately 120 degrees of camera motion, covering 48 object categories, 14 environments, and 6 structural anomaly types. It supports anomaly detection, defect classification, object classification, and anomaly visible-time localization. Systematic evaluations on MMVIAD show that current commercial and open-source video MLLMs remain far below human performance, especially for fine-grained defect recognition and temporal grounding. To improve transferable anomaly understanding, we further develop a two-stage post-training pipeline where PS-SFT (Perception-Structured Supervised Fine-Tuning) initializes perception-structured reasoning and VISTA-GRPO (Visibility-grounded Industrial Structured Temporal Anomaly Group Relative Policy Optimization) refines the model with semantic-gated defect reward and visibility-aware temporal reward, producing the final model VISTA. On MMVIAD-Unseen, VISTA improves the base model's average score across the four tasks from 45.0 to 57.5, surpassing GPT-5.4. Source code is available at https://github.com/Georgekeepmoving/MMVIAD.

2605.10831 2026-05-12 cs.LG cs.AI cs.CE cs.CL

SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing

Mingxu Zhang, Yuhan Li, Lujundong Li, Dazhong Shen, Hui Xiong, Ying Sun

AI总结 该研究提出了一种名为SLIM的可插拔框架,用于实现可解释且面向属性的基于大语言模型的分子编辑。SLIM通过稀疏自编码器将模型的隐藏状态分解为与分子属性对齐的稀疏特征,并利用可学习的重要性门控机制,从而在不修改模型参数的情况下,精准激活与目标属性相关的维度,显著提升了编辑成功率。实验表明,SLIM在多个分子属性和模型架构上均优于现有方法,最高提升了42.4个百分点。

详情
英文摘要

Large language models possess strong chemical reasoning capabilities, making them effective molecular editors. However, property-relevant information is implicitly entangled across their dense hidden states, providing no explicit handle for property control: a substantial fraction of edits fail to improve or even degrade target properties. To address these issues, we propose SLIM (Sparse Latent Interpretable Molecular editing), a plug-and-play framework that decomposes the editor's hidden states into sparse, property-aligned features via a Sparse Autoencoder with learnable importance gates. Steering in this sparse feature space precisely activates property-relevant dimensions, improving editing success rate without modifying model parameters. The same sparse basis further supports interpretable analysis of editing behavior. Experiments on the MolEditRL benchmark across four model architectures and eight molecular properties show consistent gains over baselines, with improvements of up to 42.4 points.

2605.10828 2026-05-12 cs.AI

The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning

Muhan Gao, Zih-Ching Chen, Kuan-Hao Huang

AI总结 随着大语言模型在检索增强生成和智能代理系统中的广泛应用,理解干扰信息对长上下文推理性能的影响变得至关重要。本文系统研究了固定长度上下文中误导性文档比例与模型性能之间的非线性关系,发现误导信息比例增加初期性能急剧下降,之后影响趋于平缓,这一现象被形象地称为“第一滴墨水”效应。理论与实验分析表明,少量误导信息即可显著捕获模型注意力,而后续增加的干扰影响逐渐减弱,强调了提升检索阶段精度对改善模型性能的重要性。

详情
英文摘要

As large language models are increasingly deployed in retrieval-augmented generation and agentic systems that accumulate extensive context, understanding how distracting information affects long-context performance becomes critical. Prior work shows that semantically relevant yet misleading documents degrade performance, but the quantitative relationship between the proportion of distractors and performance remains unstudied. In this work, we systematically vary the hard-distractor proportion in fixed-length contexts, revealing a striking nonlinear pattern: as the proportion of hard distractors increases, performance drops sharply within the first small fraction, while the remainder of the range yields only marginal additional decline. We term this ''The First Drop of Ink'' effect, analogous to how a single drop of ink contaminates water. Our theoretical and empirical analyses grounded in attention mechanics show that hard distractors capture disproportionate attention even at small proportions, with diminishing marginal impact as their proportion grows. Controlled experiments further show that filtering gains mainly come from context-length reduction rather than distractor removal; substantial recovery requires reducing the hard-distractor proportion to near zero, highlighting the importance of upstream retrieval precision.

2605.10823 2026-05-12 cs.LG

NoRIN: Backbone-Adaptive Reversible Normalization for Time-Series Forecasting

Shun Zhang, Yuyang Xiao

AI总结 NoRIN 是一种针对时间序列预测任务的非线性可逆归一化方法,旨在解决现有方法如 RevIN 在分布重塑能力上的不足。它基于 Johnson $S_U$ 分布的反双曲正弦变换,引入两个形状参数以灵活控制数据的尾部特征和偏度。通过将形状参数的优化与主网络训练解耦,NoRIN 能够更有效地适应不同模型结构的需求,实验表明不同主干网络需要不同的归一化参数才能发挥最佳性能。

Comments 8 pages, 2 figures

详情
英文摘要

Reversible instance normalization (RevIN) and its successors (Dish-TS, SAN, FAN) have become the de facto plug-in for time-series forecasting, yet the map they apply to each data point is strictly affine, $x \mapsto ax+b$, so they cannot reshape the underlying distribution -- heavy tails remain heavy and skewness remains uncorrected. We propose NoRIN, a non-linear reversible normalization based on the arcsinh-form Johnson $S_U$ transform with two shape parameters $(δ,\varepsilon)$ that control tailedness and skewness; the linear $Z$-score used by RevIN is recovered only in the limit $δ\to \infty$. Training $(δ,\varepsilon)$ jointly with the backbone via gradient descent reliably pushes them toward this linear limit within a few epochs -- a phenomenon we name the degeneration problem: the forecasting loss is locally indifferent to shape, and the high-capacity backbone compensates for any monotone reparameterization of its input. NoRIN escapes the degeneration by decoupling shape selection from gradient training: $(δ,\varepsilon)$ are initialized by a closed-form Slifker-Shapiro quantile fit and refined by Bayesian optimization on the validation objective, while the inner training loop is identical to standard RevIN-style training. Across six representative backbones x five real-world datasets x three prediction horizons (90 configurations), decoupled shape optimization recovers $(δ^\star,\varepsilon^\star)$ that sit systematically far from the linear limit, with values that vary in a backbone-dependent way. This empirically supports the central thesis: different backbones genuinely require different normalization parameters to reach their best performance.

2605.10822 2026-05-12 cs.LG eess.SP

Benchmarking Sensor-Fault Robustness in Forecasting

Alexander Windmann, Philipp Wittenberg, Gianluca Manca, Marcel Dix, Jens U. Brandt, Oliver Niggemann

AI总结 该论文提出了一种名为SensorFault-Bench的基准测试框架,用于评估预测模型在传感器故障情况下的鲁棒性。研究通过引入标准化的故障严重性模型和多个真实数据集,系统评估了不同预测架构和鲁棒性改进方法在多种故障场景下的表现,揭示了传统基于干净数据误差的模型排名可能与实际故障场景下的性能存在显著差异。该框架还提供了开源代码和数据接口,支持后续研究在统一协议下进行扩展和比较。

详情
英文摘要

Cyber-physical system (CPS) forecasting models depend on sensor streams with noisy, biased, missing, or temporally misaligned readings, yet standard forecasting evaluation often selects models by nominal error without showing whether they remain robust under such faults. We introduce SensorFault-Bench, a shared CPS-grounded sensor-fault stress-test protocol for evaluating forecasting architectures and robustness-improvement methods, and an operational taxonomy organizing the method comparison. Across four real-world datasets and eight scored scenarios governed by a standardized severity model, it reports worst-scenario degradation, clean mean squared error (MSE), and worst-scenario fault-time MSE, separating relative robustness from absolute error. A disjoint fault-transfer split lets explicit fault-training methods train on adjacent fault families while evaluation uses separate benchmark scenarios. Empirically, forecasting architectures favored by clean MSE can degrade sharply under faults, and clean-MSE rankings can disagree with worst-scenario fault-time error rankings. Chronos-2, the evaluated zero-shot foundation-model representative, matches or trails the last-value naive forecaster in clean MSE on the two single-target datasets and has the largest worst-scenario degradation on ETTh1 and Traffic, where all channels are forecast targets. For the evaluated robustness-improvement method set, paired deltas show selective degradation reductions: projected gradient descent adversarial training and randomized training lead where value faults dominate observed degradation, while fault augmentation leads where availability faults dominate. SensorFault-Bench provides open-source code, documented data access, and reproduction and extension guides, so new datasets, architectures, and robustness-improvement methods can be evaluated under the same CPS sensor-fault robustness protocol.

2605.10821 2026-05-12 cs.RO

Unified Noise Steering for Efficient Human-Guided VLA Adaptation

Junjie Lu, Xinyao Qin, Yuhua Jiang, Kaixin Wang, Chuheng Zhang, Bin Liang, Jun Yang, Min Xu, Li Zhao

AI总结 本文提出了一种统一的噪声引导框架UniSteer,用于高效的人类引导型视觉-语言-动作(VLA)模型适应。该方法通过近似动作到噪声的逆变换,将人类的纠正动作转化为对噪声变量的监督信号,从而在保持预训练VLA模型不变的前提下,仅更新轻量级策略网络,实现高效的策略优化。实验表明,UniSteer在多个现实机器人操作任务中表现出优越的适应效率,显著提升了任务成功率。

详情
英文摘要

Diffusion-based vision-language-action (VLA) models have emerged as strong priors for robotic manipulation, yet adapting them to real-world distributions remains challenging. In particular, on-robot reinforcement learning (RL) is expensive and time-consuming, so effective adaptation depends on efficient policy improvement within a limited budget of real-world interactions. Noise-space RL lowers the cost by keeping the pretrained VLA fixed as a denoising generator while updating only a lightweight actor that predicts the noise. However, its performance is still limited due to inefficient autonomous exploration. Human corrective interventions can reduce this exploration burden, but they are naturally provided in action space, whereas noise-space finetuning requires supervision over noise variables. To address these challenges, we propose UniSteer, a Unified Noise Steering framework that combines human corrective guidance with noise-space RL through approximate action-to-noise inversion. Given a human corrective action, UniSteer inverts the frozen flow-matching decoder to recover a noise target, which provides supervised guidance for the same noise actor that is simultaneously optimized via reinforcement learning. Real-world experiments on diverse manipulation tasks show that UniSteer adapts more efficiently than strong noise-space RL and action-space human-in-the-loop baselines, improving the success rate from 20% to 90% in 66 minutes on average across four real-world adaptation tasks.

2605.10820 2026-05-12 cs.AI cs.LG

MaD Physics: Evaluating information seeking under constraints in physical environments

Moksh Jain, Mehdi Bennani, Johannes Bausch, Yuri Chervonyi, Bogdan Georgiev, Simon Osindero, Nenad Tomašev

AI总结 本文提出了一种名为 MaD Physics 的新基准,用于评估智能体在物理环境中受测量质量和数量约束下的信息获取与科学推理能力。该基准包含三个基于不同物理定律的环境,并引入了修改后的物理定律以减少先验知识的干扰。智能体在有限的测量预算下进行实验,随后需推断出物理规律并进行未来状态预测,从而评估其模型推理与约束下的规划能力。研究还展示了该基准在评估多模态处理和上下文学习等能力方面的应用,并对多个 Gemini 模型进行了测试,指出了其在结构化探索和数据收集方面的不足。

Comments 64 pages, 10 figures. Project page: https://mad-physics.github.io/

详情
英文摘要

Scientific discovery is fundamentally a resource-constrained process that requires navigating complex trade-offs between the quality and quantity of measurements due to physical and cost constraints. Measurements drive the scientific process by revealing novel phenomena to improve our understanding. Existing benchmarks for evaluating agents for scientific discovery focus on either static knowledge-based reasoning or unconstrained experimental design tasks, and do not capture the ability to make measurements and plan under constraints. To bridge this gap, we propose Measuring and Discovering Physics (MaD Physics), a benchmark to evaluate the ability of agents to make informative measurements and conclusions subject to constraints on the quality and quantity of measurements. The benchmark consists of three environments, each based on a distinct physical law. To mitigate contamination from existing knowledge, MaD Physics includes altered physical laws. In each trial, the agent makes measurements of the system until it exhausts an allotted budget and then the agent has to infer the underlying physical law to make predictions about the state of the system in the future. MaD Physics evaluates two fundamental capabilities of scientific agents: inferring models from data and planning under constraints. We also demonstrate how MaD Physics can be used to evaluate other capabilities such as multimodality and in-context learning. We benchmark agents on MaD Physics using four Gemini models (2.5 Flash Lite, 2.5 Flash, 2.5 Pro, and 3 Flash), identifying shortcomings in their structured exploration and data collection capabilities and highlighting directions to improve their scientific reasoning.

2605.10817 2026-05-12 cs.AI

CLEF: EEG Foundation Model for Learning Clinical Semantics

Peng Cao, Ali Mirzazadeh, Jong Woo Lee, Aleksandar Videnovic, Dina Katabi

AI总结 本文提出CLEF,一种基于临床语义的长上下文EEG基础模型,用于解决临床EEG解读中需要整合完整会话信号与临床背景的问题。CLEF通过三维多窗谱图分词和对比学习目标,将EEG会话与神经科报告及结构化电子健康记录对齐,实现了在大规模数据上的高效建模。实验表明,CLEF在包含234项任务的基准测试中显著优于现有模型,展示了其在临床EEG表征学习中的潜力。

详情
英文摘要

Clinical EEG interpretation requires reasoning over full EEG sessions and integrating signal patterns with clinical context. Existing EEG foundation models are largely designed for short-window decoding and do not incorporate clinical context. We introduce CLEF, a clinically grounded long-context EEG foundation model. CLEF represents EEG sessions as 3D multitaper spectrogram tokens, enabling tractable Transformer modeling at session scale, and aligns embeddings with neurologist reports and structured EHR data through contrastive objectives. We evaluate CLEF on a new 234-task benchmark spanning disease phenotypes, medication exposures, and EEG findings, with more than 260k EEG sessions from over 108k patients. CLEF outperforms prior EEG foundation models on 229 of 234 tasks, improving mean AUROC from 0.65 to 0.74. Reconstruction-only pretraining surpasses prior EEG foundation models, while report and EHR alignment yields further gains. Held-out concept and external-cohort experiments suggest that these representations transfer beyond observed alignment targets. These results support session-scale, clinically grounded representation learning as a promising foundation-model paradigm for clinical EEG.

2605.10816 2026-05-12 cs.LG cs.AI

Policy Gradient Methods for Non-Markovian Reinforcement Learning

Avik Kar, Siddharth Chandak, Rahul Singh, Soumitra Sinhahajari, Eric Moulines, Shalabh Bhatnagar, Nicholas Bambos

AI总结 本文研究了非马尔可夫强化学习中的策略梯度方法,针对观测和奖励依赖于完整交互历史的问题,提出了一种新的策略框架。该方法通过递归更新内部状态来压缩历史信息,并联合优化状态动态与控制策略以最大化累积奖励。作者建立了适用于非马尔可夫环境的策略梯度定理,并提出了ASMPG算法,实验表明其在多个非马尔可夫任务中优于基于预测目标的状态表示学习方法。

Comments 39 pages, 5 figures, 1 table

详情
英文摘要

We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provide a compact summary of past observations and actions. In contrast to approaches that treat the agent state dynamics as fixed or learn it via predictive objectives, we propose a reward-centric formulation that jointly optimizes the agent state dynamics and the control policy to maximize the expected cumulative reward. To this end, we consider a class of Agent State-Markov (ASM) policies, comprising an agent state dynamics and a control policy that maps the agent state to actions. We establish a novel policy gradient theorem for ASM policies, extending the classical policy gradient results from the Markovian setting to episodic and infinite-horizon discounted NMDPs. Building on this gradient expression, we propose the Agent State-Markov Policy Gradient (ASMPG) algorithm, which leverages the recursive structure of the agent state dynamics for efficient optimization. We establish finite-time and almost sure convergence guarantees, and empirically demonstrate that, on a range of non-Markovian tasks, ASMPG outperforms baselines that learn state representations via predictive objectives.

2605.10809 2026-05-12 cs.LG cs.DS

Mistake-Bounded Language Generation

Jon Kleinberg, Charlotte Peale, Omer Reingold

AI总结 本文研究了极限语言生成任务,提出了一种新的“错误有界生成”概念,关注生成算法在学习过程中累积错误的最小化,而非传统上关注最后一次错误的时间。通过形式化归约到“从正确示范中学习”框架,作者给出了计算错误界的一般方法,并针对有限和无限语言流分别提出了算法与理论分析,揭示了错误界与收敛性之间的根本权衡关系。此外,该框架还可扩展至应对噪声对手,保证错误界随对手的次优性进行扩展。

详情
英文摘要

We investigate the learning task of language generation in the limit, but shift focus from the traditional time-of-last-mistake metric of a generator's success to a new notion of "mistake-bounded generation." While existing results for language generation in the limit focus on guaranteeing eventual consistency, they are blind to the cumulative error incurred during the learning process. We address this by shifting the goal to minimizing the total number of invalid elements output by a generation algorithm. We establish a formal reduction to the Learning from Correct Demonstrations framework of Joshi et al. (2025), enabling a general recipe for deriving mistake bounds via weighted update rules. For finite classes, we provide an algorithm that simultaneously achieves an optimal last-mistake time of $\mathsf{Cdim}(L)$ and a mistake bound of $\lfloor \log_2 |L| \rfloor$, whereas for the non-uniform setting of countably infinite streams of languages, we prove a fundamental trade-off: achieving logarithmic mistakes $O(\log i)$ necessarily precludes convergence guarantees established in prior work. Finally, we show that our framework can be extended to accommodate noisy adversaries and guarantee mistake bounds that scale with the adversary's suboptimality.

2605.10806 2026-05-12 cs.CV cs.AI cs.LG

PhyGround: Benchmarking Physical Reasoning in Generative World Models

Juyi Lin, Arash Akbari, Yumei He, Lin Zhao, Haichao Zhang, Arman Akbari, Xingchen Xu, Zoe Y. Lu, Enfu Nan, Hokin Deng, Edmund Yeh, Sarah Ostadabbas, Yun Fu, Jennifer Dy, Pu Zhao, Yanzhi Wang

AI总结 PhyGround 是一个用于评估生成式世界模型物理推理能力的新基准,旨在解决现有视频生成模型在物理规律遵循性方面的评估难题。该基准包含250个精心设计的提示,每个提示附带预期的物理结果,并涵盖13类物理定律的分类体系。通过大规模、质量控制的人类标注实验和一个专门的物理推理视觉语言模型 PhyJudge-9B,PhyGround 能够对生成视频的物理合理性进行细粒度、可复现的评估,显著提升了评估的准确性与可靠性。

Comments Preprint. 56 pages, 39 figures, 40 tables. Project page: https://phyground.github.io/

详情
英文摘要

Generative world models are increasingly used for video generation, where learned simulators are expected to capture the physical rules that govern real-world dynamics. However, evaluating whether generated videos actually follow these rules remains challenging. Existing physics-focused video benchmarks have made important progress, but they still face three key challenges, including the coarse evaluation frameworks that hide law-specific failures, response biases and fatigue that undermine the validity of annotation judgments, and automated evaluators that are insufficiently physics-aware or difficult to audit. To address those challenges, we introduce PhyGround, a criteria-grounded benchmark for evaluating physical reasoning in video generation. The benchmark contains 250 curated prompts, each augmented with an expected physical outcome, and a taxonomy of 13 physical laws across solid-body mechanics, fluid dynamics, and optics. Each law is operationalized through observable sub-questions to enable per-law diagnostics. We evaluate eight modern video generation models through a large-scale, quality-controlled human study, grounded on social science lab experiment design. A total of 459 annotators provided 5,796 complete annotations and over 37.4K fine-grained labels; after quality control, the retained annotations exhibited high split-half model-ranking correlations (Spearman's rho > 0.90). To support reproducible automated evaluation, we release PhyJudge-9B, an open physics-specialized VLM judge. PhyJudge-9B achieves substantially lower aggregate relative bias than Gemini-3.1-Pro (3.3% vs. 16.6%). We release prompts, human annotations, model checkpoints, and evaluation code on the project page https://phyground.github.io/.

2605.10805 2026-05-12 cs.AI cs.CL stat.ML

Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge

Wenbo Zhang, Lijinghua Zhang, Liner Xiang, Hengrui Cai

AI总结 本文研究了在LLM作为裁判的场景下,推理能力带来的收益与成本之间的平衡问题。研究发现,推理在需要结构化验证的任务中显著提升判断准确性,但在简单任务中可能带来有限甚至负面效果,并伴随更高的计算成本。为此,作者提出了RACER方法,在固定预算下通过分布鲁棒优化动态选择是否启用推理,有效应对分布偏移,并在实验中展现出优越的准确率与成本平衡能力。

Comments Accepted at ICML 2026

详情
英文摘要

Reasoning-capable large language models (LLMs) have recently been adopted as automated judges, but their benefits and costs in LLM-as-a-Judge settings remain unclear. Through controlled comparisons between reasoning and non-reasoning judges, we show that explicit reasoning substantially improves judgment accuracy on tasks requiring structured verification (e.g., math and coding), while offering limited or even negative gains on simpler evaluations and incurring significantly higher computational cost. These findings motivate that reasoning should be used selectively rather than universally, with awareness of possible distribution shift. We propose a Robust Adaptive Cost-Efficient Routing (RACER), which dynamically selects between reasoning and non-reasoning judges under a fixed budget by formulating routing as a constrained distributionally robust optimization problem. RACER explicitly accounts for distribution shift via a KL-divergence uncertainty set, admits an efficient primal--dual algorithm, and enjoys theoretical guarantees including uniqueness of the optimal policy and linear convergence. Extensive experiments show that RACER achieves superior accuracy--cost trade-offs under distribution shift.

2605.10804 2026-05-12 cs.AI cs.CY cs.HC

New AI-Driven Tools for Enhancing Campus Well-being: A Prevention and Intervention Approach

Jinwen Tang

AI总结 本研究旨在提升校园心理健康水平,通过开发AI驱动的工具解决高校在学生满意度监测和心理风险检测方面的不足。研究提出了预防性工具TigerGPT和AURA,前者通过个性化聊天机器人提升调查参与度,后者利用强化学习优化对话质量;在干预方面,引入基于叙事故事的心理筛查方法,并开发了符合临床指南的PsychoGPT模型,结合多模型推理技术提高评估的准确性与可解释性。整体框架整合了这些工具,实现了从调研到心理干预的无缝衔接。

Comments PhD Dissertation, University of Missouri, May 2026

详情
英文摘要

Campus well-being underpins academic success, yet many universities lack effective methods for monitoring satisfaction and detecting mental health risks. This dissertation addresses these gaps through prevention (improving feedback collection) and intervention (advancing mental health detection), unified under an integrated framework. For prevention, we developed TigerGPT, a personalized survey chatbot leveraging LLMs to engage users in context-aware conversations grounded in conversational design and engagement theory, achieving 75% usability and 81% satisfaction. To address its limitations in repetitiveness and response depth, we introduced AURA, a reinforcement-learning framework that adapts follow-up question types (validate, specify, reflect, probe) within a session using an LSDE quality signal (Length, Self-disclosure, Emotion, Specificity), initialized from 96 prior conversations. AURA achieved +0.12 mean quality gain (p=0.044, d=0.66), with 63% fewer specification prompts and 10x more validation behavior. For intervention, we examine Expressive Narrative Stories (ENS) for mental health screening, showing BERT(128) captures nuanced linguistic features without keyword cues, while conventional classifiers depend heavily on explicit mental health terms. We then developed PsychoGPT, an LLM built on DSM-5 and PHQ-8 guidelines that performs initial distress classification, symptom-level scoring, and reconciliation with external ratings for explainable assessment. To reduce hallucinations, we proposed Stacked Multi-Model Reasoning (SMMR), layering expert models where early layers handle localized subtasks and later layers reconcile findings, outperforming single-model solutions on DAIC-WOZ in accuracy, F1, and PHQ-8 scoring. Finally, a cohesive framework unifies these tools, enabling adaptive survey insights to flow directly into specialized mental health detection models.

2605.10797 2026-05-12 cs.LG

Muown: Row-Norm Control for Muon Optimization

Kai Lion, Florian Hübler, Bingcong Li, Antonio Orvieto, Niao He

AI总结 本文研究了Muon优化器在大规模语言模型预训练中的权重衰减敏感性问题,发现其谱范数在训练过程中会上升,主要由行幅值因子驱动。为此,作者提出了一种改进方法Muown,将行幅值向量作为显式优化变量,在$\ell_\infty$几何下更新,其余部分仍使用Muon优化。实验表明,Muown在多个模型规模上均能提升困惑度,降低对权重衰减的敏感性,并有效抑制谱范数漂移。

详情
英文摘要

Muon has emerged as a strong competitor to AdamW for language model pre-training, yet its behavior at scale is sensitive to weight decay. Recent work has observed that, for Muon without decoupled weight decay, the spectral norm of weight matrices drifts upward over training. Through a decomposition of the spectral norm into a row-magnitude factor and a row-coherence factor, we identify the former as the empirical driver of this drift under Muon, while the latter remains well-behaved along the trajectory. Motivated by this diagnosis, we introduce Muown, a drop-in replacement for Muon that treats the row-magnitude vector as an explicit optimizer variable, updating it under the $\ell_\infty$ geometry induced by the decomposition, while applying Muon unchanged to the remaining direction component. We prove that Muown attains the optimal non-convex rates in both deterministic and stochastic regimes under a dual norm aligned with the underlying geometries and with a stochastic noise coefficient that empirically remains below that of Muon throughout training. Across GPT-style pre-training on FineWeb-Edu with model sizes from 124M up to 2.7B parameters, Muown improves perplexity over Muon, SOAP, AdamW, and Lion. It also widens the plateau of near-optimal learning rates across model scales, reduces sensitivity to weight decay, and avoids the spectral norm drift at negligible step-time overhead when appropriately sharded.

2605.10796 2026-05-12 cs.AI

Interpretable Machine Learning for Football Performance Analysis: Evidence of Limited Transferability from Elite Leagues to University Competition

Yu-Fang Tsai, Yu-Jen Chen, Kok-Hua Tan, Sheng-Chieh Huang, You-Ying Ji, Yu-Lun Chen, Chun-Yi Wang, Chien-Ming Hsu

AI总结 该研究探讨了从顶级职业足球联赛到大学足球比赛的领域迁移中,机器学习模型可解释性是否保持稳定。通过在欧洲五大联赛数据上训练随机森林和多层感知机模型,并应用于清华大学足球队数据,发现顶级联赛中的表现决定因素具有稳定的层次结构,而大学联赛中关键指标的排序发生显著变化,解释结果的稳定性下降。研究指出,模型的可解释性在不同领域间存在显著差异,这一现象可能反映目标领域的结构模糊性,而非方法本身的局限。

Comments 19 pages, 6 figures

详情
英文摘要

Machine learning has become increasingly prevalent in football performance analysis, yet most studies prioritize predictive accuracy while implicitly assuming that learned performance determinants and their interpretations are transferable across competition levels. Whether interpretability remains reliable under domain shift-from elite to university football remains largely unexplored. This study investigates whether performance determinants learned from elite competitions are structurally transferable to university-level football and whether their interpretations remain robust under domain shift. Models were trained on large-scale event data from the top five European leagues and applied to university football data from National Tsing Hua University (NTHU) using an identical feature space. Random Forest and Multilayer Perceptron models were interpreted using SHapley Additive exPlanations (SHAP) and Counterfactual Impact Score (CIS). Across five experiments, elite football exhibited a stable and consistent hierarchy of performance determinants across leagues, models, and explanation methods. In contrast, NTHU university football showed substantial reordering of key indicators, reduced explanation stability, weaker structural agreement with elite domains, and increased sensitivity to explanation method. These findings suggest that interpretability robustness is domain-dependent. Rather than reflecting methodological limitations alone, instability in explanations under domain shift may serve as a diagnostic signal of structural ambiguity in the target domain.

2605.10793 2026-05-12 cs.LG

ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs

Chayne Thrash, Ali Abbasi, Soheil Kolouri

AI总结 大型语言模型(LLMs)因内存占用大、推理成本高而难以部署。本文提出一种轻量级的后训练旋转校准方法ConQuR,通过学习正交旋转将归一化激活对齐到内切超立方体的顶点,使激活能量在各维度上更均匀分布,从而提升低比特激活量化效果。该方法通过正交Procrustes问题实现高效的闭式更新,避免了对正交群的梯度优化,并引入在线校准流程以适应量化后的激活分布,无需存储大量激活数据。实验表明,该方法在多个基准测试中表现优异,同时避免了昂贵的端到端训练和大规模离线存储需求。

详情
英文摘要

Large language models (LLMs) are costly to deploy due to their large memory footprint and high inference cost. Weight-activation quantization can reduce these costs, but low-bit activation quantization remains difficult because activation outliers induce large quantization error. Recent rotation-based methods address this by applying orthogonal transformations that redistribute activation magnitude across dimensions, but existing approaches either require expensive end-to-end rotation training or rely on stored activation corpora, introducing significant compute or storage overhead. We propose a lightweight post-training rotation calibration method for LLM activation quantization. Our method learns orthogonal rotations that align normalized activations with the corners of an inscribed hypercube, encouraging activation energy to be distributed more evenly across dimensions. This objective admits an efficient closed-form update via the orthogonal Procrustes problem, avoiding gradient-based optimization over the orthogonal group. We further introduce an online calibration procedure that updates rotations as calibration samples are processed, eliminating the need to store activations on disk and allowing rotations to adapt to quantized activation distributions during calibration. Experiments on Llama-2 and Llama-3 models from 3B to 70B parameters show that our method achieves competitive or improved performance across perplexity benchmarks and common sense reasoning tasks while avoiding both costly end-to-end training and large offline activation storage.

2605.10791 2026-05-12 cs.AI

PathISE: Learning Informative Path Supervision for Knowledge Graph Question Answering

Shengxiang Gao, Chao Lei, Jey Han Lau, Jianzhong Qi

AI总结 知识图谱问答(KGQA)旨在通过推理知识图谱来回答用户问题。当前方法多依赖检索增强生成范式,但训练过程中需要高质量的中间监督信号(如相关路径或子图),获取成本较高。本文提出PathISE框架,通过一个轻量的基于Transformer的估计器从答案标签中学习高质量的路径级监督,并将其蒸馏到路径生成模型中,从而生成可用于归纳推理的紧凑证据。实验表明,PathISE在多个基准上表现优异,并能提供可复用的监督信号以增强现有模型。

详情
英文摘要

Knowledge Graph Question Answering (KGQA) aims to answer user questions by reasoning over Knowledge Graphs (KGs). Recent KGQA methods mainly follow the retrieval-augmented generation paradigm to ground Large Language Models~(LLMs) with structured knowledge from KGs. However, training effective models to retrieve question-relevant evidence from KGs typically requires high-quality intermediate supervision signals, such as question-relevant paths or subgraphs, which are time- and resource-intensive to obtain. We propose PathISE, a novel framework for learning high-quality intermediate supervision from answer-level labels. PathISE introduces a lightweight transformer-based estimator that estimates the informativeness of relation paths to construct pseudo path-level supervision. This supervision is then distilled into an LLM path generator, whose generated paths are grounded in the KG to provide compact evidence for inductive answer reasoning. ExtensiveISE experiments on three KGQA benchmarks show that PathISE achieves competitive or state-of-the-art KGQA performance, and provides reusable supervision signals that can enhance existing KGQA models, without relying on costly LLM-refined supervision signals. Our source code is available at https://anonymous.4open.science/r/PathISE-2F87.

2605.10790 2026-05-12 cs.LG

Elucidating Representation Degradation Problem in Diffusion Model Training

Zhipeng Yao, Dazhou Li, Zitong Zhang, Durude Mahee, Fan Zhu, Wenbin Zhang, Xinwei He, Yeying Jin, Rui Yu

AI总结 扩散模型在生成任务中表现出色,但在训练过程中存在效率低下的问题,这主要归因于一种称为“表示退化”的优化瓶颈。随着噪声水平的增加,模型输出会出现结构扭曲,影响训练稳定性与生成质量。本文分析指出,这一问题源于目标可恢复性不匹配,并与神经切线核谱减弱和有效低秩行为有关。为此,作者提出了一个即插即用的框架Elucidated Representation Diffusion(ERD),通过动态分配优化资源以稳定表示学习,从而加速收敛并提升多种扩散模型的生成性能。

详情
英文摘要

Diffusion models have achieved remarkable success, yet their training remains inefficient due to a severe optimization bottleneck, which we term Representation Degradation. As noise levels increase, the outputs of the trained model exhibit progressive structural distortion, which can destabilize training and impair generation quality. Our analysis suggests that this instability is driven by mismatched target recoverability, which is associated with Neural Tangent Kernel (NTK) spectral weakening and effective low-rank behavior. To address this, we propose Elucidated Representation Diffusion (ERD), a plug-and-play framework that dynamically reallocates optimization effort according to effective recoverability. By stabilizing representation learning without external supervision, ERD accelerates convergence and achieves strong empirical performance across diffusion backbones.

2605.10789 2026-05-12 cs.CV

Rapid Forest Fuel Load Estimation via Virtual Remote Sensing and Metric-Scale Feed-Forward 3D Reconstruction

Quanyun Wu, Kyle Gao, Wentao Sun, Zhengsen Xu, Hudson Sun, Linlin Xu, Yuhao Chen, David A. Clausi, Jonathan Li

AI总结 本文提出了一种基于虚拟遥感数据和度量级前馈3D重建的快速森林燃料载荷估计方法,旨在解决传统方法成本高、耗时长的问题。该方法利用Google Earth Studio生成低空轨道图像和相机位姿,结合改进的Pi-Long模型进行密集3D重建,并通过度量恢复模块解决单目重建的尺度模糊问题,最终生成鸟瞰图高度和密度图,进而实现树种分类、叶面积指数计算和燃料载荷估计。实验表明,该方法在保证几何一致性的同时,提供了高效、低成本的森林生物量估算方案。

Comments Accepted for publication at IEEE IGARSS 2026

详情
英文摘要

Accurate quantification of forest coverage and combustible biomass (fuel load) is critical for wildfire risk assessment and ecosystem management. However, traditional methods relying on airborne LiDAR or field surveys are cost-prohibitive and time-intensive, while satellite imagery often lacks the vertical resolution required for canopy volume analysis. This paper proposes a novel, automated pipeline for rapid forest inventory using virtual remote sensing data derived from Google Earth Studio (GES). Our approach first generates low-altitude orbital imagery and camera poses for a target region. For dense 3D reconstruction, we employ Pi-Long, developed within the VGGT-Long framework. This model serves as a scalable extension of the Pi-3 feed-forward Transformer architecture. To address the inherent scale ambiguity in monocular reconstruction, we introduce a metric recovery module that aligns the reconstructed trajectory with GES ground truth poses via Sim(3) Umeyama optimization. The metric-scale point cloud is then orthogonally projected into Bird's-Eye-View (BEV) height and density maps. Finally, we employ a watershed-based segmentation algorithm combined with height variance analysis to classify tree species (conifer vs. broadleaf), calculate Leaf Area Index (LAI), and estimate total fuel load. Experimental results demonstrate that this pipeline offers a scalable, cost-effective alternative to physical scanning, enabling near-real-time estimation of forest biomass with high geometric consistency.

2605.10784 2026-05-12 cs.LG

MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization

Rohan Surana, Xintong Li, Sheldon Yu, Yiran Jenny Shen, Chuhan Wang, Tong Yu, Prithviraj Ammanabrolu, Jingbo Shang, Julian McAuley, Junda Wu

AI总结 本文提出了一种名为MASS-DPO的多负样本主动选择方法,用于改进直接策略优化(DPO)中的多负偏好优化。该方法基于Plackett-Luce模型,通过引入一个特定的Fisher信息目标,从每个提示中选择信息量大且冗余少的负样本子集,从而在保持整体信息完整性的前提下减少计算开销。实验表明,MASS-DPO在多个基准任务中表现出更高的准确率和更好的优化动态,同时使用更少的负样本即可实现更强的模型对齐效果。

详情
英文摘要

Multi-negative preference optimization under the Plackett--Luce (PL) model extends Direct Preference Optimization (DPO) by leveraging comparative signals across one preferred and multiple rejected responses. However, optimizing over large negative pools is costly, and many candidates contribute redundant gradients due to their similar effects on policy updates. We introduce MASS-DPO, a multi-negative active sample selection method that derives a PL-specific Fisher-information objective for selecting compact, informative negative subsets within each prompt. The resulting log-determinant objective selects negatives that contribute complementary information for policy updates, yielding compact subsets that retain the full pool's information while reducing redundancy. In practice, this favors negatives whose gradients cover different update directions, reducing redundant signal from near-duplicate candidates while preserving the most useful training information. Across four benchmarks spanning recommendation and multiple-choice QA and three model families, MASS-DPO consistently exceeds or matches existing methods in accuracy, improves Recall/NDCG and margin-based optimization dynamics, and delivers stronger alignment with substantially fewer negatives.

2605.10782 2026-05-12 cs.AI

TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding

Lihuan Li, Wilson Wongso, Baiyu Chen, Hao Xue, Ruiyi Yang, Yifan Duan, Xiachong Lin, Yang Song, Flora Salim

AI总结 TrajPrism 是一个面向语言引导的城市轨迹理解的多任务基准,旨在统一轨迹生成、语义轨迹检索和轨迹描述等任务,并评估轨迹的准确性、检索质量与语言关联性。该基准通过将真实城市轨迹与经过筛选的语言注释相结合,构建了涵盖波尔图、旧金山和北京等城市的30万条轨迹,生成了210万条任务实例。研究还提出了针对各任务的概念性模型,验证了仅依赖几何信息的轨迹方法在涉及语言交互的任务中存在明显不足。

Comments This paper is under review

详情
英文摘要

Urban mobility is naturally expressed both as trajectories in space and as natural-language descriptions of travel intent, constraints, and preferences. However, prior work rarely evaluates these two modalities together on the same real-world trajectories: trajectory modeling often stays geometry-centric, while language-centric mobility benchmarks frequently target route planning and tool use rather than fine-grained, verifiable alignment between text and the underlying route. We introduce TrajPrism, a multi-task benchmark for language-trajectory alignment that unifies (i) instruction-conditioned trajectory generation, (ii) language-driven semantic trajectory retrieval, and (iii) trajectory captioning, together with an evaluation protocol that measures trajectory fidelity, retrieval quality, and language groundedness. We construct TrajPrism by pairing real urban trajectories with judge-filtered language annotations generated under a four-dimensional travel-intent taxonomy. The benchmark contains 300K selected trajectories across Porto, San Francisco, and Beijing, yielding 2.1M task instances from three instruction variants, three retrieval queries, and one caption per trajectory. We further develop proof-of-concept models for each task: TrajAnchor for instruction-conditioned trajectory generation, TrajFuse for semantic trajectory retrieval, and TrajRap for trajectory captioning. These models instantiate the proposed tasks and show that geometry-only trajectory baselines leave a large gap on our protocol, especially where language is part of the input-output interface. We release TrajPrism with code and a reproducible annotation pipeline that is designed to be portable across cities, given compatible trajectory inputs and map resources.

2605.10781 2026-05-12 cs.LG cs.CL

Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

Jeonghye Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang

AI总结 该研究提出了一种名为RLRT的新方法,通过逆向利用自蒸馏中的教师信号,引导学生模型在成功路径上进行有价值的探索。与传统自蒸馏在成功时抑制学生自主推理不同,RLRT强调学生自身成功路径中的推理过程,并将其作为强化学习的奖励信号。实验表明,RLRT在多种Qwen3模型上显著优于现有自蒸馏和探索基线,为RLVR提供了新的设计原则。

详情
英文摘要

Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student's choices and suppresses it's own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student's own success. Across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, RLRT substantially outperforms self-distillation and exploration-based baselines, establishing information asymmetry as a new, principled design axis for RLVR.

2605.10777 2026-05-12 cs.LG

Locking Pretrained Weights via Deep Low-Rank Residual Distillation

Keitaro Sakamoto, Pierre Ablin, Federico Danieli, Marco Cuturi

AI总结 近年来,开源权重的语言模型质量显著提升,但其权重的自由修改可能带来安全风险。为此,本文提出了一种新型防御方法DLR-Lock,通过将预训练模型中的多层感知机替换为参数量相当的深度低秩残差网络(DLR-Net),利用反向传播过程中激活内存随深度线性增长的特性,增加模型微调时的优化难度。实验表明,该方法在不影响模型原有性能的前提下,有效抵御了具有完整防御策略知识的自适应攻击者。

详情
英文摘要

The quality of open-weight language models has dramatically improved in recent years. Sharing weights greatly facilitates model adoption by enabling their use across diverse hardware and software platforms. They also allow for more open research and testing, to the extent that users can use them as checkpoints, fine-tune them according to their needs, and potentially redistribute them. In some cases, however, concerns on modifying these weights towards unauthorized uses may outweigh the pros of giving users such a freedom. Defending against such adaptation is non-trivial: since an adaptive attacker can observe all weights and architectures by definition, they can reverse simple structural defenses, and use optimization to defeat the simplest locking mechanisms. In this work, we exploit the inference-training asymmetry of automatic differentiation as a novel defense axis. We propose DLR-Lock, a method where the purveyor of the model purposely replaces each pretrained MLP in their model with a deep low-rank residual network (DLR-Net) of comparable parameter count, forcing activation memory that grows linearly with depth during backpropagation. DLR-Nets are efficiently trained via module-wise distillation. We show that, beyond this memory overhead, DLR-Lock results in architectural mismatches that complicate the optimization landscape of standard fine-tuning, and a backward pass that incurs disproportionately more overhead than the forward pass. Our defense succeeds in withstanding adaptive attackers with full knowledge of the defense strategy while preserving the original model's capabilities. Experiments on LLM validate these claims.