arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1968
2605.14067 2026-05-15 cs.LG

Comparative Evaluation of Machine Learning Approaches for Minority-Class Financial Distress Prediction Under Class Imbalance Constraints

Karan Sehgal, Khawar Naveed Bhatti

AI总结 本文研究了在类别不平衡条件下,如何有效预测少数类财务困境企业的问题,比较了经典统计方法、集成学习和神经网络模型的性能。通过结构化预处理和SMOTE过采样技术,实验发现梯度提升方法在严重不平衡情况下对少数类具有更高的敏感性。研究强调了可复现性、可解释性和治理导向的机器学习评估,为金融风险分析提供了实用的工程化解决方案。

Comments 16 pages, 4 figures, preprint under review. Applied machine learning evaluation involving imbalance-aware financial distress prediction, ensemble learning, SMOTE, and SHAP explainability analysis

详情
英文摘要

Financial distress prediction remains a significant challenge in enterprise risk analysis due to the highly imbalanced nature of real-world financial datasets, where bankrupt or distressed firms typically constitute only a small minority of observations. This paper presents a comparative evaluation of classical statistical methods, ensemble learning approaches, and exploratory neural models for minority-class financial distress prediction under class imbalance constraints. The study incorporates structured preprocessing, imbalance mitigation using the Synthetic Minority Oversampling Technique (SMOTE), comparative evaluation across ensemble learning architectures including XGBoost, CatBoost, LightGBM, Random Forest, and explainability analysis using SHAP-based feature attribution methods. Experimental evaluation demonstrates that gradient-boosting approaches achieved improved minority-class sensitivity relative to baseline statistical classifiers under severe imbalance conditions. The workflow additionally emphasises reproducibility, interpretability, auditability, and governance-oriented machine learning evaluation within enterprise financial risk environments. The work is positioned as an applied engineering evaluation intended to support reproducible and interpretable machine learning workflows for financial distress prediction under severe class imbalance constraints.

2605.14062 2026-05-15 cs.AI cs.CL

Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

Anjir Ahmed Chowdhury, Syed Zawad, Feng Yan

AI总结 本文提出了一种名为MSIFR的轻量级框架,用于在生成过程中及时检测并终止低质量的生成轨迹,从而减少大语言模型合成数据生成中的冗余计算。该方法通过分阶段生成和快速规则验证,在生成早期识别算术错误、幻觉和格式问题,实现对无效样本的提前拒绝。实验表明,MSIFR在不增加训练或架构改动的前提下,显著降低了生成过程中的token消耗,同时保持或提升了生成数据的质量。

Comments 17 pages, 4 figures, 7 tables

详情
英文摘要

While synthetic data generation with large language models (LLMs) is widely used in post-training pipelines, existing approaches typically generate full outputs before applying quality filters, leading to substantial token waste on samples that are ultimately discarded. To address this, we propose Multi-Stage In-Flight Rejection (MSIFR), a lightweight, training-free framework that detects and terminates low-quality generation trajectories at intermediate checkpoints before they reach full completion. MSIFR decomposes the generation process into sequential stages and applies fast rule-based validators to identify arithmetic inconsistencies, hallucination patterns, and formatting violations, enabling early rejection of faulty samples. We formalize in-flight rejection as a sequential decision process and show that any non-trivial discard policy reduces expected token consumption, with stage-wise savings increasing when rejection occurs earlier in the generation pipeline. We further demonstrate that conditional utility estimates form a martingale, ensuring that early, in-flight rejection does not bias the expected utility of retained samples. Across five instruction-tuned models and seven reasoning benchmarks, MSIFR reduces token consumption by 11%-77% as a standalone method, and up to 78.2% when combined with early-exit methods, while preserving or improving evaluation accuracy. These results confirm that MSIFR provides a practical mechanism for improving the efficiency of LLM-based synthetic data generation without additional training or architectural changes.

2605.14061 2026-05-15 cs.AI cs.LG

MathAtlas: A Benchmark for Autoformalization in the Wild

Nilay Patel, Noah Arias, Davit Babayan, Victoria Cochran, Timothy Libman, Hafsah Mahmood, Liam McCarty, Soli Munoz, Laurel Willey, Jeffrey Flanigan

AI总结 当前自动形式化基准主要聚焦于竞赛或本科数学内容,而研究生及研究级数学领域仍缺乏相关资源。本文提出 MathAtlas,首个大规模研究生级别数学自动形式化基准,包含从103本教材中提取的约52,000条定理、定义、练习、示例及证明,并构建了包含约178,000个关系的数学依赖图。实验表明该基准质量高但极具挑战性,现有先进模型在定理和定义形式化任务上的表现均较低,且随着依赖深度增加,模型性能显著下降。

Comments In submission at NeurIPS 2026

详情
英文摘要

Current autoformalization benchmarks are largely focused on olympiad or undergraduate mathematics, while graduate and research-level mathematics remains underexplored. In this paper, we introduce MathAtlas, the first large-scale autoformalization benchmark of in the wild graduate-level mathematics, containing ~52k theorems, definitions, exercises, examples, and proofs extracted from 103 graduate mathematics textbooks. MathAtlas is enriched with a mathematical dependency graph containing ~178k relations, and is the first autoformalization benchmark to include such relations, facilitating evaluation and development of dependency-aware autoformalization systems. Our extensive experiments show that MathAtlas is high quality but extremely challenging: strong baselines achieve at most 9.8% correctness on theorem statements and 16.7% on definitions. Furthermore, we find performance of state-of-the-art models degrades substantially with dependency depth: on MA-Hard, a subset of 700 entities with the deepest dependency trees, the best model achieves only 2.6% correctness for autoformalization on this challenging dataset. We release MathAtlas to the community as a benchmark set for large-scale autoformalization of graduate-level mathematics in the wild.

2605.14055 2026-05-15 cs.CL cs.AI

PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts

Anjir Ahmed Chowdhury, Syed Zawad, Xiaolong Ma, Xu Dong, Feng Yan

AI总结 本文提出了一种参数高效的多任务学习方法PEML,旨在通过优化连续提示和模型权重的联合调整,提升大语言模型在多任务场景下的微调效率。与现有方法如LoRA和Prefix Tuning相比,PEML结合了神经架构工程优化提示结构,并采用低秩适配调整模型参数,从而更全面地适应多任务需求。实验表明,PEML在多个基准数据集上实现了显著的性能提升,平均准确率提高达6.67%,部分任务提升甚至超过10.75%。

Comments 26 pages, 8 figures, 18 Tables

详情
英文摘要

Parameter-Efficient Fine-Tuning (PEFT) is widely used for adapting Large Language Models (LLMs) for various tasks. Recently, there has been an increasing demand for fine-tuning a single LLM for multiple tasks because it requires overall less data for fine-tuning thanks to the common features shared among tasks. More importantly, LLMs are resource demanding and deploying a single model for multiple tasks facilitates resource consolidation and consumes significantly less resources compared to deploying individual large model for each task. Existing PEFT methods like LoRA and Prefix Tuning are designed to adapt LLMs to a specific task. LoRA and its variation focus on aligning the model itself for tasks, overlooking the importance of prompt tuning in multi-task learning while Prefix Tuning only adopts a simple architecture to optimize prompts, which limits the adaption capabilities for multi-task. To enable efficient fine-tuning for multi-task learning, it is important to co-optimize prompt optimization and model adaptation. In this work, we propose a Parameter-Efficient Multi-task Learning (\PM), which employs a neural architecture engineering method for optimizing the continuous prompts while also performing low-rank adaption for model weights. We prototype PEML by creating an automated framework for optimizing the continuous prompts and adapting model weights. We evaluate PEML against state-of-the-arts multi-task learning methods MTL-LoRA, MultiLoRa, C-Poly, and MoE, on the GLUE, SuperGLUE, Massive Multitask Language Understanding, and commonsense reasoning benchmarks. The evaluation results present an average accuracy improvement of up to 6.67%, with individual tasks showing peak gains of up to 10.75%.

2605.14053 2026-05-15 cs.CL cs.AI

Derivation Prompting: A Logic-Based Method for Improving Retrieval-Augmented Generation

Ignacio Sastre, Guillermo Moncecchi, Aiala Rosá

AI总结 该研究针对大语言模型在问答任务中出现的幻觉和错误推理问题,提出了一种基于逻辑推导的新型提示方法——推导提示(Derivation Prompting),用于改进检索增强生成(RAG)框架中的生成步骤。该方法通过预定义规则系统性地从初始假设推导结论,构建可解释的推导树,从而增强生成过程的可控性。实验表明,该方法在特定案例中显著减少了不可接受的回答,优于传统RAG和长上下文方法。

详情
Journal ref
Advances in Artificial Intelligence IBERAMIA 2024, LNCS 15277, pp. 412 423, Springer (2025)
英文摘要

The application of Large Language Models to Question Answering has shown great promise, but important challenges such as hallucinations and erroneous reasoning arise when using these models, particularly in knowledge-intensive, domain-specific tasks. To address these issues, we introduce Derivation Prompting, a novel prompting technique for the generation step of the Retrieval-Augmented Generation framework. Inspired by logic derivations, this method involves deriving conclusions from initial hypotheses through the systematic application of predefined rules. It constructs a derivation tree that is interpretable and adds control over the generation process. We applied this method in a specific case study, significantly reducing unacceptable answers compared to traditional RAG and long-context window methods.

2605.14051 2026-05-15 cs.AI

SPIN: Structural LLM Planning via Iterative Navigation for Industrial Tasks

Yusuke Ozaki, Dhaval Patel

AI总结 该论文提出了一种名为SPIN的规划包装器,旨在解决工业任务中大型语言模型(LLM)规划阶段常出现的结构无效或冗余的问题。SPIN结合了验证的有向无环图(DAG)规划与基于前缀的执行控制,通过严格的DAG合同和修复提示生成可执行的计划,并在执行前逐步评估DAG前缀以提前终止任务。实验表明,SPIN在多个基准测试中有效减少了执行任务数量和工具调用次数,同时提升了任务完成率和相关性能指标。

Comments 31 pages, 10 figures

详情
英文摘要

Industrial LLM agent systems often separate planning from execution, yet LLM planners frequently produce structurally invalid or unnecessarily long workflows, leading to brittle failures and avoidable tool and API cost. We propose \texttt{SPIN}, a planning wrapper that combines validated Directed Acyclic Graph (DAG) planning with prefix based execution control. \texttt{SPIN} enforces a strict DAG contract through \texttt{\_validate\_plan\_text} and repair prompting, producing executable plans before downstream execution, and then evaluates DAG prefixes incrementally to stop when the current prefix is sufficient to answer the query. On AssetOpsBench, across 261 scenarios, \texttt{SPIN} reduces executed tasks from 1061 to 623 and improves \emph{Accomplished} from 0.638 to 0.706, while reducing tool calls from 11.81 to 6.82 per run. On MCP Bench, the same wrapper improves planning, grounding, and dependency related scores for both GPT OSS1 and Llama 4 Maverick.

2605.14049 2026-05-15 cs.AI cs.CL cs.CY

Bridging Legal Interpretation and Formal Logic: Faithfulness, Assumption, and the Future of AI Legal Reasoning

Olivia Peiyu Wang, Leilani H. Gilpin

AI总结 随着大型语言模型在法律实践中的应用日益广泛,其带来的潜力与风险并存。本文探讨了当前AI在法律推理中存在系统性假设性推理的问题,即模型常基于文本内容之外的假设得出结论,缺乏逻辑严谨性。为此,研究提出了一种结合大型语言模型表达能力和形式化验证严谨性的神经符号方法,旨在提升AI辅助法律推理的可靠性与可信度,从而在降低人工验证负担的同时满足法律实践对责任性的要求。

Comments 2 pages abstract accepted by Bloomberg LSLLAI 2026 Symposium

详情
英文摘要

The growing adoption of large language models in legal practice brings both significant promise and serious risk. Legal professionals stand to benefit from AI that can reason over contracts, draft documents, and analyze sources at scale, yet the high-stakes nature of legal work demands a level of rigor that current AI systems do not provide. The central problem is not simply that LLMs hallucinate facts and references; it is that they systematically draw inferences that go beyond what the source text actually supports, presenting assumption-laden conclusions as if they were logically grounded. This proposal presents a neuro-symbolic approach to legal AI that combines the expressive power of large language models with the rigor of formal verification, aiming to make AI-assisted legal reasoning both capable and trustworthy, thus reducing the burden of manual verification without sacrificing the accountability that legal practice demands.

2605.14047 2026-05-15 cs.CV cs.AR

Evolving Layer-Specific Scalar Functions for Hardware-Aware Transformer Adaptation

Kieran Carrigg, Sigur de Vries, Amirhossein Sadough, Marcel van Gerven

AI总结 本文研究了如何在边缘设备上高效部署视觉Transformer(ViT),针对其因层归一化操作导致的计算复杂度和全局归约瓶颈问题,提出了一种基于遗传编程的硬件感知框架。该方法通过进化生成每层特定的标量函数,替代传统的归一化层,无需从头训练模型即可实现高性能适配。实验表明,该方法在保持图像分类精度的同时,显著降低了计算和内存开销,为ViT在边缘加速器上的部署提供了有效解决方案。

Comments 18 pages, 7 figures

详情
英文摘要

Vision Transformers (ViTs) achieve state-of-the-art performance on challenging vision tasks, but their deployment on edge devices is severely hindered by the computational complexity and global reduction bottleneck imposed by layer normalization. Recent methods attempt to bypass this by replacing normalization layers with hardware-friendly scalar approximations. However, these homogeneous replacements do not optimally fit to all layers' behaviour and rely on expensive model retraining. In this work, we propose a highly efficient, hardware-aware framework that utilizes genetic programming (GP) to evolve heterogeneous, layer-specific scalar functions directly from pre-trained weights. Coupled with a novel post-training re-alignment strategy, our approach eliminates the need to retrain models from scratch entirely. Our evolved expressions accurately approximate the target normalization behaviours, capturing $91.6\%$ of the variance ($R^2$) compared to only $70.2\%$ for homogeneous baselines, allowing our modified architecture to recover $84.25\%$ Top-1 ImageNet-1K accuracy in only 20 epochs. By preserving this performance while eliminating the global reduction bottleneck, our approach establishes a highly favourable trade-off between arithmetic complexity and off-chip memory traffic, removing a primary barrier to the efficient deployment of ViTs on edge accelerators.

2605.14045 2026-05-15 cs.CV

PVRF: All-in-one Adverse Weather Removal via Prior-modulated and Velocity-constrained Rectified Flow

Wei Dong, Han Zhou, Terry Ji, Guanhua Zhao, Shahab Asoodeh, Yulun Zhang, Guangtao Zhai, Jun Chen, Xiaohong Liu

AI总结 该论文提出了一种名为PVRF的统一框架,用于解决真实场景中复杂多变的恶劣天气去除问题。该方法结合了基于冻结视觉-语言模型的软天气感知模块和速度约束的修正流优化,通过属性调制归一化和天气加权适配器生成初始修复估计,并利用终端一致的残差修正流提升修复质量与稳定性。实验表明,PVRF在修复保真度和感知质量方面优于现有方法,且具有良好的跨数据集泛化能力。

Comments 10 pages, 9 figures, and 4 tables

详情
英文摘要

Adverse weather removal (AWR) in real-world images remains challenging due to heterogeneous and unseen degradations, while distortion-driven training often yields overly smooth results. We propose PVRF, a unified framework that integrates zero-shot soft weather perceptions with velocity-constrained rectified-flow refinement. PVRF introduces an AWR-specific question answering module (AWR-QA) that uses frozen vision--language models (VLMs) to estimate soft probabilities of weather types and low-level attribute scores. These perceptions condition restoration networks via attribute-modulated normalization (AMN) and weather-weighted adapters (WWA), producing an anchor estimate for refinement. We then learn a terminal-consistent residual rectified flow with perception-adaptive source perturbation and a terminal-consistent velocity parameterization to stabilize learning near the terminal regime. Extensive experiments show that PVRF improves both fidelity and perceptual quality over state-of-the-art baselines, with strong cross-dataset generalization on single and combined degradations. Code will be released at https://github.com/dongw22/PVRF.

2605.14040 2026-05-15 cs.CL

Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

Shan Yang

AI总结 该研究审视了多模态物理推理评估流程中的潜在问题,揭示了训练-评估污染、翻译偏差和选择题饱和等三个未被察觉的问题,并提出了改进方案。研究构建了经过严格审计的多模态数据集和评估体系,显著提升了模型在物理推理任务中的表现。通过发布四个新数据集和一个基于Qwen3-VL的参考方法,该工作为视觉物理推理提供了更可靠的研究基准和训练资源。

Comments 10 pages, 3 tables. Project page: https://shanyang.me/physics-r1-page/

详情
英文摘要

We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).

2605.14037 2026-05-15 cs.LG cs.CL

Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

Gergely Szilvasy, Manuel Faysse, Maria Lomeli, Matthijs Douze, Pierre-Emmanuel Mazaré, Loïc Cabannes, Wen-tau Yih, Hervé Jégou

AI总结 在现代语言模型处理长序列文本的背景下,键值缓存(KV cache)的内存占用和带宽限制成为高效生成的瓶颈。本文提出了一种自剪枝键值注意力机制(SP-KV),通过预测键值对的未来效用,动态地决定哪些键值对需要保留在全局缓存中,从而有效减少缓存大小。该方法在不显著影响模型性能的前提下,显著提升了内存使用效率和解码速度,并揭示了层和头级别的稀疏性模式,为设计混合局部-全局注意力架构提供了新思路。

Comments 28 pages, 8 figures, 8 tables

详情
英文摘要

Under modern test-time compute and agentic paradigms, language models process ever-longer sequences. Efficient text generation with transformer architectures is increasingly constrained by the Key-Value cache memory footprint and bandwidth. To address this limitation, we introduce Self-Pruned Key-Value Attention (SP-KV), a mechanism designed to predict future KV utility in order to reduce the size of the long-term KV cache. This strategy operates at a fine granularity: a lightweight utility predictor scores each key-value pair, and while recent KVs are always available via a local window, older pairs are written in the cache and used in global attention only if their predicted utility surpasses a given threshold. The LLM and the utility predictor are trained jointly end-to-end exclusively through next-token prediction loss, and are adapted from pretrained LLM checkpoints. Rather than enforcing a fixed compression ratio, SP-KV performs dynamic sparsification: the mechanism adapts to the input and typically reduces the KV cache size by a factor of $3$ to $10\times$, longer sequences often being more compressible. This leads to vast improvements in memory usage and decoding speed, with little to no degradation of validation loss nor performance on a broad set of downstream tasks. Beyond serving as an effective KV-cache reduction mechanism, our method reveals structured layer- and head-specific sparsity patterns that we can use to guide the design of hybrid local-global attention architectures.

2605.14036 2026-05-15 cs.AI cs.CC cs.CL cs.LG

Enhanced and Efficient Reasoning in Large Learning Models

Leslie G. Valiant

AI总结 本文提出了一种高效且原理明确的推理方法,旨在提升大型语言模型在生成内容可信度方面的表现。该方法通过预处理阶段将数据编码为更明确描述对象关系的“Unary Relational Integracode”,随后结合标准的机器学习流程进行训练,从而在保留现有软硬件基础的同时实现更可靠的推理能力。该方法不仅适用于自然语言处理,还可拓展至视觉与动作等领域,并基于“鲁棒逻辑”理论,使得模型在单次或多次调用中都能进行更稳固的逻辑推理。

详情
英文摘要

In current Large Language Models we can trust the production of smoothly flowing prose on the basis of the principles of machine learning. However, there is no comparably principled basis to justify trust in the content of the text produced. It appears to be conventional wisdom that addressing this issue by adding more principled reasoning is not computationally affordable. Here we propose a principled method of reasoning that is efficient enough to be practical for large language models. Further, the method allows the retention of much of the currently used software and hardware base. Our method for improving the functioning of large language models consists of a first stage of preprocessing that recodes the data to a Unary Relational Integracode that is more explicit about the relationships among the objects described in the text, followed as a second stage by a standard but possibly streamlined machine learning process that then also learns to predict these relationships. The method may be viewed as realizing a world model and applying beyond natural language, to vision and actions, for example, where the multiple properties of an object referred to in an input are brought together explicitly, rather than remaining distributed in the various references to it in the input. We articulate its advantages in terms of Robust Logic, a system for performing principled chaining on learned, and hence uncertain, information. We show that this recoding has the surprising and fortuitous property that, while succinct, it makes the task of learning a core subset of relational rules that hold in the world described in the training data polynomial time learnable in a defined sense, the polynomial depending on the complexity of the rule. This gives support for sound reasoning within each single call of the learned classifier as well as between multiple calls.

2605.14034 2026-05-15 cs.AI cs.CL cs.CY

From Descriptive to Prescriptive: Uncover the Social Value Alignment of LLM-based Agents

Jinxian Qu, Qingqing Gu, Teng Chen, Luo Ji

AI总结 本文研究了基于大语言模型的智能体如何更好地与人类社会价值观对齐的问题,提出了一个基于价值的新型框架,利用GraphRAG将伦理原则转化为价值导向的指令,从而引导智能体在具体对话情境中做出符合预期的行为。通过引入马斯洛需求层次理论和普鲁奇克情绪轮来定义期望行为,实验表明该方法在DAILYDILEMMAS基准上显著优于基于提示的基线方法,为AI系统中自我情感的生成提供了理论基础。

Comments Accepted by CogSci 2026

详情
英文摘要

Wide applications of LLM-based agents require strong alignment with human social values. However, current works still exhibit deficiencies in self-cognition and dilemma decision, as well as self-emotions. To remedy this, we propose a novel value-based framework that employs GraphRAG to convert principles into value-based instructions and steer the agent to behave as expected by retrieving the suitable instruction upon a specific conversation context. To evaluate the ratio of expected behaviors, we define the expected behaviors from two famous theories, Maslow's Hierarchy of Needs and Plutchik's Wheel of Emotion. By experimenting with our method on the benchmark of DAILYDILEMMAS, our method exhibits significant performance gains compared to prompt-based baselines, including ECoT, Plan-and-Solve, and Metacognitive prompting. Our method provides a basis for the emergence of self-emotion in AI systems.

2605.14033 2026-05-15 cs.AI cs.LG

Sheaf-Theoretic Transport and Obstruction for Detecting Scientific Theory Shift in AI Agents

David N. Olivieri, Roque J. Hernández

AI总结 本文研究了人工智能代理在科学理论转变时如何检测现有表征框架是否适用于新情境,或是否需要扩展。作者提出了一种基于有限sheaf理论的框架,通过运输与阻塞机制识别理论转变的候选情况,衡量不一致性的指标包括残差拟合、重叠不兼容性、约束违反等。该方法在控制实验中验证有效,能够区分理论变形与扩展,并为AI代理提供一种有限的诊断工具,以判断表征迁移失败时是否需要进行扩展。

详情
英文摘要

Scientific theory shift in AI agents requires more than fitting equations to data. An artificial scientific agent must detect whether an existing representational framework remains transportable into a new regime, or whether its language has become locally-to-globally obstructed and must be extended. This paper develops a finite sheaf-theoretic framework for detecting theory-shift candidates through transport and obstruction. Contexts are organized as a local-to-global structure in which source, overlap, target, and validation charts are fitted, restricted, and tested for gluing. Obstruction measures failure of coherence through residual fit, overlap incompatibility, constraint violation, limiting-relation failure, and representational cost. We evaluate the framework on a controlled transition-card benchmark designed to separate deformation within a source language from extension of that language. The main result is direct obstruction ranking: the intended deformation or extension is usually the lowest-obstruction candidate, and transition type is separated in the benchmark. A constellation kernel over the same signatures is included only as a secondary representational-similarity probe. The aim is not to reconstruct historical paradigm shifts or solve open-ended autonomous theory invention, but to isolate a finite diagnostic subproblem for AI agents: detecting when representational transport fails and extension becomes the coherent next move.

2605.14031 2026-05-15 cs.SD cs.CV cs.LG

Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study

Wuao Liu, Mustafa Chasmai, Subhransu Maji, Grant Van Horn

AI总结 本文研究了在有限数据条件下,掩码自编码器(MAE)在生物声学细粒度物种分类任务中的有效性。通过在iNatSounds数据集上的系统实验,分析了预训练数据规模、领域特异性、数据筛选和迁移策略等因素的影响。研究发现,使用多样化通用音频数据预训练的模型在生物声学任务中表现最佳,而针对特定领域的额外预训练和数据筛选在小规模数据下效果有限,甚至可能降低性能。结果表明,在中等规模的细粒度生物声学场景中,预训练数据的规模比目标函数设计对模型性能影响更大。

Comments Workshop on Fine-Grained Visual Categorization (FGVC) at CVPR 2026. 8 pages, 6 figures

详情
英文摘要

Bioacoustic recognition requires fine-grained acoustic understanding to distinguish similar-sounding species. However, many large-scale data repositories such as iNaturalist are weakly annotated, often with only a single positive species label per recording, making supervised learning particularly challenging. Inspired by advances in computer vision, recent approaches have shifted toward self-supervised learning to capture the underlying structure of audio without relying on exhaustive annotations. In particular, masked autoencoders (MAE) have shown strong transferability on massive audio corpora, yet their effectiveness in more modest bioacoustic settings remains underexplored. In this work, we conduct a systematic study of MAE pretraining for species classification on iNatSounds, analyzing the impacts of pretraining data scale, domain specificity, data curation, and transfer strategies. Consistent with prior work, we find that models pretrained on diverse general audio data achieve the best transfer performance on iNatSounds. Contrary to observations from large-scale audio benchmarks, we find that (1) additional masked reconstruction pretraining on domain-specific data provides limited benefits and may even degrade performance relative to off-the-shelf models, and (2) selective data filtering offers a negligible advantage when the overall data scale is limited. Our results indicate that, in moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design. These findings further clarify when MAE-based pretraining is effective and provide practical guidance for model selection under limited supervision.

2605.14026 2026-05-15 cs.LG cs.AI

R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning

Sanghyeob Song, Donghyeok Lee, Jinsik Kim, Sungroh Yoon

AI总结 在数据稀缺的现实机器人等强化学习场景中,密集的数据复用虽能提升效率,但易导致过拟合。为解决自预测学习(SPL)在高更新与数据比(UTD)条件下表示层不稳定的问题,本文提出了一种通过冗余减少实现鲁棒表示的方法R2R2。该方法通过理论分析指出标准零中心化与SPL的谱特性存在冲突,并设计了非中心化的正则化目标,实验表明R2R2有效缓解了过拟合问题,在多个连续控制任务中显著提升了算法性能。

Comments Accepted at the Forty-Third International Conference on Machine Learning (ICML 2026). This is the camera-ready version

详情
英文摘要

For reinforcement learning in data-scarce domains like real-world robotics, intensive data reuse enhances efficiency but induces overfitting. While prior works focus on critic bias, representation-level instability in Self-Predictive Learning (SPL) under high Update-to-Data (UTD) regimes remains underexplored. To bridge this gap, we propose Robust Representation via Redundancy Reduction (R2R2), a regularization method within SPL. We theoretically identify that standard zero-centering conflicts with SPL's spectral properties and design a non-centered objective accordingly. We verify R2R2 on SPL-native algorithms like TD7. Furthermore, to demonstrate its orthogonality to prior advancements, we extend the state-of-the-art SimbaV2, which originally lacks SPL, by integrating a tailored SPL module, termed SimbaV2-SPL. Experiments across 11 continuous control tasks confirm that R2R2 effectively mitigates overfitting; specifically, at a UTD ratio of 20, it improves TD7 by ~22% and provides additional gains on top of SimbaV2-SPL, which itself establishes a new state-of-the-art. The code can be found at: https://github.com/songsang7/R2R2

2605.14004 2026-05-15 cs.AI

Conditional Attribute Estimation with Autoregressive Sequence Models

Erica Stutz, Giacomo Marino, Daniella Meeker, Qiao Liu, Andrew J. Loza

AI总结 本文提出了一种名为“条件属性变换器”的新方法,用于在生成模型中联合估计下一个词的概率以及在每个潜在下一个词选择下的属性值。该方法能够在单次前向计算中实现三个关键功能:逐词归因、反事实分析和可控生成,无需修改输入序列。该方法在稀疏奖励任务中表现出色,提升了大模型的下一个词预测能力,并显著加快了属性概率的估计速度,适用于多种语言任务的生成引导。

详情
英文摘要

Generative models are often trained with a next-token prediction objective, yet many downstream applications require the ability to estimate or control sequence-level properties. Next-token prediction can lead to overfitting of local patterns during training, underfitting of global structure, and requires significant downstream modifications or expensive sampling to guide or predict the global attributes of generated samples at inference time. Here, we introduce Conditional Attribute Transformers, a novel method for jointly estimating the next-token probability and the value of an attribute conditional on each potential next token selection. This framework enables three critical capabilities within a single forward pass, without modification of the input sequence: (1) per-token credit assignment across an entire sequence, by identifying how each token in a sequence is associated with an attribute's value; (2) counterfactual analysis, by quantifying attribute differences conditional on alternative next token choices; (3) steerable generation, by decoding sequences based on a combination of next-token and attribute likelihoods. Our approach achieves state of the art performance on sparse reward tasks, improves next-token prediction at sufficient model sizes, estimates attribute probabilities orders of magnitude faster than sampling, and can guide decoding of autoregressive sequence models on a range of language tasks.

2605.14002 2026-05-15 cs.AI

PolitNuggets: Benchmarking Agentic Discovery of Long-Tail Political Facts

Yifei Zhu

AI总结 本文提出 PolitNuggets,一个多语言基准,用于评估智能体在开放环境中发现和综合长尾政治事实的能力。该基准通过构建400位全球政要的生平,涵盖超过10000个政治事实,引入优化的多智能体系统和FactNet协议,从发现性、准确性与效率三个维度进行标准化评估。研究发现当前模型在细节处理和效率上存在较大差异,并揭示了智能体性能与模型基础能力之间的关系,突显了短上下文提取、多语言鲁棒性与工具使用可靠性的重要性。

Comments 24 pages, 7 figues, accpeted in The 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

详情
英文摘要

Large Reasoning Models (LRMs) embedded in agentic frameworks have transformed information retrieval from static, long context question answering into open-ended exploration. Yet real world use requires models to discover and synthesize "long-tail" facts from dispersed sources, a capability that remains under-evaluated. We introduce PolitNuggets, a multilingual benchmark for agentic information synthesis via constructing political biographies for 400 global elites, covering over 10000 political facts. We standardize evaluation with an optimized multi agent system and propose FactNet, an evidence conditional protocol that scores discovery, fine-grained accuracy, and efficiency. Across models and settings, we find that current systems often struggle with fine-grained details, and vary substantially in efficiency. Finally, using benchmark diagnostics, we relate agent performance to underlying model capabilities, highlighting the importance of short-context extraction, multilingual robustness, and reliable tool use.

2605.13999 2026-05-15 cs.LG

Support Before Frequency in Discrete Diffusion

Adrian Müller, Antoine Gonon, Zebang Shen, Ya-Ping Hsieh, Niao He

AI总结 本文研究了离散扩散模型在语言建模中的学习机制,发现其去噪目标在学习过程中存在一个从粗粒度支持信息到细粒度频率信息的层次结构。通过理论分析与实验验证,作者证明在小噪声条件下,单个标记的逆向编辑可分为主导尺度(决定是否接近数据支持,如语法正确的句子)和更细粒度的系数(决定同一尺度内的概率分布)。研究指出,模型首先学习数据支持结构,随后才学习数据频率分布,这一分离现象在均匀扩散和吸收扩散中表现出不同的特性。

详情
英文摘要

Discrete diffusion models are increasingly competitive for language modeling, yet it remains unclear how their denoising objectives organize learning. Although these objectives target the full data distribution, we show that the exact reverse process induces a hierarchy between coarse support information and finer frequency information. For uniform and absorbing (a.k.a. masking) diffusion, we prove that, in the small-noise regime of the final denoising steps, each single-token reverse edit decomposes into a leading scale, determined by whether it moves toward the data support (e.g., grammatically valid sentences), and a finer coefficient, determining relative probabilities within the same scale. Thus, recovering validity structure only requires learning the correct order of magnitude of reverse probabilities, whereas recovering data frequencies requires coefficient-level estimation. The separation is mechanism-dependent: uniform diffusion exhibits a trichotomy into validity-improving, validity-preserving, and validity-worsening edits, while absorbing diffusion places its leading-order mass on validity-improving moves. Experiments on a masked language diffusion model and synthetic regular-language tasks support these predictions: support-localization emerges earlier than within-support frequency ranking, and the contrast between uniform and absorbing diffusion matches the predicted rate separation. Together, our results suggest that discrete diffusion models learn data support before data frequencies.

2605.13997 2026-05-15 cs.LG cs.AI cs.CL

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

Tao Zhong, Dongzhe Zheng, Christine Allen-Blanchette

AI总结 本文研究了稀疏专家混合(MoE)层的无学习压缩问题,指出现有方法在处理专家合并时存在结构性盲点,即三个两两兼容的专家可能形成无法合并的循环结构。为此,作者引入了基于单纯复形拉普拉斯算子的调和核概念,提出HodgeCover方法,通过覆盖关键边和三角形结构实现专家选择,并结合权重剪枝进一步提升压缩效果。实验表明,HodgeCover在专家大幅减少的情况下表现优异,优于现有无学习方法,并在压缩效率与质量之间实现了良好平衡。

Comments 34 pages, 8 figures

详情
英文摘要

Sparse Mixture-of-Experts (MoE) layers route tokens through a handful of experts, and learning-free compression of these layers reduces inference cost without retraining. A subtle obstruction blocks every existing compressor in this family: three experts can each be pairwise compatible yet form an irreducible cycle when merged together, so any score that ranks experts on pairwise signals is structurally blind to which triples are jointly mergeable. We show the obstruction is a precise mathematical object, the harmonic kernel of the simplicial Laplacian on a 2-complex whose vertices are experts, whose edges carry KL merge barriers, and whose faces carry triplet barriers; Hodge-decomposing the edge-barrier signal isolates the kernel exactly. We turn the diagnostic into a selection objective: HodgeCover greedily covers the harmonic-critical edges and triplet-critical triangles, and a hybrid variant of HodgeCover pairs it with off-the-shelf weight pruning on survivors. On three open-weight Sparse MoE backbones under aggressive expert reduction, HodgeCover matches state-of-the-art learning-free baselines on the expert-reduction axis, leads on the aggressive-compression frontier of the hybrid axis, and uniquely balances retained mass across all four Hodge components. These results show that exposing the harmonic kernel of a learned MoE structure changes which compressor wins at the regime that matters most.

2605.13996 2026-05-15 cs.RO

Ergodic Imitation for Adaptive Exploration around Demonstrations

Ziyi Xu, Cem Bilaloglu, Yiming Li, Sylvain Calinon

AI总结 在机器人模仿学习中,训练与部署条件的不匹配是一个常见挑战,可能导致机器人无法完成任务。为此,本文提出了一种基于示范的自适应遍历模仿方法,通过从检索到的示范中构造目标分布,生成能够在跟踪与探索之间自适应插值的轨迹。该方法将遍历控制扩展到自适应模仿领域,为机器人在动态环境中的在线探索提供了新的解决方案。

Comments 4 pages, 3 figures

详情
英文摘要

In robotics, a common challenge in imitation learning is the mismatch between training and deployment conditions, caused, for example, by environmental changes or imperfect observation and control. When a robot follows a nominal trajectory under such mismatch, it may become stuck and fail to complete the task. This calls for adaptive online exploration strategies that remain grounded in demonstrations. To this end, we propose an adaptive ergodic imitation approach that constructs a target distribution from the geometry of the retrieved demonstrations and uses it to generate trajectories that adaptively interpolate between tracking and exploration. Our method extends ergodic control beyond its traditional role in area-coverage and search by incorporating demonstrations into a retrieval-based receding-horizon framework for adaptive imitation.

2605.13994 2026-05-15 cs.CV cs.AI

CineMesh4D: Personalized 4D Whole Heart Reconstruction from Sparse Cine MRI

Xiaoyue Liu, Xiaohan Yuan, Mark Y Chan, Ching-Hui Sia, Lei Li

AI总结 本文提出了一种名为CineMesh4D的端到端4D(3D+时间)重建方法,用于从稀疏的动态MRI图像中生成个性化的全心脏网格模型。该方法通过跨域映射直接从多视角的2D动态MRI图像重建全心结构,引入了可微渲染损失以利用多视角稀疏轮廓进行监督,并设计了双上下文时间块以融合全局和局部时间信息,从而提升重建质量与运动一致性。实验表明,CineMesh4D在重建精度和运动连贯性方面优于现有方法,为个性化实时心脏评估提供了可行的解决方案。

详情
英文摘要

Accurate 3D+t whole-heart mesh reconstruction from cine MRI is a clinically crucial yet technically challenging task. The difficulty of this task arises from two coupled factors: inherently sparse sampling of 3D cardiac anatomy by 2D image slices and the tight coupling between cardiac shape and motion. Current cardiac image-to-mesh approaches typically reconstruct only a subset of cardiac chambers or a single phase of the cardiac cycle. In this work, we propose CineMesh4D, a novel end-to-end 4D (3D+t) pipeline that directly reconstructs patient-specific whole-heart mesh from multi-view 2D cine MRI via cross-domain mapping. Specifically, we introduce a differentiable rendering loss that enables supervision of 3D+t whole-heart mesh from multi-view sparse contours of cine MRI. Furthermore, we develop a dual-context temporal block that fuses global and local cardiac temporal information to capture high-dimensional sequential patterns. In quantitative and qualitative evaluations, CineMesh4D outperforms existing approaches in terms of reconstruction quality and motion consistency, providing a practical pathway for personalized real-time cardiac assessment. The code will be publicly released once the manuscript is accepted.

2605.13988 2026-05-15 cs.LG quant-ph

Neural Fields for NV-Center Inverse Sensing

Zhixuan Zhao, Tao Zhong, Yixun Hu, Nathalie P. de Leon, Christine Allen-Blanchette

AI总结 本文研究基于氮空位(NV)中心的量子传感中的逆问题,针对传统方法在非线性、光谱耦合和物理敏感场景下的不足,提出了一种新的神经场方法NeTMY。该方法结合了可微的NV前向模型与坐标神经场,通过位置编码、多尺度优化和稀疏性约束等技术,有效提升了稀疏源的定位与分布重建性能,并揭示了其在抑制中心塌陷问题上的机制优势。研究为物理保真神经逆问题提供了新的实验平台。

Comments 33 pages, 16 figures

详情
英文摘要

Inverse problems in scientific sensing are often solved with either hand-designed regularizers or supervised networks trained on simulated labels, yet both can fail when the forward model is nonlinear, spectrally coupled, and physically delicate. We study this issue for noise sensing based on nitrogen-vacancy (NV) centers in diamond, where a quantum sensor measures magnetic-noise spectra generated by sparse spin sources. We show that replacing a common scalar/coherent forward approximation with a tensor power-summed dipolar operator changes the inverse landscape and exposes a center-collapse failure mode in free-density optimization. We propose NeTMY, an amortization-free coordinate neural field coupled to the differentiable NV forward model, with annealed positional encoding, multiscale optimization, sparsity/gating, and spectrum-fidelity losses. Across sparse synthetic reconstructions generated by the corrected operator, NeTMY achieves the best localization and distributional metrics in the tested benchmark. Mechanism experiments show that NeTMY does not directly execute the raw density-space gradient; its parameterization smooths and redistributes updates, mitigating the center-collapse pathology. These results position NV quantum sensing as a useful testbed for physics-faithful neural inverse problems.

2605.13981 2026-05-15 cs.LG cs.AI

Towards Resource-Efficient LLMs: End-to-End Energy Accounting of Distillation Pipelines

Katherine Lambert, Sasha Luccioni

AI总结 随着大语言模型部署的增加,对GPU和数据中心的需求激增,引发了对电力消耗和电网压力的关注。本文提出了一种全面的能源核算框架,通过详细追踪各阶段的GPU功耗,量化知识蒸馏流程的完整计算成本,揭示了传统方法中常被忽视的教师模型相关能耗。实验中对比了两种常见蒸馏方法的能源消耗与碳排放,构建了能源-质量帕累托前沿,并据此提出了在能源和预算约束下选择蒸馏方法和超参数的实用设计规则,同时发布了开源测量工具和核算协议,为可比、可复现的蒸馏研究奠定标准化基础。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026). 11 pages, 6 figures

详情
英文摘要

The rise in deployment of large language models has driven a surge in GPU demand and datacenter scaling, raising concerns about electricity use, grid stress, and the impacts of modern AI workloads. Distillation is often promoted as one of the most effective paths to obtain cheaper, more efficient models, yet these claims rarely account for the full end-to-end energy and resource costs, including crucial teacher-side workloads such as data generation, logit caching, and evaluation. We present a comprehensive energy accounting framework that measures the complete computational cost of distillation pipelines via detailed stage-wise tracking of GPU device power consumption. In our experiments, we separate and log empirical energy use across distinct phases and systematically measure the energy and emissions of two common distillation methods: the classic logit-based knowledge distillation and synthetic-data supervised fine-tuning, constructing energy-quality Pareto frontiers that expose the previously ignored costs. From these measurements and analyses, we derive practical design rules for selecting distillation methods and hyperparameters under energy and budget constraints, and release an open-source measurement harness and accounting protocol to provide a standardized foundation for comparable, reproducible distillation research, explicitly accountable for complete pipeline energy impact.

2605.13974 2026-05-15 cs.CV cs.AI cs.MM

Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers

Evelyn Turri, Davide Bucciarelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia

AI总结 本文研究了扩散变换器(DiT)中一种被称为“大规模激活”的现象,即一小部分隐藏通道的响应远大于其余通道。研究发现,这些少量通道在功能上至关重要,能够主导图像生成质量;在空间上具有组织性,能反映图像的主要主体和显著区域;并且具有可迁移性,可用于实现跨提示的语义插值和主体驱动生成。这些发现揭示了DiT模型中隐藏的稀疏语义控制机制,为理解与利用扩散模型提供了新视角。

Comments Project page: https://aimagelab.github.io/MAs-DiT/

详情
英文摘要

Diffusion Transformers (DiTs) and related flow-based architectures are now among the strongest text-to-image generators, yet the internal mechanisms through which prompts shape image semantics remain poorly understood. In this work, we study massive activations: a small subset of hidden-state channels whose responses are consistently much larger than the rest. We show that, despite their sparsity, these few channels effectively draw the whole picture, in three complementary senses. First, they are functionally critical: a controlled disruption probe that zeroes the massive channels causes a sharp collapse in generation quality, while disrupting an equally-sized set of low-statistic channels has marginal effect. Second, they are spatially organized: restricting image-stream tokens to massive channels and clustering them yields coherent partitions that closely align with the main subject and salient regions, exposing a structured spatial code hidden inside an apparently outlier-like subspace. Third, they are transferable: transporting massive activations from one prompt-conditioned trajectory into another, shifts the final image toward the source prompt while preserving substantial content from the target, producing localized semantic interpolation rather than unstructured pixel blending. We exploit this property in two use cases: text-conditioned and image-conditioned semantic transport, where massive activations transport enables prompt interpolation and subject-driven generation without any additional training. Together, these results recast massive activations not as activation anomalies, but as a sparse prompt-conditioned carrier subspace that organizes and controls semantic information in modern DiT models.

2605.13959 2026-05-15 cs.LG cs.AI cs.RO

WarmPrior: Straightening Flow-Matching Policies with Temporal Priors

Sinjae Kang, Chanyoung Kim, Kaixin Wang, Li Zhao, Kimin Lee

AI总结 本文提出了一种名为 WarmPrior 的方法,通过利用近期动作历史构建时间感知的先验分布,替代传统高斯源分布,从而提升基于扩散和流匹配的生成策略在机器人操作任务中的成功率。该方法通过生成更直捷的概率路径,提高了策略的稳定性和效率,并在行为克隆和先验空间强化学习中均展现出优越的采样效率和最终性能。研究揭示了源分布设计在生成式机器人控制中的重要影响,为相关领域提供了新的设计思路。

详情
英文摘要

Generative policies based on diffusion and flow matching have become a dominant paradigm for visuomotor robotic control. We show that replacing the standard Gaussian source distribution with WarmPrior, a simple temporally grounded prior constructed from readily available recent action history, consistently improves success rates on robotic manipulation tasks. We trace this gain to markedly straighter probability paths, echoing the effect of optimal-transport couplings in Rectified Flow. Beyond standard behavior cloning, WarmPrior also reshapes the exploration distribution in prior-space reinforcement learning, improving both sample efficiency and final performance. Collectively, these results identify the source distribution as an important and underexplored design axis in generative robot control.

2605.13950 2026-05-15 cs.LG cs.AI hep-ex hep-ph

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

Darius A. Faroughy, Sofia Palacios Schweitzer, Ian Pang, Siddharth Mishra-Sharma, David Shih

AI总结 本文提出 Collider-Bench,一个用于评估大型语言模型代理能否仅凭公开论文和开源软件重现大型强子对撞机实验分析的基准。该任务要求代理构建可执行的模拟与筛选流程,并预测特定信号区域的碰撞事件数量,评估基于连续保真度分数而非人工评分标准。研究还分析了不同代理的计算成本,并通过LLM判别器检测代码中的错误模式,结果表明目前尚无代理能稳定超越人类物理学家的表现。

Comments 23 pages | 9 figures | 4 tables | Code: https://github.com/dfaroughy/Collider-Bench | Task Corpus: https://huggingface.co/datasets/Dariusfar/ColliderBench

详情
英文摘要

Autonomous language-model agents are increasingly evaluated on long-horizon tool-use tasks, but existing benchmarks rarely capture the complexity and nuance of real scientific work. To address this gap, we introduce Collider-Bench, a benchmark for evaluating whether LLM agents can reproduce experimental analyses from the Large Hadron Collider (LHC) using only public papers and open scientific software. Such analyses are often difficult to reproduce because the public toolchain only approximates the software used internally by the experimental collaborations, while the published papers inevitably omit implementation details needed for a faithful reconstruction. Agents must therefore rely on physical reasoning, domain knowledge, and trial-and-error to fill these gaps. Each task requires the agent to turn a published analysis into an executable simulation-and-selection pipeline and submit predicted collision event yields in specified signal regions. These predictions are evaluated with standard histogram metrics that provide continuous fidelity scores without a hand-written rubric. We also report the computational cost incurred by each agent per task. Finally, we evaluate the codebase and full session trace using an LLM judge to catch qualitative failure modes such as fabrications, hallucinations and duplications. We release an initial set of tasks drawn from LHC searches, together with a containerized sandbox and event simulation tools. We evaluate across a capability ladder of general purpose coding agents. Our results show that on average no agent reliably beats the physicist-in-the-loop solution.

2605.13943 2026-05-15 cs.LG

A Unified Geometric Framework for Weighted Contrastive Learning

Raphael Vock, Edouard Duchesnay, Benoit Dufumier

AI总结 本文提出了一种统一的几何框架,用于分析加权对比学习中的表示结构,揭示了不同加权策略对嵌入几何特性的影响。研究将加权InfoNCE目标解释为距离几何问题,明确了目标几何由加权方案决定,并对多种有监督和弱监督任务下的最优嵌入进行了精确刻画。研究还指出,在类别不平衡或连续标签场景下,传统对比学习方法可能存在几何不一致性,而几何一致的加权方式能够保证表示的最优性和一致性,为设计对比学习目标提供了理论指导。

Comments Preprint

详情
英文摘要

Contrastive learning (CL) aims to preserve relational structure between samples by learning representations that reflect a similarity graph. Yet, the geometry of the resulting embeddings remains poorly understood. Here we show that weighted InfoNCE objectives can be interpreted as Distance Geometry Problems, where the weighting scheme specifies the target geometry to be realized by the representation. This viewpoint yields exact characterizations of the optimal embeddings for several supervised and weakly supervised objectives. In supervised classification, both SupCon and Soft SupCon (a dense relaxation of it where pairs from distinct classes have small non-zero similarity) collapse samples within each class to a single prototype. However, while balanced SupCon recovers the classical regular simplex geometry, class imbalance breaks this symmetry: SupCon induces non-uniform inter-class similarities depending on class sizes, whereas Soft SupCon preserves a regular simplex geometry regardless of class imbalance. In continuous-label settings, our framework reveals a different failure mode: y-Aware CL generally cannot attain its entropic optimum unless the labels lie on a hypersphere, exposing a mismatch between Euclidean label weights and spherical latent similarity. By contrast, geometrically consistent choices such as Euclidean-Euclidean weighting or X-CLR admit unique optimal embeddings. Our results show that the choice of weighting scheme determines whether contrastive learning is geometrically realizable, degenerate, or inconsistent, providing a principled framework for designing contrastive objectives.

2605.13942 2026-05-15 cs.LG cs.DC cs.NI

EMA: Efficient Model Adaptation for Learning-based Systems

Daiyang Yu, Xinyu Chen, Yihan Zhang, Yan Liang, Yaqi Qiao, Fan Lai

AI总结 本文提出了一种名为EMA的高效模型适应系统,旨在帮助基于学习的系统在异构、长期运行和动态变化的环境中进行快速适应。EMA采用系统驱动、数据为中心的方法,通过引入状态转换器减少模型训练成本,并优化数据标注过程以平衡训练与标注成本。实验表明,EMA在多个代表性系统中显著降低了适应成本并提升了系统性能。

Comments SIGCOMM (2026)

详情
英文摘要

Machine learning (ML) is increasingly applied to optimize system performance in tasks such as resource management and network simulation. Unlike traditional ML tasks (e.g., image classification), networked systems often operate in heterogeneous, long-running, and dynamic environment states, where input conditions (e.g., network loads) and operational objectives can shift over time and across settings. Existing learning-based systems offer little support for adaptation, resulting in costly model training, extensive data collection, degraded system performance, and slow responsiveness. This paper presents EMA, the first model adaptation system supporting learning-based systems to adapt to evolving environments with minimal operational overhead. EMA takes a system-driven, data-centric approach that accommodates diverse system and model designs while addressing two key deployment challenges. First, it reduces expensive model training by introducing state transformers that align the input state of a new environment with previously similar states, allowing models to warm-start adaptation. Second, it addresses the often-overlooked yet costly process of data labeling--collecting ground truth for exploring and training on various system decisions--by prioritizing labeling high-utility data while balancing the tradeoff between training and labeling cost. Evaluations on eight representative learning-based systems show that EMA reduces adaptation costs (e.g., GPU training time) by 14.9-42.4% while improving system performance (e.g., network throughput) by 6.9-31.3%.

2605.13941 2026-05-15 cs.LG cs.AI

EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents

Jiaqi Liu, Xinyu Ye, Peng Xia, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao

AI总结 本文提出了一种名为 EvolveMem 的自进化记忆架构,旨在提升大型语言模型代理在多会话场景下的长期记忆能力。该方法通过一个由诊断模块驱动的闭环自进化过程,使记忆系统中的存储内容和检索机制能够协同进化,从而实现对检索策略的自动优化。实验表明,EvolveMem 在多个基准测试中显著优于现有方法,并且其进化出的配置具有跨任务的泛化能力,体现了其对通用检索原则的有效捕捉。

详情
英文摘要

Long-term memory is essential for LLM agents that operate across multiple sessions, yet existing memory systems treat retrieval infrastructure as fixed: stored content evolves while scoring functions, fusion strategies, and answer-generation policies remain frozen at deployment. We argue that truly adaptive memory requires co-evolution at two levels: the stored knowledge and the retrieval mechanism that queries it. We present EvolveMem, a self-evolving memory architecture that exposes its full retrieval configuration as a structured action space optimized by an LLM-powered diagnosis module. In each evolution round, the module reads per-question failure logs, identifies root causes, and proposes targeted configuration adjustments; a guarded meta-analyzer applies them with automatic revert-on-regression and explore-on-stagnation safeguards. This closed-loop self-evolution realizes an AutoResearch process: the system autonomously conducts iterative research cycles on its own architecture, replacing manual configuration tuning. Starting from a minimal baseline, the process converges autonomously, discovering effective retrieval strategies including entirely new configuration dimensions not present in the original action space. On LoCoMo, EvolveMem outperforms the strongest baseline by 25.7% relative and achieves a 78.0% relative improvement over the minimal baseline. On MemBench, EvolveMem exceeds the strongest baseline by 18.9% relative. Evolved configurations transfer across benchmarks with positive rather than catastrophic transfer, indicating that the self-evolution process captures universal retrieval principles rather than benchmark-specific heuristics. Code is available at https://github.com/aiming-lab/SimpleMem.