arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4056
2509.02510 2026-05-12 cs.CL cs.AI stat.ML

Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation

Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu, Massoud Pedram

AI总结 本文提出了一种名为Top-H的解码方法,旨在解决大语言模型在开放文本生成中创造力与连贯性之间的平衡问题。通过建立熵约束下的最小化散度理论框架,并将其转化为熵约束质量最大化问题,作者设计了一种高效的贪心算法来实现该目标。实验表明,Top-H在创意写作任务上优于现有方法,提升了约25.63%,同时在问答任务中也保持了良好的连贯性,具有实际应用价值。

详情
英文摘要

Large language models (LLMs), despite their impressive performance across a wide range of tasks, often struggle to balance two competing objectives in open-ended text generation: fostering diversity and creativity while preserving logical coherence. Existing truncated sampling techniques, including temperature scaling, top-\$p\$ (nucleus) sampling, and min-\$p\$ sampling, aim to manage this trade-off. However, they exhibit limitations, particularly in the effective incorporation of the confidence of the model into the corresponding sampling strategy. For example, min-\$p\$ sampling relies on a single top token as a heuristic for confidence, eventually underutilizing the information of the probability distribution. Toward effective incorporation of the confidence of the model, in this paper, we present **top-H** decoding. We first establish the theoretical foundation of the interplay between creativity and coherence in truncated sampling by formulating an **entropy-constrained minimum divergence** problem. We then prove this minimization problem to be equivalent to an **entropy-constrained mass maximization** (ECMM) problem, which is NP-hard. Finally, we present top-H decoding, a computationally efficient greedy algorithm to solve the ECMM problem. Extensive empirical evaluations demonstrate that top-H outperforms the state-of-the-art (SoTA) alternative of min-\$p\$ sampling by up to **25.63%** on creative writing benchmarks, while maintaining robustness on question-answering datasets such as GPQA, GSM8K, and MT-Bench. Additionally, an *LLM-as-judge* evaluation confirms that top-H indeed produces coherent outputs even at higher temperatures, where creativity is especially critical. In summary, top-H advances SoTA in open-ended text generation and can be *easily integrated* into creative writing applications. The code is available at https://github.com/ErfanBaghaei/Top-H-Decoding.

2508.14685 2026-05-12 cs.CL

SSA: Improving Performance With a Better Scoring Function

Omar Naim, Swarnadeep Bhar, Jérôme Bolte, Nicholas Asher

AI总结 尽管Transformer模型在上下文学习方面表现出色,但在面对简单分布偏移时往往难以泛化。本文分析发现,注意力机制中的Softmax评分函数是导致这一问题的重要因素。为此,作者提出了一种新的注意力评分函数——**带缩放的有符号平均(SSA)**,有效提升了模型在上下文学习任务中的表现,并在多个自然语言处理基准和语言学探测任务中优于使用Softmax的Transformer模型。

Comments ACL 2026 Main Conference

详情
英文摘要

While transformer models exhibit strong in-context learning (ICL) abilities, they often fail to generalize under simple distribution shifts. We analyze these failures and identify Softmax, the scoring function in the attention mechanism, as a contributing factor. We propose \textbf{Scaled Signed Averaging (SSA)}, a novel attention scoring function that mitigates these failures. SSA significantly improves performance on our ICL tasks and outperforms transformer models with Softmax on several NLP benchmarks and linguistic probing tasks, in both decoder-only and encoder-only architectures.

2508.13434 2026-05-12 cs.LG cs.AI

EventTSF: Event-Aware Non-Stationary Time Series Forecasting

Yunfeng Ge, Ming Jin, Yiji Zhao, Hongyan Li, Bo Du, Chang Xu, Shirui Pan

AI总结 本文提出了一种名为 EventTSF 的事件感知非平稳时间序列预测方法,旨在通过融合文本事件信息提升时间序列预测的准确性。该方法基于自回归扩散框架,通过逐步扩散过程整合历史时间序列与文本事件,并引入事件感知的流匹配时间步,以应对传统方法中忽略事件影响导致的去噪难度不平衡问题。实验表明,EventTSF 在多个合成和真实数据集上显著优于现有基线方法,分别在概率预测和确定性预测中实现了平均41.3%和27.5%的性能提升。

Comments Accepted by the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)

详情
英文摘要

Time series forecasting is vital in diverse sectors such as energy and transportation, where non-stationary dynamics are deeply intertwined with external events in other modalities such as texts. However, incorporating natural language-based external events to improve non-stationary forecasting remains largely unexplored, as most approaches still rely on a single modality, resulting in limited contextual knowledge and model underperformance. Enabling fine-grained multimodal interactions between temporal and textual data is challenged by two fundamental issues: (1) the gap in modeling interactions among discrete external events and continuous time series in a unified framework; (2) classical uniform diffusion timestep ignores event-induced non-stationary variability, leading to imbalanced denoising difficulty across diffusion stages. In this work, we propose event-aware non-stationary time series forecasting (EventTSF), an autoregressive diffusion framework that integrates historical time series and textual events via step-wise diffusion. To mitigate the imbalanced denoising difficulty of uniform timestep sampling, EventTSF uses an event-aware flow-matching timestep conditioned on event semantics. Extensive experiments on 7 synthetic and real-world datasets show that EventTSF outperforms 12 non-stationary time series forecasting baselines, achieving average gains of 41.3% in probabilistic forecasting and 27.5% in deterministic forecasting across all evaluation metrics.

2508.11070 2026-05-12 cs.AI

Your Recourse, My Loss? Algorithmic Recourse under Shared Constraints

Zahra Khotanlou, Kate Larson, Amir-Hossein Karimi

AI总结 本文研究了在多方利益相关者共享资源约束下的算法补救问题,提出了一个多对多的算法补救框架,考虑了提供者容量限制和个体福利之间的权衡。作者将问题建模为带容量约束的加权二分图匹配问题,并设计了多层次优化策略,包括容量匹配、容量再分配和成本感知优化,以提升整体社会福利并兼顾公平性。实验表明,该方法能够在系统设置最小修改的前提下实现接近最优的福利分配,为算法补救从个体推荐扩展到系统级设计提供了可行路径。

详情
英文摘要

Decision makers are increasingly relying on machine learning in sensitive situations. Algorithmic recourse aims to provide individuals with actionable and minimally costly steps to reverse unfavorable AI-driven decisions. While existing research focuses on single-individual (i.e., seeker) and single-model (i.e., provider) scenarios, real-world applications involve multiple stakeholders. Optimizing outcomes for seekers under an individual welfare approach overlooks the multi-agent nature of real-world systems, with competition for limited resources. Accordingly, we extend algorithmic recourse to a many-to-many setting with capacity constraints, where individually computed recourse recommendations no longer compose independently and stakeholder interactions affect recourse validity. We model this multi-agent algorithimc recourse as a capacitated weighted bipartite matching problem, based on recourse cost and provider capacity. Edge weights, reflecting recourse costs, are optimized for social welfare while quantifying the welfare gap between individual welfare and this collectively feasible outcome. We propose three optimization layers: capacitated matching, optimal capacity redistribution, and cost-aware optimization. We further model inequality-averse objectives through a concave social-welfare formulation that prioritizes the most disadvantaged seekers. Experiments demonstrate that our framework enables the many-to-many algorithmic recourse to achieve near-optimal welfare with minimum modification in system settings. Our results also show how recourse systems can be designed to balance aggregate welfare with distributive considerations. We extend algorithmic recourse from individual recommendations to system-level design, providing a tractable path toward higher social welfare while maintaining individual actionability.

2508.09042 2026-05-12 cs.CL

First, Do No Harm: AI Supervisor Scaffolds Novice Growth in Counselor Education

Chen Xu, Zhenyu Lyu, Tian Lan, Yi Yang, Yu Ji, Luyao Ji, Jian Shen, Zhihua Wang, Leyang Cui, Jieshuo Zhang, Qunxi Dong, Minqiang Yang, Juan Wang, Xiuling Liu, Bin Hu

AI总结 本文研究了如何通过AI督导帮助心理咨询新手避免伦理错误,提升其专业能力。作者提出了一种AI督导系统,它不直接替代新手,而是通过识别其对话中的伦理违规行为,并提供针对性反馈,引导其自我改进。为了解决伦理错误难以标注和现有AI难以教学的问题,研究构建了一个可控的AI新手模型生成带标签的数据集,并采用以新手成长为导向的奖励机制优化督导系统,实验表明该方法有效提升了新手的临床表现和自我效能感。

Comments 9 pages, 5 figures

详情
英文摘要

The most dangerous mistakes a novice counselor makes are not the obvious ones: they are utterances that sound caring while quietly violating professional ethics and leaving vulnerable clients less protected. We build an AI supervisor that does not replace novice counselors, but grows them-teaching them to internalize ethical violations they would otherwise never notice. What makes this supervisor non-trivial is not detection but teaching: it must locate the ethical-violating utterance, diagnose the ethical violation against APA principles, and deliver feedback that explains not just what went wrong, but why it is risky and how to respond differently. The core obstacle is that (1) ethical violations are by nature unlabeled in real clinical data, and (2) existing AI counselors trained only to match correct answers will never learn to teach. We resolve both at once: a controllable AI novice that intentionally enacts predefined mistake categories makes supervision labels a natural byproduct of generation, yielding ETHICSCAFF, a 9,915-instance human-in-the-loop dataset; and GRPO under a Novice Growth Reward (NGR) optimizes the supervisor not for answer correctness but for whether a weaker novice model actually improves after reading its explanation. Experiments show that a novice guided by our supervisor outperforms an unguided peer on clinical metrics, and that teaching-oriented optimization via NGR further sharpens the supervisor's own ethical detection. In a user study with novice counseling-psychology students, participants show significant self-efficacy gains across all eight assessed competencies after receiving AI supervisory feedback, demonstrating that the scaffold transfers from simulation to real-world practice.

2508.07697 2026-05-12 cs.LG cs.CE

Semantic-Enhanced Time-Series Forecasting via Large Language Models

Hao Liu, Xiaoxing Zhang, Chun Yang, Xiaobin Zhu

AI总结 本文研究如何利用大语言模型(LLM)提升时间序列预测的性能,重点解决语言知识结构与时间序列数据模式之间的语义表示差距问题。提出了一种语义增强的大语言模型(SE-LLM),通过挖掘时间序列的周期性和异常特征,增强词元嵌入的语义表达,从而提升模型的可解释性和时序分析能力。此外,还设计了一个嵌入自注意力机制的插件模块,以同时建模长期依赖和短期异常,有效降低计算消耗,并在实验中展现出优于现有先进方法的性能。

Comments 22 pages,6 figures

详情
英文摘要

Time series forecasting plays a significant role in finance, energy, meteorology, and IoT applications. Recent studies have leveraged the generalization capabilities of large language models (LLMs) to adapt to time series forecasting, achieving promising performance. However, existing studies focus on token-level modal alignment, instead of bridging the intrinsic modality gap between linguistic knowledge structures and time series data patterns, greatly limiting the semantic representation. To address this issue, we propose a novel Semantic-Enhanced LLM (SE-LLM) that explores the inherent periodicity and anomalous characteristics of time series to embed into the semantic space to enhance the token embedding. This process enhances the interpretability of tokens for LLMs, thereby activating the potential of LLMs for temporal sequence analysis. Moreover, existing Transformer-based LLMs excel at capturing long-range dependencies but are weak at modeling short-term anomalies in time-series data. Hence, we propose a plugin module embedded within self-attention that models long-term and short-term dependencies to effectively adapt LLMs to time-series analysis. Our approach freezes the LLM and reduces the sequence dimensionality of tokens, greatly reducing computational consumption. Experiments demonstrate the superiority performance of our SE-LLM against the state-of-the-art (SOTA) methods.

2508.04750 2026-05-12 cs.LG

PA-RNet: Perturbation-Aware Residual Network for Robust Multimodal Time Series Forecasting

Enqiang Zhu, Zhenbin Deng, Shengzhi Wang, Yi-Kun Tang, Chanjuan Liu

AI总结 在实际应用中,多模态时间序列预测面临文本信息有用但不可靠的挑战,直接整合文本容易引入噪声信号。为此,本文提出PA-RNet,一种具有扰动感知能力的残差网络,通过在融合前对文本和数值特征进行扰动感知的精细化处理,有效抑制误导性信息,提升预测鲁棒性。理论分析表明PA-RNet在文本嵌入上具有Lipschitz连续性,且能降低零均值文本扰动下的预测误差,实验结果也验证了其在多种场景下优于现有方法并具有良好的噪声鲁棒性。

详情
英文摘要

In real-world applications, multimodal time-series forecasting faces a key challenge: textual information is often useful but unreliable. Auxiliary texts may contain irrelevant, ambiguous, incomplete, or structurally corrupted content, making direct text integration prone to introducing noisy semantic signals and degrading forecasting performance. Therefore, robust multimodal forecasting requires a model that can exploit useful textual context while suppressing misleading perturbations. To address this challenge, we propose PA-RNet, a carefully designed perturbation-aware residual network for robust multimodal time-series forecasting. Rather than directly fusing textual and numerical representations, PA-RNet first refines multimodal features in a perturbation-aware manner, preserving stable contextual information while reducing unstable or misleading signals. The refined textual representations are then aligned with temporal dynamics, enabling more reliable forecasting under noisy multimodal conditions. Theoretically, we prove that PA-RNet is Lipschitz continuous with respect to textual embeddings and show that the proposed spectral residual correction can reduce the expected prediction error under zero-mean textual perturbations. We further conduct supplementary experiments with injected textual perturbations to examine the robustness of PA-RNet. The results across diverse domains demonstrate that PA-RNet consistently outperforms state-of-the-art baselines and maintains stable forecasting performance under both original and noise-perturbed textual conditions.

2507.11198 2026-05-12 cs.CL cs.AI

Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding

Conrad Borchers, Bahar Shahrokhian, Francesco Balzan, Elham Tajik, Sreecharan Sankaranarayanan, Sebastian Simon

AI总结 该研究探讨了多智能体系统(MAS)中温度参数和角色设定对大型语言模型(LLM)在定性编码任务中达成共识和编码准确性的影响。研究通过实验分析了六种开源大模型在不同角色和温度设置下的编码表现,发现温度显著影响共识达成的时机,而多角色设定在多数模型中延迟了共识,但并未显著提升编码准确性。研究还表明,单个LLM在多数情况下与MAS的共识表现相当或更优,但MAS的协作过程分析可能有助于改进编码手册设计和人机协同编码。

Comments Accepted as full paper to the 19th International Conference on Educational Data Mining (EDM 2026)

详情
英文摘要

Large Language Models (LLMs) enable new possibilities for qualitative research at scale, including annotation and qualitative coding of educational data. While LLM-based multi-agent systems (MAS) can emulate human coding workflows, their benefits over single LLM agents for coding remain poorly understood. To that end, we conducted an experimental study of how persona and temperature of component agents of a MAS shapes consensus-building and coding accuracy for dialog segments. LLMs were prompted to code these segments deductively using a mature codebook with 8 codes and high inter-rater reliability derived from prior research. Our open-source MAS mirrors deductive human coding through structured agent discussion and consensus arbitration. Using six open-source LLMs (with 3 to 32 billion parameters) and 18 experimental configurations, we analyze over 77,000 coding decisions against a gold-standard dataset of human-annotated transcripts from online math tutoring sessions facilitated by educational software. Temperature significantly impacted whether and when consensus was reached across all six LLMs. MAS with multiple personas (including neutral, assertive, or empathetic) significantly delayed consensus in four out of six LLMs compared to uniform personas. In three of those LLMs, higher temperatures significantly diminished the effects of multiple personas on consensus. However, neither temperature nor persona pairing led to robust improvements in coding accuracy. Single agents matched or outperformed MAS consensus in most conditions. Qualitative analysis of MAS collaboration and coding disagreement may, however, improve codebook design and human-AI coding.

2507.01008 2026-05-12 cs.RO

DexWrist: A Robotic Wrist for Constrained and Dynamic Manipulation

Martin Peticco, Gabriella Ulloa, John Marangola, Nitish Dashora, Pulkit Agrawal

AI总结 本文提出了一种名为 DexWrist 的新型机器人腕部装置,旨在提升机器人在受限和动态环境中的操作能力。该腕部结合了准直接驱动和解耦的平行机构,实现了高精度、高响应的力控性能,同时保持紧凑的结构设计。实验表明,DexWrist 在复杂环境中显著提升了操作成功率并加快了任务完成速度,为机器人精细操作提供了有效的硬件支持。

Comments 9 pages, 8 figures. Submitted to RA-L 2026

详情
英文摘要

Development of dexterous manipulation hardware has primarily focused on hands and grippers. However, these end-effectors are often paired with bulky and highly stiff wrists that limit performance in human environments. More designs have adopted backdrivable actuation, but are still difficult to model and control due to coupled kinematics or high mechanical inertia from heavy links. We present DexWrist, a robotic wrist that advances manipulation in highly constrained environments and enables dynamic, contact-rich tasks. We achieve this by combining quasi-direct drive actuation with a decoupled parallel kinematic mechanism in a compact design. It delivers 3.75 +/- 0.05 Nm rated torque, 0.33 +/- 0.06 Nm backdrive torque, 10.15 +/- 1.34 Hz torque bandwidth, +/- 40 degrees ROM in both DOFs, and a one-to-one motor-to-DOF mapping in a 0.97 kg package. In practice, these properties increase workspace in cluttered environments and stabilize contact without the need for finely tuned admittance control. We evaluate DexWrist as a drop-in wrist upgrade in simulation and on two robot arms performing representative constrained and contact-rich tasks. In learned policy evaluations, DexWrist achieved 50-76% relative improvements in success rate, and reduced autonomous task completion times by 3-5x. More details about DexWrist can be found at https://dexwrist.csail.mit.edu.

2506.14167 2026-05-12 cs.LG

Kolmogorov-Arnold Energy Models: Fast, Interpretable Generative Modeling

Prithvi Raj

AI总结 该论文提出了一种名为Kolmogorov-Arnold能量模型(KAEM)的生成模型,旨在解决传统生成模型在效率与可解释性之间的权衡问题。KAEM基于Kolmogorov-Arnold表示定理的改进版本,通过引入一维潜变量结构,实现了精确的逆变换推断,并结合重要性采样与退火分布策略,提升了后验推断的效率与稳定性。实验表明,KAEM在SVHN和CIFAR10数据集上取得了优于VAE和神经潜变量能量模型的生成质量,同时保持了潜空间的可解释性与高效的采样能力。

详情
英文摘要

Generative models typically rely on either simple latent priors (e.g., Variational Autoencoders, VAEs), which are efficient but limited, or highly expressive iterative samplers (e.g., Diffusion and Energy-based Models), which are costly and opaque. We introduce the Kolmogorov-Arnold Energy Model (KAEM) to bridge this trade-off and provide new opportunities for latent-space interpretability. Based on a novel adaptation of the Kolmogorov-Arnold Representation Theorem, KAEM imposes a univariate latent structure on the prior, enabling exact inference via the inverse transform method. With a low-dimensional latent space and appropriate inductive biases, importance sampling becomes a tractable, unbiased, and efficient posterior inference method. For settings where this fails, we propose a population-based strategy that decomposes the posterior into a sequence of annealed distributions, a new remedy for poor mixing in Energy-based Models. We compare KAEM against VAEs and the neural latent EBM architecture. KAEM attains the best Fréchet Inception Distance among latent-prior models on SVHN and CIFAR10, while sampling in a single forward pass and exposing an interpretable prior built from 1D densities.

2506.11790 2026-05-12 cs.LG cs.AI

Why Do Class-Dependent Evaluation Effects Occur with Time Series Feature Attributions? A Synthetic Data Investigation

Gregor Baer, Isel Grau, Chao Zhang, Pieter Van Gorp

AI总结 本文研究了在时间序列特征归因中出现的“类依赖评估效应”现象,即不同类别在相同数据集上表现出不同的归因方法性能。通过构建具有已知真实特征位置的合成时间序列数据,作者系统分析了特征类型和类别对比变化对归因评估的影响,并对比了基于扰动和基于真实标签的评估指标。研究发现,即使在简单场景下,两类评估方法也会得出矛盾的归因质量判断,表明当前归因评估方法可能存在局限,需重新审视其衡量标准并开发更全面的评估方式。

Comments Accepted at TempXAI Workshop @ ECML-PKDD 2025 (Explainable AI for Time Series and Data Streams)

详情
Journal ref
Koprinska, I., Mendes-Moreira, J., Branco, P. (eds) Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2025. Communications in Computer and Information Science, vol 2842. Springer, Cham
英文摘要

Evaluating feature attribution methods represents a critical challenge in explainable AI (XAI), as researchers typically rely on perturbation-based metrics when ground truth is unavailable. However, recent work reveals that these evaluation metrics can show different performance across predicted classes within the same dataset. These "class-dependent evaluation effects" raise questions about whether perturbation analysis reliably measures attribution quality, with direct implications for XAI method development and evaluation trustworthiness. We investigate under which conditions these class-dependent effects arise by conducting controlled experiments with synthetic time series data where ground truth feature locations are known. We systematically vary feature types and class contrasts across binary classification tasks, then compare perturbation-based degradation scores with ground truth-based precision-recall metrics using multiple attribution methods. Our experiments demonstrate that class-dependent effects emerge with both evaluation approaches, even in simple scenarios with temporally localized features, triggered by basic variations in feature amplitude or temporal extent between classes. Most critically, we find that perturbation-based and ground truth metrics frequently yield contradictory assessments of attribution quality across classes, with weak correlations between evaluation approaches. These findings suggest that researchers should interpret perturbation-based metrics with care, as they may not always align with whether attributions correctly identify discriminating features. By showing this disconnect, our work points toward reconsidering what attribution evaluation actually measures and developing more rigorous evaluation methods that capture multiple dimensions of attribution quality.

2506.11419 2026-05-12 cs.AI cs.RO

FocalAD: Local Motion Planning for End-to-End Autonomous Driving

Bin Sun, Boao Zhang, Jiayi Lu, Xinjie Feng, Jiachen Shang, Rui Cao, Mengchao Zheng, Chuanye Wang, Shichun Yang, Yaoguang Cao, Ziying Song

AI总结 在端到端自动驾驶中,运动预测对车辆路径规划至关重要。现有方法通常依赖全局运动特征,而忽略了规划决策主要由少数局部交互的智能体影响。为此,本文提出FocalAD框架,专注于关键局部邻居,通过增强局部运动表示来提升规划性能。该方法包含两个核心模块:基于图结构的自车-局部智能体交互模块(ELAI)和聚焦局部智能体的损失函数(FLA Loss),在多个数据集上均表现出优越的性能,尤其在鲁棒性测试中显著降低了碰撞率。

详情
Journal ref
FocalAD: Local Motion Planning for End-to-End Autonomous Driving. Automot. Innov. (2026)
英文摘要

In end-to-end autonomous driving,the motion prediction plays a pivotal role in ego-vehicle planning. However, existing methods often rely on globally aggregated motion features, ignoring the fact that planning decisions are primarily influenced by a small number of locally interacting agents. Failing to attend to these critical local interactions can obscure potential risks and undermine planning reliability. In this work, we propose FocalAD, a novel end-to-end autonomous driving framework that focuses on critical local neighbors and refines planning by enhancing local motion representations. Specifically, FocalAD comprises two core modules: the Ego-Local-Agents Interactor (ELAI) and the Focal-Local-Agents Loss (FLA Loss). ELAI conducts a graph-based ego-centric interaction representation that captures motion dynamics with local neighbors to enhance both ego planning and agent motion queries. FLA Loss increases the weights of decision-critical neighboring agents, guiding the model to prioritize those more relevant to planning. Extensive experiments show that FocalAD outperforms existing state-of-the-art methods on the open-loop nuScenes datasets and closed-loop Bench2Drive benchmark. Notably, on the robustness-focused Adv-nuScenes dataset, FocalAD achieves even greater improvements, reducing the average colilision rate by 41.9% compared to DiffusionDrive and by 15.6% compared to SparseDrive.

2506.05967 2026-05-12 cs.AI cs.LG stat.ML

Preference Learning for AI Alignment: a Causal Perspective

Katarzyna Kobalczyk, Mihaela van der Schaar

AI总结 本文从因果视角探讨了基于偏好数据的奖励建模问题,旨在提升大语言模型与人类价值观的一致性。研究指出了因果误识别、偏好异质性及用户特定因素带来的混淆等关键挑战,并借鉴因果推断领域的理论,明确了实现可靠泛化的必要假设。通过分析朴素奖励模型的失效模式,文章展示了因果启发方法在提升模型鲁棒性方面的潜力,并提出了未来研究和实践应关注的方向。

详情
Journal ref
Proceedings of the 42nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025
英文摘要

Reward modelling from preference data is a crucial step in aligning large language models (LLMs) with human values, requiring robust generalisation to novel prompt-response pairs. In this work, we propose to frame this problem in a causal paradigm, providing the rich toolbox of causality to identify the persistent challenges, such as causal misidentification, preference heterogeneity, and confounding due to user-specific factors. Inheriting from the literature of causal inference, we identify key assumptions necessary for reliable generalisation and contrast them with common data collection practices. We illustrate failure modes of naive reward models and demonstrate how causally-inspired approaches can improve model robustness. Finally, we outline desiderata for future research and practices, advocating targeted interventions to address inherent limitations of observational data.

2505.23617 2026-05-12 cs.CV cs.AI cs.GR cs.LG

One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

Chenhao Zheng, Jieyu Zhang, Mohammadreza Salehi, Ziqi Gao, Vishnu Iyengar, Norimasa Kobori, Quan Kong, Ranjay Krishna

AI总结 本文提出了一种基于全景子物体轨迹的视频分词方法,旨在解决传统时空块分词在长视频处理中导致的冗余令牌和计算效率低的问题。该方法通过将视频内容组织为物体轨迹生成语义令牌,有效减少了令牌数量并保持时间一致性。所提出的TrajViT模型在多个视频理解任务中显著优于现有方法,展现出更高的性能和更低的计算成本。

Comments ICCV 2025

详情
英文摘要

Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin of 6% top-5 recall in average at video-text retrieval task with 10x token deduction. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 18x less inference FLOPs. TrajViT is the first efficient encoder to consistently outperform ViT3D across diverse video analysis tasks, making it a robust and scalable solution.

2505.22919 2026-05-12 cs.CL

ER-Reason: A Benchmark Dataset for LLM Clinical Reasoning in the Emergency Room

Nikita Mehandru, Niloufar Golchini, Namrata Garg, Kathy T. LeSaint, Christopher J. Nash, Anu Ramachandran, Travis Zack, Liam G. McCoy, Adam Rodman, David Bamman, Melanie Molina, Ahmed Alaa

AI总结 本文介绍了ER-Reason,一个用于评估大语言模型在急诊科临床推理能力的基准数据集。该数据集基于真实临床记录,涵盖急诊流程的各个环节,通过逐步推理问题和医生标注的评分,全面评估模型在证据积累过程中的诊断推理能力。与现有基准相比,ER-Reason能更细致地揭示大语言模型在真实病例中的推理缺陷。

详情
英文摘要

Existing benchmarks for evaluating the clinical reasoning capabilities of large language models (LLMs) often lack a clear definition of "clinical reasoning" as a construct, fail to capture the full breadth of interdependent tasks within a clinical workflow, and rely on stylized vignettes rather than real-world clinical documentation. As a result, recent studies have found significant discrepancies between LLM performance on stylized benchmarks derived from medical licensing exams and their performance in real-world prospective studies. To address these limitations, we introduce ER-Reason, a benchmark designed to evaluate LLM reasoning as clinical evidence accumulates across decision-making tasks spanning the full workflow of emergency medicine. ER-Reason comprises 25,174 de-identified clinical notes from 3,437 patients, supporting evaluation across all stages of the emergency department workflow: triage intake, treatment selection, disposition planning, and final diagnosis. Crucially, evaluation in ER-Reason extends beyond diagnostic accuracy to include stepwise Script Concordance Test (SCT)-style questions grounded in real patient cases, which assess whether LLMs update their diagnostic beliefs in the correct direction and magnitude as clinical evidence accumulates, scored against 2,555 emergency physician annotations. We evaluate reasoning and non-reasoning LLMs on ER-Reason, and show that our tasks provide a more nuanced view of how LLM reasoning fails on real patient cases than existing benchmarks allow.

2505.19525 2026-05-12 cs.LG cs.AI

Rethinking Gating Mechanism in Sparse MoE: Handling Arbitrary Modality Inputs with Confidence-Guided Gate

Liangwei Nathan Zheng, Wei Emma Zhang, Mingyu Guo, Olaf Maennel, Weitong Chen

AI总结 在现实世界的多模态学习任务中,如何有效处理缺失模态是一个关键挑战。本文提出ConfSMoE方法,通过引入两阶段插补模块和基于置信度引导的门控机制,提升稀疏混合专家(SMoE)架构对缺失模态的处理能力,缓解专家崩溃问题。该方法无需额外引入负载均衡损失函数,理论分析与实验验证表明其在多个真实数据集上具有良好的鲁棒性和泛化能力。

详情
英文摘要

Effectively managing missing modalities is a fundamental challenge in real-world multimodal learning scenarios, where data incompleteness often results from systematic collection errors or sensor failures. Sparse Mixture-of-Experts (SMoE) architectures have the potential to naturally handle multimodal data, with individual experts specializing in different modalities. However, existing SMoE approach often lacks proper ability to handle missing modality, leading to performance degradation and poor generalization in real-world applications. We propose ConfSMoE to introduce a two-stage imputation module to handle the missing modality problem for the SMoE architecture by taking the opinion of experts and reveal the insight of expert collapse from theoretical analysis with strong empirical evidence. Inspired by our theoretical analysis, ConfSMoE propose a novel expert gating mechanism by detaching the softmax routing score to task confidence score w.r.t ground truth signal. This naturally relieves expert collapse without introducing additional load balance loss function. We show that the insights of expert collapse aligns with other gating mechanism such as Gaussian and Laplacian gate. The proposed method is evaluated on four different real world dataset with three distinct experiment settings to conduct comprehensive analysis of ConfSMoE on resistance to missing modality and the impacts of proposed gating mechanism.

2505.18269 2026-05-12 cs.LG math.OC math.PR stat.ML

Representative Action Selection for Large Action Space Bandit Families

Quan Zhou, Mark Kozdoba, Shie Mannor

AI总结 本文研究了从共享动作空间的多个老虎机问题中选择代表性动作子集的问题。在实际场景中,尽管动作空间较大,但不同动作在不同环境中的奖励高度相关,因此无需保留全部动作。作者提出了一种简单有效的算法,通过随机采样并求解每个老虎机实例,收集其最优动作,从而显著减少动作空间。该方法无需预先了解动作间的相关性结构,并在理论上保证了性能,实验也验证了其优于多种基准方法。

详情
英文摘要

We study the problem of selecting a subset from a large action space shared by a family of bandits. In many natural situations, while the nominal set of actions is large, actions are highly correlated: many yield similar rewards across environments, making it wasteful to maintain the full set. Our aim is to understand whether it is possible -- and how -- to select a smaller set of representative actions that performs nearly as well as the full action space. Our main contribution is a surprisingly simple algorithm: repeatedly sample a bandit instance at random, solve it, and collect the optimal action. This algorithm can significantly reduce the action space when such correlations are present, without the need to know a-priori the correlation structure. We provide theoretical guarantees on the performance of the algorithm and demonstrate its practical effectiveness through empirical comparisons with Combinatorial Bandit, Meta Learning Bandit and Zooming baselines.

2505.18091 2026-05-12 cs.LG cs.AI cs.CL

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

Xinran Gu, Kaifeng Lyu, Jiazheng Li, Jingzhao Zhang

AI总结 本研究探讨了在混合数据上训练大语言模型时知识获取过程中可能出现的相变现象。研究发现,当模型规模或数据混合比例达到临界值时,模型对知识密集型数据的掌握程度会突然发生显著变化,表现出类似相变的非连续特性。通过理论分析和实验验证,作者将这一现象归因于模型容量分配问题,并提出了一种信息论框架来解释和预测这种相变行为,揭示了混合比例与模型规模之间存在幂律关系。这一发现对理解模型训练中的数据混合策略具有重要意义。

Comments NeurIPS'25 Spotlight

详情
英文摘要

Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets, unlike training exclusively on knowledge-dense data (arXiv:2404.05405), does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. Through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this threshold, it rapidly memorizes more biographies. We attribute these phase transitions to a capacity allocation phenomenon: a model with bounded capacity must act like a knapsack problem solver to minimize the overall test loss, and the optimal allocation across datasets can change discontinuously as the model size or mixing ratio varies. We formalize this intuition in an information-theoretic framework and reveal that these phase transitions are predictable, with the critical mixing ratio following a power-law relationship with the model size. Our findings highlight a concrete case where a good mixing recipe for large models may not be optimal for small models, and vice versa.

2505.11604 2026-05-12 cs.CL

Talk to Your Slides: High-Efficiency Slide Editing via Language-Driven Structured Data Manipulation

Kyudan Jung, Hojun Cho, Jooyeol Yun, Soyoung Yang, Jaehyeok Jang, Jaegul Choo

AI总结 本文提出了一种名为“Talk-to-Your-Slides”的高效幻灯片编辑代理,通过语言驱动的结构化数据操作实现对幻灯片内容的精准修改,避免了基于图像识别的传统方法在处理文本密集型任务时的高计算成本和延迟问题。该方法利用幻灯片的底层对象模型而非屏幕像素进行操作,既保持了样式一致性,又提升了编辑效率。实验表明,该方法在文本处理和格式调整任务中比基于GUI的方法快34%,指令执行准确率提高34%,成本降低87%,并引入了包含379条人工验证指令的TSBench基准数据集以评估系统性能。

Comments 30 pages, Accepted at ACL2026

详情
英文摘要

Editing presentation slides is a frequent yet tedious task, ranging from creative layout design to repetitive text maintenance. While recent GUI-based agents powered by Multimodal LLMs (MLLMs) excel at tasks requiring visual perception, such as spatial layout adjustments, they often incur high computational costs and latency when handling structured, text-centric, or batch processing tasks. In this paper, we propose Talk-to-Your-Slides, a high-efficiency slide editing agent that operates via language-driven structured data manipulation rather than relying on the image modality. By leveraging the underlying object model instead of screen pixels, our approach ensures precise content modification while preserving style fidelity, addressing the limitations of OCR-based visual agents. Our system features a hierarchical architecture that effectively bridges high-level user instructions with low-level execution codes. Experiments demonstrate that for text-centric and formatting tasks, our method enables 34% faster processing, achieves 34% better instruction fidelity, and operates at an 87% lower cost compared to GUI-based baselines. Furthermore, we introduce TSBench, a human-verified benchmark dataset comprising 379 instructions, including a Hard subset designed to evaluate robustness against complex and visually dependent queries. Our code and benchmark are available at https://github.com/KyuDan1/Talk-to-Your-Slides.

2505.07027 2026-05-12 cs.AI cs.CL cs.LG cs.NE physics.chem-ph

LLM-Augmented Chemical Synthesis and Design Decision Programs

Haorui Wang, Jeff Guo, Lingkai Kong, Rampi Ramprasad, Philippe Schwaller, Yuanqi Du, Chao Zhang

AI总结 本文研究了如何利用大语言模型(LLM)增强化学合成路径规划与分子设计决策。针对传统机器学习方法在多步逆合成规划中受限于组合空间的问题,作者提出了一种高效的反应路径编码方案和新的路线级搜索策略,突破了传统逐步预测反应物的局限。实验表明,该LLM增强方法在逆合成规划任务中表现出色,并可自然扩展至可合成分子设计的更广泛挑战中。

Comments ICML 2025

详情
英文摘要

Retrosynthesis, the process of breaking down a target molecule into simpler precursors through a series of valid reactions, stands at the core of organic chemistry and drug development. Although recent machine learning (ML) research has advanced single-step retrosynthetic modeling and subsequent route searches, these solutions remain restricted by the extensive combinatorial space of possible pathways. Concurrently, large language models (LLMs) have exhibited remarkable chemical knowledge, hinting at their potential to tackle complex decision-making tasks in chemistry. In this work, we explore whether LLMs can successfully navigate the highly constrained, multi-step retrosynthesis planning problem. We introduce an efficient scheme for encoding reaction pathways and present a new route-level search strategy, moving beyond the conventional step-by-step reactant prediction. Through comprehensive evaluations, we show that our LLM-augmented approach excels at retrosynthesis planning and extends naturally to the broader challenge of synthesizable molecular design.

2505.06835 2026-05-12 cs.LG stat.CO stat.ME stat.ML

Streaming Sliced Optimal Transport

Khai Nguyen

AI总结 本文提出了一种用于流式数据的切片沃谢尔距离估计方法——Streaming Sliced Wasserstein(Stream-SW),旨在提升切片最优传输在计算效率和内存消耗方面的表现。该方法基于对一维沃谢尔距离的流式估计,结合分位数近似技术,实现了对流式样本的高效处理。实验表明,与随机子采样方法相比,Stream-SW 在保持较低内存消耗的同时,能够更准确地逼近切片沃谢尔距离,并在点云分类、梯度流和流式变化点检测等任务中展现出优越的性能。

Comments Accepted to ICML 2026, 22 pages, 8 figures, 7 tables

详情
英文摘要

Sliced optimal transport (SOT), or sliced Wasserstein (SW) distance, is widely recognized for its statistical and computational scalability. In this work, we further enhance computational scalability by proposing the first method for estimating SW from sample streams, called streaming sliced Wasserstein (Stream-SW). To define Stream-SW, we first introduce a streaming estimator of the one-dimensional Wasserstein distance (1DW). Since the 1DW has a closed-form expression, given by the integral of the absolute difference between the quantile functions of the compared distributions, we leverage quantile approximation techniques for sample streams to define a streaming 1DW estimator. By applying the streaming 1DW to all projections, we obtain Stream-SW. The key advantage of Stream-SW is its low memory complexity while providing theoretical guarantees on the approximation error. We demonstrate that Stream-SW achieves a more accurate approximation of SW than random subsampling, with lower memory consumption, when comparing Gaussian distributions and mixtures of Gaussians from streaming samples. Additionally, we conduct experiments on point cloud classification, point cloud gradient flows, and streaming change point detection to further highlight the favorable performance of the proposed Stream-SW.

2505.05740 2026-05-12 cs.LG

Deep-ICE: the first globally optimal algorithm for minimizing 0-1 loss in two-layer ReLU and maxout networks

Xi He, Yi Miao, Max A. Little

AI总结 本文提出了一种首个能够全局最优求解两层ReLU和maxout网络最小化0-1损失问题的算法,即最小化分类错误数量。该算法在最坏情况下的时间复杂度为$O\left(N^{DK+1}\right)$,并可推广至任意可计算损失函数而不影响复杂度。实验表明,该算法在小规模数据集上能提供理论保证的精确解,并通过引入一种新的数据集约化方法,使其能够高效处理大规模数据集,在训练和预测中的分类错误率相比现有方法降低了20%-30%。

详情
英文摘要

This paper introduces the first globally optimal algorithm for the empirical risk minimization problem of two-layer maxout and ReLU networks, i.e., minimizing the number of misclassifications. The algorithm has a worst-case time complexity of $O\left(N^{DK+1}\right)$, where $K$ denotes the number of hidden neurons and $D$ represents the number of features. It can be can be generalized to accommodate arbitrary computable loss functions without affecting its computational complexity. Our experiments demonstrate that the proposed algorithm provides provably exact solutions for small-scale datasets. To handle larger datasets, we introduce a novel coreset selection method that reduces the data size to a manageable scale, making it feasible for our algorithm. This extension enables efficient processing of large-scale datasets and achieves significantly improved performance, with a 20-30\% reduction in misclassifications for both training and prediction, compared to state-of-the-art approaches (neural networks trained using gradient descent and support vector machines), when applied to the same models (two-layer networks with fixed hidden nodes and linear models). The artifacts of the Deep-ICE algorithm can be found in https://github.com/XiHegrt/DeepICE-algorithm-artifacts.

2505.05707 2026-05-12 cs.LG cs.CR

Crowding Out The Noise: Algorithmic Collective Action Under Differential Privacy

Rushabh Solanki, Meghana Bhange, Ulrich Aïvodji, Elliot Creager

AI总结 本文研究了在差分隐私保护下,用户群体通过算法集体行动影响AI学习过程的能力。核心方法是分析差分隐私随机梯度下降(DP-SGD)对集体行动效果的影响,并通过理论分析和实验验证了隐私保护与集体行动成功率之间的权衡关系。研究的主要贡献在于揭示了隐私参数和集体规模如何影响算法集体行动的有效性,并进一步结合经济分析探讨了隐私成本对集体形成的影响。

详情
英文摘要

The integration of AI into daily life has generated considerable attention and excitement, while also raising concerns about automating algorithmic harms and re-entrenching existing social inequities. While the responsible deployment of trustworthy AI systems is a worthy goal, there are many possible ways to realize it, from policy and regulation to improved algorithm design and evaluation. In fact, since AI trains on social data, there is even a possibility for everyday users, citizens, or workers to directly steer the AI system's behavior through Algorithmic Collective Action, by deliberately modifying the data they share with a platform to drive its learning process in their favor. This paper considers how these grassroots efforts to influence AI interact with methods used by AI firms and governments to improve model trustworthiness. In particular, we focus on the setting where the AI firm deploys a differentially private model, motivated by the growing regulatory focus on privacy and data protection. We investigate how the use of Differentially Private Stochastic Gradient Descent (DP-SGD) affects the collective's ability to influence the learning process. Our findings show that while differential privacy protects individual data, it introduces challenges for effective algorithmic collective action. We establish this trade-off formally by characterizing lower bounds on the success of algorithmic collective action under differential privacy as a function of the collective's size and the firm's privacy parameters. We then verify these trends experimentally by simulating collective action during the training of deep neural network classifiers across several datasets. Finally, we perform a stylized economic analysis of privacy costs to integrate additional incentives, analyzing how utility and participation costs influence the formation of collectives under private training regimes.

2505.05209 2026-05-12 cs.CV

EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution

Haizhen Xie, Kunpeng Du, Qiangyu Yan, Sen Lu, Jianhong Han, Hanting Chen, Hailin Hu, Jie Hu

AI总结 本文提出了一种基于扩散变换器(DiT)的盲超分辨率方法EAM,旨在提升图像超分辨率性能。该方法引入了新的$Ψ$-DiT模块,通过三流架构有效利用预训练DiT的先验知识,并结合渐进式掩码图像建模策略和主题感知提示生成策略,显著提升了模型的泛化能力和训练效率。实验表明,EAM在多个数据集上取得了优于现有方法的定量指标和视觉质量。

Comments Revision of Section 4.1

详情
英文摘要

Utilizing pre-trained Text-to-Image (T2I) diffusion models to guide Blind Super-Resolution (BSR) has become a predominant approach in the field. While T2I models have traditionally relied on U-Net architectures, recent advancements have demonstrated that Diffusion Transformers (DiT) achieve significantly higher performance in this domain. In this work, we introduce Enhancing Anything Model (EAM), a novel BSR method that leverages DiT and outperforms previous U-Net-based approaches. We introduce a novel block, $Ψ$-DiT, which effectively guides the DiT to enhance image restoration. This block employs a low-resolution latent as a separable flow injection control, forming a triple-flow architecture that effectively leverages the prior knowledge embedded in the pre-trained DiT. To fully exploit the prior guidance capabilities of T2I models and enhance their generalization in BSR, we introduce a progressive Masked Image Modeling strategy, which also reduces training costs. Additionally, we propose a subject-aware prompt generation strategy that employs a robust multi-modal model in an in-context learning framework. This strategy automatically identifies key image areas, provides detailed descriptions, and optimizes the utilization of T2I diffusion priors. Our experiments demonstrate that EAM achieves state-of-the-art results across multiple datasets, outperforming existing methods in both quantitative metrics and visual quality.

2504.12334 2026-05-12 cs.CL

QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model

Zongxian Yang, Jiayu Qian, Kay Chen Tan, Hau-San Wong, Yulong Chen, Haoyu Zhang, Zhi-An Huang

AI总结 本文提出了一种名为 QM-ToT 的量化医学树状思维推理框架,旨在解决大语言模型在医疗领域任务中因量化后性能下降的问题。该方法通过树状思维(ToT)分解复杂的医学问题为多个子任务,并结合评估层进行推理优化,显著提升了量化模型在医学数据集上的表现。实验表明,QM-ToT 在 MedQAUSMLE 数据集上实现了显著的性能提升,并提出了一种基于 ToT 的高效数据蒸馏方法,仅使用少量数据即可取得优于传统方法的性能。

Comments Accepted by ICIC 2026 Poster

详情
英文摘要

Large language models (LLMs) face significant challenges in specialized biomedical tasks due to the inherent complexity of medical reasoning and the sensitive nature of clinical data. Existing LLMs often struggle with intricate medical terminology and the need for accurate clinical insights, leading to performance reduction when quantized for resource-constrained deployment. To address these issues, we propose Quantized Medical Tree of Thought (QM-ToT), a path-based reasoning framework. QM-ToT leverages a Tree of Thought (ToT) reasoning approach to decompose complex medical problems into manageable subtasks, coupled with evaluator assessment layers. This framework facilitates substantial performance improvements in INT4-quantized models on the challenging MedQAUSMLE dataset. Specifically, we demonstrate a remarkable accuracy increase from 34% to 50% for the LLaMA2-70b model and from 58.77% to 69.49% for LLaMA-3.1-8b. Besides, we also proposed an effect data distillation method based on ToT. Compared to the traditional distillation method, we achieved an improvement of 86. 27% while using only 3.9% of the data.This work, for the first time, showcases the potential of ToT to significantly enhance performance on complex biomedical tasks, establishing a crucial foundation for future advances in deploying high-performing quantized LLM in resource-limited medical settings.

2504.04991 2026-05-12 cs.RO

Wavelet Policy: Imitation Learning in the Scale Domain with World Prior Memory

Changchuan Yang, Yuhang Dong, Guanzhong Tian, Haizhou Ge, Hongrui Zhu

AI总结 本文提出了一种名为Wavelet Policy的轻量级模仿学习框架,通过结合世界先验记忆(WPM)与小波域多尺度动作建模,提升机器人长期任务中的场景感知与动作规划能力。该方法将静态背景图像中的持久物理结构编码为紧凑的记忆标记,并在前向传播中注入到编码器中,从而增强对环境的理解;同时,利用小波变换对动作标记进行多尺度分解,并采用单编码器多解码器架构进行建模,最终通过逆小波变换生成可执行动作。实验表明,该方法在多个模拟和真实机器人任务中均优于现有先进方法,验证了其在长期具身操作任务中的有效性与高效性。

详情
英文摘要

Conventional visuomotor imitation learning usually predicts future robot actions directly in the time domain. Such formulations often have limited physical scene awareness and weak long-horizon memory. In contrast, world-model-based perception and memory-augmented policies can improve world awareness with substantial computation overhead. In this work, we propose Wavelet Policy, a lightweight imitation learning framework that combines World Prior Memory (WPM) with wavelet-based multi-scale action modeling. Our key idea is to encode persistent physical scene structure from static background images into compact memory tokens, which are fused into world-prior tokens and injected into the encoder during forward propagation. Based on this memory-conditioned representation, We further perform wavelet-domain decomposition over horizon-aligned latent action tokens and adopt a Single-Encoder Multiple-Decoder (SE2MD) architecture to model latent components at different temporal scales. The resulting latent subbands are reconstructed through inverse wavelet transform and finally projected into executable action chunks. To facilitate efficient world prior learning, we introduce a world-prior adaptation loss, encouraging the background encoder to retain persistent scene knowledge while remaining lightweight and stable. Extensive experiments on four simulated and six real-world robotic manipulation tasks show that Wavelet Policy consistently outperforms strong baselines. These results demonstrate that combining scale-domain action modeling with world-prior memory provides an effective and efficient solution for long-horizon embodied manipulation. We release the source code, data and model checkpoint of simulation task at https://github.com/lurenjia384/Wavelet_Policy.

2504.02450 2026-05-12 cs.RO cs.AI cs.LG

CHARMS: A Cognitive Hierarchical Agent for Reasoning and Motion Stylization in Autonomous Driving

Jingyi Wang, Duanfeng Chu, Zejian Deng, Liping Lu, Jinxiang Wang, Chen Sun

AI总结 为了解决自动驾驶决策中交互性不足和行为多样性缺乏的问题,本文提出了一种认知分层智能体CHARMS,通过结合Level-k博弈论和两阶段训练流程(强化学习预训练与监督微调),使其能够模拟人类的推理模式,表现出多样化且接近人类的驾驶行为。此外,CHARMS还引入泊松认知分层理论,构建了场景生成框架,能够生成具有不同驾驶风格的车辆分布,实验表明CHARMS在自主驾驶决策和环境车辆场景生成方面均表现出优异性能。

详情
Journal ref
ITSC 2025
英文摘要

To address the challenge of insufficient interactivity and behavioral diversity in autonomous driving decision-making, this paper proposes a Cognitive Hierarchical Agent for Reasoning and Motion Stylization (CHARMS). By leveraging Level-k game theory, CHARMS captures human-like reasoning patterns through a two-stage training pipeline comprising reinforcement learning pretraining and supervised fine-tuning. This enables the resulting models to exhibit diverse and human-like behaviors, enhancing their decision-making capacity and interaction fidelity in complex traffic environments. Building upon this capability, we further develop a scenario generation framework that utilizes the Poisson cognitive hierarchy theory to control the distribution of vehicles with different driving styles through Poisson and binomial sampling. Experimental results demonstrate that CHARMS is capable of both making intelligent driving decisions as an ego vehicle and generating diverse, realistic driving scenarios as environment vehicles. The code for CHARMS is released at https://github.com/chuduanfeng/CHARMS.

2503.14434 2026-05-12 cs.LG cs.AI cs.CL cs.NE

LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers

Nikhil Abhyankar, Parshin Shojaee, Chandan K. Reddy

AI总结 本文提出了一种基于大语言模型(LLM)的自动特征工程框架LLM-FE,用于提升表格数据学习任务的预测性能。该方法将特征工程视为程序搜索问题,利用LLM的领域知识和推理能力生成特征转换程序,并结合数据驱动的反馈进行迭代优化。实验表明,LLM-FE在多个分类和回归基准上显著优于现有方法,有效提升了表格预测模型的表现。

Comments Accepted in Transactions on Machine Learning Research (TMLR)

详情
英文摘要

Automated feature engineering plays a critical role in improving predictive model performance for tabular learning tasks. Traditional automated feature engineering methods are limited by their reliance on pre-defined transformations within fixed, manually designed search spaces, often neglecting domain knowledge. Recent advances using Large Language Models (LLMs) have enabled the integration of domain knowledge into the feature engineering process. However, existing LLM-based approaches use direct prompting or rely solely on validation scores for feature selection, failing to leverage insights from prior feature discovery experiments or establish meaningful reasoning between feature generation and data-driven performance. To address these challenges, we propose LLM-FE, a novel framework that combines evolutionary search with the domain knowledge and reasoning capabilities of LLMs to automatically discover effective features for tabular learning tasks. LLM-FE formulates feature engineering as a program search problem, where LLMs propose new feature transformation programs iteratively, and data-driven feedback guides the search process. Our results demonstrate that LLM-FE consistently outperforms state-of-the-art baselines, significantly enhancing the performance of tabular prediction models across diverse classification and regression benchmarks. The code is available at: https://github.com/nikhilsab/LLMFE

2503.09336 2026-05-12 cs.CV

Stealthy Patch-Wise Backdoor Attack in 3D Point Cloud via Curvature Awareness

Yu Feng, Dingxin Zhang, Runkai Zhao, Yong Xia, Heng Huang, Weidong Cai

AI总结 本文提出了一种针对3D点云模型的隐蔽块状后门攻击方法SPBA,通过利用局部曲率变化对点云进行块状划分,并选择不易察觉的块作为后门触发区域,从而在不显著改变点云结构的前提下实现高效隐蔽的后门植入。该方法相比传统的样本级触发方式,大幅降低了计算开销并提升了攻击隐蔽性,在多个基准数据集上取得了优越的实验结果。

Comments 12 pages, 6 figures, 11 tables

详情
英文摘要

Backdoor attacks pose a severe threat to deep neural networks (DNNs) by implanting hidden backdoors that can be activated with predefined triggers to manipulate model behaviors maliciously. Recent studies have extended backdoor attacks to 3D point clouds, but most existing triggers are sample-wise and often cause visible geometric artifacts or high optimization cost. To address these limitations, we propose the Stealthy Patch-Wise Backdoor Attack (SPBA), a patch-wise backdoor attack framework for 3D point clouds. Specifically, SPBA decomposes a point cloud into local patches, where each patch is formed by a Farthest Point Sampling (FPS) center and its K-nearest neighbors (KNN). Candidate patches are ranked using a patch imperceptibility score derived from local curvature variation, and a unified spectral trigger is injected into the selected patches by perturbing only the coordinates of existing points while preserving the original point cardinality. Extensive experiments on ModelNet40 and ShapeNetPart further demonstrate that SPBA achieves state-of-the-art stealthiness among prior methods and reduces spectral-trigger computation by 98.43% relative to a sample-wise spectral baseline, while maintaining competitive attack performance. These results support localized spectral design as an effective and efficient approach to stealthy backdoor attacks in 3D point cloud models. Code is available at https://github.com/HazardFY/SPBA.

2503.06629 2026-05-12 cs.LG cs.AI eess.SP

Hardware-Accelerated Event-Graph Neural Networks for Low-Latency Time-Series Classification on SoC FPGA

Hiroshi Nakano, Krzysztof Blachut, Kamil Jeziorek, Piotr Wzorek, Manon Dampfhoffer, Thomas Mesquida, Hiroaki Nishi, Tomasz Kryjak, Thomas Dalgaty

AI总结 随着边缘传感器采集的数据量不断增加,对本地智能处理的需求也日益增长。本文提出了一种基于事件图神经网络的硬件加速方案,用于在SoC FPGA上实现低延迟的时间序列分类。该方法利用人工耳蜗模型将时间序列信号转换为稀疏事件数据,大幅减少计算量,并在SHD数据集上实现了优于现有方法的分类准确率与能效表现。

Comments Paper accepted for the 21st International Symposium on Applied Reconfigurable Computing ARC 2025, Sevilla, Spain, April 9-11, 2025

详情
英文摘要

As the quantities of data recorded by embedded edge sensors grow, so too does the need for intelligent local processing. Such data often comes in the form of time-series signals, based on which real-time predictions can be made locally using an AI model. However, a hardware-software approach capable of making low-latency predictions with low power consumption is required. In this paper, we present a hardware implementation of an event-graph neural network for time-series classification. We leverage an artificial cochlea model to convert the input time-series signals into a sparse event-data format that allows the event-graph to drastically reduce the number of calculations relative to other AI methods. We implemented the design on a SoC FPGA and applied it to the real-time processing of the Spiking Heidelberg Digits (SHD) dataset to benchmark our approach against competitive solutions. Our method achieves a floating-point accuracy of 92.7% on the SHD dataset for the base model, which is only 2.4% and 2% less than the state-of-the-art models with over 10% and 67% fewer model parameters, respectively. It also outperforms FPGA-based spiking neural network implementations by 19.3% and 4.5%, achieving 92.3% accuracy for the quantised model while using fewer computational resources and reducing latency.