arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4050
2605.08723 2026-05-12 cs.CV cs.MM

EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

Huilai Li, Xiaomeng Di, Ying Xing, Yonghao Dang, Yiming Wang, Jianqin Yin

AI总结 本文研究弱监督音视频视频解析(AVVP)问题,旨在仅使用粗粒度标签识别和定位视频中的音频、视觉及音视频事件。现有方法多关注多模态融合,却忽视了对单模态语义的引导与保持,导致伪标签噪声大、解析性能不佳。为此,本文提出一种增强单模态表征的新框架,通过相似性标签迁移方法提升伪标签生成器对单模态事件的理解,并采用软约束方式同步优化单模态与多模态特征建模,从而提升事件定位性能。实验表明,该方法在伪标签生成和AVVP任务中均优于现有先进方法。

详情
英文摘要

Weakly supervised Audio-Visual Video Parsing (AVVP) aims to recognize and temporally localize audio, visual, and audio-visual events in videos using only coarse-grained labels. Faced with the challenging task settings, existing research advances along two main paths: pre-training pseudo-label generators for fine-grained cross-modal semantic guidance, or refining AVVP model architectures to enhance audio-visual fusion. However, since audio and visual signals are typically unaligned, achieving accurate video parsing fundamentally relies on precise perception of uni-modal events. Yet these multi-modal focused strategies excessively emphasize multi-modal fusion while inadequately guiding and preserving uni-modal semantics, resulting in noisy pseudo-labels and sub-optimal video parsing performance. This paper proposes a novel framework that enhances uni-modal representations for both the pseudo-label generator and the AVVP model. Specifically, we introduce a similarity-based label migration approach to annotate pre-training data, thereby enabling the pseudo-label generator to better understand uni-modal events. We also employ a soft-constrained manner to refine modeling of uni-modal features in parallel with multi-modal fusion. These designs enable coordinated attention to both uni-modal and cross-modal representations, thus boosting the localization performance for events. Extensive experiments show that our method outperforms state-of-the-art methods in both pseudo-label and AVVP performance.

2605.08722 2026-05-12 cs.RO cs.MA

HULK: Large-scale Hierarchical Coordination under Continual and Uncertain Temporal Tasks

Qingyuan Luo, Jie Li, Meng Guo

AI总结 本文研究了在持续生成且任务数量不确定的环境下,如何实现大规模多智能体系统的高效协作与任务分配问题。为此,提出了一种分层协调框架HULK,通过滚动分配任务到子团队,并在子团队内进行动态协调,实现了不同粒度和触发条件下的分层协调机制。该方法在大规模异构系统中进行了严格验证,显著提升了计算效率和系统鲁棒性。

Comments Accepted to the IEEE International Conference on Robotics and Automation. 7 pages, 4 figures

详情
英文摘要

Multi-agent systems can be extremely efficient when working concurrently and collaboratively, e.g., for delivery, surveillance, search and rescue. Coordination of such teams often involves two aspects: selecting appropriate subteams for different tasks in various areas, and coordinating agents in the subteams to execute the associated subtasks. Existing work often assumes that the tasks are static and known beforehand, where an integer program can be formulated and solved offline. However, in many applications, the team-wise tasks are generated online continually by external requests, and the amount of subtasks within each task is uncertain, e.g., the number of packages to deliver or victims to rescue. The aforementioned offline solution becomes inadequate as it would require constant re-computation for the whole team and global communication to broadcast the results. Thus, this work tackles the large-scale coordination problem under continual and uncertain temporal tasks, specified as temporal logic formulas over collaborative actions. The proposed hierarchical framework, HULK, consists of two interleaved layers: the rolling assignment of currently known tasks to subteams within a certain horizon, and the dynamic coordination within a subteam given the detected subtasks during online execution. Thus, coordination is performed hierarchically at different granularities and triggering conditions, improving computational efficiency and robustness. The method is validated rigorously over large-scale heterogeneous systems under various temporal tasks and environment uncertainties.

2605.08721 2026-05-12 cs.CL

Breaking the Impasse: Dual-Scale Evolutionary Policy Training for Social Language Agents

Minzheng Wang, Run Luo, Yanbo Wang, Zichen Liu, Yuqiao Tan, Tao Tan, Xu Nan, Yinhe Zheng, Wenji Mao

AI总结 本文针对社交语言智能体在开放性任务中因策略空间庞大而陷入进化停滞的问题,提出了双尺度进化策略训练(DEPT)方法。该方法通过时间尺度感知机制检测停滞状态,并利用不对称优势重塑动态调整优化景观,从而恢复梯度信号并促进持续策略探索。实验表明,DEPT在多个社交语言游戏中显著优于现有方法,有效避免策略退化并推动智能体的持续进化。

Comments Accepted to the ACL 2026 Main Conference

详情
英文摘要

While Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for closed-ended tasks, extending it to open-ended social language games via self-play reveals a critical issue: evolution impasse. Due to the vast strategy space, language agents frequently converge to homogenized behaviors, leading to deterministic match outcomes that eliminate the gradient signals necessary for policy evolution. To tackle this issue, we propose Dual-scale Evolutionary Policy Training (DEPT) for social language games. DEPT introduces a time-scaled evolutionary perception mechanism that detects impasse by quantifying dual-scale value baseline divergence alongside match entropy. Upon perceiving the collapse, it then activates asymmetric advantage reshaping to dynamically modulate the optimization landscape for intervention. Thus, our method effectively restores gradient signals and enforces sustained strategic exploration. Extensive experiments on multiple social language games demonstrate that DEPT outperforms strong baselines, avoiding policy degeneration and driving the continuous evolution of social language agents.

2605.08716 2026-05-12 cs.AI cs.CL cs.LG

Bias by Necessity: Impossibility Theorems for Sequential Processing with Convergent AI and Human Validation

Jikun Wu, Dongxin Guo, Siu-Ming Yiu

AI总结 该论文研究了序列信息处理架构中某些认知偏差是否是数学上的必然结果,证明了首因效应、锚定效应和顺序依赖在自回归语言模型中是结构上不可避免的。通过三个不可能性定理,论文揭示了这些偏差的产生机制,并提出了去偏方法的计算复杂性限制。研究在多个前沿大语言模型和人类实验中验证了理论预测,表明这些偏差是资源受限下顺序处理的理性反应。

Comments 6 pages, 3 figures, 5 tables. Accepted to CogSci 2026

详情
英文摘要

Are certain cognitive biases mathematically inevitable consequences of sequential information processing? We prove that primacy effects, anchoring, and order-dependence are architecturally necessary in autoregressive language models due to causal masking constraints. Our three impossibility theorems establish: (1) primacy bias arises from asymmetric attention accumulation; (2) anchoring emerges from sequential conditioning with provable information bounds; and (3) exact debiasing by permutation marginalization requires factorial-time computation, with Monte Carlo approximation feasible at constant per-tolerance overhead. We validate these bounds across 12 frontier LLMs ($R^2 = 0.89$; $Δ$BIC $= 16.6$ vs. next-best alternative). We then derive quantitative predictions from the framework and test them in two pre-registered human experiments ($N = 464$ analyzed). Study 1 confirms anchor position modulates anchoring magnitude ($d = 0.52$, BF$_{10} = 847$). Study 2 shows working memory load amplifies primacy bias ($d = 0.41$, BF$_{10} = 156$), with WM capacity predicting bias reduction ($r = -.38$). These convergent findings reframe cognitive biases as resource-rational responses to sequential processing.

2605.08713 2026-05-12 cs.RO cs.AI

REAP: Reinforcement-Learning End-to-End Autonomous Parking with Gaussian Splatting Simulator for Real2Sim2Real Transfer

Changze Li, Zhe Chen, Shaoyu Chen, Lisen Mu, Yijian Li, Yuelong Yu, Qian Zhang, Qing Su, Ming Yang, Tong Qin

AI总结 本文提出了一种基于强化学习的端到端自主停车方法REAP,旨在解决极端停车场景下的挑战。通过引入不对称强化学习框架和行为克隆技术,REAP提升了训练效率和推理性能,并采用软预测碰撞惩罚机制降低碰撞风险。为实现从仿真到现实的迁移,研究构建了基于3D高斯点云的Real2Sim2Real模拟器,使训练模型能够直接应用于真实车辆,成功实现了包括机械车位在内的多种复杂停车场景的自主停车。

详情
英文摘要

In recent years, autonomous parking has made significant advances, yet parking tasks still face challenges in extreme scenarios such as mechanical and dead-end parking slots, often resulting in failures. This is mainly due to traditional parking methods adopting a multistage approach, lacking the ability to optimize the parking problem as a whole. End-to-end methods enable joint optimization across perception and planning modules to eliminate the accumulation of errors, enhancing algorithm performance in extreme scenarios. Although several end-to-end parking methods use imitation or reinforcement learning, the former is limited by data cost and distribution coverage, while the latter suffers from inefficient exploration. To address these challenges, we propose a Reinforcement learning End-to-end Autonomous Parking method (REAP). REAP employs Soft Actor-Critic (SAC) within an asymmetric reinforcement learning framework to improve training efficiency and inference performance. To accelerate model convergence, we distill the capabilities of a rule-based planner into the end-to-end network through behavior cloning. We further introduce a soft predictive collision penalty mechanism to reduce collision rates by penalizing obstacle-approaching actions. To ensure that the trained reinforcement learning network can directly transfer to real-world scenarios, we have established a Real2Sim2Real simulator. In the Real2Sim step, we use 3D Gaussian Splatting (3DGS) to transform real-world scenes into digital scenes. In the Sim2Real step, we deploy the end-to-end model onto the vehicle to bridge the Sim2Real gap. Trained in the 3DGS simulator and deployed on physical vehicles, REAP successfully parks in various types of parking spaces, especially demonstrating the feasibility of end-to-end RL parking in extremely narrow mechanical slots.

2605.08712 2026-05-12 cs.CV

From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation

Bohan Li, Shuojue Yang, Baorui Peng, Xianda Guo, Erli Zhang, Youqi Tao, Junfeng Duan, Daguang Xu, Qi Dou, Xin Jin, Wenjun Zeng, Hao Zhao, Yueming Jin

AI总结 本文研究了基于动作条件的手术视频生成问题,其核心挑战在于如何通过低维控制向量精确控制复杂的图像空间演变。为此,作者提出了一种从关节运动学向视觉控制提升的框架,将机械臂的运动学信息转化为五种与图像对齐的控制模态,并设计了一种分层路由的视觉控制体系,动态选择最相关的控制模态和运动尺度,从而提升生成效率与控制精度。此外,作者构建了一个包含精细标注的手术视频数据集,并通过实验验证了方法在动作忠实度、视觉保真度和跨域泛化能力方面的优越性。

详情
英文摘要

Action-conditioned surgical video generation is a critical yet highly challenging problem for robotic surgery. The core difficulty is that low-dimensional control vectors must precisely govern complex image-space evolution. In this work, we propose a kinematic-to-visual lifting paradigm that converts articulated kinematics into a unified set of five image-aligned control modalities. Building on this representation, we introduce a hierarchically routed visual control framework that selectively activates the most relevant control modalities and motion scales. Instead of uniformly applying all control signals, our model performs hierarchical routing to dynamically allocate conditioning capacity. We further design kinematic-prior-guided routing loss functions to ensure physically meaningful, temporally stable, and efficient expert utilization. To improve efficiency, we propose a budgeted training and inference scheme that leverages routing-induced sparsity. By selectively discarding low-significance control pathways during training and execution, our approach enables adaptive computation that is complementary to standard distillation. We additionally construct a new benchmark with curated articulated annotations, obtained through human-in-the-loop semantic labeling and differentiable pose tracking, providing realistic supervision for action-conditioned surgical video generation. Extensive experiments demonstrate that our method consistently improves action faithfulness, visual fidelity, and cross-domain generalization over diverse baselines. Moreover, our efficient variant achieves substantial reductions in latency while maintaining strong control accuracy.

2605.08710 2026-05-12 cs.AI

When Can Human-AI Teams Outperform Individuals? Tight Bounds with Impossibility Guarantees

Dongxin Guo, Jikun Wu, Siu-Ming Yiu

AI总结 该研究探讨了人类与人工智能团队在何种条件下能够超越个体表现,提出了基于置信度聚合规则的严格理论边界。通过结合信号检测理论与信息论分析,研究得出了互补性定理、增益尺度、不可行性结果及多分类推广等四个关键结论,并在多个数据集上验证了预测的准确性。研究揭示了人类与AI团队表现互补的罕见性,并为系统设计提供了可操作的理论指导。

Comments 8 pages, 2 figures, 7 tables. Accepted at CogSci 2026

详情
英文摘要

Human-AI teams fail to outperform their best member in 70% of studies, yet no theory specifies when complementarity is achievable. We derive tight bounds for the broad class of confidence-based aggregation rules by integrating signal detection theory with information-theoretic analysis, yielding four results: (1) a complementarity theorem (teams outperform individuals iff error correlation $ρ_{HM} < ρ^*$, with $ρ^* \approx a$ in the symmetric near-chance regime); (2) minimax bounds showing gains scale as $Θ(\sqrt{Δd})$ with metacognitive sensitivity difference; (3) an impossibility result proving no confidence-based aggregation rule achieves complementarity when $ρ_{HM} \geq ρ^*$; and (4) multi-class generalization $ρ^*_K \approx ρ^*/\sqrt{K-1}$. Predictions match observed team accuracy ($R = 0.94$ on ImageNet-16H, $R = 0.91$ on CIFAR-10H) and the multi-class threshold scaling holds on human data ($R = 0.93$, $K = 16$), with robustness under non-Gaussian distributions. The framework explains why complementarity is rare and provides actionable design formulas; results apply to aggregation, not to interactive deliberation that generates novel answers.

2605.08709 2026-05-12 cs.CV

UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning

Hongrui Li, Yichen Shi, Hongyang Wang, Yuhao Gao, Hui Ma, Jun Feng, Zitong Yu

AI总结 本文提出了一种基于知识引导的多模态推理框架UniShield,用于统一的人脸攻击检测,旨在同时识别物理欺骗和数字伪造攻击。该方法构建了人脸攻击知识图谱(FAKG),并通过攻击图指令调优(AGIT)生成大量训练样本,同时引入图一致性推理优化(GCRO)以提升推理的一致性。实验表明,UniShield在多种检测协议下均表现出优异的性能,显著提升了检测准确率和推理可靠性。

详情
英文摘要

Unified face attack detection (UAD) requires recognizing physical spoofing and digital forgery within a shared decision space, yet existing discriminative or prompt-based methods largely rely on appearance correlations and provide limited evidence-grounded reasoning. We propose UniShield, a knowledge-grounded multimodal reasoning framework for unified face attack defense. UniShield constructs a Face Attack Knowledge Graph (FAKG) that links attack categories to diagnostic visual cues and attack-conditioned relations, and uses it to synthesize 52,025 FAKG-QA examples for Attack-Graph Instruction Tuning (AGIT). To improve rationale consistency, we further introduce Graph-Consistent Reasoning Optimization (GCRO), a GRPO-based objective with a KG-consistency reward that encourages generated rationales to match graph-supported cues while penalizing incompatible claims. Experiments on our multimodal UAD benchmark show that UniShield achieves strong performance across binary, coarse-grained, and fine-grained protocols, with consistently high ACC and low HTER. These results suggest that structured attack knowledge can improve both detection accuracy and reasoning reliability over discriminative baselines and general-purpose MLLMs. Our code will be released at https://anonymous.4open.science/r/Unishield-A6A3/.

2605.08704 2026-05-12 cs.AI

AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization

Hyunmin Hwang, Jaemin Kim, Choonghan Kim, Hangeol Chang, Jong Chul Ye

AI总结 本文提出了一种名为AgentPSO的框架,旨在通过多智能体粒子群优化方法提升大型语言模型的推理能力。该方法将每个智能体视为一个具有自然语言技能的粒子,通过迭代更新其技能状态和语义更新方向,使个体和集体推理能力共同提升。实验表明,AgentPSO不仅优于静态单智能体方法和仅在推理时使用的多智能体方法,而且其演化出的推理技能具有跨任务和跨模型的迁移能力。

详情
英文摘要

Multi-agent reasoning has shown promise for improving the problem-solving ability of large language models by allowing multiple agents to explore diverse reasoning paths. However, most existing multi-agent methods rely on inference-time debate or aggregation, which can be vulnerable to incorrect peer influence and biased consensus. Moreover, the agents themselves remain static, as their underlying reasoning skills do not evolve across tasks. In this paper, we introduce AgentPSO, a particle-swarm-inspired framework for evolving multi-agent reasoning skills. AgentPSO treats each agent as a particle-like reasoner whose state is a natural-language skill and whose velocity is a semantic update direction, iteratively moving agents toward stronger skill states to improve both individual and collective reasoning performance. Across training iterations, each agent updates its skill by combining its previous velocity, personal-best skill, global-best skill, and a self-reflective direction derived from peer reasoning trajectories. This enables agents to learn reusable reasoning behaviors from both their own experiences and the strongest skills discovered by the population, without updating the parameters of the backbone language model. Experiments on mathematical and general reasoning benchmarks show that AgentPSO improves over static single-agent skills and test-time-only multi-agent reasoning baselines. The evolved skills further transfer across benchmarks and to another backbone model, suggesting that AgentPSO captures reusable reasoning procedures rather than merely optimizing benchmark-specific prompts. Code is open-sourced at https://github.com/HYUNMIN-HWANG/AgentPSO/.

2605.08703 2026-05-12 cs.AI cs.CL cs.CV cs.LG

RewardHarness: Self-Evolving Agentic Post-Training

Yuxuan Zhang, Penghui Du, Bo Li, Cong Wei, Junwen Miao, Huaisong Zhang, Songcheng Cai, Yubo Wang, Dongfu Jiang, Yuyu Zhang, Ping Nie, Wenhu Chen, Changqian Yu, Kelsey R. Allen

AI总结 该研究提出了一种名为 RewardHarness 的自进化智能奖励框架,旨在解决图像编辑任务中评估指令引导编辑效果时所需奖励模型依赖大量人工标注的问题。该方法通过少量示例迭代进化工具和技能库,无需额外训练即可对齐人类偏好,显著提升了数据效率。实验表明,仅使用 0.05% 的标注数据,RewardHarness 在图像编辑评估基准上取得了优于 GPT-5 的性能,展现了其在奖励建模中的高效性与有效性。

Comments Project page: https://rewardharness.com

详情
英文摘要

Evaluating instruction-guided image edits requires rewards that reflect subtle human preferences, yet current reward models typically depend on large-scale preference annotation and additional model training. This creates a data-efficiency gap: humans can often infer the target evaluation criteria from only a few examples, while models are usually trained on hundreds of thousands of comparisons. We present RewardHarness, a self-evolving agentic reward framework that reframes reward modeling as context evolution rather than weight optimization. Instead of learning from large-scale annotations, RewardHarness aligns with human preferences by iteratively evolving a library of tools and skills from as few as 100 preference demonstrations. Given a source image, candidate edited images, and an editing instruction, an Orchestrator selects the most relevant subset of tools and skills from the maintained library, and a frozen Sub-Agent uses them to construct a reasoning chain that produces a preference judgment. By comparing predicted judgments with ground-truth preferences and analyzing successes and failures in the reasoning process, the Orchestrator automatically refines its library of tools and skills without additional human annotation. Using only 0.05% of the EditReward preference data, RewardHarness achieves 47.4% average accuracy on image-editing evaluation benchmarks, surpassing GPT-5 by 5.3 points. When used as a reward signal for GRPO fine-tuning, RL-tuned models achieve 3.52 on ImgEdit-Bench. Project page: https://rewardharness.com.

2605.08702 2026-05-12 cs.CV cs.AI

Gate-and-Merge: Zero-shot Compositional Personalization of Vision Language Models

Guodong Ding, Angela Yao

AI总结 本文研究了视觉语言模型的组合式个性化问题,即在测试时同时识别或描述多个用户定义的概念。提出了一种零样本框架 Gate-and-Merge,无需共现训练即可实现组合式个性化。该方法通过独立学习每个概念的轻量 LoRA 适配器并结合概念标记,在推理时直接在权重空间合并相关更新,并利用门控机制抑制无关激活,从而提升模型在单一概念和组合场景下的性能。

详情
英文摘要

This paper tackles compositional personalization of vision-language models (VLMs). In this problem, multiple user-defined concepts must be recognized or described jointly at test time. We introduce Gate-and-Merge, a zero-shot framework that enables compositional personalization without the need for co-occurrence training. During personalization, each concept is learned independently as a lightweight LoRA adapter, paired with a concept token. The base model remains unchanged and concepts are kept disentangled. At inference, we enable composition by merging concept-specific LoRA updates directly in weight space. To suppress irrelevant activations and prevent interference, a gating mechanism is employed to estimate textual and visual cues and select only the modules that contribute to the prediction. We further stabilize composition by combining only the most meaningful and mutually consistent updates, helping preserve each concept's identity. Our quantitative and qualitative analyses show consistent gains in performance across multiple personalization tasks in both single-concept and compositional settings.

2605.08701 2026-05-12 cs.LG physics.ao-ph

METBRA25Y: Brazil Surface Meteorology Archive with Harmonized Variables and Quality Control

Matheus Lima Castro, William Dantas Vichete, Leopoldo Lusquino Filho

AI总结 本文介绍了METBRA25Y数据集,这是一个整合了巴西全国地面气象观测数据的标准化档案,包含从2000年至2025年的每小时气象观测记录。该数据集通过统一变量命名、质量控制和元数据标注,支持环境、气候、水文、农业等多领域研究,特别适用于需要标准化时间序列数据的机器学习应用。研究提出了两阶段的质量控制策略,包括异常值处理和时间与变量间的一致性检查,并提供了详细的站点信息和数据验证结果,为相关研究提供了可靠的数据基础。

Comments 12 pages, 5 figures. Dataset paper describing METBRA25Y, a harmonized archive of hourly Brazilian surface meteorological observations derived from INMET records. Dataset available at Zenodo: 10.5281/zenodo.19964979

详情
英文摘要

This data paper describes METBRA25Y, a harmonized archive of hourly surface meteorological observations from Brazil derived from public historical records of the Instituto Nacional de Meteorologia (INMET). The dataset was designed to support reproducible environmental, climatological, hydrological, agricultural, urban-risk, and machine-learning studies that require station-level meteorological time series with standardized variable names and explicit quality-control metadata. The processing workflow ingests annual INMET archives, parses station metadata from raw file headers, normalizes heterogeneous Portuguese column names into a canonical schema, constructs hourly timestamps, consolidates observations by city and station, and exports compressed CSV files together with station manifests, per-station quality flags, daily precipitation aggregates, variable-level failure summaries, and missing-data audits. The quality-control protocol follows a two-stage strategy: first, physically implausible values are converted to missing values and flagged; second, temporal and cross-variable consistency checks generate diagnostic flags without necessarily overwriting the original measurements. The resulting package covers observations between 2000 and 2025, with stationspecific temporal coverage, and includes key meteorological variables such as precipitation, air temperature, dew point, relative humidity, atmospheric pressure, wind speed, wind gust, wind direction, and global solar radiation. Based on the summary files included in the current release snapshot, the archive contains 616 unique station codes across variable summaries, of which 605 have coordinates within a broad Brazil plausibility envelope. This paper documents the dataset provenance, file organization, harmonized schema, quality-control rules, technical validation outputs, limitations, and recommended usage practices.

2605.08697 2026-05-12 cs.AI

MBP-KT: Learning Global Collaborative Information from Meta-Behavioral Pattern for Enhanced Knowledge Tracing

Yuhao Jia, Duantengchuan Li, Jinsong Chen, Zhongjie Mao, Mingwen Tong, Yue Li, Xiaoguang Wang

AI总结 该研究提出了一种基于元行为模式的协作信息学习框架MBP-KT,旨在提升知识追踪(KT)模型的性能。通过构建元行为序列,MBP-KT能够更有效地捕捉学习者的行为模式,并利用无参数模块提取全局协作信息,从而增强模型对学习状态的预测能力。该方法还提供了通用的注入策略,使提取的协作信息能够广泛应用于不同下游KT模型,实验表明其在多个真实数据集上均能显著提升模型表现。

详情
英文摘要

The emerging collaborative information-based knowledge tracing (KT) has been a promising way to enhance modeling of learners' knowledge states. The core idea is to extract the collaborative information from interaction sequences of other learners to assist the prediction on the target one. Despite effectiveness, existing methods are built on the raw interaction sequences with tailored modules, which inevitably limits their capacity in deeply capturing learning behavioral patterns and generalization. To this end, we propose a general meta-behavioral pattern-aware framework (MBP-KT) for KT. Specifically, MBP-KT introduces a novel meta-behavioral sequence construction to transform the raw interaction sequences into the combinations of different meta-behavioral patterns. In this way, the learning behavioral patterns of learners can be effectively preserved. Then, MBP-KT develops a parameter-free module to extract the global collaborative representations from the constructed meta-behavioral sequences. Moreover, MBP-KT provides general injection strategies to introduce the extracted global collaborative information into various downstream KT models, ensuring the universality of the collaborative information. Extensive results on real-world datasets demonstrate that MBP-KT can consistently boosts the performance of a wide range of KT models.

2605.08695 2026-05-12 cs.CV

EditSleuth: A Dataset of Grounded Reasoning Chains for Image-Edit Forensics

Van-Loc Nguyen, AprilPyone MaungMaung, Minh-Triet Tran, Isao Echizen

AI总结 EditSleuth 是一个用于图像编辑取证的新型数据集,包含257,725个图像编辑三元组,每个样本包含编辑后的图像、原始图像、编辑掩码、编辑类型标签、难度评分以及六步推理链。该数据集通过确定性方法构建,推理链中的每一步都基于可计算的视觉证据,旨在支持基于视觉依据的编辑定位与语义识别。实验表明,该数据集能够有效指导模型学习编辑推理能力,并生成具有解释性的取证说明。

详情
英文摘要

Forensic analysis of AI-edited images requires more than binary real-versus-fake prediction: a useful system should localize the edit, identify its semantic type, and ground its decisions in visual evidence. Existing image-forensics datasets typically emphasize detection or localization, while reasoning-supervised vision-language datasets rarely target image manipulation and often rely on LLM-generated rationales whose faithfulness is difficult to verify. We introduce EditSleuth, a dataset of 257,725 image-edit triplets constructed from existing image-editing corpora for grounded image-edit forensic reasoning. Each example includes an edited image, its source image, a binary edit mask, a 12-class edit taxonomy label, a difficulty score, and a six-step reasoning chain. EditSleuth chains are generated deterministically from triplet-grounded upstream artifacts, with each statement tied to a specific computable source of evidence. Our analysis reveals that a naive four-component difficulty formulation suffers from a rank-2 correlation collapse among magnitude features; a simplified three-component formulation substantially increases score dispersion on both Pico-Banana and MagicBrush. Difficulty also varies meaningfully within most edit categories, indicating that the score is not a proxy for edit type. As an initial learning study, we fine-tune Qwen2-VL-2B with LoRA and find that chain-as-target supervision matches a label-only baseline on classification accuracy among parseable answers, while additionally yielding grounded explanatory prose that label-only supervision cannot produce. We release the dataset, the deterministic construction pipeline, and pilot training scripts.

2605.08689 2026-05-12 cs.LG cs.AI cs.SI

Structure-Centric Graph Foundation Model via Geometric Bases

Xiaodong He, Haolan He, Ruiyi Fang, Ming Sun, Zhao Kang

AI总结 该论文提出了一种结构为中心的图基础模型(SCGFM),旨在解决图域间结构异质性和节点特征空间不兼容的问题。通过将图拓扑视为可迁移知识的核心,模型引入了可学习的几何基底,利用Gromov-Wasserstein距离对齐图结构,生成统一的结构对齐潜在表示。同时,模型采用结构感知的特征重编码机制,在无需固定特征维度或数据集特定预处理的情况下,实现了节点表示的统一,实验表明其在图级和节点级任务中均具有优异的域内和跨域泛化能力。

Comments Accepted by ICML 2026

详情
英文摘要

Graph foundation models (GFMs) seek transferable representations across graph domains but are limited by structural heterogeneity and incompatible node feature spaces. We propose Structure-Centric Graph Foundation Models (SCGFM), which treat graph topology as the primary source of transferable knowledge. Modeling graphs as metric measure spaces, SCGFM introduces learnable geometric bases that define a shared structural coordinate system. Graphs are aligned to these bases via Gromov-Wasserstein distances, yielding structure-aligned latent representations that accommodate heterogeneous graph topologies. To address feature incompatibility, SCGFM employs a structure-aware feature re-encoding mechanism that unifies node representations without assuming a fixed feature dimensionality or requiring dataset-specific preprocessing. Experiments on graph- and node-level tasks demonstrate strong in-domain and cross-domain generalization, outperforming existing GFM approaches.

2605.08688 2026-05-12 cs.AI cs.DB cs.LO

Reconciling Consistency-Based Diagnosis with Actual-Causality-Based Explanations

Leopoldo Bertossi

AI总结 本文从可解释人工智能(XAI)的角度,建立了基于一致性诊断(CBD)与实际因果性及因果责任之间的联系。研究旨在弥合这两个领域之间的鸿沟,为XAI和可解释数据管理提供新的理论支持和方法途径。通过这种跨领域的结合,有望推动更深入的因果解释与诊断方法的发展。

Comments under submission

详情
英文摘要

We establish, from the point of view of Explainable AI (XAI), connections between Consistency-Based Diagnosis (CBD), on one side, and Actual Causality and Causal Responsibility, on the other. CBD has received little attention from the XAI community. Connections between these two areas could have a fruitful impact on XAI and Explainable Data Management.

2605.08686 2026-05-12 cs.AI

Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs

Wenzhi Fang, Liangqi Yuan, Guangchen Lan, Dong-Jun Han, Christopher G. Brinton

AI总结 本文提出了一种用于异构大语言模型(LLM)多智能体系统的迭代批判与路由控制器,解决了现有控制器仅能进行一次性模型选择、无法对中间结果进行批判和迭代优化的问题。该控制器将多智能体协作视为一个有限时间范围内的马尔可夫决策过程,在每一步评估当前输出,决定是否继续优化并选择合适的模型进行下一步改进。实验表明,该方法在多个异构系统和推理基准上显著优于现有方法,同时大幅缩小了与最强模型的性能差距,并减少了模型调用次数。

详情
英文摘要

Multi-agent large language model (LLM) systems often rely on a controller to coordinate a pool of heterogeneous models, yet existing controllers are typically limited to one-shot routing: they select a model once and return its output directly. Such routing-only designs provide no mechanism to critique intermediate drafts or support iterative refinement. To address this limitation, we propose a critique-and-routing controller that casts multi-agent coordination as a sequential decision problem. At each turn, the controller evaluates the current draft, decides whether to stop or continue, and, if needed, selects the next agent for further refinement. We formulate this process as a finite-horizon Markov Decision Process (MDP) with explicit agent-utilization constraints, design a composite reward for controller decisions across turns, and optimize the controller via policy gradients under a Lagrangian-relaxed objective. Extensive experiments across multiple heterogeneous multi-agent systems and seven reasoning benchmarks show that our method consistently outperforms state-of-the-art baselines and substantially narrows the gap to the strongest agent, while using it for fewer than 25% of total calls.

2605.08685 2026-05-12 cs.LG cs.AI

Event Fields: Learning Latent Event Structure for Waveform Foundation Models

Li Na, Yuanyun Zhang, Shi Li

AI总结 本文提出了一类新型波形基础模型,通过建模生理时间序列作为潜在事件过程的实现,替代传统的序列表示方法。该方法假设临床有意义的结构来源于时间上延伸且相互作用的事件,而这些事件的边界和动态并未直接观测到。研究引入了一种自监督学习框架,通过在不同随机分割和时频投影之间保持一致性,学习对信号扰动具有鲁棒性且保留事件结构的表示,并在多种生理任务中展现出优越的性能和鲁棒性。

详情
英文摘要

We propose a new class of waveform foundation models that departs from conventional sequence based representations by modeling physiological time series as realizations of latent event processes. Rather than treating signals as collections of local tokens or patches, our approach assumes that clinically meaningful structure arises from temporally extended, interacting events whose boundaries and dynamics are not directly observed. To capture this structure, we introduce a self supervised learning framework that enforces consistency across stochastic segmentations and time frequency projections of the same waveform, encouraging representations that are invariant to signal level perturbations while preserving event level organization. The resulting model combines a segmentation aware encoder with a latent interaction operator that captures dependencies among inferred events, and naturally extends to multimodal settings by aligning modalities through shared event representations. Across a range of physiological benchmarks, including arrhythmia classification, hemodynamic prediction, and waveform retrieval, the proposed method improves performance, robustness, and label efficiency relative to strong sequence based baselines. These results suggest that shifting from signal centric to event centric representations provides a more appropriate inductive bias for modeling physiological dynamics and offers a complementary path to scaling foundation models in healthcare.

2605.08673 2026-05-12 cs.LG

PHIDA: Persistence-Guided Node-to-Cluster Mapping for Online Clustering

Naoki Masuyama, Yusuke Nojima, Stefan Wermter, Yuichiro Toda, Hisao Ishibuchi, Chu Kiong Loo

AI总结 本文提出了一种名为PHIDA的在线聚类方法,旨在解决现有方法中节点状态到聚类结果映射不明确的问题。该方法结合逆距离自适应共振理论(IDA)的节点学习与持续同调(PH)约束的节点到聚类映射,从而在保持节点学习灵活性的同时,增强聚类结果的稳定性与鲁棒性。实验表明,PHIDA在多个基准数据集上表现优异,尤其在非平稳环境下优于其他自适应节点更新的在线聚类方法。

Comments This paper is currently under review

详情
英文摘要

Online clustering methods that adaptively create and update nodes as data arrive often make node learning explicit, whereas the mapping from the learned node state to output clusters often remains implicit or simplified. Implicit mappings make output clusters sensitive to weak graph bridges or local relations based on distance in the graph over learned nodes, leaving no explicit constraint on which node groups remain intact during mapping. This paper addresses this gap by proposing PHIDA, a persistence-guided node-to-cluster mapping method for online clustering with learned nodes. PHIDA implements this mapping within Adaptive Resonance Theory (ART)-based online clustering by combining Inverse-Distance ART (IDA) node learning with node-to-cluster mapping constrained by Persistent Homology (PH). Experiments on 24 benchmark datasets show that PHIDA achieves the best average ranks in stationary comparisons that include the recent stationary-only clustering methods, while also improving aggregate performance in the nonstationary setting over the evaluated online methods that adaptively create and update nodes. Ablations and comparisons with conventional node-to-cluster mappings indicate that the observed gains are associated with PH-constrained mapping that preserves raw PH components, together with the use of the PH component view during node learning. Source code is available at https://github.com/Masuyama-lab/PHIDA

2605.08671 2026-05-12 cs.CL cs.AI

Explanation Fairness in Large Language Models: An Empirical Analysis of Disparities in How LLMs Justify Decisions Across Demographic Groups

Gautam Veldanda

AI总结 本文研究了大语言模型(LLM)在不同人口群体之间解释决策时存在的公平性问题,提出了一个包含五个维度的“解释公平性分类法”(EFT),用于量化解释在长度、情感倾向、知识不确定性表达、与决策关联性以及词汇复杂性等方面的差异。通过在多个决策场景和多个主流模型上的实证分析,发现不同模型在解释公平性上存在显著差异,且某些改进方法虽能减少解释内容的相关性差异,却难以改善风格层面的不均衡,揭示了预训练数据分布对解释公平性的重要影响。

Comments 10 pages, 4 figures, 9 tables

详情
英文摘要

Large language models (LLMs) are increasingly deployed not only to make decisions but to explain them. While AI decision fairness has been studied extensively, the fairness of AI explanations (whether LLMs justify decisions with equal quality, depth, tone, and linguistic sophistication across demographic groups) has received little attention. This paper introduces the Explanation Fairness Taxonomy (EFT), a framework comprising five formally defined, operationalizable dimensions: Verbosity Disparity, Sentiment Disparity, Epistemic Hedging Disparity, Decision-Linked Explanation Disparity, and Lexical Complexity Disparity. The taxonomy is instantiated in a controlled empirical study across 80 prompt templates, four consequential decision domains (hiring, medical triage, credit assessment, legal judgment), and five LLMs: GPT-4.1, Claude Sonnet, LLaMA 3.3 70B, GPT-OSS 120B, and Qwen3 32B. Two novel black-box metrics are introduced: the Hedging Density Score (HDS) and the Explanation Faithfulness Proxy (EFP), a heuristic indicator of decision-linked explanation variation. Across up to 400 prompt pairs, all eight EFT metrics show statistically significant disparities (Cohen's d ranging from small to large, all p_BH < 10^(-62)). Model choice is strongly associated with disparity magnitude: Qwen3 32B exhibits verbosity disparities 5.9x larger than LLaMA 3.3 70B. Two prompting-based mitigations show significant reductions in EFP disparity (78-95%) but no significant effect on stylistic dimensions, consistent with the hypothesis that stylistic explanation inequalities are encoded in pre-training distributions and are not resolvable through deployment-level instruction alone. A reproducible measurement framework is offered for explanation-level fairness auditing, with implications for AI regulation and deployment practice.

2605.08670 2026-05-12 cs.AI cs.CL cs.MA

MIND-Skill: Quality-Guaranteed Skill Generation via Multi-Agent Induction and Deduction

Yixuan Li, Mingshu Cai, Ziyang Xiao, Wanyuan Wang, Yanchen Deng, Bo An

AI总结 本文提出了一种名为 MIND-Skill 的框架,旨在自动从成功任务轨迹中生成具有质量保障的可复用技能,以提升人工智能代理在复杂任务中的表现。该方法通过归纳代理提取通用技能,并通过演绎代理根据这些技能重建任务轨迹,结合重建损失、结果损失和评分标准损失等多目标优化策略,确保生成技能的质量与适用性。实验表明,MIND-Skill 在多个任务基准上优于现有技能生成方法。

详情
英文摘要

Large language model (LLM) powered AI agents have emerged as a promising paradigm for autonomous problem-solving, yet they continue to struggle with complex, multi-step real-world tasks that demand domain-specific procedural knowledge. Reusable agent skills, which encapsulate successful problem-solving strategies, offer a natural remedy by enabling agents to build on prior experience. However, curating such skills has largely remained a manual endeavor, requiring human experts to distill rich domain knowledge into actionable guidelines. In this work, we present $\textbf{M}$ulti-agent $\textbf{IN}$duction and $\textbf{D}$eduction for $\textbf{Skill}$s ($\textbf{MIND-Skill}$), a framework that automatically induces generalizable skills from successful trajectories with robust quality guarantees. MIND-Skill consists of an induction agent which is tasked to abstract reusable skills from successful trajectories, and a deduction agent which aims to reconstruct trajectories by following the induced skills. To guarantee the quality of the generated skills, we introduce a reconstruction loss that compares input and reconstructed trajectories, an outcome loss that enforces the correctness of the reconstructed trajectories, and a rubric loss that assesses the documentation quality and regularizes the abstraction level of the generated skills according to predefined criteria. These textual losses are jointly optimized with TextGrad, and the resulting skills are evaluated on held-out tasks unseen during optimization. Experiments on AppWorld and BFCL-v3 show that MIND-Skill consistently outperforms concurrent skill generation methods.

2605.08666 2026-05-12 cs.LG

The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

Tianhao Cheng, Zeyu Huang, Zihan Qiu, Yu Cheng, Edoardo Ponti, Yinghui Xu, Ivan Titov, Zenglin Xu

AI总结 本文研究了无评论强化学习(Critic-Free RL)在大语言模型中的机制,从标记(token)层面揭示了一个名为“抵消假设”的新现象:正负 rollout 中标记的概率变化存在显著相似性,且标记之间的梯度耦合在低置信度预测的相同标记间尤为明显。基于此,作者提出抵消假设,认为正负 rollout 共有的标记其正负信号会相互抵消,而更专属于成功 rollout 的标记则获得更强的强化,从而实现隐式的标记级信用分配。文章还提出了两种简单但有效的训练策略,以增强抵消效应,提升了多尺度模型的训练效果。

详情
英文摘要

A commonly accepted explanation of critic-free RL for LLMs, based on sequence-level rewards, is that it reinforces successful rollouts with a positive advantage while penalizing failed ones. In contrast, we study critic-free RL from a token-level perspective, revealing the token-flipping phenomenon: positive and negative rollouts exhibit remarkably similar proportions of tokens whose probabilities are boosted or suppressed during RL training. To explain this phenomenon, we further show that a token's change in probability is not fully determined by its own advantage; coupled gradient interactions with other tokens also play a non-negligible role. Specifically, these token coupling effects occur primarily between identical tokens that are both predicted with low confidence. Building upon this analysis, we propose the cancellation hypothesis: as a result of coupling, opposing signals cancel out for tokens shared by positive and negative rollouts, while tokens more specific to successful rollouts receive stronger reinforcement, thereby inducing hidden token-level credit assignment from rollout-level rewards. We support this hypothesis with complementary empirical evidence. (1) Compared with training on only positive rollouts, critic-free RL shifts updates from template and formatting tokens toward reasoning tokens; (2) Tokens boosted by critic-free RL consistently demonstrate higher value than suppressed tokens, regardless of whether they originate from positive or negative rollouts. Guided by this view, we implement two batching interventions to encourage or preserve cancellation in critic-free RL training: query-preserved mini-batching and reward-balanced batching. Despite their simplicity, these interventions improve RLVR training across multiple model scales, supporting cancellation as both an explanatory principle and a practical design criterion for critic-free RL training.

2605.08664 2026-05-12 cs.CV

IPAD-CLIP: Teaching CLIP to Detect Image Local Perceptual Artifacts

Juan Wang, Xinyu Sun, Ke Zhang, Jin Wang, Bing Li, Weiming Hu, Liang Wang

AI总结 当前图像质量评估方法主要关注全局失真(如噪声、模糊),而忽视了局部感知伪影(如鬼影、镜头眩光、摩尔效应)的检测。为解决这一问题,本文提出图像感知伪影检测(IPAD)任务,并构建了一个包含3,520张标注图像的基准数据集。基于CLIP模型,研究者设计了IPAD-CLIP框架,通过学习与伪影相关的语义嵌入,增强模型对局部细微伪影的识别能力,实验表明该方法在资源效率和检测性能上均优于现有先进方法。

Comments 14 pages, 6 figures

详情
英文摘要

Current image quality assessment methods are heavily biased towards global distortions (e.g., noise, blur), neglecting local perceptual artifacts such as ghosting, lens flare, and moire effects. Although significant progress has been made in artifact removal, the fundamental problem of automatic artifact detection remains largely unexplored. In this paper, we formalize the Image Perceptual Artifact Detection (IPAD) task to address this gap. We contribute a benchmark dataset comprising 3,520 artifact images, including 520 real-captured and 3,000 synthetic samples, each paired with pixel-level masks across three representative artifact categories. The core challenge of IPAD lies in the localized, subtle, and semantically weak nature of these artifacts, which makes them prone to missed detection. To overcome this, we introduce IPAD-CLIP, a novel framework built upon CLIP that enhances artifact discrimination in both textual and visual spaces while preserving generalization capabilities. Our key insight is that local artifacts often exhibit strong correlations with specific semantic contexts. Accordingly, we learn artifact-aware text embeddings to explicitly model the object-artifact relationships, resulting in enhanced representations that clear differentiate between clean and artifact prompts. These text embeddings are then used as anchors to shift the visual encoder's attention from high-level semantics to subtle, low-level artifacts. Extensive experiments demonstrate that IPAD-CLIP offers a resource-efficient adaptation of CLIP for detection, significantly outperforming advanced image anomaly detection and manipulation detection methods on our benchmark. To the best of our knowledge, this is the first study addressing multi-class local perceptual artifact detection in terms of both dataset and model.

2605.08663 2026-05-12 cs.CV

CAST: Channel-Aware Spatial Transfer Learning with Pseudo-Image Radar for Sign Language Recognition

Md. Shakhoyat Rahman Shujon, Sheikh Md. Galib Mahim, Md. Milon Islam, Md Rezwanul Haque, Md Rabiul Islam, Hamdi Altaheri, Fakhri Karray

AI总结 本文提出了一种名为CAST的双流架构,用于解决仅基于60GHz雷达回波幅度的孤立手语识别问题。该方法结合了三个基于物理特性的模块与预训练视觉网络,通过通道感知的空间迁移学习,有效提升了雷达信号的表征能力。核心方法包括对数压缩信号的逆变换、跨天线空间注意力机制以及异构网络的跨注意力融合,实验表明该方法在五折交叉验证中达到了80.5%的Top-1准确率,优于现有最佳单模型基线。

Comments Accepted for the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), MSLR Workshop @ CVPR 2026 in Denver (Colorado, USA)

详情
英文摘要

We propose CAST, a dual-stream architecture that utilizes channel-aware spatial transfer learning for isolated sign language recognition addressing the challenges of magnitude-only 60~GHz radar Range-Time Maps (RTM). The proposed framework combines three physics-aware architectures with pretrained vision backbones, which operate under radar-only constraints across clinical and alphabetical gestures. First, an explicit decibel-to-linear inversion is combined with a windowed fast Fourier transform that extracts Cadence Velocity Diagrams (CVD) while avoiding the harmonic artifacts that arise from the spectral analysis of log-compressed signals. Second, a cross-antenna spatial attention module applies attention to raw antenna channels before the convolution, preserving inter-receiver amplitude covariance. Third, an asymmetric cross-attention mechanism fuses representations from parallel ConvNeXt-Tiny (CVD) and EfficientNetV2-S (RTM) backbones. Extensive experiments reveal that the architecture achieves a Top-1 accuracy of 80.5% under 5-fold cross-validation, establishing a 3.3% improvement over the best single-model baseline (77.2%). The findings suggest that physics-aware signal representations form a promising direction for radar-only sign language recognition under constrained sensor modalities. The source code is available at: https://github.com/Shakhoyat/CAST-at-SignEval2026.

2605.08660 2026-05-12 cs.LG

Optimised Support Vector Regression for California Housing Price Prediction: The Critical Role of Feature Engineering and Hyperparameter Tuning

Emmanuel Adutwum

AI总结 本文针对支持向量回归(SVR)在加州房价预测任务中表现不佳的问题,通过特征工程、超参数调优和标准化预处理等方法,显著提升了SVR的预测性能。研究构建了10个领域驱动的衍生特征,结合交叉验证进行超参数搜索,并通过消融实验验证各环节的贡献。最终优化后的SVR在测试集上达到0.723的R²值,相较之前结果提升了约20%,在十种模型中排名第四,验证了SVR在合理配置下的有效性。

Comments 25 pages, 13 figures, 10 tables

详情
英文摘要

In the recent literature, Support Vector Regression (SVR) has been cited as one of the weakest performers on the California Housing benchmark dataset, with Preethi et al. (2025)specifically ranking it last among the algorithms they tested, reporting an R2 of only 0.60. This paper examines whether the previously reported performance reflects experimental configuration choices rather than an inherent algorithmic limitation. A structured experimental workflow is applied: ten domain-motivated derived features are constructed from the eight raw inputs, an exploratory ensemble feature importance analysis identifies the most predictive candidates, and a randomised search over hyperparameter combinations with three-fold cross-validation selects the optimal SVR configuration within a leakage-safe scikit-learn Pipeline. A formal four-stage ablation study isolates the contribution of each component: scaling alone accounts for +0.744 in R2 (from -0.054 to 0.690), feature engineering adds +0.026 (to 0.716), and hyperparameter tuning contributes +0.008 (to 0.723). The resulting tuned SVR achieves a test R2 of 0.723, a 0.123-point absolute improvement over the previously reported SVR result (from 0.60 to 0.723, approximately 20% relative gain). In the ten-model comparison, the tuned SVR ranks fourth with R2 = 0.723, below XGBoost (0.832), Random Forest (0.814) and Gradient Boosting (0.783), while substantially outperforming simpler baselines. Ten-fold cross-validation yields a mean R2 of 0.703 (95% CI: [0.630, 0.775]), confirming robust generalisation. The observed improvement from R2 = 0.60 to R2 = 0.723 is associated primarily with proper feature scaling within a unified preprocessing pipeline, with domain-motivated feature engineering and systematic hyperparameter tuning, providing further incremental gains.

2605.08658 2026-05-12 cs.LG cs.AI cs.SE

Sketch-and-Verify: Structured Inference-Time Scaling via Program Sketching

Shan Jiang, Zijian Yi, Chenguang Zhu

AI总结 本文提出了一种名为Sketch-and-Verify的方法,旨在通过程序草图在推理时高效扩展模型性能。该方法通过让大语言模型生成多种算法策略的程序草图,并在其中填充不同的实现,从而生成结构多样的候选解,再通过执行验证和指纹聚类进行筛选。实验表明,在固定模型层级下,该方法在代码生成任务中显著优于传统的采样方法,尤其在资源受限的情况下表现出更高的效率和效果。

详情
英文摘要

SKETCHVERIFY is a within-tier cost-performance policy, not a universal accuracy improvement. The operational question: a practitioner stuck with a small, cheap code model (here, Gemini 3.1 Flash Lite) for latency, deployment, or budget reasons -- how should they spend a small amount of extra test-time compute? SKETCHVERIFY factorizes the search space: the LLM enumerates K distinct algorithmic strategies, writes a program sketch for each (a partial program with ?? holes), and fills each sketch M times, producing K x M structurally diverse candidates that are verified by execution and selected by fingerprint clustering. Each extra sketch is guaranteed to explore a different algorithm; each extra flat sample likely duplicates an existing one. Our central evidence is a cost-quality Pareto plot on HumanEval+ across three Gemini tiers (Lite, Flash, Pro), and a reanalysis of the 19 problems where Lite greedy fails. Two findings: (1) Within-tier, sketching dominates flat sampling at matched candidate count. On the hard subset, Lite Sketch K=2, M=5 recovers 11/19 (58%) vs. flat N=10 at 5/19 (26%, +32pp); Lite Sketch K=10, M=10 recovers 15/19 (79%) vs. flat N=100 at 10/19 (53%, +26pp). Flat cannot close the gap even at ~3x the budget: flat N=50 still loses to Sketch K=2, M=5 by +11pp. (2) Cross-tier, sketching does not replace upgrading. Pro greedy (89%) dominates Lite Sketch K=10, M=10 (79%) on both pass@1 and dollar cost. Practitioner rule: if a stronger tier is available, use greedy on it; otherwise sketching is the cost-effective way to spend extra compute. We characterize the K-vs-M trade-off via a Flash Lite scaling sweep, report HumanEval+ saturation on Flash and Pro, and show the method composes cleanly with execution-based selection from the concurrent Semantic Voting line of work.

2605.08657 2026-05-12 cs.LG cs.AI

Fitting Multilinear Polynomials for Logic Gate Networks

Youngsung Kim

AI总结 本文研究了一种可学习的逻辑门网络,通过堆叠多层2输入布尔门来构建组合电路。每个布尔门对应一个4维空间中的多线性多项式,从而将训练问题转化为向量量化问题。作者提出了一种基于协方差雅可比矩阵的改进方法,有效解决了传统方法在深度增加时梯度消失的问题,并在多个数据集上表现出更优的性能。

详情
英文摘要

We study learnable logic gate networks that stack layers of 2-input Boolean gates to build combinational circuits. Every 2-input gate has a unique multilinear polynomial with 4 coefficients, so the 16 Boolean gates form a codebook of prototypes in a 4-dimensional space, reducing training to a vector-quantization problem. The baseline method, Soft-Mix, learns a 16-dimensional softmax over gate identities, but the codebook has rank~4: 11 of 15 simplex directions carry nullspace gradient, and at uniform initialization the backward signal vanishes exactly. We prove that no affine product reparameterization fixes the resulting interaction-coefficient starvation under STE, and show that the covariance Jacobian of soft-VQ selection bypasses it by coupling the starved coefficient to the always-active constant channel. Working in the 4-dimensional polynomial space reduces each neuron from 16 to 4 parameters. On seven datasets, at least one 4-parameter method matches or exceeds Soft-Mix on every dataset; the CovJac advantage over STE grows monotonically with interaction demand across all seven datasets. At depth, Soft-Mix collapses ($-37.3$pp on CIFAR-10 at 12 layers) while CovJac holds ($-0.5$pp on CIFAR-10, stable on MNIST).

2605.08653 2026-05-12 cs.AI

C2L-Net: A Data-Driven Model for State-of-Charge Estimation of Lithium-Ion Batteries During Discharge

Khoa Tran, T. Nguyen-Thoi, Vin Nguyen-Thai, Duong Tran Anh, Hung-Cuong Trinh, Tri Le

AI总结 本文提出了一种名为C2L-Net的数据驱动模型,用于在放电过程中准确估计锂离子电池的荷电状态(SOC)。该模型通过仅使用20秒的短历史窗口进行实时估计,有效克服了传统方法依赖长历史序列带来的高计算成本和位置偏差问题。C2L-Net采用上下文与最新测量分离的框架,结合块级特征提取、因果编码和递归解码机制,显著提升了计算效率和动态适应能力,在多个固定温度条件下的实验中表现出优越的精度和效率。

详情
英文摘要

Accurate state-of-charge (SOC) estimation is critical for the safe and efficient operation of lithium-ion batteries in battery management systems (BMS). Although data-driven approaches can effectively capture nonlinear battery dynamics, many existing methods rely on long historical input sequences, resulting in high computational cost and introducing padding-induced positional bias at the beginning of drive cycles. To address these limitations, we propose C2L-Net, a novel context-to-latest data-driven framework for realistic online SOC estimation using only a short historical window (20 s). Unlike existing short-receptive-field or long-history models, the proposed framework explicitly separates contextual encoding from latest-measurement updating, enabling both efficient temporal modeling and rapid adaptation to dynamic battery states. The proposed model incorporates a chunk-based feature extraction mechanism that combines Theta Attention Pooling with a Fourier-based Seasonality Basis to capture local temporal patterns while reducing sequence length. A causal context encoder, integrating a gated recurrent unit (GRU) with Causal Cosine Attention, models temporal dependencies without information leakage. Furthermore, a latest-measurement decoder, inspired by recursive filtering, updates the contextual state using the most recent measurement, enhancing responsiveness to dynamic operating conditions. Extensive experiments on a public lithium-ion battery drive-cycle dataset under multiple fixed-temperature conditions demonstrate that the proposed method achieves state-of-the-art or competitive accuracy while significantly improving computational efficiency. In particular, C2L-Net achieves up to 60 times faster inference and requires fewer parameters than recent data-driven baselines, while maintaining robust performance across unseen driving profiles.

2605.08651 2026-05-12 cs.CV cs.AI cs.LG

Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection

Lei Wang, Wenxiang Diao, Andrew Busch, Jun Zhou, Yongsheng Gao

AI总结 本文研究了隐私感知的视频异常检测问题,提出了一种通过正交子空间投影来保护隐私的新型方法。核心方法包括正交投影层(OPL)和引导式正交投影层(G-OPL),能够去除与任务无关的特征变化,同时抑制人脸属性信息,保留动作和姿态等非身份识别特征。该方法在保证检测性能的同时有效保护隐私,并引入了隐私感知的评估框架,实验表明其在提升检测准确性的同时有效过滤敏感信息。

Comments Accepted as a Spotlight paper at the Forty-Third International Conference on Machine Learning (ICML 2026)

详情
英文摘要

Video anomaly detection (VAD) systems often prioritize accuracy while overlooking privacy concerns, limiting their suitability for real-world deployment. We propose the Orthogonal Projection Layer (OPL), a lightweight module that removes task-irrelevant variations to produce representations focused on anomaly-relevant cues. To address privacy risks in human-centered scenarios, we introduce Guided OPL (G-OPL), which suppresses facial attributes using weak supervision from face-presence signals while preserving non-identifying features such as pose and motion. A cosine alignment objective enforces consistent capture and removal of facial information without identity labels or adversarial training. We further present a privacy-aware evaluation framework that jointly assesses detection performance and privacy preservation, and enables analysis of how sensitive information is filtered. Experiments show that embedding privacy constraints into model design reduces sensitive information while maintaining or improving detection accuracy, supporting projection-based architectures as a principled approach for privacy-aware VAD.

2605.08648 2026-05-12 cs.LG q-bio.NC

FLUX: Geometry-Aware Longitudinal Flow Matching with Mixture of Experts

Josue Ortega Caro, Yongxu Zhang, Hannah M Batchelor, Sizhuang He, Jessica Cardin, Shreya Saxena

AI总结 许多生物系统在连续的局部动态中演化,并在由学习、刺激背景、内部状态或发育阶段定义的潜在状态之间切换。这类过程通常只能以未配对的纵向快照形式观测,给轨迹建模和状态识别带来挑战。本文提出FLUX,一种基于专家混合的几何感知纵向流匹配框架,能够联合建模运输过程并实现无监督的状态发现。FLUX通过学习数据依赖的度量构建几何感知的条件路径,并将速度场分解为由直通Gumbel-Softmax路由选择的稀疏专家向量场,从而在多个生物系统中成功重建纵向运输并恢复可解释的状态结构。

详情
英文摘要

Many biological systems evolve through continuous local dynamics while switching between latent regimes defined by learning, stimulus context, internal state, or developmental stage. These processes are often observed only as unpaired longitudinal snapshots: the same cells, neurons, or animals are not tracked as matched trajectories, even though population states are sampled across successive stages. This creates two coupled challenges. First, trajectories must respect curved low-dimensional manifolds embedded in high-dimensional biological measurements. Second, the model must identify when the transport mechanism itself changes. We introduce FLUX (FLow matching for Unpaired longitudinal data with miXture-of-experts), a geometry-aware longitudinal flow-matching framework for joint transport modeling and unsupervised regime discovery. FLUX learns a data-dependent metric from pooled labeled and unlabeled observations, uses that metric to construct geometry-aware conditional paths between adjacent marginals, and decomposes the resulting velocity field into sparse expert vector fields selected by a Straight-Through Gumbel-Softmax router. Across manifold controls, a regime-switching Lorenz system, widefield cortical calcium imaging during associative learning, and embryoid body single-cell differentiation, FLUX reconstructs longitudinal transport while recovering interpretable regime structure. Ablations show that mixture-of-experts routing alone is insufficient: FLUX without geometric learning can fit local transport but fails or weakens regime discovery when regimes are encoded in local dynamics. These results suggest that geometry-aware velocity decomposition provides a general strategy for discovering latent biological state transitions from unpaired longitudinal snapshots.