Music Audio-Visual Question Answering Requires Specialized Multimodal Designs
Comments Accepted to Annual Meeting of the Association for Computational Linguistics (ACL 2026). The first two authors contributed equally
Wenhao You, Xingjian Diao, Wenjun Huang, Chunhui Zhang, Keyi Kong, Weiyi Wu, Chiyu Ma, Zhongyu Ouyang, Tingxuan Wu, Ming Cheng, Soroush Vosoughi, Jiang Gui
Comments Accepted to Annual Meeting of the Association for Computational Linguistics (ACL 2026). The first two authors contributed equally
While recent Multimodal Large Language Models exhibit impressive capabilities for general multimodal tasks, specialized domains like music necessitate tailored approaches. Music Audio-Visual Question Answering (Music AVQA) particularly underscores this, presenting unique challenges with its continuous, densely layered audio-visual content, intricate temporal dynamics, and the critical need for domain-specific knowledge. Through a systematic analysis of Music AVQA datasets and methods, this paper identifies that specialized input processing, architectures incorporating dedicated spatial-temporal designs, and music-specific modeling strategies are critical for success in this domain. Our study provides valuable insights for researchers by highlighting effective design patterns empirically linked to strong performance, proposing concrete future directions for incorporating musical priors, and aiming to establish a robust foundation for advancing multimodal musical understanding. We aim to encourage further research in this area and provide a GitHub repository of relevant works: https://github.com/WenhaoYou1/Survey4MusicAVQA.
Bryan Sangwoo Kim, Jeongsol Kim, Jong Chul Ye
Comments NeurIPS 2025 (Spotlight)
Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity. Project Page: https://bryanswkim.github.io/chain-of-zoom/.
Junhao Liu, Haonan Yu, Zhenyu Yan, Xin Zhang
Comments Accepted to ACL 2026 Main Conference
Post-hoc explanations provide transparency and are essential for guiding model optimization, such as prompt engineering and data sanitation. However, applying model-agnostic techniques to Large Language Models (LLMs) is hindered by prohibitive computational costs, rendering these tools dormant for real-world applications. To revitalize model-agnostic interpretability, we propose a budget-friendly proxy framework that leverages efficient models to approximate the decision boundaries of expensive LLMs. We introduce a screen-and-apply mechanism to statistically verify local alignment before deployment. Our empirical evaluation confirms that proxy explanations achieve over 90% fidelity with only 11% of the oracle's cost. Building on this foundation, we demonstrate the actionable utility of our framework in prompt compression and poisoned example removal. Results show that reliable proxy explanations effectively guide optimization, transforming interpretability from a passive observation tool into a scalable primitive for LLM development. Additionally, we open-source code and datasets to facilitate future research.
Feng Yu, Jia Hu, Geyong Min
Federated Parameter-Efficient Fine-Tuning (Fed-PEFT) enables lightweight adaptation of large pre-trained models in federated learning settings by updating only a small subset of parameters. However, Fed-PEFT methods typically assume a fixed label space and static downstream tasks, which is restrictive in realistic application scenarios where clients continuously encounter new classes over time. This leads to an emerging problem, known as \emph{Federated Continual Fine-Tuning} (FCFT). In FCFT, clients collaboratively fine-tune a pre-trained model over a sequence of tasks, where each client observes disjoint sets of new classes over time, and task identity is unavailable at inference time. FCFT is challenging because it simultaneously suffers from severe forgetting under non-IID client data distributions, parameter growth and task-specific inference caused by task-wise modules, and aggregation inconsistency across heterogeneous clients. To address these challenges, we propose Federated Task-agnostic Low-rank Residual Adaptation (Fed-TaLoRA), a novel approach for efficient FCFT built on task-agnostic adaptation, post-aggregation model calibration, and strategic low-rank adaptation placement. Fed-TaLoRA continuously fine-tunes a single shared module across sequential tasks to avoid task-wise parameter growth, and further introduces a theoretically grounded residual weight update mechanism to calibrate the aggregated global model and improve aggregation fidelity. We provide a theoretical analysis of the convergence and aggregation behavior of Fed-TaLoRA. Extensive experiments on four benchmark datasets demonstrate that Fed-TaLoRA consistently outperforms strong baselines while reducing communication and computation costs significantly.
Akira Tamamori
Comments 17 pages, 7 figures; accepted to NOLTA, IEICE
Kernel-based learning methods such as Kernel Logistic Regression (KLR) can substantially increase the storage capacity of Hopfield networks, but the principles governing their performance and stability remain largely uncharacterized. This paper presents a comprehensive quantitative analysis of the attractor landscape in KLR-trained networks to establish a solid foundation for their design and application. Through extensive, statistically validated simulations, we address critical questions of generality, scalability, and robustness. Our comparative analysis shows that KLR and Kernel Ridge Regression (KRR) exhibit similarly high storage capacities and clean attractor landscapes under typical operating conditions, suggesting that this behavior is a general property of kernel regression methods, although KRR is computationally much faster. We identify a non-trivial, scale-dependent law for the kernel width $γ$, demonstrating that optimal capacity requires $γ$ to be scaled such that $γN$ increases with network size $N$. This finding implies that larger networks require more localized kernels, in which each pattern's influence is more spatially confined, to mitigate inter-pattern interference. Under this optimized scaling, we provide clear evidence that storage capacity scales linearly with network size~($P \propto N$). Furthermore, our sensitivity analysis shows that performance is remarkably robust with respect to the choice of the regularization parameter $λ$. Collectively, these findings provide a concise set of empirical principles for designing high-capacity and robust associative memories and clarify the mechanisms that enable kernel methods to overcome the classical limitations of Hopfield-type models.
Pawel Pukowski, Venet Osmani
Comments Submitted to Springer ML
Class-bias, that is class-wise performance disparities, is typically attributed to data imbalance and addressed through frequency-based resampling. However, we demonstrate that substantial bias persists even in perfectly balanced datasets, proving that class frequency alone cannot explain unequal model performance. We investigate these disparities through the lens of class-level learning difficulty and propose Hardness-Based Resampling (HBR), a strategy that leverages hardness estimates to guide data selection. To better capture these effects, we introduce an evaluation protocol that complements global metrics with gap- and dispersion-based measures. Our experiments show that HBR significantly reduces recall gaps, by up to 32% on CIFAR-10 and 16% on CIFAR-100, outperforming standard frequency-based resampling. We further show that we can improve fairness outcomes by selectively using the hardest samples from a state-of-the-art diffusion model, rather than randomly selecting them. These findings demonstrate that data balance alone is insufficient to mitigate class bias, necessitating a shift toward hardness-aware approaches.
Zhenwei Shao, Mingyang Wang, Weijun Zhang, Zhou Yu, Wenwen Pan, Yan Yang, Tao Wei, Hongyuan Zhang, Jun Yu
Comments An extended version of our ICCV paper at ICCV2025/html/Shao_Growing_a_Twig_to_Accelerate_Large_Vision-Language_Models_ICCV_2025_paper.html" target="_blank" rel="noopener">https://openaccess.thecvf.com/content/ICCV2025/html/Shao_Growing_a_Twig_to_Accelerate_Large_Vision-Language_Models_ICCV_2025_paper.html
Large vision-language models (VLMs) have demonstrated remarkable capabilities in open-world multimodal understanding, yet their high computational overheads pose great challenges for practical deployment. Some recent works have proposed methods to accelerate VLMs by pruning redundant visual tokens guided by the attention maps of VLM's early layers. Despite the success of these token pruning methods, they still suffer from two major shortcomings: (i) considerable accuracy drop due to insensitive attention signals in early layers, and (ii) limited speedup when generating long responses (e.g., 30 tokens). To address the limitations above, we present TwigVLM -- a simple and general architecture by growing a lightweight module, named twig, upon an early layer of the base VLM. Compared with most existing VLM acceleration methods purely based on visual token pruning, our TwigVLM not only achieves better accuracy retention by employing a twig-guided token pruning (TTP) strategy, but also yields higher generation speed by utilizing a self-speculative decoding (SSD) strategy. Taking LLaVA-1.5-7B as the base VLM, experimental results show that TwigVLM preserves 96% of the original performance after pruning 88.9% of the visual tokens and achieves 154% speedup in generating long responses, delivering significantly better performance in terms of both accuracy and speed over the state-of-the-art VLM acceleration methods. Moreover, we extend TwigVLM to an improved TwigVLM++ variant by introducing a novel multi-head twig architecture with a specialized pruning head. TwigVLM++ improves pruning quality via a two-stage training paradigm combining a distillation learning stage and a pruning-oriented reinforcement learning stage, and further accelerates inference via a tree-based SSD strategy.
Jiahao Wang, Xiangyu Cao, Jiaru Zhong, Yuner Zhang, Zeyu Han, Haibao Yu, Chuang Zhang, Lei He, Shaobing Xu, Jianqiang Wang
Comments Accepted by AAAI 2026
While cooperative perception can overcome the limitations of single-vehicle systems, the practical implementation of vehicle-to-vehicle and vehicle-to-infrastructure systems is often impeded by significant economic barriers. Aerial-ground cooperation (AGC), which pairs ground vehicles with drones, presents a more economically viable and rapidly deployable alternative. However, this emerging field has been held back by a critical lack of high-quality public datasets and benchmarks. To bridge this gap, we present \textit{Griffin}, a comprehensive AGC 3D perception dataset, featuring over 250 dynamic scenes (37k+ frames). It incorporates varied drone altitudes (20-60m), diverse weather conditions, realistic drone dynamics via CARLA-AirSim co-simulation, and critical occlusion-aware 3D annotations. Accompanying the dataset is a unified benchmarking framework for cooperative detection and tracking, with protocols to evaluate communication efficiency, altitude adaptability, and robustness to communication latency, data loss and localization noise. By experiments through different cooperative paradigms, we demonstrate the effectiveness and limitations of current methods and provide crucial insights for future research. The dataset and codes are available at https://github.com/wang-jh18-SVM/Griffin.
Ammar Kheder, Benjamin Foreback, Lili Wang, Zhi-Song Liu, Michael Boy
Air quality prediction is key to mitigating health impacts and guiding decisions, yet existing models tend to focus on temporal trends while overlooking spatial generalization. We propose AQ-Net, a spatiotemporal reanalysis model for both observed and unobserved stations in the near future. AQ-Net utilizes the LSTM and multi-head attention for the temporal regression. We also propose a cyclic encoding technique to ensure continuous time representation. To learn fine-grained spatial air quality estimation, we incorporate AQ-Net with the neural kNN to explore feature-based interpolation, such that we can fill the spatial gaps given coarse observation stations. To demonstrate the efficiency of our model for spatiotemporal reanalysis, we use data from 2013-2017 collected in northern China for PM2.5 analysis. Extensive experiments show that AQ-Net excels in air quality reanalysis, highlighting the potential of hybrid spatio-temporal models to better capture environmental dynamics, especially in urban areas where both spatial and temporal variability are critical.
Josua Faller, Jörg Martin
Comments for associated code, see https://github.com/josh3142/LowRankLaplaceApproximation
Subspace inference for neural networks assumes that a subspace of their parameter space suffices to produce a reliable uncertainty quantification. In this work, we underpin the validity of this assumption by using low rank techniques. We derive an expression for a subspace model to a Bayesian inference scenario based on the Laplace approximation that is, in a certain sense, optimal given a specific dataset. We empirically show that a Laplace approximation constructed with a dimensionally reduced covariance matrix closely matches the full Laplace approximation obtained using the exact covariance matrix. Where feasible, this subspace model can serve as a baseline for benchmarking the performance of subspace models. In addition, we provide a scalable approximation of this subspace construction that is usable in practice and compare it to existing subspace models from the literature. In general, our approximation scheme outperforms previous work. Furthermore, we present a metric to qualitatively compare the approximation quality of different subspace models even if the exact Laplace approximation is unknown.
Xin He, Yili Wang, Wenqi Fan, Xu Shen, Xin Juan, Rui Miao, Xin Wang
Comments Accepted by The Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2025)
Graph Neural Networks (GNNs) have shown great success in various graph-based learning tasks. However, it often faces the issue of over-smoothing as the model depth increases, which causes all node representations to converge to a single value and become indistinguishable. This issue stems from the inherent limitations of GNNs, which struggle to distinguish the importance of information from different neighborhoods. In this paper, we introduce MbaGCN, a novel graph convolutional architecture that draws inspiration from the Mamba paradigm-originally designed for sequence modeling. MbaGCN presents a new backbone for GNNs, consisting of three key components: the Message Aggregation Layer, the Selective State Space Transition Layer, and the Node State Prediction Layer. These components work in tandem to adaptively aggregate neighborhood information, providing greater flexibility and scalability for deep GNN models. While MbaGCN may not consistently outperform all existing methods on each dataset, it provides a foundational framework that demonstrates the effective integration of the Mamba paradigm into graph representation learning. Through extensive experiments on benchmark datasets, we demonstrate that MbaGCN paves the way for future advancements in graph neural network research.
Angel Romero, Ashwin Shenai, Ismail Geles, Elie Aljalbout, Davide Scaramuzza
Comments 8 pages, 6 Figures, accepted to IEEE ICRA 2026
Autonomous drone racing has risen as a challenging robotic benchmark for testing the limits of learning, perception, planning, and control. Expert human pilots are able to fly a drone through a race track by mapping pixels from a single camera directly to control commands. Recent works in autonomous drone racing attempting direct pixel-to-commands control policies have relied on either intermediate representations that simplify the observation space or performed extensive bootstrapping using Imitation Learning (IL). This paper leverages DreamerV3 to train visuomotor policies capable of agile flight through a racetrack using only pixels as observations. In contrast to model-free methods like PPO or SAC, which are sample-inefficient and struggle in this setting, our approach acquires drone racing skills from pixels. Notably, a perception-aware behaviour of actively steering the camera toward texture-rich gate regions emerges without the need of handcrafted reward terms for the viewing direction. Our experiments show in both, simulation and real-world flight using a hardware-in-the-loop setup with rendered image observations, how the proposed approach can be deployed on real quadrotors at speeds of up to 9 m/s. These results advance the state of pixel-based autonomous flight and demonstrate that MBRL offers a promising path for real-world robotics research.
Xin He, Wenqi Fan, Yili Wang, Chengyi Liu, Rui Miao, Xin Juan, Xin Wang
Comments Accepted by The 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 (KDD 2026)
Graph Neural Networks (GNNs) are highly vulnerable to adversarial attacks, which can greatly degrade their performance. Existing graph purification methods attempt to address this issue by filtering attacked graphs. However, they struggle to defend effectively against multiple types of adversarial attacks (e.g., targeted attacks and non-targeted attacks) simultaneously due to limited flexibility. Additionally, these methods lack comprehensive modeling of graph data, relying heavily on heuristic prior knowledge. To overcome these challenges, we introduce the Graph Defense Diffusion Model (GDDM), a flexible purification method that leverages the denoising and modeling capabilities of diffusion models. The iterative nature of diffusion models aligns well with the stepwise process of adversarial attacks, making them particularly suitable for defense. By iteratively adding and removing noises (edges), GDDM effectively purifies attacked graphs, restoring their original structures and features. Our GDDM consists of two key components: (1) Graph Structure-Driven Refiner, which preserves the basic fidelity of the graph during the denoising process, and ensures that the generated graph remains consistent with the original scope; and (2) Node Feature-Constrained Regularizer, which removes residual impurities from the denoised graph, further enhancing the purification effect. By designing tailored denoising strategies to handle different types of adversarial attacks, we improve the GDDM's adaptability to various attack scenarios. Furthermore, GDDM demonstrates strong scalability, leveraging its structural properties to seamlessly transfer across similar datasets without retraining. Extensive experiments on three real-world datasets demonstrate that GDDM outperforms state-of-the-art methods in defending against various adversarial attacks, showcasing its robustness and effectiveness.
Eugenio Cuniato, Mike Allenspach, Thomas Stastny, Helen Oleynikova, Roland Siegwart, Michael Pantic
Tilt-rotor aerial robots are more dynamic and versatile than fixed-rotor platforms, since the thrust vector and body orientation are decoupled. However, the coordination of servos and propellers (the allocation problem) is not trivial, especially accounting for overactuation and actuator dynamics. We incrementally build and present three novel allocation methods for tilt-rotor aerial robots, comparing them to state-of-the-art methods on a real system performing dynamic maneuvers. We extend the state-of-the-art geometric allocation into a differential allocation, which uses the platform's redundancy and does not suffer from singularities. We expand it by incorporating actuator dynamics and propeller power dynamics. These allow us to model dynamic propeller acceleration limits, bringing two main advantages: balancing propeller speed without the need for nullspace goals and allowing the platform to selectively turn off propellers during flight, opening the door to new manipulation possibilities. We also use actuator dynamics and limits to normalize the allocation problem, making it easier to tune and allowing it to track 70% faster trajectories than a geometric allocation.
Yangfan Ye, Xiaocheng Feng, Xiachong Feng, Libo Qin, Yichong Huang, Lei Huang, Weitao Ma, Qichen Hong, Zhirui Zhang, Yunfei Lu, Xiaohui Yan, Duyu Tang, Dandan Tu, Bing Qin
Comments IEEE Transactions on Audio, Speech and Language Processing
Current large language models (LLMs) often exhibit imbalances in multilingual capabilities and cultural adaptability, largely attributed to their English-centric pre-training data. In this paper, we introduce and investigate cross-lingual latent transplantation (XTransplant), a probing framework which aims to further exploit the model's internalized multilingual knowledge during inference and examine its effects on the multilingual capability and cultural adaptability of LLMs. XTransplant framework enables models to harness the complementary strengths of both English and non-English resources by transplanting latent activations across languages. Through extensive analysis, we empirically demonstrate that XTransplant, a form of cross-lingual interaction, has mutually beneficial effects on the multilingual capability and cultural adaptability of LLMs, particularly for low-resource languages and cultures. We further reveal that attention modules play a pivotal role in supporting multilingual understanding, while feed-forward modules are more adept at capturing culture-specific knowledge. In addition, we conduct in-depth analysis of XTransplant's stability, effectiveness, and generalizability. By probing the upper bound performance of XTransplant, we expose the considerable underutilization of current LLMs' multilingual potential-a challenge that remains open. We hope our analysis offers a new lens for advancing cross-lingual interactions and better leveraging models' internalized multilingual knowledge.
Yangyang Li, Daqing Liu, Wu Liu, Allen He, Xinchen Liu, Yongdong Zhang, Guoqing Jin
Comments WebPage available at https://tale17.github.io/omni/
Creative visual concept generation often draws inspiration from specific concepts in a reference image to produce relevant outcomes. However, existing methods are typically constrained to single-aspect concept generation or are easily disrupted by irrelevant concepts in multi-aspect concept scenarios, leading to concept confusion and hindering creative generation. To address this, we propose OmniPrism, a visual concept disentangling approach for creative image generation. Our method learns disentangled concept representations guided by natural language and trains a diffusion model to incorporate these concepts. We utilize the rich semantic space of a multimodal extractor to achieve concept disentanglement from given images and concept guidance. To disentangle concepts with different semantics, we construct a paired concept disentangled dataset (PCD-200K), where each pair shares the same concept such as content, style, and composition. We learn disentangled concept representations through our contrastive orthogonal disentangled (COD) training pipeline, which are then injected into additional diffusion cross-attention layers for generation. A set of block embeddings is designed to adapt each block's concept domain in the diffusion models. Extensive experiments demonstrate that our method can generate high-quality, concept-disentangled results with high fidelity to text prompts and desired concepts.
Shubhajit Roy, Hrriday Ruparel, Kishan Ved, Anirban Dasgupta
Comments Published in Transactions on Machine Learning Research (TMLR), 2026. Available at https://openreview.net/forum?id=g7r7y2I7Sz
Scalability of Graph Neural Networks (GNNs) remains a significant challenge. To tackle this, methods like coarsening, condensation, and computation trees are used to train on a smaller graph, resulting in faster computation. Nonetheless, prior research has not adequately addressed the computational costs during the inference phase. This paper presents a novel approach to improve the scalability of GNNs by reducing computational burden during the inference phase using graph coarsening. We demonstrate two different methods -- Extra Nodes and Cluster Nodes. Our study extends the application of graph coarsening for graph-level tasks, including graph classification and graph regression. We conduct extensive experiments on multiple benchmark datasets to evaluate the performance of our approach. Our results show that the proposed method achieves orders of magnitude improvements in single-node inference time compared to traditional approaches. Furthermore, it significantly reduces memory consumption for node and graph classification and regression tasks, enabling efficient training and inference on low-resource devices where conventional methods are impractical. Notably, these computational advantages are achieved while maintaining competitive performance relative to baseline models.
Tiago da Silva, Eliezer de Souza da Silva, Diego Mesquita
Comments Accepted at NeurIPS 2024, https://openreview.net/forum?id=N5H4z0Pzvn
Generative Flow Networks (GFlowNets) are amortized inference models designed to sample from unnormalized distributions over composable objects, with applications in generative modeling for tasks in fields such as causal discovery, NLP, and drug discovery. Traditionally, the training procedure for GFlowNets seeks to minimize the expected log-squared difference between a proposal (forward policy) and a target (backward policy) distribution, which enforces certain flow-matching conditions. While this training procedure is closely related to variational inference (VI), directly attempting standard Kullback-Leibler (KL) divergence minimization can lead to proven biased and potentially high-variance estimators. Therefore, we first review four divergence measures, namely, Renyi-$α$'s, Tsallis-$α$'s, reverse and forward KL's, and design statistically efficient estimators for their stochastic gradients in the context of training GFlowNets. Then, we verify that properly minimizing these divergences yields a provably correct and empirically effective training scheme, often leading to significantly faster convergence than previously proposed optimization. To achieve this, we design control variates based on the REINFORCE leave-one-out and score-matching estimators to reduce the variance of the learning objectives' gradients. Our work contributes by narrowing the gap between GFlowNets training and generalized variational approximations, paving the way for algorithmic ideas informed by the divergence minimization viewpoint.
Wen Ye, Wei Yang, Defu Cao, Yizhou Zhang, Lumingyuan Tang, Jie Cai, Yan Liu
Time series analysis is crucial in real-world applications, yet traditional methods focus on isolated tasks only, and recent studies on time series reasoning remain limited to either single-step inference or are constrained to natural language answers. In this work, we introduce TS-Reasoner, a domain-specialized agent designed for multi-step time series inference. By integrating large language model (LLM) reasoning with domain-specific computational tools and an error feedback loop, TS-Reasoner enables domain-informed, constraint-aware analytical workflows that combine symbolic reasoning with precise numerical analysis. We assess the system's capabilities along two axes: (1) fundamental time series understanding assessed by TimeSeriesExam and (2) complex, multi-step inference evaluated by a newly proposed dataset designed to test both compositional reasoning and computational precision in time series analysis. Experiments show that our approach outperforms standalone general-purpose LLMs in both basic time series concept understanding as well as the multi-step time series inference task, highlighting the promise of domain-specialized agents for automating real-world time series reasoning and analysis.
Haotian Tan, Sakriani Sakti
Comments Accepted to Interspeech 2024 main conference
Recent advances in simultaneous speech translation (SST) focus on the decision policies that enable the use of offline-trained ST models for simultaneous inference. These decision policies not only control the quality-latency trade-off in SST but also mitigate the impact of unstable predictions on translation quality by delaying translation for more context or discarding these predictions through stable hypothesis detection. However, these policies often overlook the potential benefits of utilizing unstable predictions. We introduce the contrastive feedback mechanism (CFM) for SST, a novel method that leverages these unstable predictions as feedback to improve translation quality. CFM guides the system to eliminate undesired model behaviors from these predictions through a contrastive objective. The experiments on 3 state-of-the-art decision policies across 8 languages in the MuST-C v1.0 dataset show that CFM effectively improves the performance of SST.
Wei Duan, Jie Lu, Junyu Xuan
Comments Accepted by IJCAI 2024. Update Discussion
Cooperative Multi-Agent Reinforcement Learning (MARL) necessitates seamless collaboration among agents, often represented by an underlying relation graph. Existing methods for learning this graph primarily focus on agent-pair relations, neglecting higher-order relationships. While several approaches attempt to extend cooperation modelling to encompass behaviour similarities within groups, they commonly fall short in concurrently learning the latent graph, thereby constraining the information exchange among partially observed agents. To overcome these limitations, we present a novel approach to infer the Group-Aware Coordination Graph (GACG), which is designed to capture both the cooperation between agent pairs based on current observations and group-level dependencies from behaviour patterns observed across trajectories. This graph is further used in graph convolution for information exchange between agents during decision-making. To further ensure behavioural consistency among agents within the same group, we introduce a group distance loss, which promotes group cohesion and encourages specialization between groups. Our evaluations, conducted on StarCraft II micromanagement tasks, demonstrate GACG's superior performance. An ablation study further provides experimental evidence of the effectiveness of each component of our method.
Wei Duan, Jie Lu, Junyu Xuan
Comments Accepted by IEEE TNNLS on 17-Nov-2024. Update Discussion
Effective agent coordination is crucial in cooperative Multi-Agent Reinforcement Learning (MARL). While agent cooperation can be represented by graph structures, prevailing graph learning methods in MARL are limited. They rely solely on one-step observations, neglecting crucial historical experiences, leading to deficient graphs that foster redundant or detrimental information exchanges. Additionally, high computational demands for action-pair calculations in dense graphs impede scalability. To address these challenges, we propose inferring a Latent Temporal Sparse Coordination Graph (LTS-CG) for MARL. The LTS-CG leverages agents' historical observations to calculate an agent-pair probability matrix, where a sparse graph is sampled from and used for knowledge exchange between agents, thereby simultaneously capturing agent dependencies and relation uncertainty. The computational complexity of this procedure is only related to the number of agents. This graph learning process is further augmented by two innovative characteristics: Predict-Future, which enables agents to foresee upcoming observations, and Infer-Present, ensuring a thorough grasp of the environmental context from limited data. These features allow LTS-CG to construct temporal graphs from historical and real-time information, promoting knowledge exchange during policy learning and effective collaboration. Graph learning and agent training occur simultaneously in an end-to-end manner. Our demonstrated results on the StarCraft II benchmark underscore LTS-CG's superior performance.
Dmytro Zakharov, Oleksandr Kuznetsov, Emanuele Frontoni, Natalia Kryvinska
Biometric authentication systems are crucial for security, but developing them involves various complexities, including privacy, security, and achieving high accuracy without directly storing pure biometric data in storage. We introduce an innovative image distortion technique that makes facial images unrecognizable to the eye but still identifiable by any custom embedding neural network model. Using the proposed approach, we test the reliability of biometric recognition networks by determining the maximum image distortion that does not change the predicted identity. Through experiments on MNIST and LFW datasets, we assess its effectiveness and compare it based on the traditional comparison metrics.
Jung-Hoon Cho, Sirui Li, Jeongyun Kim, Cathy Wu
Comments 18 pages, 12 figures
The recent development of connected and automated vehicle (CAV) technologies has spurred investigations to optimize dense urban traffic to maximize vehicle speed and throughput. This paper explores advisory autonomy, in which real-time driving advisories are issued to the human drivers, thus achieving near-term performance of automated vehicles. Due to the complexity of traffic systems, recent studies of coordinating CAVs have resorted to leveraging deep reinforcement learning (RL). Coarse-grained advisory is formalized as zero-order holds, and we consider a range of hold duration from 0.1 to 40 seconds. However, despite the similarity of the higher frequency tasks on CAVs, a direct application of deep RL fails to be generalized to advisory autonomy tasks. To overcome this, we utilize zero-shot transfer, training policies on a set of source tasks--specific traffic scenarios with designated hold durations--and then evaluating the efficacy of these policies on different target tasks. We introduce Temporal Transfer Learning (TTL) algorithms to select source tasks for zero-shot transfer, systematically leveraging the temporal structure to solve the full range of tasks. TTL selects the most suitable source tasks to maximize the performance of the range of tasks. We validate our algorithms on diverse mixed-traffic scenarios, demonstrating that TTL more reliably solves the tasks than baselines. This paper underscores the potential of coarse-grained advisory autonomy with TTL in traffic flow optimization.
Zixuan Hu, Yongxian Wei, Li Shen, Zhenyi Wang, Baoyuan Wu, Chun Yuan, Dacheng Tao
Data-Free Meta-Learning (DFML) aims to enable efficient learning of unseen few-shot tasks, by meta-learning from multiple pre-trained models without accessing their original training data. While existing DFML methods typically generate synthetic data from these models to perform meta-learning, a comprehensive analysis of DFML's robustness-particularly its failure modes and vulnerability to potential attacks-remains notably absent. Such an analysis is crucial as algorithms often operate in complex and uncertain real-world environments. This paper fills this significant gap by systematically investigating the robustness of DFML, identifying two critical but previously overlooked vulnerabilities: Task-Distribution Shift (TDS) and Task-Distribution Corruption (TDC). TDS refers to the sequential shifts in the evolving task distribution, leading to the catastrophic forgetting of previously learned meta-knowledge. TDC exposes a security flaw of DFML, revealing its susceptibility to attacks when the pre-trained model pool includes untrustworthy models that deceptively claim to be beneficial but are actually harmful. To mitigate these vulnerabilities, we propose a trustworthy DFML framework comprising three components: synthetic task reconstruction, meta-learning with task memory interpolation, and automatic model selection. Specifically, utilizing model inversion techniques, we reconstruct synthetic tasks from multiple pre-trained models to perform meta-learning. To prevent forgetting, we introduce a strategy to replay interpolated historical tasks to efficiently recall previous meta-knowledge. Furthermore, our framework seamlessly incorporates an automatic model selection mechanism to automatically filter out untrustworthy models during the meta-learning process. Code is available at https://github.com/Egg-Hu/Trustworthy-DFML.
Anthony T. Nixon
Comments 34 pages, 13 figures. Code: https://github.com/alch3mistdev/semantic-rate-distortion
When two agents of different computational capacities interact with the same environment, they need not compress a common semantic alphabet differently; they can induce different semantic alphabets altogether. We show that the quotient POMDP $Q_{m,T}(M)$ - the unique coarsest abstraction consistent with an agent's capacity - serves as a capacity-derived semantic space for any bounded agent, and that communication between heterogeneous agents exhibits a sharp structural phase transition. Below a critical rate $R_{\text{crit}}$ determined by the quotient mismatch, intent-preserving communication is structurally impossible. In the supported one-way memoryless regime, classical side-information coding then yields exponential decay above the induced benchmark. Classical coding theorems tell you the rate once the source alphabet is fixed; our contribution is to derive that alphabet from bounded interaction itself. Concretely, we prove: (1) a fixed-$\varepsilon$ structural phase-transition theorem whose lower bound is fully general on the common-history quotient comparison; (2) a one-way Wyner-Ziv benchmark identification on quotient alphabets, with exact converse, exact operational equality for memoryless quotient sources, and an ergodic long-run bridge via explicit mixing bounds; (3) an asymptotic one-way converse in the shrinking-distortion regime $\varepsilon = O(1/T)$, proved from the message stream and decoder side information; and (4) alignment traversal bounds enabling compositional communication through intermediate capacity levels. Experiments on eight POMDP environments (including RockSample(4,4)) illustrate the phase transition, a structured-policy benchmark shows the one-way rate can drop by up to $19\times$ relative to the counting bound, and a shrinking-distortion sweep matches the regime of the asymptotic converse.
Israt Jahan Mouri, Muhammad Ridowan, Muhammad Abdullah Adnan
Comments 21 pages, 9 figures, 7 tables
Model poisoning attacks pose a significant security threat to Federated Learning (FL). Most existing model poisoning attacks rely on collusion, requiring adversarial clients to coordinate by exchanging local benign models and synchronizing the generation of their poisoned updates. However, sustaining such coordination is increasingly impractical in real-world FL deployments, as it effectively requires botnet-like control over many devices. This approach is costly to maintain and highly vulnerable to detection. This context raises a fundamental question: Can model poisoning attacks remain effective without any communication between attackers? To address this challenge, we introduce and formalize the \textbf{non-collusive attack model}, in which all compromised clients share a common adversarial objective but operate independently. Under this model, each attacker generates its malicious update without communicating with other adversaries, accessing other clients' updates, or relying on any knowledge of server-side defenses. To demonstrate the feasibility of this threat model, we propose \textbf{XFED}, the first aggregation-agnostic, non-collusive model poisoning attack. Our empirical evaluation across six benchmark datasets shows that XFED bypasses eight state-of-the-art defenses and outperforms six existing model poisoning attacks. These findings indicate that FL systems are substantially less secure than previously believed and underscore the urgent need for more robust and practical defense mechanisms.
Muazzem Hussain Khan, Tasdid Hasnain, Md. Jamil khan, Ruhul Amin, Md. Shamim Reza, Md. Al Mehedi Hasan, Md Ashad Alam
Comments 25 [ages. 9 Figures
In this study, we proposed a deep Swin-Vision Transformer-based transfer learning architecture for robust multi-cancer histopathological image classification. The proposed framework integrates a hierarchical Swin Transformer with ResNet50-based convolution features extraction, enabling the model to capture both long-range contextual dependencies and fine-grained local morphological patterns within histopathological images. To validate the efficiency of the proposed architecture, an extensive experiment was executed on a comprehensive multi-cancer dataset including Breast Cancer, Oral Cancer, Lung and Colon Cancer, Kidney Cancer, and Acute Lymphocytic Leukemia (ALL), including both original and segmented images were analyzed to assess model robustness across heterogeneous clinical imaging conditions. Our approach is benchmarked alongside several state-of-the-art CNN and transfer models, including DenseNet121, DenseNet201, InceptionV3, ResNet50, EfficientNetB3, multiple ViT variants, and Swin Transformer models. However, all models were trained and validated using a unified pipeline, incorporating balanced data preprocessing, transfer learning, and fine-tuning strategies. The experimental results demonstrated that our proposed architecture consistently gained superior performance, reaching 100% test accuracy for lung-colon cancer, segmented leukemia datasets, and up to 99.23% accuracy for breast cancer classification. The model also achieved near-perfect precision, f1 score, and recall, indicating highly stable scores across divers cancer types. Overall, the proposed model establishes a highly accurate, interpretable, and also robust multi-cancer classification system, demonstrating strong benchmark for future research and provides a unified comparative assessment useful for designing reliable AI-assisted histopathological diagnosis and clinical decision-making.
Sepideh Nikookar, Edward Tian, Harrison Hoffman, Matthew Parks, J. Lucas McKay, Yashar Kiarashi, Tommy T. Thomas, Alex Hall, David W. Wright, Gari D. Clifford
The accelerometer has become an almost ubiquitous device, providing enormous opportunities in healthcare monitoring beyond step counting or other average energy estimates in 15-60 second epochs. Objective: To develop an open data set with associated open-source code for processing 50 Hz tri-axial accelerometry-based to classify patient activity levels and natural types of movement. Approach: Data were collected from 23 healthy subjects (16 males and seven females) aged between 23 and 62 years using an ambulatory device, which included a triaxial accelerometer and synchronous lead II equivalent ECG for an average of 26 minutes each. Participants followed a standardized activity routine involving five distinct activities: lying, sitting, standing, walking, and jogging. Two classifiers were constructed: a signal processing technique to distinguish between high and low activity levels and a convolutional neural network (CNN)-based approach to classify each of the five activities. Main results: The binary (high/low) activity classifier exhibited an F1 score of 0.79. The multi-class CNN-based classifier provided an F1 score of 0.83. The code for this analysis has been made available under an open-source license together with the data on which the classifiers were trained and tested. Significance: The classification of behavioral activity, as demonstrated in this study, offers valuable context for interpreting traditional health metrics and may provide contextual information to support the future development of clinical decision-making tools for patient monitoring, predictive analytics, and personalized health interventions.
Mohammad Ali Vahedifar, Mojtaba Nazari, Qi Zhang
The Tactile Internet demands sub-millisecond latency and ultra-high reliability, as high latency or packet loss could lead to haptic control instability. To address this, we propose the Mode-Domain Architecture (MDA), a bilateral predictive neural network architecture designed to restore missing signals on both the human and robot sides. Unlike conventional models that extract features implicitly from raw data, MDA utilizes a novel Continuous-Orthogonal Mode Decomposition framework. By integrating an orthogonality constraint, we overcome the pervasive issue of "mode overlapping" found in state-of-the-art decomposition methods. Experimental results demonstrate that this structured feature extraction achieves high prediction accuracies of 98.6% (human) and 97.3% (robot). Furthermore, the model achieves ultra-low inference latency of 0.065 ms, significantly outperforming existing benchmarks and meeting the stringent real-time requirements of haptic teleoperation.
扫码添加微信好友,提出您的宝贵建议 👇
💡 备注请填写:网站反馈