Cross-subject Muscle Fatigue Detection via Adversarial and Supervised Contrastive Learning with Inception-Attention Network
Comments This work has been submitted to ICARM 2026 for possible publication. 6 pages, 7 figures, 5 tables
Zitao Lin, Chang Zhu, Wei Meng
Comments This work has been submitted to ICARM 2026 for possible publication. 6 pages, 7 figures, 5 tables
Muscle fatigue detection plays an important role in physical rehabilitation. Previous researches have demonstrated that sEMG offers superior sensitivity in detecting muscle fatigue compared to other biological signals. However, features extracted from sEMG may vary during dynamic contractions and across different subjects, which causes unstability in fatigue detection. To address these challenges, this research proposes a novel neural network comprising an Inception-attention module as a feature extractor, a fatigue classifier and a domain classifier equipped with a gradient reversal layer. The integrated domain classifier encourages the network to learn subject-invariant common fatigue features while minimizing subject-specific features. Furthermore, a supervised contrastive loss function is also employed to enhance the generalization capability of the model. Experimental results demonstrate that the proposed model achieved outstanding performance in three-class classification tasks, reaching 93.54% accuracy, 92.69% recall and 92.69% F1-score, providing a robust solution for cross-subject muscle fatigue detection, offering significant guidance for rehabilitation training and assistance.
Divyanshu Kumar, Ishita Gupta, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi
How biased is a language model? The answer depends on how you ask. A model that refuses to choose between castes for a leadership role will, in a fill-in-the-blank task, reliably associate upper castes with purity and lower castes with lack of hygiene. Single-task benchmarks miss this because they capture only one slice of a model's bias profile. We introduce a hierarchical taxonomy covering 9 bias types, including under-studied axes like caste, linguistic, and geographic bias, operationalized through 7 evaluation tasks that span explicit decision-making to implicit association. Auditing 7 commercial and open-weight LLMs with \textasciitilde45K prompts, we find three systematic patterns. First, bias is task-dependent: models counter stereotypes on explicit probes but reproduce them on implicit ones, with Stereotype Score divergences up to 0.43 between task types for the same model and identity groups. Second, safety alignment is asymmetric: models refuse to assign negative traits to marginalized groups, but freely associate positive traits with privileged ones. Third, under-studied bias axes show the strongest stereotyping across all models, suggesting alignment effort tracks benchmark coverage rather than harm severity. These results demonstrate that single-benchmark audits systematically mischaracterize LLM bias and that current alignment practices mask representational harm rather than mitigating it.
Vira Kasprova, Amruta Parulekar, Abdulrahman AlRabah, Krishna Agaram, Ritwik Garg, Sagar Jha, Nimet Beyza Bozdag, Dilek Hakkani-Tur
Large language models (LLMs) often exhibit sycophancy: agreement with user stance even when it conflicts with the model's opinion. While prior work has mostly studied this in single-agent settings, it remains underexplored in collaborative multi-agent systems. We ask whether awareness of other agents' sycophancy levels influences discussion outcomes. To investigate this, we run controlled experiments with six open-source LLMs, providing agents with peer sycophancy rankings that estimate each peer's tendency toward sycophancy. These rankings are based on scores calculated using various static (pre-discussion) and dynamic (online) strategies. We find that providing sycophancy priors reduces the influence of sycophancy-prone peers, mitigates error-cascades, and improves final discussion accuracy by an absolute 10.5%. Thus, this is a lightweight, effective way to reduce discussion sycophancy and improve downstream accuracy.
Joshua Drossman, Alexandre Jacquillat, Sébastien Martin
Optimization is as much about modeling the right problem as solving it. Identifying the right objectives, constraints, and trade-offs demands extensive interaction between researchers and stakeholders. Large language models can empower decision-makers with optimization capabilities through interactive optimization agents that can propose, interpret and refine solutions. However, it is fundamentally harder to evaluate a conversation-based interaction than traditional one-shot approaches. This paper proposes a scalable and replicable methodology for evaluating optimization agents through conversations. We build LLM-powered decision agents that role-play diverse stakeholders, each governed by an internal utility function but communicating like a real decision-maker. We generate thousands of conversations in a school scheduling case study. Results show that one-shot evaluation is severely limiting: the same optimization agent converges to much higher-quality solutions through conversations. Then, this paper uses this methodology to demonstrate that tailored optimization agents, endowed with domain-specific prompts and structured tools, can lead to significant improvements in solution quality in fewer interactions, as compared to general-purpose chatbots. These findings provide evidence of the benefits of emerging solutions at the AI-optimization interface to expand the reach of optimization technologies in practice. They also uncover the impact of operations research expertise to facilitate interactive deployments through the design of effective and reliable optimization agents.
Jeesuk Shin, Donggyun Seo, Sihyeong Yu, Joongoo Jeon
Comments 37 pages, 7 figures
Severe accident analysis using system-level codes such as MELCOR is indispensable for nuclear safety assessment, yet the computational cost of repeated simulations poses a significant bottleneck for parametric studies and uncertainty quantification. Existing surrogate models accelerate these analyses but depend on large volumes of simulation data, while physics-informed neural networks (PINNs) enable data-free training but must be retrained for every change in problem parameters. This study addresses both limitations by developing the Parameterized PINNs coupled with FDM (P2F) method, a node-assigned hybrid framework for MELCOR's Control Volume Hydrodynamics/Flow Path (CVH/FP) module. In the P2F method, a parameterized Node-Assigned PINN (NA-PINN) accepts the water-level difference, initial velocity, and time as inputs, learning a solution manifold so that a single trained network serves as a data-free surrogate for the momentum conservation equation across all flow paths without retraining. This PINN is coupled with a finite difference method (FDM) solver that advances the mass conservation equation at each time step, ensuring exact discrete mass conservation while replacing the iterative nonlinear momentum solve with a single forward pass. Verification on a six-tank gravity-driven draining scenario yields a water level mean absolute error of $7.85 \times 10^{-5}$ m and a velocity mean absolute error of $3.21 \times 10^{-3}$ m/s under the nominal condition with $Δt = 1.0$ s. The framework maintains consistent accuracy across time steps ranging from 0.2 to 1.0 s and generalizes to five distinct initial conditions, all without retraining or simulation data. This work introduces a numerical coupling methodology for integrating parameterized PINNs with FDM within a nuclear thermal-hydraulic system code framework.
Divyanshu Kumar, Ishita Gupta, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi
As Large Language Models (LLMs) increasingly power decision-making systems across critical domains, understanding and mitigating their biases becomes essential for responsible AI deployment. Although bias assessment frameworks have proliferated for attributes such as race and gender, socioeconomic status bias remains significantly underexplored despite its widespread implications in the real world. We introduce SocioEval, a template-based framework for systematically evaluating socioeconomic bias in foundation models through decision-making tasks. Our hierarchical framework encompasses 8 themes and 18 topics, generating 240 prompts across 6 class-pair combinations. We evaluated 13 frontier LLMs on 3,120 responses using a rigorous three-stage annotation protocol, revealing substantial variation in bias rates (0.42\%-33.75\%). Our findings demonstrate that bias manifests differently across themes lifestyle judgments show 10$\times$ higher bias than education-related decisions and that deployment safeguards effectively prevent explicit discrimination but show brittleness to domain-specific stereotypes. SocioEval provides a scalable, extensible foundation for auditing class-based bias in language models.
Farhad Pourkamali-Anaraki
Comments 13 pages
The massive scale of pretrained models has made efficient compression essential for practical deployment. Low-rank decomposition based on the singular value decomposition (SVD) provides a principled approach for model reduction, but its exact computation is expensive for large weight matrices. Randomized alternatives such as randomized SVD (RSVD) improve efficiency, yet they can suffer from poor approximation quality when the singular value spectrum decays slowly, a regime commonly observed in modern pretrained models. In this work, we address this limitation from both theoretical and empirical perspectives. First, we establish a connection between low-rank approximation error and predictive performance by analyzing softmax perturbations, showing that deviations in class probabilities are controlled by the spectral error of the compressed weights. Second, we demonstrate that RSVD is inadequate, and we propose randomized subspace iteration (RSI) as a more effective alternative. By incorporating multiple power iterations, RSI improves spectral separation and provides a controllable mechanism for enhancing approximation quality. We evaluate our approach on both convolutional networks and transformer-based architectures. Our results show that RSI achieves near-optimal approximation quality while outperforming RSVD in predictive accuracy under aggressive compression, enabling efficient model compression.
Yuqing Huang, Liting Lin, Weijun Zhuang, Zhenyu He, Xin Li
Comments accepted by CVPR 2026
Temporal information is crucial for visual tracking, but existing multi-frame trackers are vulnerable to model drift caused by naively aggregating noisy historical predictions. In this paper, we introduce DTPTrack, a lightweight and generalizable module designed to be seamlessly integrated into existing trackers to suppress drift. Our framework consists of two core components: (1) a Temporal Reliability Calibrator (TRC) mechanism that learns to assign a per-frame reliability score to historical states, filtering out noise while anchoring on the ground-truth template; and (2) a Temporal Guidance Synthesizer (TGS) module that synthesizes this calibrated history into a compact set of dynamic temporal priors to provide predictive guidance. To demonstrate its versatility, we integrate DTPTrack into three diverse tracking architectures--OSTrack, ODTrack, and LoRAT-and show consistent, significant performance gains across all baselines. Our best-performing model, built upon an extended LoRATv2 backbone, sets a new state-of-the-art on several benchmarks, achieving a 77.5% Success rate on LaSOT and an 80.3% AO on GOT-10k.
Haruhi Shida, Koo Imai, Keigo Kansa
Comments 7 pages, 2 figures, 2 tables, accepted at JSAI 2026
The safety of large language models (LLMs) relies on alignment techniques such as reinforcement learning from human feedback (RLHF). However, recent theoretical analyses suggest that reinforcement learning-based training does not acquire new capabilities but merely redistributes the utilization probabilities of existing ones. In this study, we propose ``compound jailbreaks'' targeting OpenAI gpt-oss-20b, which exploit the generalization failures of alignment. This approach combines multiple attack techniques -- each individually defended against -- to saturate the instruction hierarchy maintenance process. Our evaluation shows that the attack success rate (ASR) increased from 14.3\% with individual methods to 71.4\% with the combined approach. These results provide empirical evidence for the hypothesis that safety training does not generalize as broadly as model capabilities, highlighting the need for multifaceted safety evaluations using compound attack scenarios.
Cunyang Wei, Siddharth Singh, Aishwarya Sarkar, Daniel Nichols, Tisha Patel, Aditya K. Ranjan, Sayan Ghosh, Ali Jannesari, Nathan R. Tallent, Abhinav Bhatele
Graph neural networks (GNNs) are widely used for learning on graph datasets derived from various real-world scenarios. Learning from extremely large graphs requires distributed training, and mini-batching with sampling is a popular approach for parallelizing GNN training. Existing distributed mini-batch approaches have significant performance bottlenecks due to expensive sampling methods and limited scaling when using data parallelism. In this work, we present ScaleGNN, a 4D parallel framework for scalable mini-batch GNN training that combines communication-free distributed sampling, 3D parallel matrix multiplication (PMM), and data parallelism. ScaleGNN introduces a uniform vertex sampling algorithm, enabling each process (GPU device) to construct its local mini-batch, i.e., subgraph partitions without any inter-process communication. 3D PMM enables scaling mini-batch training to much larger GPU counts than vanilla data parallelism with significantly lower communication overheads. We also present additional optimizations to overlap sampling with training, reduce communication overhead by sending data in lower precision, kernel fusion, and communication-computation overlap. We evaluate ScaleGNN on five graph datasets and demonstrate strong scaling up to 2048 GPUs on Perlmutter, 2048 GCDs on Frontier, and 1024 GPUs on Tuolumne. On Perlmutter, ScaleGNN achieves 3.5x end-to-end training speedup over the SOTA baseline on ogbn-products.
Yupu Liang, Shuang Chen, Guanwei Zhang, Shaolei Wang, Suncong Zheng
Existing studies on Long-Context Continual Pre-training (LCCP) mainly focus on small-scale models and limited data regimes (tens of billions of tokens). We argue that directly migrating these small-scale settings to industrial-grade models risks insufficient adaptation and premature training termination. Furthermore, current evaluation methods rely heavily on downstream benchmarks (e.g., Needle-in-a-Haystack), which often fail to reflect the intrinsic convergence state and can lead to "deceptive saturation". In this paper, we present the first systematic investigation of LCCP learning dynamics using the industrial-grade Hunyuan-A13B (80B total parameters), tracking its evolution across a 200B-token training trajectory. Specifically, we propose a hierarchical framework to analyze LCCP dynamics across behavioral (supervised fine-tuning probing), probabilistic (perplexity), and mechanistic (attention patterns) levels. Our findings reveal: (1) Necessity of Massive Data Scaling: Training regimes of dozens of billions of tokens are insufficient for industrial-grade LLMs' LCCP (e.g., Hunyuan-A13B reaches saturation after training over 150B tokens). (2) Deceptive Saturation vs. Intrinsic Saturation: Traditional NIAH scores report "fake saturation" early, while our PPL-based analysis reveals continuous intrinsic improvements and correlates more strongly with downstream performance. (3) Mechanistic Monitoring for Training Stability: Retrieval heads act as efficient, low-resource training monitors, as their evolving attention scores reliably track LCCP progress and exhibit high correlation with SFT results. This work provides a comprehensive monitoring framework, evaluation system, and mechanistic interpretation for the LCCP of industrial-grade LLM.
Mohammad Al-Jarrah, Michele Martino, Marcus Yim, Bamdad Hosseini, Amirhossein Taghvaei
Comments 8 pages, 5 figures
We present Conditional Wasserstein Autoencoders (CWAEs), a framework for conditional simulation that exploits low-dimensional structure in both the conditioned and the conditioning variables. The key idea is to modify a Wasserstein autoencoder to use a (block-) triangular decoder and impose an appropriate independence assumption on the latent variables. We show that the resulting model gives an autoencoder that can exploit low-dimensional structure while simultaneously the decoder can be used for conditional simulation. We explore various theoretical properties of CWAEs, including their connections to conditional optimal transport (OT) problems. We also present alternative formulations that lead to three architectural variants forming the foundation of our algorithms. We present a series of numerical experiments that demonstrate that our different CWAE variants achieve substantial reductions in approximation error relative to the low-rank ensemble Kalman filter (LREnKF), particularly in problems where the support of the conditional measures is truly low-dimensional.
Kenichirou Narita, Siqi Peng, Taku Fukui, Moyuru Yamada, Satoshi Munakata, Satoru Takahashi
Comments 8 pages, 3 figures. Accepted at AAAI 2026 Workshop
Performance evaluation of Retrieval-Augmented Generation (RAG) systems within enterprise environments is governed by multi-dimensional and composite factors extending far beyond simple final accuracy checks. These factors include reasoning complexity, retrieval difficulty, the diverse structure of documents, and stringent requirements for operational explainability. Existing academic benchmarks fail to systematically diagnose these interlocking challenges, resulting in a critical gap where models achieving high performance scores fail to meet the expected reliability in practical deployment. To bridge this discrepancy, this research proposes a multi-dimensional diagnostic framework by defining a four-axis difficulty taxonomy and integrating it into an enterprise RAG benchmark to diagnose potential system weaknesses.
Weimin Liu, Jiyuan Qiu, Wenjun Wang, Joshua H. Meng
Surround depth estimation provides a cost-effective alternative to LiDAR for 3D perception in autonomous driving. While recent self-supervised methods explore multi-camera settings to improve scale awareness and scene coverage, they are primarily designed for passenger vehicles and rarely consider articulated vehicles or robotics platforms. The articulated structure introduces complex cross-segment geometry and motion coupling, making consistent depth reasoning across views more challenging. In this work, we propose \textbf{ArticuSurDepth}, a self-supervised framework for surround-view depth estimation on articulated vehicles that enhances depth learning through cross-view and cross-vehicle geometric consistency guided by structural priors from vision foundation model. Specifically, we introduce multi-view spatial context enrichment strategy and a cross-view surface normal constraint to improve structural coherence across spatial and temporal contexts. We further incorporate camera height regularization with ground plane-awareness to encourage metric depth estimation, together with cross-vehicle pose consistency that bridges motion estimation between articulated segments. To validate our proposed method, an articulated vehicle experiment platform was established with a dataset collected over it. Experiment results demonstrate state-of-the-art (SoTA) performance of depth estimation on our self-collected dataset as well as on DDAD, nuScenes, and KITTI benchmarks.
Yasushi Nishida
Comments 6 pages, 3 figures, 3 tables. Code: https://github.com/Axelidea/AXELRAM
We propose AXELRAM, a smart SRAM macro architecture that computes attention scores directly from quantized KV cache indices without dequantization. The key enabler is a design-time fixed codebook: orthogonal-transform-based quantization concentrates each coordinate's distribution to N(0,1/d), so the optimal quantizer depends only on dimension d and bit-width b, not on input data. The asymmetric path design -- transform on write, table-lookup on read with no inverse transform -- reduces per-query multiplications by 102.4x (a mathematical identity). Through multi-seed evaluation (10 seeds x 3 models), we discover that sign pattern sensitivity causes catastrophic PPL spikes (Delta > 50) on certain models (Qwen2.5-3B), while others (LLaMA-3.1-8B) are fully stable. This phenomenon extends SpinQuant's observation of rotation variance in weight quantization to the KV cache domain, where the effect is qualitatively more severe. We trace the root cause to layer-wise norm heterogeneity and propose a gradient-free sign pattern selection (200 candidates, 8 calibration samples, one-time) that eliminates catastrophic spikes with zero additional hardware cost. All source code is available at https://github.com/Axelidea/AXELRAM.
Qihui Fan, Min Ge, Chenyan Jia, Weiyan Shi
As large language models (LLMs) become increasingly persuasive, there is concern that people's opinions and decisions may be influenced across various contexts at scale. Prior mitigation (e.g., AI detectors and disclaimers) largely treats people as passive recipients of AI-generated information. To provide a more proactive intervention against persuasive AI, we introduce $\textbf{LLMimic}$, a role-play-based, interactive, gamified AI literacy tutorial, where participants assume the role of an LLM and progress through three key stages of the training pipeline (pretraining, SFT, and RLHF). We conducted a $2 \times 3$ between-subjects study ($N = 274$) where participants either (1) watched an AI history video (control) or (2) interacted with LLMimic (treatment), and then engaged in one of three realistic AI persuasion scenarios: (a) charity donation persuasion, (b) malicious money solicitation, or (c) hotel recommendation. Our results show that LLMimic significantly improved participants' AI literacy ($p < .001$), reduced persuasion success across scenarios ($p < .05$), and enhanced truthfulness and social responsibility levels ($p<0.01$) in the hotel scenario. These findings suggest that LLMimic offers a scalable, human-centered approach to improving AI literacy and supporting more informed interactions with persuasive AI.
Lei Song, Shihan Guan, Youyong Kong
Non-Exemplar Continual Graph Learning (NECGL) seeks to eliminate the privacy risks intrinsic to rehearsal-based paradigms by retaining solely class-level prototype representations rather than raw graph examples for mitigating catastrophic forgetting. However, this design choice inevitably precipitates feature drift. As a nascent alternative, Analytic Continual Learning (ACL) capitalizes on the intrinsic generalization properties of frozen pre-trained models to bolster continual learning performance. Nonetheless, a key drawback resides in the pronounced attenuation of model plasticity. To surmount these challenges, we propose Analytic Drift Resister (ADR), a novel and theoretically grounded NECGL framework. ADR exploits iterative backpropagation to break free from the frozen pre-trained constraint, adapting to evolving task graph distributions and fortifying model plasticity. Since parameter updates trigger feature drift, we further propose Hierarchical Analytic Merging (HAM), performing layer-wise merging of linear transformations in Graph Neural Networks (GNNs) via ridge regression, thereby ensuring absolute resistance to feature drift. On this basis, Analytic Classifier Reconstruction (ACR) enables theoretically zero-forgetting class-incremental learning. Empirical evaluation on four node classification benchmarks demonstrates that ADR maintains strong competitiveness against existing state-of-the-art methods.
Hao Li, Liwei Zou, Wenping Yin, Gulsen Taskin, Naoto Yokoya, Danfeng Hong, Wufan Zhao
Living in a changing climate, human society now faces more frequent and severe natural disasters than ever before. As a consequence, rapid disaster response during the "Golden 72 Hours" of search and rescue becomes a vital humanitarian necessity and community concern. However, traditional disaster damage surveys routinely fail to generalize across distinct urban morphologies and new disaster events. Effective damage mapping typically requires exhaustive and time-consuming manual data annotation. To address this issue, we introduce Smart Transfer, a novel Geospatial Artificial Intelligence (GeoAI) framework, leveraging state-of-the-art vision Foundation Models (FMs) for rapid building damage mapping with post-earthquake Very High Resolution (VHR) imagery. Specifically, we design two novel model transfer strategies: first, Pixel-wise Clustering (PC), ensuring robust prototype-level global feature alignment; second, a Distance-Penalized Triplet (DPT), integrating patch-level spatial autocorrelation patterns by assigning stronger penalties to semantically inconsistent yet spatially adjacent patches. Extensive experiments and ablations from the recent 2023 Turkiye-Syria earthquake show promising performance in multiple cross-region transfer settings, namely Leave One Domain Out (LODO) and Specific Source Domain Combination (SSDC). Moreover, Smart Transfer provides a scalable, automated GeoAI solution to accelerate building damage mapping and support rapid disaster response, offering new opportunities to enhance disaster resilience in climate-vulnerable regions and communities. The data and code are publicly available at https://github.com/ai4city-hkust/SmartTransfer.
Yiyang Shen, Lifu Tu, Weiran Wang
Reinforcement Learning (RL) has been shown to substantially improve the reasoning capability of small and large language models (LLMs), but existing approaches typically rely on verifiable rewards, hence ground truth labels. We propose an RL framework that uses rewards from an LLM that acts as a judge evaluating model outputs over large amounts of unlabeled data, enabling label-free knowledge distillation and replacing the need of ground truth supervision. Notably, the judge operates with a single-token output, making reward computation efficient. When combined with verifiable rewards, our approach yields substantial performance gains across math reasoning benchmarks. These results suggest that LLM-based evaluators can produce effective training signals for RL fine-tuning.
Yitao Li, Zhanlin Liu, Anuranjan Pandey, Muni Srikanth
Organizing a large-scale knowledge graph into a typed property graph requires structural decisions -- which entities become nodes, which properties become edges, and what schema governs these choices. Existing approaches embed these decisions in pipeline code or extract relations ad hoc, producing schemas that are tightly coupled to their construction process and difficult to reuse for downstream ontology-level tasks. We present an ontology-oriented approach in which the schema is designed from the outset for ontology analysis, entity disambiguation, domain customization, and LLM-guided extraction -- not merely as a byproduct of graph building. The core mechanism is intrinsic-relational routing, which classifies every property as either intrinsic or relational and routes it to the corresponding schema module. This routing produces a declarative schema that is portable across storage backends and independently reusable. We instantiate the approach on the January 2026 Wikidata dump. A rule-based cleaning stage identifies a 34.6M-entity core set from the full dump, followed by iterative intrinsic-relational routing that assigns each property to one of 94 modules organized into 8 categories. With tool-augmented LLM support and human review, the schema reaches 93.3% category coverage and 98.0% module assignment among classified entities. Exporting this schema yields a property graph with 34.0M nodes and 61.2M edges across 38 relationship types. We validate the ontology-oriented claim through five applications that consume the schema independently of the construction pipeline: ontology structure analysis, benchmark annotation auditing, entity disambiguation, domain customization, and LLM-guided extraction.
Yuntao Du, Minh Dinh, Kaiyuan Zhang, Ninghui Li
Comments Winner of 2025-2026 Radiance Technologies Innovation Bowl
Scientific and Technical Intelligence (S&TI) analysis requires verifying complex technical claims across rapidly growing literature, where existing approaches fail to bridge the verification gap between surface-level accuracy and deeper methodological validity. We present AutoVerifier, an LLM-based agentic framework that automates end-to-end verification of technical claims without requiring domain expertise. AutoVerifier decomposes every technical assertion into structured claim triples of the form (Subject, Predicate, Object), constructing knowledge graphs that enable structured reasoning across six progressively enriching layers: corpus construction and ingestion, entity and claim extraction, intra-document verification, cross-source verification, external signal corroboration, and final hypothesis matrix generation. We demonstrate AutoVerifier on a contested quantum computing claim, where the framework, operated by analysts with no quantum expertise, automatically identified overclaims and metric inconsistencies within the target paper, traced cross-source contradictions, uncovered undisclosed commercial conflicts of interest, and produced a final assessment. These results show that structured LLM verification can reliably evaluate the validity and maturity of emerging technologies, turning raw technical documents into traceable, evidence-backed intelligence assessments.
Guangyu Sun, Wenhan Wu, Zhishuai Guo, Ziteng Wang, Pegah Khosravi, Chen Chen
Comments Accepted on the CVPR 2026 Workshop on Computer Vision for Children (CV4CHL)
Automated recognition of autistic behaviors in children is essential for early intervention and objective clinical assessment. However, the development of robust models is severely hindered by strict privacy regulations (e.g., HIPAA) and the sensitive nature of pediatric data, which prevents the centralized aggregation of clinical datasets. Furthermore, individual clinical sites often suffer from data scarcity, making it difficult to learn generalized behavior patterns or tailor models to site-specific patient distributions. To address these challenges, we observe that Federated Learning (FL) can decouple model training from raw data access, enabling multi-site collaboration while maintaining strict data residency. In this paper, we present the first study exploring Federated Learning for pose-based child autism behavior recognition. Our framework employs a two-layer privacy protection mechanism: utilizing human skeletal abstraction to remove identifiable visual information from the raw RGB videos and FL to ensure sensitive pose data remains within the clinic. This approach leverages distributed clinical data to learn generalized representations while providing the flexibility for site-specific personalization. Experimental results on the MMASD benchmark demonstrate that our framework achieves high recognition accuracy, outperforming traditional federated baselines and providing a robust, privacy-first solution for multi-site clinical analysis.
Samuel Honor, Mohamed Abdelnaby, Kevin Leahy
Comments 8 pages, 6 figures, submitted to CDC 2026 main track
Graph neural networks (GNNs) are a well-regarded tool for learned control of networked dynamical systems due to their ability to be deployed in a distributed manner. However, current distributed GNN architectures assume that all nodes in the network collect geometric observations in compatible bases, which limits the usefulness of such controllers in GPS-denied and compass-denied environments. This paper presents a GNN parametrization that is globally invariant to choice of local basis. 2D geometric features and transformations between bases are expressed in the complex domain. Inside each GNN layer, complex-valued linear layers with phase-equivariant activation functions are used. When viewed from a fixed global frame, all policies learned by this architecture are strictly invariant to choice of local frames. This architecture is shown to increase the data efficiency, tracking performance, and generalization of learned control when compared to a real-valued baseline on an imitation learning flocking task.
Gregory M. Campbell
Comments PhD Thesis, University of Pennsylvania, 2025
Modern robots embody power and precision control. Yet, as robots undertake tasks that apply forces on humans, this power brings risk of injury. Soft robotic actuators use deformation to produce smooth, continuous motions and conform to delicate objects while imparting forces capable of safely pushing humans. This thesis presents strategies for the design, modeling, and strain-based control of human-safe elastomeric soft pneumatic actuators (SPA) for force generation, focusing on embodied mechanical response to simple pressure inputs. We investigate electroadhesive (EA) strain limiters for variable shape generation, rapid force application, and targeted inflation trajectories. We attach EA clutches to a concentrically strain-limited elastomeric membrane to alter the inflation trajectory and rapidly reorient the inflated shape. We expand the capabilities of EA for soft robots by encasing them in elastomeric sheaths and varying their activation in real time, demonstrating applications in variable trajectory inflation under identical pressure sweeps. We then address the problem of trajectory control in the presence of external forces by modeling the pressure-trajectory relationship for a concentrically strain-limited class of silicone actuators. We validate theoretical models based on material properties and energy minimization using active learning and automated testing. We apply our ensemble of neural networks for inverse membrane design, specifying quasi-static mass lift trajectories from a simple pressure sweep. Finally, we demonstrate the power of multiple pressure-linked actuators in a proof-of-concept mannequin leg lift.
Ramaneswaran Selvakumar, Kaousheik Jayakumar, S Sakshi, Sreyan Ghosh, Ruohan Gao, Dinesh Manocha
Comments CVPR Findings
Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretability study of AVLLMs, analyzing how audio and visual features evolve and fuse through different layers of an AVLLM to produce the final text outputs. We find that although AVLLMs encode rich audio semantics at intermediate layers, these capabilities largely fail to surface in the final text generation when audio conflicts with vision. Probing analyses show that useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues. We further trace this imbalance to training: the AVLLM's audio behavior strongly matches its vision-language base model, indicating limited additional alignment to audio supervision. Our findings reveal a fundamental modality bias in AVLLMs and provide new mechanistic insights into how multimodal LLMs integrate audio and vision.
Kunzhe Song, Geo Jie Zhou, Xiaoming Liu, Huacheng Zeng
Comments Accepted to CVPR 2026
Robust 3D environmental perception is critical for applications such as autonomous driving and robot navigation. However, optical sensors such as cameras and LiDAR often fail under adverse conditions, including smoke, fog, and non-ideal lighting. Although specialized radar systems can operate in these environments, their reliance on bespoke hardware and licensed spectrum limits scalability and cost-effectiveness. This paper introduces Rascene, an integrated sensing and communication (ISAC) framework that leverages ubiquitous mmWave OFDM communication signals for 3D scene imaging. To overcome the sparse and multipath-ambiguous nature of individual radio frames, Rascene performs multi-frame, spatially adaptive fusion with confidence-weighted forward projection, enabling the recovery of geometric consensus across arbitrary poses. Experimental results demonstrate that our method reconstructs 3D scenes with high precision, offering a new pathway toward low-cost, scalable, and robust 3D perception.
Ethan Reid
Comments Demo: https://moondream.ai/me/playground
We present Moondream Segmentation, a referring image segmentation extension of Moondream 3, a vision-language model. Given an image and a referring expression, the model autoregressively decodes a vector path and iteratively refines the rasterized mask into a final detailed mask. We introduce a reinforcement learning stage that resolves ambiguity in the supervised signal by directly optimizing mask quality. Rollouts from this stage produce coarse-to-ground-truth targets for the refiner. To mitigate evaluation noise from polygon annotations, we release RefCOCO-M, a cleaned RefCOCO validation split with boundary-accurate masks. Moondream Segmentation achieves a cIoU of 80.2% on RefCOCO (val) and 62.6% mIoU on LVIS (val).
Daheng Yin, Isaac Ding, Yili Jin, Jianxin Shi, Jiangchuan Liu
Comments 11 pages, 6 figures
Recent advancements in 3D Gaussian Splatting (3DGS) have demonstrated its potential for efficient and photorealistic 3D reconstructions, which is crucial for diverse applications such as robotics and immersive media. However, current Gaussian-based methods for dynamic scene reconstruction struggle with large inter-frame displacements, leading to artifacts and temporal inconsistencies under fast object motions. To address this, we introduce \textit{TrackerSplat}, a novel method that integrates advanced point tracking methods to enhance the robustness and scalability of 3DGS for dynamic scene reconstruction. TrackerSplat utilizes off-the-shelf point tracking models to extract pixel trajectories and triangulate per-view pixel trajectories onto 3D Gaussians to guide the relocation, rotation, and scaling of Gaussians before training. This strategy effectively handles large displacements between frames, dramatically reducing the fading and recoloring artifacts prevalent in prior methods. By accurately positioning Gaussians prior to gradient-based optimization, TrackerSplat overcomes the quality degradation associated with large frame gaps when processing multiple adjacent frames in parallel across multiple devices, thereby boosting reconstruction throughput while preserving rendering quality. Experiments on real-world datasets confirm the robustness of TrackerSplat in challenging scenarios with significant displacements, achieving superior throughput under parallel settings and maintaining visual quality compared to baselines. The code is available at https://github.com/yindaheng98/TrackerSplat.
Wei Li, Yufan Ren, Hanqing Jiang, Jianhui Ding, Zhen Peng, Leman Feng, Yichun Shentu, Guoqiang Xu, Baigui Sun
Comments 9 pages, 6 figures, 2 tables
We propose FusionBERT, a novel multi-view visual fusion framework for image-3D multimodal retrieval. Existing image-3D representation learning methods predominantly focus on feature alignment of a single object image and its 3D model, limiting their applicability in realistic scenarios where an object is typically observed and captured from multiple viewpoints. Although multi-view observations naturally provide complementary geometric and appearance cues, existing multimodal large models rarely explore how to effectively fuse such multi-view visual information for better cross-modal retrieval. To address this limitation, we introduce a multi-view image-3D retrieval framework named FusionBERT, which innovatively utilizes a cross-attention-based multi-view visual aggregator to adaptively integrate features from multi-view images of an object. The proposed multi-view visual encoder fuses inter-view complementary relationships and selectively emphasizes informative visual cues across multiple views to get a more robustly fused visual feature for better 3D model matching. Furthermore, FusionBERT proposes a normal-aware 3D model encoder that can further enhance the 3D geometric feature of an object model by jointly encoding point normals and 3D positions, enabling a more robust representation learning for textureless or color-degraded 3D models. Extensive image-3D retrieval experiments demonstrate that FusionBERT achieves significantly higher retrieval accuracy than SOTA multimodal large models under both single-view and multi-view settings, establishing a strong baseline for multi-view multimodal retrieval.
Yan Zheng, Florian Bordes
Evaluating code generation models for 3D spatial reasoning requires executing generated code in realistic environments and assessing outputs beyond surface-level correctness. We introduce a platform VoxelCode, for analyzing code generation capabilities for 3D understanding and environment creation. Our platform integrates natural language task specification, API-driven code execution in Unreal Engine, and a unified evaluation pipeline supporting both automated metrics and human assessment. To demonstrate its utility, we construct VoxelCodeBench, a benchmark of voxel manipulation tasks spanning three reasoning dimensions: symbolic interpretation, geometric construction, and artistic composition. Evaluating leading code generation models, we find that producing executable code is far easier than producing spatially correct outputs, with geometric construction and multi-object composition proving particularly challenging. By open-sourcing our platform and benchmark, we provide the community with extensible infrastructure for developing new 3D code generation benchmarks and probing spatial reasoning in future models.
扫码添加微信好友,提出您的宝贵建议 👇
💡 备注请填写:网站反馈