HCMT: A Novel Hierarchical Cross-Modal Transformer for Recognition of Abnormal Behavior

Liu, Hai Chuan and Khairuddin, Anis Salwa Mohd and Chuah, Joon Huang and Zhao, Xian Min and Wang, Xiao Dan and Fang, Li Ming (2024) HCMT: A Novel Hierarchical Cross-Modal Transformer for Recognition of Abnormal Behavior. IEEE Access, 12. pp. 161296-161311. ISSN 2169-3536, DOI https://doi.org/10.1109/ACCESS.2024.3483896.

Full text not available from this repository.

Official URL: https://doi.org/10.1109/ACCESS.2024.3483896

Abstract

Enhancing video recognition systems with advanced abnormal behavior recognition technologies is crucial for school safety and campus security. Traditional methods primarily rely on visual data and often fail to recognize complex behaviors due to intricate backgrounds. Similarly, traditional audio processing techniques struggle to capture transient anomalies, as they have limited capacity to handle complex sounds. This study overcomes these limitations by integrating audio and visual data, addressing the shortcomings of visual-only modalities in recognizing subtle behaviors. This study introduces a novel Hierarchical Cross-Modal Transformer (HCMT), which innovatively combines multiple hierarchical branches of visual and audio. The innovative integration of hierarchical audio and visual modalities in HCMT enables capturing low-level features often overlooked by single late-stage fusion methods, thus learning global features more effectively. The audio branch utilizes the newly developed Audio Temporal Spectrogram Transformer (ATST), which employs a global sparse uniform sampling technique to effectively capture the transient nature of audio-based abnormalities, thereby enhancing behavior recognition robustness. The HCMT model demonstrated a Top-1 accuracy of 79.45% and a Top-5 accuracy of 98.44% on the challenging Campus Abnormal Behavior Recognition Hard (CABRH8) dataset, consisting of eight indistinguishable human abnormal behaviors. The ATST significantly improved Top-1 accuracy by 7.45% over visual benchmarks alone. Furthermore, the HCMT recorded Top-1 and Top-5 accuracies of 84.93% and 97.63% on the CABR50 dataset, outperforming prior models that relied solely on visual data. It underscores the adaptability of the HCMT approach. The model's complexity includes 992 GFLOPs, achieving 28 frames per second (FPS). The model's generalizability was also confirmed on additional datasets, including UCF-101, which achieved advanced outcomes.

Item Type:	Article
Funders:	Chongqing Key Laboratory of Public Big Data Security Technology, Universiti Malaya (ST018-2023), Science and Technology Research Program of Chongqing Municipal Education Commission (KJQN202304017)
Uncontrolled Keywords:	Visualization; Feature extraction; Transformers; Accuracy; Spatiotemporal phenomena; Adaptation models; Transient analysis; Computational modeling; Computational complexity; Behavioral sciences; Video recognition systems; campus abnormal behaviors; visual and audio branches; hierarchical cross-modal transformer; audio temporal spectrogram transformer
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science T Technology > TK Electrical engineering. Electronics Nuclear engineering
Divisions:	Faculty of Engineering > Department of Electrical Engineering
Depositing User:	Ms. Juhaida Abd Rahim
Date Deposited:	20 Jan 2025 02:39
Last Modified:	20 Jan 2025 02:39
URI:	http://eprints.um.edu.my/id/eprint/47644

Actions (login required)

View Item