Home > Articles > All Issues > 2026 > Volume 14, No. 1, 2026 >
JOIG 2026 Vol.14(1):96-107
doi: 10.18178/joig.14.1.96-107

MMTFL: Multi-Timescale Multi-Modal Feature Learning for Weakly-Supervised Anomaly Detection

Erkut Akdag *, Henk Corporaal, Peter H. N. D. With, and Egor Bondarev
Electrical Engineering Department, Eindhoven University of Technology, Eindhoven, The Netherlands
Email: e.akdag@tue.nl (E.A.); h.corporaal@tue.nl (H.C.); p.h.n.de.with@tue.nl (P.H.N.D.W.); e.bondarev@tue.nl (E.B.)
*Corresponding author

Manuscript received May 19, 2025; revised July 18, 2025; accepted September 1, 2025; published February 27, 2026.

Abstract—Detection of anomalous events is critical for public safety and requires capturing fine-grained motion patterns and contextual information across multiple time-scales. To this end, we propose a Multi-Timescale Feature Learning (MTFL) method to enhance the representation of anomaly features. Short, medium, and long temporal tubelets are employed to extract spatio-temporal video features using a Video Swin Transformer. Experimental results demonstrate that MTFL achieves an anomaly detection performance 87.16% Area Under the Curve (AUC) on the University of Central Florida (UCF)-Crime dataset and 84.57% Average Precision (AP) on the Xi Dian University (XD)-Violence dataset. While MTFL relies solely on spatio-temporal features extracted from a single modality using RGB video, it encounters challenges such as occlusions, ambiguous actions, and limited contextual understanding. To overcome these limitations, we also propose Multi-Modal Multi-Timescale Feature Learning (MMTFL), which integrates spatiotemporal, depth, and text-based features in conjunction with multi-timescale tubelet analysis, rather than focusing only on RGB inputs. Although adding modalities increases feature extraction cost, it remains feasible for real-world purposes. Experimental results demonstrate that the MMTFL outperforms single-modality approaches, achieving 88.29% AUC on the UCF-Crime dataset and 84.96% AP on the XDViolence dataset. By leveraging complementary information from multiple modalities, the proposed approach achieves more robust and accurate detection of complex and diverse anomalies compared to single-modal methods.

Keywords—anomaly detection, surveillance videos, video understanding, multi-modality, feature fusion, attention

Cite: Erkut Akdag, Henk Corporaal, Peter H. N. D. With, and Egor Bondarev, "MMTFL: Multi-Timescale Multi-Modal Feature Learning for Weakly-Supervised Anomaly Detection," Journal of Image and Graphics, Vol. 14, No. 1, pp. 96-107, 2026.

Copyright © 2026 by the authors. This is an open access article distributed under the Creative Commons Attribution License (CC-BY-4.0), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.

Article Metrics in Dimensions