Home > Articles > All Issues > 2026 > Volume 14, No. 1, 2026 >
JOIG 2026 Vol.14(1):96-107
doi: 10.18178/joig.14.1.96-107

MMTFL: Multi-Timescale Multi-Modal Feature Learning for Weakly-Supervised Anomaly Detection

Erkut Akdag *, Henk Corporaal, Peter H. N. D. With, and Egor Bondarev
Electrical Engineering Department, Eindhoven University of Technology, Eindhoven, The Netherlands
Email: e.akdag@tue.nl (E.A.); h.corporaal@tue.nl (H.C.); p.h.n.de.with@tue.nl (P.H.N.D.W.); e.bondarev@tue.nl (E.B.)
*Corresponding author

Manuscript received May 19, 2025; revised July 18, 2025; accepted September 1, 2025; published February 27, 2026.

Abstract—Detection of anomalous events is critical for public safety and requires capturing fine-grained motion patterns and contextual information across multiple time-scales. To this end, we propose a Multi-Timescale Feature Learning (MTFL) method to enhance the representation of anomaly features. Short, medium, and long temporal tubelets are employed to extract spatio-temporal video features using a Video Swin Transformer. Experimental results demonstrate that MTFL achieves an anomaly detection performance 87.16% Area Under the Curve (AUC) on the University of Central Florida (UCF)-Crime dataset and 84.57% Average Precision (AP) on the Xi Dian University (XD)-Violence dataset. While MTFL relies solely on spatio-temporal features extracted from a single modality using RGB video, it encounters challenges such as occlusions, ambiguous actions, and limited contextual understanding. To overcome these limitations, we also propose Multi-Modal Multi-Timescale Feature Learning (MMTFL), which integrates spatiotemporal, depth, and text-based features in conjunction with multi-timescale tubelet analysis, rather than focusing only on RGB inputs. Although adding modalities increases feature extraction cost, it remains feasible for real-world purposes. Experimental results demonstrate that the MMTFL outperforms single-modality approaches, achieving 88.29% AUC on the UCF-Crime dataset and 84.96% AP on the XDViolence dataset. By leveraging complementary information from multiple modalities, the proposed approach achieves more robust and accurate detection of complex and diverse anomalies compared to single-modal methods.

Keywords—anomaly detection, surveillance videos, video understanding, multi-modality, feature fusion, attention

Cite: Erkut Akdag, Henk Corporaal, Peter H. N. D. With, and Egor Bondarev, "MMTFL: Multi-Timescale Multi-Modal Feature Learning for Weakly-Supervised Anomaly Detection," Journal of Image and Graphics, Vol. 14, No. 1, pp. 96-107, 2026.

Copyright © 2026 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC-BY-4.0).

Article Metrics in Dimensions