2025-12-25
2025-12-13
2025-10-07
Manuscript received August 28, 2025; revised September 18, 2025; accepted October 30, 2025; published February 27, 2026.
Abstract—Machine learning is rapidly advancing across various fields and accelerating a paradigm shift in image and video manipulation. Deepfakes represent one of the challenges emerging from this development. Deepfakes are synthetically manipulated media using deep learning algorithms. Criminals have abused deepfakes as a weapon to spread false information. The distribution of deepfake videos or images may lead to some significant public risks, such as misleading information, privacy violation, and misuse in political and social realms. Therefore, the development of a counter for those threats is needed, namely a reliable deepfake detection method. One of the promising methods in the deepfake detection cases is the Vision Transformer (ViT). ViT is a deep learning architecture that uses self-attention mechanisms to understand complex relationships between images. Despite its potential, ViT needs a substantial amount of computational costs and a large dataset, which pose challenges for development. In this research, we present a rigorous evaluation of the ViT model with the use of the balanced FaceForensics++ dataset and 5-fold crossvalidation strategy to ensure a more reliable result. The result shows an average accuracy of 85.39%, meaning that the model achieves a robust and stable performance. The model also showed an excellent balance between precision score (85.40%) and recall score (85.39%), which suggests to us that it is a reliable method in detecting deepfakes without significant bias. These findings indicate that a properly trained ViT, particularly with a balanced dataset, can serve as an effective and powerful tool to combat the threats posed by deepfakes. Keywords—vision transformer, deep learning, deepfake, machine learning, video manipulation Cite: Orvis L. Siagian, Reinhard Ebenhaizer, Pandu Wicaksono, and Zahra N. Izdihar, "Improving Vision Transformer for Deepfake Detection," Journal of Image and Graphics, Vol. 14, No. 1, pp. 76-83, 2026. Copyright © 2026 by the authors. This is an open access article distributed under the Creative Commons Attribution License (CC-BY-4.0), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.