2026-06-04
2026-04-30
2026-02-27
Manuscript received October 13, 2025; revised November 20, 2025; accepted March 2, 2026; published June 17, 2026.
Abstract—Glaucoma, often known as the 'silent thief of sight', is a leading cause of irreversible blindness, which affects around 80 million people worldwide and hence its early and reliable detection is critical for preventing permanent vision loss. Although Vision Transformer (ViT)-based deep learning models are employed by existing techniques for automated glaucoma screening, yet, they primarily rely on conventional supervised training and single-dataset evaluation, which result in limited feature discrimination, suboptimal generalization across heterogeneous images, and poor clinical interpretability. To address these limitations, this study proposes a novel contrastive learning-optimized ViT framework, which integrates supervised contrastive pre-training with systematic hyperparameter optimization to learn more discriminative retinal features, and fine-tuning for glaucoma classification. In addition, a unified preprocessing and patch-based representation strategy is introduced to mitigate domain shifts across multiple imaging devices and acquisition protocols. Unlike prior studies using single benchmarks, this framework is validated on comprehensive multi-dataset setting combining six public fundus datasets (including G1020, ORIGA, REFUGE, PAPILA) to assess real-world generalization. Experimental results demonstrate consistent and statistically significant improvements over Convolutional Neural Network (CNN), baseline ViT models, in terms of achieving up to 87.91% accuracy and performance gains of 3-16% across accuracy, precision, recall, and F1score metrics. Further, Layer-wise Relevance Propagation (LRP) is employed to generate clinically interpretable heatmaps, which confirms that the model focuses on anatomically meaningful regions such as the optic disc and optic nerve head. These findings prove that the proposed framework provides robust, explainable, and generalizable solution for automated glaucoma screening and highlights its potential for clinical deployment. Keywords—Glaucoma detection, vision transformers, contrastive learning, medical imaging, explainable AI, layer-wise relevance propagation, deep learning, convolutional neural network Cite: R.Roopalakshmi, Ayush Amarnath Bhagat, and Sambhav Nath Jain, "Contrastive Vision Transformer Combined with Hyperparameter Fine-Tuning and Interpretable AI for Glaucoma Assessment," Journal of Image and Graphics, Vol. 14, No. 3, pp. 493-505, 2026. Copyright © 2026 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).