2025-06-04
2025-04-30
Manuscript received January 17, 2025; revised April 7, 2025; accepted May 26, 2025; published September 17, 2025.
Abstract—Wheat is a staple crop cultivated widely across the world, making effective management of wheat fields a critical task. A key component of this management is accurately identifying and counting wheat heads, which provides essential data for assessing growth conditions, estimating crop yields and optimizing agricultural. This study introduces a novel approach for automatic wheat head detection by treating the wheat head as a single point to avoid ambiguous annotation of dense objects while leveraging the long-range dependency modeling capabilities of Transformer architecture to learn multi-scale features for head prediction, dubbed as CenterFormer. Specifically, we employ a hierarchical Transformer architecture with self-attention exploitation in both spatial and channel domains as the backbone to extract multi-scale features in the hierarchical stages. To maintain the linear complexity of the Transformer block, we implement window-based self-attention in spatial domain and group-wised self-attention in channel direction. In addition, to leverage the multi-scale features with both detailed spatial information and abstracted semantic contexts, we design a simple yet effective fusion block to integrate these features for enhanced wheat prediction. The prediction block aims to estimate a heat map, denoting the probabilities if the points are located at the centers of the wheat heads, and regresses other object properties such as size and sub-pixel deviations for each center location. Extensive experiments on the Global Wheat Head Detection (GWHD) dataset have demonstrated that our proposed method achieves substantial performance improvements compared with the state-of-the-art object detection models. Keywords—wheat head detection, transformer, self-attention, multi-scale feature fusion, hierarchical architecture, center point, CenterNet Cite: Ekei Harimoto and Xian-Hua Han, "CenterFormer: Coupling CenterNet and Vision Transformer for Accurate Wheat Head Detection," Journal of Image and Graphics, Vol. 13, No. 5, pp. 476-488, 2025. Copyright © 2025 by the authors. This is an open access article distributed under the Creative Commons Attribution License (CC-BY-4.0), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.