2026-06-04
2026-04-30
2026-02-27
Manuscript received January 17, 2025; revised April 7, 2025; accepted May 26, 2025; published September 17, 2025.
Abstract—Wheat is a staple crop cultivated widely across the world, making effective management of wheat fields a critical task. A key component of this management is accurately identifying and counting wheat heads, which provides essential data for assessing growth conditions, estimating crop yields and optimizing agricultural. This study introduces a novel approach for automatic wheat head detection by treating the wheat head as a single point to avoid ambiguous annotation of dense objects while leveraging the long-range dependency modeling capabilities of Transformer architecture to learn multi-scale features for head prediction, dubbed as CenterFormer. Specifically, we employ a hierarchical Transformer architecture with self-attention exploitation in both spatial and channel domains as the backbone to extract multi-scale features in the hierarchical stages. To maintain the linear complexity of the Transformer block, we implement window-based self-attention in spatial domain and group-wised self-attention in channel direction. In addition, to leverage the multi-scale features with both detailed spatial information and abstracted semantic contexts, we design a simple yet effective fusion block to integrate these features for enhanced wheat prediction. The prediction block aims to estimate a heat map, denoting the probabilities if the points are located at the centers of the wheat heads, and regresses other object properties such as size and sub-pixel deviations for each center location. Extensive experiments on the Global Wheat Head Detection (GWHD) dataset have demonstrated that our proposed method achieves substantial performance improvements compared with the state-of-the-art object detection models. Keywords—wheat head detection, transformer, self-attention, multi-scale feature fusion, hierarchical architecture, center point, CenterNet Cite: Ekei Harimoto and Xian-Hua Han, "CenterFormer: Coupling CenterNet and Vision Transformer for Accurate Wheat Head Detection," Journal of Image and Graphics, Vol. 13, No. 5, pp. 476-488, 2025. Copyright © 2025 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC-BY-4.0).