Tsinghua Takes Over, YOLOv10 Debuts: Significant Performance Boost, Tops GitHub Trending

Thu, May 30 2024 07:42 AM EST

Synced Report

Synced Editorial Team

The benchmark object detection system YOLO series has once again received a major upgrade. Since the release of YOLOv9 in February this year, the baton of the YOLO (You Only Look Once) series has been passed to researchers at Tsinghua University.

Last weekend, the news of YOLOv10's launch sparked attention in the AI community. It is considered a groundbreaking framework in the field of computer vision, renowned for its real-time end-to-end object detection capabilities. By offering a powerful solution that combines efficiency and accuracy, YOLOv10 continues the tradition of the YOLO series. Paper link: https://arxiv.org/pdf/2405.14458

Project link: https://github.com/THU-MIG/yolov10

After the new version was released, many people have already conducted deployment tests, and the results are promising.

YOLO has been a dominant paradigm in real-time object detection due to its powerful performance and low computational cost. Widely used in various practical applications such as autonomous driving, surveillance, and logistics, its efficient and accurate object detection capabilities make it an ideal choice for real-time tasks like recognizing pedestrians and vehicles. In logistics, it aids in inventory management and package tracking, enhancing efficiency in many work aspects through AI capabilities.

Over the years, researchers have made significant progress in exploring YOLO's architecture design, optimization objectives, and data augmentation strategies. However, the reliance on post-processing like non-maximum suppression (NMS) has hindered YOLO's end-to-end deployment and negatively impacted inference latency. Additionally, the lack of thorough examination in the design of YOLO's components has led to evident computational redundancy and limited the model's capabilities.

The breakthrough of YOLOv10 lies in further enhancing YOLO's performance-efficiency boundary from post-processing to model architecture.

To achieve this, the research team introduced the concept of consistent dual assignment without NMS training for YOLO, leading to improvements in performance and inference latency.

The research team proposed an overall efficiency-accuracy-driven model design strategy for YOLO, optimizing its components comprehensively from both efficiency and accuracy perspectives, significantly reducing computational costs and enhancing model capabilities.

Extensive experiments demonstrate that YOLOv10 achieves state-of-the-art performance and efficiency across various model scales. For instance, YOLOv10-S is 1.8 times faster in similar AP on COCO compared to RT-DETR-R18, with a substantial reduction in parameter count and FLOPs. Compared to YOLOv9-C, YOLOv10-B shows a 46% decrease in latency and a 25% reduction in parameters while maintaining the same performance level. Introduction to Methods

In order to achieve a model design driven by overall efficiency and accuracy, the research team has proposed improvement methods from the perspectives of efficiency and accuracy.

To enhance efficiency, the study introduces lightweight classification heads, spatial-channel decoupled downsampling, and sorting-guided block design to reduce significant computational redundancy and achieve a more efficient architecture.

For improved accuracy, the research team explores large kernel convolutions and introduces effective Partial Self-Attention (PSA) modules to enhance model capabilities, uncovering performance improvements at a low cost. Based on these methods, the team successfully implements a series of real-time end-to-end detectors of various scales, namely YOLOv10-N/S/M/B/L/X.

Consistent Dual Label Assignment for NMS-Free Training

During training, YOLO typically utilizes Target Assignment Labels (TAL) to assign multiple positive samples for each instance. The one-to-many assignment generates rich supervision signals, facilitating optimization and enabling the model to achieve outstanding performance.

However, this reliance on NMS post-processing during deployment leads to suboptimal inference efficiency. While prior research has explored one-to-one matching to suppress redundant predictions, they often introduce additional inference overhead.

In contrast to one-to-many assignment, one-to-one matching assigns only one prediction for each ground truth, avoiding NMS post-processing. However, this can result in weak supervision, leading to less than ideal accuracy and convergence speed. Fortunately, this deficiency can be mitigated by one-to-many assignment.

The proposed "Dual Label Assignment" in this study combines the advantages of the above two strategies. As shown in the diagram below, the research introduces another one-to-one head for YOLO. It retains the same structure as the original one-to-many branch and adopts the same optimization objective, but uses one-to-one matching for label assignment. During training, both heads are jointly optimized to provide rich supervision; during inference, YOLOv10 discards the one-to-many head and utilizes the one-to-one head for predictions. This enables YOLO to be deployed end-to-end without incurring any additional inference costs. Efficiency-Driven Model Design for YOLO

Apart from post-processing, the model architecture of YOLO also poses a significant challenge in balancing efficiency and accuracy. While previous research has explored various design strategies, there remains a lack of comprehensive examination of the components within YOLO. As a result, the model architecture exhibits noticeable computational redundancy and limited capabilities.

The components in YOLO include the stem, downsampling layers, stages with basic building blocks, and the head. The authors primarily focus on efficiency-driven model design for the following three parts:

Lightweight classification head
Spatial channel decoupling downsampling
Sort-guided module design To achieve accuracy-driven model design, the research team further explored large kernel convolutions and self-attention mechanisms, aiming to enhance model performance at minimal cost.

Experiment

As shown in Table 1, the YOLOv10 developed by the Tsinghua team achieved state-of-the-art performance and end-to-end latency across various model scales. The study also conducted ablation experiments on YOLOv10-S and YOLOv10-M, with the experimental results shown in the table below: As shown in the table below, dual-tag assignment achieves the optimal trade-off between AP and delay, and using a consistent matching metric can achieve the best performance.

As shown in the table below, each design component, including the lightweight classification head, spatial channel-wise decoupling downsampling, and sort-guided module design, contributes to reducing the number of parameters, FLOPs, and latency. Importantly, these enhancements are achieved while maintaining excellent performance.

Analysis of model design driven by accuracy. Researchers demonstrated the results of incrementally integrating accuracy-driven design elements based on YOLOv10-S/M.

As shown in Table 10, employing large kernel convolutions and PSA modules resulted in a significant improvement of 0.4% AP and 1.4% AP in the performance of YOLOv10-S, with a minimal increase in latency of 0.03ms and 0.15ms, respectively. Reference content:

pre：Kunlun Wanwei Chairman and CEO Fang Han: Tian Gong's Large Model Drives New Transformation in the AI Era

next：Elon Musk's AI venture valued at $24 billion, LeCun fires back with personal attacks!

Tsinghua Takes Over, YOLOv10 Debuts: Significant Performance Boost, Tops GitHub Trending

Navigation

Related Articles