Benchmark
Backends
CPU: ncnn, ONNXRuntime, OpenVINO
GPU: ncnn, TensorRT, PPLNN
Latency benchmark
Platform
Ubuntu 18.04
ncnn 20211208
Cuda 11.3
TensorRT 7.2.3.4
Docker 20.10.8
NVIDIA tesla T4 tensor core GPU for TensorRT
Other settings
Static graph
Batch size 1
Synchronize devices after each inference.
We count the average inference performance of 100 images of the dataset.
Warm up. For ncnn, we warm up 30 iters for all codebases. As for other backends: for classification, we warm up 1010 iters; for other codebases, we warm up 10 iters.
Input resolution varies for different datasets of different codebases. All inputs are real images except for mmediting
because the dataset is not large enough.
Users can directly test the speed through model profiling . And here is the benchmark in our environment.
mmcls
TensorRT(ms)
PPLNN(ms)
ncnn(ms)
Ascend(ms)
model
spatial
T4
JetsonNano2GB
Jetson TX2
T4
SnapDragon888
Adreno660
Ascend310
fp32
fp16
int8
fp32
fp16
fp32
fp16
fp32
fp32
fp32
ResNet
224x224
2.97
1.26
1.21
59.32
30.54
24.13
1.30
33.91
25.93
2.49
ResNeXt
224x224
4.31
1.42
1.37
88.10
49.18
37.45
1.36
133.44
69.38
-
SE-ResNet
224x224
3.41
1.66
1.51
74.59
48.78
29.62
1.91
107.84
80.85
-
ShuffleNetV2
224x224
1.37
1.19
1.13
15.26
10.23
7.37
4.69
9.55
10.66
-
mmdet part1
TensorRT(ms)
PPLNN(ms)
model
spatial
T4
Jetson TX2
T4
fp32
fp16
int8
fp32
fp16
YOLOv3
320x320
14.76
24.92
24.92
-
18.07
SSD-Lite
320x320
8.84
9.21
8.04
1.28
19.72
RetinaNet
800x1344
97.09
25.79
16.88
780.48
38.34
FCOS
800x1344
84.06
23.15
17.68
-
-
FSAF
800x1344
82.96
21.02
13.50
-
30.41
Faster R-CNN
800x1344
88.08
26.52
19.14
733.81
65.40
Mask R-CNN
800x1344
104.83
58.27
-
-
86.80
mmdet part2
ncnn
model
spatial
SnapDragon888
Adreno660
fp32
fp32
MobileNetv2-YOLOv3
320x320
48.57
66.55
SSD-Lite
320x320
44.91
66.19
YOLOX
416x416
111.60
134.50
mmedit
TensorRT(ms)
PPLNN(ms)
model
spatial
T4
Jetson TX2
T4
fp32
fp16
int8
fp32
fp16
ESRGAN
32x32
12.64
12.42
12.45
-
7.67
SRCNN
32x32
0.70
0.35
0.26
58.86
0.56
mmocr
TensorRT(ms)
PPLNN(ms)
ncnn(ms)
model
spatial
T4
T4
SnapDragon888
Adreno660
fp32
fp16
int8
fp16
fp32
fp32
DBNet
640x640
10.70
5.62
5.00
34.84
-
-
CRNN
32x32
1.93
1.40
1.36
-
10.57
20.00
mmseg
TensorRT(ms)
PPLNN(ms)
model
spatial
T4
Jetson TX2
T4
fp32
fp16
int8
fp32
fp16
FCN
512x1024
128.42
23.97
18.13
1682.54
27.00
PSPNet
1x3x512x1024
119.77
24.10
16.33
1586.19
27.26
DeepLabV3
512x1024
226.75
31.80
19.85
-
36.01
DeepLabV3+
512x1024
151.25
47.03
50.38
2534.96
34.80
Performance benchmark
Users can directly test the performance through how_to_evaluate_a_model.md . And here is the benchmark in our environment.
mmcls
PyTorch
TorchScript
ONNX Runtime
TensorRT
PPLNN
Ascend
model
metric
fp32
fp32
fp32
fp32
fp16
int8
fp16
fp32
ResNet-18
top-1
69.90
69.90
69.88
69.88
69.86
69.86
69.86
69.91
top-5
89.43
89.43
89.34
89.34
89.33
89.38
89.34
89.43
ResNeXt-50
top-1
77.90
77.90
77.90
77.90
-
77.78
77.89
-
top-5
93.66
93.66
93.66
93.66
-
93.64
93.65
-
SE-ResNet-50
top-1
77.74
77.74
77.74
77.74
77.75
77.63
77.73
-
top-5
93.84
93.84
93.84
93.84
93.83
93.72
93.84
-
ShuffleNetV1 1.0x
top-1
68.13
68.13
68.13
68.13
68.13
67.71
68.11
-
top-5
87.81
87.81
87.81
87.81
87.81
87.58
87.80
-
ShuffleNetV2 1.0x
top-1
69.55
69.55
69.55
69.55
69.54
69.10
69.54
-
top-5
88.92
88.92
88.92
88.92
88.91
88.58
88.92
-
MobileNet V2
top-1
71.86
71.86
71.86
71.86
71.87
70.91
71.84
71.87
top-5
90.42
90.42
90.42
90.42
90.40
89.85
90.41
90.42
Vision Transformer
top-1
85.43
85.43
-
85.43
85.42
-
-
85.43
top-5
97.77
97.77
-
97.77
97.76
-
-
97.77
Swin Transformer
top-1
81.18
81.18
81.18
81.18
81.18
-
-
top-5
95.61
95.61
95.61
95.61
95.61
-
-
mmdet
Pytorch
TorchScript
ONNXRuntime
TensorRT
PPLNN
Ascend
OpenVINO
model
task
dataset
metric
fp32
fp32
fp32
fp32
fp16
int8
fp16
fp32
fp32
YOLOV3
Object Detection
COCO2017
box AP
33.7
33.7
-
33.5
33.5
33.5
-
-
-
SSD
Object Detection
COCO2017
box AP
25.5
25.5
-
25.5
25.5
-
-
-
-
RetinaNet
Object Detection
COCO2017
box AP
36.5
36.4
-
36.4
36.4
36.3
36.5
36.4
-
FCOS
Object Detection
COCO2017
box AP
36.6
-
-
36.6
36.5
-
-
-
-
FSAF
Object Detection
COCO2017
box AP
37.4
37.4
-
37.4
37.4
37.2
37.4
-
-
CenterNet
Object Detection
COCO2017
box AP
25.9
26.0
26.0
26.0
25.8
-
-
-
-
YOLOX
Object Detection
COCO2017
box AP
40.5
40.3
-
40.3
40.3
29.3
-
-
-
Faster R-CNN
Object Detection
COCO2017
box AP
37.4
37.3
-
37.3
37.3
37.1
37.3
37.2
-
ATSS
Object Detection
COCO2017
box AP
39.4
-
-
39.4
39.4
-
-
-
-
Cascade R-CNN
Object Detection
COCO2017
box AP
40.4
-
-
40.4
40.4
-
40.4
-
-
GFL
Object Detection
COCO2017
box AP
40.2
-
40.2
40.2
40.0
-
-
-
-
RepPoints
Object Detection
COCO2017
box AP
37.0
-
-
36.9
-
-
-
-
-
DETR
Object Detection
COCO2017
box AP
40.1
40.1
-
40.1
40.1
-
-
-
-
Mask R-CNN
Instance Segmentation
COCO2017
box AP
38.2
38.1
-
38.1
38.1
-
38.0
-
-
mask AP
34.7
34.7
-
33.7
33.7
-
-
-
-
Swin-Transformer
Instance Segmentation
COCO2017
box AP
42.7
-
42.7
42.5
37.7
-
-
-
-
mask AP
39.3
-
39.3
39.3
35.4
-
-
-
-
SOLO
Instance Segmentation
COCO2017
mask AP
33.1
-
-
-
-
-
-
-
32.7
mmedit
Pytorch
TorchScript
ONNX Runtime
TensorRT
PPLNN
model
task
dataset
metric
fp32
fp32
fp32
fp32
fp16
int8
fp16
SRCNN
Super Resolution
Set5
PSNR
28.4316
28.4120
28.4323
28.4323
28.4286
28.1995
28.4311
SSIM
0.8099
0.8106
0.8097
0.8097
0.8096
0.7934
0.8096
ESRGAN
Super Resolution
Set5
PSNR
28.2700
28.2619
28.2592
28.2592
-
-
28.2624
SSIM
0.7778
0.7784
0.7764
0.7774
-
-
0.7765
ESRGAN-PSNR
Super Resolution
Set5
PSNR
30.6428
30.6306
30.6444
30.6430
-
-
27.0426
SSIM
0.8559
0.8565
0.8558
0.8558
-
-
0.8557
SRGAN
Super Resolution
Set5
PSNR
27.9499
27.9252
27.9408
27.9408
-
-
27.9388
SSIM
0.7846
0.7851
0.7839
0.7839
-
-
0.7839
SRResNet
Super Resolution
Set5
PSNR
30.2252
30.2069
30.2300
30.2300
-
-
30.2294
SSIM
0.8491
0.8497
0.8488
0.8488
-
-
0.8488
Real-ESRNet
Super Resolution
Set5
PSNR
28.0297
-
27.7016
27.7016
-
-
27.7049
SSIM
0.8236
-
0.8122
0.8122
-
-
0.8123
EDSR
Super Resolution
Set5
PSNR
30.2223
30.2192
30.2214
30.2214
30.2211
30.1383
-
SSIM
0.8500
0.8507
0.8497
0.8497
0.8497
0.8469
-
mmocr
Pytorch
TorchScript
ONNXRuntime
TensorRT
PPLNN
OpenVINO
model
task
dataset
metric
fp32
fp32
fp32
fp32
fp16
int8
fp16
fp32
DBNet*
TextDetection
ICDAR2015
recall
0.7310
0.7308
0.7304
0.7198
0.7179
0.7111
0.7304
0.7309
precision
0.8714
0.8718
0.8714
0.8677
0.8674
0.8688
0.8718
0.8714
hmean
0.7950
0.7949
0.7950
0.7868
0.7856
0.7821
0.7949
0.7950
PSENet
TextDetection
ICDAR2015
recall
0.7526
0.7526
0.7526
0.7526
0.7520
0.7496
-
0.7526
precision
0.8669
0.8669
0.8669
0.8669
0.8668
0.8550
-
0.8669
hmean
0.8057
0.8057
0.8057
0.8057
0.8054
0.7989
-
0.8057
PANet
TextDetection
ICDAR2015
recall
0.7401
0.7401
0.7401
0.7357
0.7366
-
-
0.7401
precision
0.8601
0.8601
0.8601
0.8570
0.8586
-
-
0.8601
hmean
0.7955
0.7955
0.7955
0.7917
0.7930
-
-
0.7955
CRNN
TextRecognition
IIIT5K
acc
0.8067
0.8067
0.8067
0.8067
0.8063
0.8067
0.8067
-
SAR
TextRecognition
IIIT5K
acc
0.9517
-
0.9287
-
-
-
-
-
SATRN
TextRecognition
IIIT5K
acc
0.9470
0.9487
0.9487
0.9487
0.9483
0.9483
-
-
mmseg
Pytorch
TorchScript
ONNXRuntime
TensorRT
PPLNN
Ascend
model
dataset
metric
fp32
fp32
fp32
fp32
fp16
int8
fp16
fp32
FCN
Cityscapes
mIoU
72.25
72.36
-
72.36
72.35
74.19
72.35
72.35
PSPNet
Cityscapes
mIoU
78.55
78.66
-
78.26
78.24
77.97
78.09
78.67
deeplabv3
Cityscapes
mIoU
79.09
79.12
-
79.12
79.12
78.96
79.12
79.06
deeplabv3+
Cityscapes
mIoU
79.61
79.60
-
79.60
79.60
79.43
79.60
79.51
Fast-SCNN
Cityscapes
mIoU
70.96
70.96
-
70.93
70.92
66.00
70.92
-
UNet
Cityscapes
mIoU
69.10
-
-
69.10
69.10
68.95
-
-
ANN
Cityscapes
mIoU
77.40
-
-
77.32
77.32
-
-
-
APCNet
Cityscapes
mIoU
77.40
-
-
77.32
77.32
-
-
-
BiSeNetV1
Cityscapes
mIoU
74.44
-
-
74.44
74.43
-
-
-
BiSeNetV2
Cityscapes
mIoU
73.21
-
-
73.21
73.21
-
-
-
CGNet
Cityscapes
mIoU
68.25
-
-
68.27
68.27
-
-
-
EMANet
Cityscapes
mIoU
77.59
-
-
77.59
77.6
-
-
-
EncNet
Cityscapes
mIoU
75.67
-
-
75.66
75.66
-
-
-
ERFNet
Cityscapes
mIoU
71.08
-
-
71.08
71.07
-
-
-
FastFCN
Cityscapes
mIoU
79.12
-
-
79.12
79.12
-
-
-
GCNet
Cityscapes
mIoU
77.69
-
-
77.69
77.69
-
-
-
ICNet
Cityscapes
mIoU
76.29
-
-
76.36
76.36
-
-
-
ISANet
Cityscapes
mIoU
78.49
-
-
78.49
78.49
-
-
-
OCRNet
Cityscapes
mIoU
74.30
-
-
73.66
73.67
-
-
-
PointRend
Cityscapes
mIoU
76.47
76.47
-
76.41
76.42
-
-
-
Semantic FPN
Cityscapes
mIoU
74.52
-
-
74.52
74.52
-
-
-
STDC
Cityscapes
mIoU
75.10
-
-
75.10
75.10
-
-
-
STDC
Cityscapes
mIoU
77.17
-
-
77.17
77.17
-
-
-
UPerNet
Cityscapes
mIoU
77.10
-
-
77.19
77.18
-
-
-
Segmenter
ADE20K
mIoU
44.32
44.29
44.29
44.29
43.34
43.35
-
-
mmpose
Pytorch
ONNXRuntime
TensorRT
PPLNN
OpenVINO
model
task
dataset
metric
fp32
fp32
fp32
fp16
fp16
fp32
HRNet
Pose Detection
COCO
AP
0.748
0.748
0.748
0.748
-
0.748
AR
0.802
0.802
0.802
0.802
-
0.802
LiteHRNet
Pose Detection
COCO
AP
0.663
0.663
0.663
-
-
0.663
AR
0.728
0.728
0.728
-
-
0.728
MSPN
Pose Detection
COCO
AP
0.762
0.762
0.762
0.762
-
0.762
AR
0.825
0.825
0.825
0.825
-
0.825
Hourglass
Pose Detection
COCO
AP
0.717
0.717
0.717
0.717
-
0.717
AR
0.774
0.774
0.774
0.774
-
0.774
SimCC
Pose Detection
COCO
AP
0.607
-
0.608
-
-
-
AR
0.668
-
0.672
-
-
-
mmrotate
Pytorch
ONNXRuntime
TensorRT
PPLNN
OpenVINO
model
task
dataset
metrics
fp32
fp32
fp32
fp16
fp16
fp32
RotatedRetinaNet
Rotated Detection
DOTA-v1.0
mAP
0.698
0.698
0.698
0.697
-
-
Oriented RCNN
Rotated Detection
DOTA-v1.0
mAP
0.756
0.756
0.758
0.730
-
-
GlidingVertex
Rotated Detection
DOTA-v1.0
mAP
0.732
-
0.733
0.731
-
-
RoI Transformer
Rotated Detection
DOTA-v1.0
mAP
0.761
-
0.758
-
-
-
mmaction2
Pytorch
ONNXRuntime
TensorRT
PPLNN
OpenVINO
model
task
dataset
metrics
fp32
fp32
fp32
fp16
fp16
fp32
TSN
Recognition
Kinetics-400
top-1
69.71
-
69.71
-
-
-
top-5
88.75
-
88.75
-
-
-
SlowFast
Recognition
Kinetics-400
top-1
74.45
-
75.62
-
-
-
top-5
91.55
-
92.10
-
-
-
## Notes
As some datasets contain images with various resolutions in codebase like MMDet. The speed benchmark is gained through static configs in MMDeploy, while the performance benchmark is gained through dynamic ones.
Some int8 performance benchmarks of TensorRT require Nvidia cards with tensor core, or the performance would drop heavily.
DBNet uses the interpolate mode nearest
in the neck of the model, which TensorRT-7 applies a quite different strategy from Pytorch. To make the repository compatible with TensorRT-7, we rewrite the neck to use the interpolate mode bilinear
which improves final detection performance. To get the matched performance with Pytorch, TensorRT-8+ is recommended, which the interpolate methods are all the same as Pytorch.
Mask AP of Mask R-CNN drops by 1% for the backend. The main reason is that the predicted masks are directly interpolated to original image in PyTorch, while they are at first interpolated to the preprocessed input image of the model and then to original image in other backends.
MMPose models are tested with flip_test
explicitly set to False
in model configs.
Some models might get low accuracy in fp16 mode. Please adjust the model to avoid value overflow.