[Feature] YOLOv8 supports using mask annotation to optimize bbox (#484)

* add cfg * add copypaste * add todo * 在mosaic和mixup中处理gt_masks,改config * fix cat bug * add finetune box in affine * add repr * del albu config in l * add doc * add config * format code * fix loadmask * addconfig,fix mask * fix loadann * fix tra * update LoadAnnotations * update * support mask * fix error * fix error * fix config and no maskrefine bug * fix * fix * update config * format code * beauty config * add yolov5 config and readme * beauty yolov5 config * add ut * fix ut. bitmap 2 poly * fix ut and add mix transform ut. * fix bool * fix loadann * rollback yolov5 * rollback yolov5 * format * 提高速度 * update --------- Co-authored-by: huanghaian <huanghaian@sensetime.com>
2023-02-20 11:11:13 +08:00 · 2023-02-20 11:11:13 +08:00 · 75fc8fc2a3
parent cbadd3abe4
commit 75fc8fc2a3
15 changed files with 1087 additions and 191 deletions
--- a/configs/yolov8/README.md
+++ b/configs/yolov8/README.md
@ -20,19 +20,25 @@ YOLOv8-P5 model structure

 ### COCO

-| Backbone | Arch | size | SyncBN | AMP | Mem (GB) | box AP |                                                     Config                                                     |                                                                                                                                                           Download                                                                                                                                                           |
-| :------: | :--: | :--: | :----: | :-: | :------: | :----: | :------------------------------------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| YOLOv8-n |  P5  | 640  |  Yes   | Yes |   2.8    |  37.2  | [config](https://github.com/open-mmlab/mmyolo/blob/dev/configs/yolov8/yolov8_n_syncbn_fast_8xb16-500e_coco.py) | [model](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_n_syncbn_fast_8xb16-500e_coco/yolov8_n_syncbn_fast_8xb16-500e_coco_20230114_131804-88c11cdb.pth) \| [log](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_n_syncbn_fast_8xb16-500e_coco/yolov8_n_syncbn_fast_8xb16-500e_coco_20230114_131804.log.json) |
-| YOLOv8-s |  P5  | 640  |  Yes   | Yes |   4.0    |  44.2  | [config](https://github.com/open-mmlab/mmyolo/blob/dev/configs/yolov8/yolov8_s_syncbn_fast_8xb16-500e_coco.py) | [model](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_s_syncbn_fast_8xb16-500e_coco/yolov8_s_syncbn_fast_8xb16-500e_coco_20230117_180101-5aa5f0f1.pth) \| [log](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_s_syncbn_fast_8xb16-500e_coco/yolov8_s_syncbn_fast_8xb16-500e_coco_20230117_180101.log.json) |
-| YOLOv8-m |  P5  | 640  |  Yes   | Yes |   7.2    |  49.8  | [config](https://github.com/open-mmlab/mmyolo/blob/dev/configs/yolov8/yolov8_m_syncbn_fast_8xb16-500e_coco.py) | [model](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_m_syncbn_fast_8xb16-500e_coco/yolov8_m_syncbn_fast_8xb16-500e_coco_20230115_192200-c22e560a.pth) \| [log](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_m_syncbn_fast_8xb16-500e_coco/yolov8_m_syncbn_fast_8xb16-500e_coco_20230115_192200.log.json) |
+| Backbone | Arch | size | Mask Refine | SyncBN | AMP | Mem (GB) |   box AP    |                                 Config                                  |                                                                                                                                                                                   Download                                                                                                                                                                                   |
+| :------: | :--: | :--: | :---------: | :----: | :-: | :------: | :---------: | :---------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| YOLOv8-n |  P5  | 640  |     No      |  Yes   | Yes |   2.8    |    37.2     |       [config](../yolov8/yolov8_n_syncbn_fast_8xb16-500e_coco.py)       |                         [model](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_n_syncbn_fast_8xb16-500e_coco/yolov8_n_syncbn_fast_8xb16-500e_coco_20230114_131804-88c11cdb.pth) \| [log](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_n_syncbn_fast_8xb16-500e_coco/yolov8_n_syncbn_fast_8xb16-500e_coco_20230114_131804.log.json)                         |
+| YOLOv8-n |  P5  | 640  |     Yes     |  Yes   | Yes |   2.5    | 37.4 (+0.2) | [config](../yolov8/yolov8_n_mask-refine_syncbn_fast_8xb16-500e_coco.py) | [model](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_n_mask-refine_syncbn_fast_8xb16-500e_coco/yolov8_n_mask-refine_syncbn_fast_8xb16-500e_coco_20230216_101206-b975b1cd.pth) \| [log](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_n_mask-refine_syncbn_fast_8xb16-500e_coco/yolov8_n_mask-refine_syncbn_fast_8xb16-500e_coco_20230216_101206.log.json) |
+| YOLOv8-s |  P5  | 640  |     No      |  Yes   | Yes |   4.0    |    44.2     |       [config](../yolov8/yolov8_s_syncbn_fast_8xb16-500e_coco.py)       |                         [model](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_s_syncbn_fast_8xb16-500e_coco/yolov8_s_syncbn_fast_8xb16-500e_coco_20230117_180101-5aa5f0f1.pth) \| [log](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_s_syncbn_fast_8xb16-500e_coco/yolov8_s_syncbn_fast_8xb16-500e_coco_20230117_180101.log.json)                         |
+| YOLOv8-s |  P5  | 640  |     Yes     |  Yes   | Yes |   4.0    | 45.1 (+0.9) | [config](../yolov8/yolov8_s_mask-refine_syncbn_fast_8xb16-500e_coco.py) | [model](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_s_mask-refine_syncbn_fast_8xb16-500e_coco/yolov8_s_mask-refine_syncbn_fast_8xb16-500e_coco_20230216_095938-ce3c1b3f.pth) \| [log](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_s_mask-refine_syncbn_fast_8xb16-500e_coco/yolov8_s_mask-refine_syncbn_fast_8xb16-500e_coco_20230216_095938.log.json) |
+| YOLOv8-m |  P5  | 640  |     No      |  Yes   | Yes |   7.2    |    49.8     |       [config](../yolov8/yolov8_m_syncbn_fast_8xb16-500e_coco.py)       |                         [model](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_m_syncbn_fast_8xb16-500e_coco/yolov8_m_syncbn_fast_8xb16-500e_coco_20230115_192200-c22e560a.pth) \| [log](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_m_syncbn_fast_8xb16-500e_coco/yolov8_m_syncbn_fast_8xb16-500e_coco_20230115_192200.log.json)                         |
+| YOLOv8-m |  P5  | 640  |     Yes     |  Yes   | Yes |   7.0    | 50.6 (+0.8) | [config](../yolov8/yolov8_m_mask-refine_syncbn_fast_8xb16-500e_coco.py) | [model](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_m_mask-refine_syncbn_fast_8xb16-500e_coco/yolov8_m_mask-refine_syncbn_fast_8xb16-500e_coco_20230216_223400-f40abfcd.pth) \| [log](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_m_mask-refine_syncbn_fast_8xb16-500e_coco/yolov8_m_mask-refine_syncbn_fast_8xb16-500e_coco_20230216_223400.log.json) |
+| YOLOv8-l |  P5  | 640  |     No      |  Yes   | Yes |   9.8    |    52.1     |       [config](../yolov8/yolov8_l_syncbn_fast_8xb16-500e_coco.py)       |                         [model](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_l_syncbn_fast_8xb16-500e_coco/yolov8_l_syncbn_fast_8xb16-500e_coco_20230217_182526-189611b6.pth) \| [log](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_l_syncbn_fast_8xb16-500e_coco/yolov8_l_syncbn_fast_8xb16-500e_coco_20230217_182526.log.json)                         |
+| YOLOv8-l |  P5  | 640  |     Yes     |  Yes   | Yes |   9.1    | 53.0 (+0.9) | [config](../yolov8/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco.py) | [model](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco_20230217_120100-5881dec4.pth) \| [log](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco_20230217_120100.log.json) |
+| YOLOv8-x |  P5  | 640  |     No      |  Yes   | Yes |   12.2   |    52.7     |       [config](../yolov8/yolov8_x_syncbn_fast_8xb16-500e_coco.py)       |                         [model](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_x_syncbn_fast_8xb16-500e_coco/yolov8_x_syncbn_fast_8xb16-500e_coco_20230218_023338-5674673c.pth) \| [log](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_x_syncbn_fast_8xb16-500e_coco/yolov8_x_syncbn_fast_8xb16-500e_coco_20230218_023338.log.json)                         |
+| YOLOv8-x |  P5  | 640  |     Yes     |  Yes   | Yes |   12.4   | 54.0 (+1.3) | [config](../yolov8/yolov8_x_mask-refine_syncbn_fast_8xb16-500e_coco.py) | [model](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_x_mask-refine_syncbn_fast_8xb16-500e_coco/yolov8_x_mask-refine_syncbn_fast_8xb16-500e_coco_20230217_120411-079ca8d1.pth) \| [log](https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_x_mask-refine_syncbn_fast_8xb16-500e_coco/yolov8_x_mask-refine_syncbn_fast_8xb16-500e_coco_20230217_120411.log.json) |

 **Note**

-In the official YOLOv8 code, the [bbox annotation](https://github.com/ultralytics/ultralytics/blob/0cb87f7dd340a2611148fbf2a0af59b544bd7b1b/ultralytics/yolo/data/dataloaders/v5loader.py#L1011), [`random_perspective`](https://github.com/ultralytics/ultralytics/blob/0cb87f7dd3/ultralytics/yolo/data/dataloaders/v5augmentations.py#L208) and [`copy_paste`](https://github.com/ultralytics/ultralytics/blob/0cb87f7dd3/ultralytics/yolo/data/dataloaders/v5augmentations.py#L208) data augmentation in COCO object detection task training uses mask annotation information, which leads to higher performance. Object detection should not use mask annotation, so only box annotation information is used in `MMYOLO`. We trained the official YOLOv8s code with `8xb16` configuration and its best performance is also 44.2. We will support mask annotations in object detection tasks in the next version.
-
 1. We use 8x A100 for training, and the single-GPU batch size is 16. This is different from the official code, but has no effect on performance.
 2. The performance is unstable and may fluctuate by about 0.3 mAP and the highest performance weight in `COCO` training in `YOLOv8` may not be the last epoch. The performance shown above is the best model.
 3. We provide [scripts](https://github.com/open-mmlab/mmyolo/tree/dev/tools/model_converters/yolov8_to_mmyolo.py) to convert official weights to MMYOLO.
-4. `SyncBN` means use SyncBN, `AMP` indicates training with mixed precision.
+4. `SyncBN` means using SyncBN, `AMP` indicates training with mixed precision.
+5. The performance of `Mask Refine` training is for the weight performance officially released by YOLOv8. `Mask Refine` means refining bbox by mask while loading annotations and transforming after `YOLOv5RandomAffine`, and the L and X models use `Copy Paste`.

 ## Citation
--- a/configs/yolov8/metafile.yml
+++ b/configs/yolov8/metafile.yml
@ -54,3 +54,87 @@ Models:
        Metrics:
          box AP: 49.8
    Weights: https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_m_syncbn_fast_8xb16-500e_coco/yolov8_m_syncbn_fast_8xb16-500e_coco_20230115_192200-c22e560a.pth
+  - Name: yolov8_l_syncbn_fast_8xb16-500e_coco
+    In Collection: YOLOv8
+    Config: configs/yolov8/yolov8_l_syncbn_fast_8xb16-500e_coco.py
+    Metadata:
+      Training Memory (GB): 9.8
+      Epochs: 500
+    Results:
+      - Task: Object Detection
+        Dataset: COCO
+        Metrics:
+          box AP: 52.1
+    Weights: https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_l_syncbn_fast_8xb16-500e_coco/yolov8_l_syncbn_fast_8xb16-500e_coco_20230217_182526-189611b6.pth
+  - Name: yolov8_x_syncbn_fast_8xb16-500e_coco
+    In Collection: YOLOv8
+    Config: configs/yolov8/yolov8_x_syncbn_fast_8xb16-500e_coco.py
+    Metadata:
+      Training Memory (GB): 12.2
+      Epochs: 500
+    Results:
+      - Task: Object Detection
+        Dataset: COCO
+        Metrics:
+          box AP: 52.7
+    Weights: https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_x_syncbn_fast_8xb16-500e_coco/yolov8_x_syncbn_fast_8xb16-500e_coco_20230218_023338-5674673c.pth
+  - Name: yolov8_n_mask-refine_syncbn_fast_8xb16-500e_coco
+    In Collection: YOLOv8
+    Config: configs/yolov8/yolov8_n_mask-refine_syncbn_fast_8xb16-500e_coco.py
+    Metadata:
+      Training Memory (GB): 2.5
+      Epochs: 500
+    Results:
+      - Task: Object Detection
+        Dataset: COCO
+        Metrics:
+          box AP: 37.4
+    Weights: https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_n_mask-refine_syncbn_fast_8xb16-500e_coco/yolov8_n_mask-refine_syncbn_fast_8xb16-500e_coco_20230216_101206-b975b1cd.pth
+  - Name: yolov8_s_mask-refine_syncbn_fast_8xb16-500e_coco
+    In Collection: YOLOv8
+    Config: configs/yolov8/yolov8_s_mask-refine_syncbn_fast_8xb16-500e_coco.py
+    Metadata:
+      Training Memory (GB): 4.0
+      Epochs: 500
+    Results:
+      - Task: Object Detection
+        Dataset: COCO
+        Metrics:
+          box AP: 45.1
+    Weights: https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_s_mask-refine_syncbn_fast_8xb16-500e_coco/yolov8_s_mask-refine_syncbn_fast_8xb16-500e_coco_20230216_095938-ce3c1b3f.pth
+  - Name: yolov8_m_mask-refine_syncbn_fast_8xb16-500e_coco
+    In Collection: YOLOv8
+    Config: configs/yolov8/yolov8_m_mask-refine_syncbn_fast_8xb16-500e_coco.py
+    Metadata:
+      Training Memory (GB): 7.0
+      Epochs: 500
+    Results:
+      - Task: Object Detection
+        Dataset: COCO
+        Metrics:
+          box AP: 50.6
+    Weights: https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_m_mask-refine_syncbn_fast_8xb16-500e_coco/yolov8_m_mask-refine_syncbn_fast_8xb16-500e_coco_20230216_223400-f40abfcd.pth
+  - Name: yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco
+    In Collection: YOLOv8
+    Config: configs/yolov8/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco.py
+    Metadata:
+      Training Memory (GB): 9.1
+      Epochs: 500
+    Results:
+      - Task: Object Detection
+        Dataset: COCO
+        Metrics:
+          box AP: 53.0
+    Weights: https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco_20230217_120100-5881dec4.pth
+  - Name: yolov8_x_mask-refine_syncbn_fast_8xb16-500e_coco
+    In Collection: YOLOv8
+    Config: configs/yolov8/yolov8_x_mask-refine_syncbn_fast_8xb16-500e_coco.py
+    Metadata:
+      Training Memory (GB): 12.4
+      Epochs: 500
+    Results:
+      - Task: Object Detection
+        Dataset: COCO
+        Metrics:
+          box AP: 54.0
+    Weights: https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_x_mask-refine_syncbn_fast_8xb16-500e_coco/yolov8_x_mask-refine_syncbn_fast_8xb16-500e_coco_20230217_120411-079ca8d1.pth
--- a/configs/yolov8/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco.py
+++ b/configs/yolov8/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco.py
@ -0,0 +1,65 @@
+_base_ = './yolov8_m_mask-refine_syncbn_fast_8xb16-500e_coco.py'
+
+# This config use refining bbox and `YOLOv5CopyPaste`.
+# Refining bbox means refining bbox by mask while loading annotations and
+# transforming after `YOLOv5RandomAffine`
+
+# ========================modified parameters======================
+deepen_factor = 1.00
+widen_factor = 1.00
+last_stage_out_channels = 512
+
+mixup_prob = 0.15
+copypaste_prob = 0.3
+
+# =======================Unmodified in most cases==================
+img_scale = _base_.img_scale
+pre_transform = _base_.pre_transform
+last_transform = _base_.last_transform
+affine_scale = _base_.affine_scale
+
+model = dict(
+    backbone=dict(
+        last_stage_out_channels=last_stage_out_channels,
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor),
+    neck=dict(
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        in_channels=[256, 512, last_stage_out_channels],
+        out_channels=[256, 512, last_stage_out_channels]),
+    bbox_head=dict(
+        head_module=dict(
+            widen_factor=widen_factor,
+            in_channels=[256, 512, last_stage_out_channels])))
+
+mosaic_affine_transform = [
+    dict(
+        type='Mosaic',
+        img_scale=img_scale,
+        pad_val=114.0,
+        pre_transform=pre_transform),
+    dict(type='YOLOv5CopyPaste', prob=copypaste_prob),
+    dict(
+        type='YOLOv5RandomAffine',
+        max_rotate_degree=0.0,
+        max_shear_degree=0.0,
+        max_aspect_ratio=100.,
+        scaling_ratio_range=(1 - affine_scale, 1 + affine_scale),
+        # img_scale is (width, height)
+        border=(-img_scale[0] // 2, -img_scale[1] // 2),
+        border_val=(114, 114, 114),
+        min_area_ratio=_base_.min_area_ratio,
+        use_mask_refine=_base_.use_mask2refine)
+]
+
+train_pipeline = [
+    *pre_transform, *mosaic_affine_transform,
+    dict(
+        type='YOLOv5MixUp',
+        prob=mixup_prob,
+        pre_transform=[*pre_transform, *mosaic_affine_transform]),
+    *last_transform
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
--- a/configs/yolov8/yolov8_l_syncbn_fast_8xb16-500e_coco.py
+++ b/configs/yolov8/yolov8_l_syncbn_fast_8xb16-500e_coco.py
@ -8,6 +8,10 @@ last_stage_out_channels = 512
 mixup_prob = 0.15

 # =======================Unmodified in most cases==================
+pre_transform = _base_.pre_transform
+mosaic_affine_transform = _base_.mosaic_affine_transform
+last_transform = _base_.last_transform
+
 model = dict(
    backbone=dict(
        last_stage_out_channels=last_stage_out_channels,
@ -23,17 +27,12 @@ model = dict(
            widen_factor=widen_factor,
            in_channels=[256, 512, last_stage_out_channels])))

-pre_transform = _base_.pre_transform
-albu_train_transform = _base_.albu_train_transform
-mosaic_affine_pipeline = _base_.mosaic_affine_pipeline
-last_transform = _base_.last_transform
-
 train_pipeline = [
-    *pre_transform, *mosaic_affine_pipeline,
+    *pre_transform, *mosaic_affine_transform,
    dict(
        type='YOLOv5MixUp',
        prob=mixup_prob,
-        pre_transform=[*pre_transform, *mosaic_affine_pipeline]),
+        pre_transform=[*pre_transform, *mosaic_affine_transform]),
    *last_transform
 ]

--- a/configs/yolov8/yolov8_m_mask-refine_syncbn_fast_8xb16-500e_coco.py
+++ b/configs/yolov8/yolov8_m_mask-refine_syncbn_fast_8xb16-500e_coco.py
@ -0,0 +1,85 @@
+_base_ = './yolov8_s_mask-refine_syncbn_fast_8xb16-500e_coco.py'
+
+# This config use refining bbox and `YOLOv5CopyPaste`.
+# Refining bbox means refining bbox by mask while loading annotations and
+# transforming after `YOLOv5RandomAffine`
+
+# ========================modified parameters======================
+deepen_factor = 0.67
+widen_factor = 0.75
+last_stage_out_channels = 768
+
+affine_scale = 0.9
+mixup_prob = 0.1
+copypaste_prob = 0.1
+
+# ===============================Unmodified in most cases====================
+img_scale = _base_.img_scale
+pre_transform = _base_.pre_transform
+last_transform = _base_.last_transform
+
+model = dict(
+    backbone=dict(
+        last_stage_out_channels=last_stage_out_channels,
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor),
+    neck=dict(
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        in_channels=[256, 512, last_stage_out_channels],
+        out_channels=[256, 512, last_stage_out_channels]),
+    bbox_head=dict(
+        head_module=dict(
+            widen_factor=widen_factor,
+            in_channels=[256, 512, last_stage_out_channels])))
+
+mosaic_affine_transform = [
+    dict(
+        type='Mosaic',
+        img_scale=img_scale,
+        pad_val=114.0,
+        pre_transform=pre_transform),
+    dict(type='YOLOv5CopyPaste', prob=copypaste_prob),
+    dict(
+        type='YOLOv5RandomAffine',
+        max_rotate_degree=0.0,
+        max_shear_degree=0.0,
+        max_aspect_ratio=100.,
+        scaling_ratio_range=(1 - affine_scale, 1 + affine_scale),
+        # img_scale is (width, height)
+        border=(-img_scale[0] // 2, -img_scale[1] // 2),
+        border_val=(114, 114, 114),
+        min_area_ratio=_base_.min_area_ratio,
+        use_mask_refine=_base_.use_mask2refine)
+]
+
+train_pipeline = [
+    *pre_transform, *mosaic_affine_transform,
+    dict(
+        type='YOLOv5MixUp',
+        prob=mixup_prob,
+        pre_transform=[*pre_transform, *mosaic_affine_transform]),
+    *last_transform
+]
+
+train_pipeline_stage2 = [
+    *pre_transform,
+    dict(type='YOLOv5KeepRatioResize', scale=img_scale),
+    dict(
+        type='LetterResize',
+        scale=img_scale,
+        allow_scale_up=True,
+        pad_val=dict(img=114.0)),
+    dict(
+        type='YOLOv5RandomAffine',
+        max_rotate_degree=0.0,
+        max_shear_degree=0.0,
+        scaling_ratio_range=(1 - affine_scale, 1 + affine_scale),
+        max_aspect_ratio=_base_.max_aspect_ratio,
+        border_val=(114, 114, 114),
+        min_area_ratio=_base_.min_area_ratio,
+        use_mask_refine=_base_.use_mask2refine), *last_transform
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+_base_.custom_hooks[1].switch_pipeline = train_pipeline_stage2
--- a/configs/yolov8/yolov8_m_syncbn_fast_8xb16-500e_coco.py
+++ b/configs/yolov8/yolov8_m_syncbn_fast_8xb16-500e_coco.py
@ -9,9 +9,9 @@ affine_scale = 0.9
 mixup_prob = 0.1

 # =======================Unmodified in most cases==================
-num_classes = _base_.num_classes
-num_det_layers = _base_.num_det_layers
 img_scale = _base_.img_scale
+pre_transform = _base_.pre_transform
+last_transform = _base_.last_transform

 model = dict(
    backbone=dict(
@ -28,11 +28,7 @@ model = dict(
            widen_factor=widen_factor,
            in_channels=[256, 512, last_stage_out_channels])))

-pre_transform = _base_.pre_transform
-albu_train_transform = _base_.albu_train_transform
-last_transform = _base_.last_transform
-
-mosaic_affine_pipeline = [
+mosaic_affine_transform = [
    dict(
        type='Mosaic',
        img_scale=img_scale,
@ -51,16 +47,14 @@ mosaic_affine_pipeline = [

 # enable mixup
 train_pipeline = [
-    *pre_transform, *mosaic_affine_pipeline,
+    *pre_transform, *mosaic_affine_transform,
    dict(
        type='YOLOv5MixUp',
        prob=mixup_prob,
-        pre_transform=[*pre_transform, *mosaic_affine_pipeline]),
+        pre_transform=[*pre_transform, *mosaic_affine_transform]),
    *last_transform
 ]

-train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
-
 train_pipeline_stage2 = [
    *pre_transform,
    dict(type='YOLOv5KeepRatioResize', scale=img_scale),
@ -78,16 +72,5 @@ train_pipeline_stage2 = [
        border_val=(114, 114, 114)), *last_transform
 ]

-custom_hooks = [
-    dict(
-        type='EMAHook',
-        ema_type='ExpMomentumEMA',
-        momentum=0.0001,
-        update_buffers=True,
-        strict_load=False,
-        priority=49),
-    dict(
-        type='mmdet.PipelineSwitchHook',
-        switch_epoch=_base_.max_epochs - _base_.close_mosaic_epochs,
-        switch_pipeline=train_pipeline_stage2)
-]
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+_base_.custom_hooks[1].switch_pipeline = train_pipeline_stage2
--- a/configs/yolov8/yolov8_n_mask-refine_syncbn_fast_8xb16-500e_coco.py
+++ b/configs/yolov8/yolov8_n_mask-refine_syncbn_fast_8xb16-500e_coco.py
@ -0,0 +1,12 @@
+_base_ = './yolov8_s_mask-refine_syncbn_fast_8xb16-500e_coco.py'
+
+# This config will refine bbox by mask while loading annotations and
+# transforming after `YOLOv5RandomAffine`
+
+deepen_factor = 0.33
+widen_factor = 0.25
+
+model = dict(
+    backbone=dict(deepen_factor=deepen_factor, widen_factor=widen_factor),
+    neck=dict(deepen_factor=deepen_factor, widen_factor=widen_factor),
+    bbox_head=dict(head_module=dict(widen_factor=widen_factor)))
--- a/configs/yolov8/yolov8_s_mask-refine_syncbn_fast_8xb16-500e_coco.py
+++ b/configs/yolov8/yolov8_s_mask-refine_syncbn_fast_8xb16-500e_coco.py
@ -0,0 +1,83 @@
+_base_ = './yolov8_s_syncbn_fast_8xb16-500e_coco.py'
+
+# This config will refine bbox by mask while loading annotations and
+# transforming after `YOLOv5RandomAffine`
+
+# ========================modified parameters======================
+use_mask2refine = True
+min_area_ratio = 0.01  # YOLOv5RandomAffine
+
+# ===============================Unmodified in most cases====================
+pre_transform = [
+    dict(type='LoadImageFromFile', file_client_args=_base_.file_client_args),
+    dict(
+        type='LoadAnnotations',
+        with_bbox=True,
+        with_mask=True,
+        mask2bbox=use_mask2refine)
+]
+
+last_transform = [
+    # Delete gt_masks to avoid more computation
+    dict(type='RemoveDataElement', keys=['gt_masks']),
+    dict(
+        type='mmdet.Albu',
+        transforms=_base_.albu_train_transforms,
+        bbox_params=dict(
+            type='BboxParams',
+            format='pascal_voc',
+            label_fields=['gt_bboxes_labels', 'gt_ignore_flags']),
+        keymap={
+            'img': 'image',
+            'gt_bboxes': 'bboxes'
+        }),
+    dict(type='YOLOv5HSVRandomAug'),
+    dict(type='mmdet.RandomFlip', prob=0.5),
+    dict(
+        type='mmdet.PackDetInputs',
+        meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'flip',
+                   'flip_direction'))
+]
+
+train_pipeline = [
+    *pre_transform,
+    dict(
+        type='Mosaic',
+        img_scale=_base_.img_scale,
+        pad_val=114.0,
+        pre_transform=pre_transform),
+    dict(
+        type='YOLOv5RandomAffine',
+        max_rotate_degree=0.0,
+        max_shear_degree=0.0,
+        scaling_ratio_range=(1 - _base_.affine_scale, 1 + _base_.affine_scale),
+        max_aspect_ratio=_base_.max_aspect_ratio,
+        # img_scale is (width, height)
+        border=(-_base_.img_scale[0] // 2, -_base_.img_scale[1] // 2),
+        border_val=(114, 114, 114),
+        min_area_ratio=min_area_ratio,
+        use_mask_refine=use_mask2refine),
+    *last_transform
+]
+
+train_pipeline_stage2 = [
+    *pre_transform,
+    dict(type='YOLOv5KeepRatioResize', scale=_base_.img_scale),
+    dict(
+        type='LetterResize',
+        scale=_base_.img_scale,
+        allow_scale_up=True,
+        pad_val=dict(img=114.0)),
+    dict(
+        type='YOLOv5RandomAffine',
+        max_rotate_degree=0.0,
+        max_shear_degree=0.0,
+        scaling_ratio_range=(1 - _base_.affine_scale, 1 + _base_.affine_scale),
+        max_aspect_ratio=_base_.max_aspect_ratio,
+        border_val=(114, 114, 114),
+        min_area_ratio=min_area_ratio,
+        use_mask_refine=use_mask2refine), *last_transform
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+_base_.custom_hooks[1].switch_pipeline = train_pipeline_stage2
--- a/configs/yolov8/yolov8_s_syncbn_fast_8xb16-500e_coco.py
+++ b/configs/yolov8/yolov8_s_syncbn_fast_8xb16-500e_coco.py
@ -161,7 +161,7 @@ model = dict(
            eps=1e-9)),
    test_cfg=model_test_cfg)

-albu_train_transform = [
+albu_train_transforms = [
    dict(type='Blur', p=0.01),
    dict(type='MedianBlur', p=0.01),
    dict(type='ToGray', p=0.01),
@ -176,7 +176,7 @@ pre_transform = [
 last_transform = [
    dict(
        type='mmdet.Albu',
-        transforms=albu_train_transform,
+        transforms=albu_train_transforms,
        bbox_params=dict(
            type='BboxParams',
            format='pascal_voc',
--- a/configs/yolov8/yolov8_x_mask-refine_syncbn_fast_8xb16-500e_coco.py
+++ b/configs/yolov8/yolov8_x_mask-refine_syncbn_fast_8xb16-500e_coco.py
@ -0,0 +1,13 @@
+_base_ = './yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco.py'
+
+# This config use refining bbox and `YOLOv5CopyPaste`.
+# Refining bbox means refining bbox by mask while loading annotations and
+# transforming after `YOLOv5RandomAffine`
+
+deepen_factor = 1.00
+widen_factor = 1.25
+
+model = dict(
+    backbone=dict(deepen_factor=deepen_factor, widen_factor=widen_factor),
+    neck=dict(deepen_factor=deepen_factor, widen_factor=widen_factor),
+    bbox_head=dict(head_module=dict(widen_factor=widen_factor)))
--- a/mmyolo/datasets/transforms/init.py
+++ b/mmyolo/datasets/transforms/init.py
@ -1,12 +1,13 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 from .mix_img_transforms import Mosaic, Mosaic9, YOLOv5MixUp, YOLOXMixUp
 from .transforms import (LetterResize, LoadAnnotations, PPYOLOERandomCrop,
-                         PPYOLOERandomDistort, YOLOv5HSVRandomAug,
+                         PPYOLOERandomDistort, RemoveDataElement,
+                         YOLOv5CopyPaste, YOLOv5HSVRandomAug,
                         YOLOv5KeepRatioResize, YOLOv5RandomAffine)

 __all__ = [
    'YOLOv5KeepRatioResize', 'LetterResize', 'Mosaic', 'YOLOXMixUp',
    'YOLOv5MixUp', 'YOLOv5HSVRandomAug', 'LoadAnnotations',
    'YOLOv5RandomAffine', 'PPYOLOERandomDistort', 'PPYOLOERandomCrop',
-    'Mosaic9'
+    'Mosaic9', 'YOLOv5CopyPaste', 'RemoveDataElement'
 ]
--- a/mmyolo/datasets/transforms/mix_img_transforms.py
+++ b/mmyolo/datasets/transforms/mix_img_transforms.py
@ -317,6 +317,8 @@ class Mosaic(BaseMixImageTransform):
        mosaic_bboxes = []
        mosaic_bboxes_labels = []
        mosaic_ignore_flags = []
+        mosaic_masks = []
+        with_mask = True if 'gt_masks' in results else False
        # self.img_scale is wh format
        img_scale_w, img_scale_h = self.img_scale

@ -370,6 +372,20 @@ class Mosaic(BaseMixImageTransform):
            mosaic_bboxes.append(gt_bboxes_i)
            mosaic_bboxes_labels.append(gt_bboxes_labels_i)
            mosaic_ignore_flags.append(gt_ignore_flags_i)
+            if with_mask and results_patch.get('gt_masks', None) is not None:
+                gt_masks_i = results_patch['gt_masks']
+                gt_masks_i = gt_masks_i.rescale(float(scale_ratio_i))
+                gt_masks_i = gt_masks_i.translate(
+                    out_shape=(int(self.img_scale[0] * 2),
+                               int(self.img_scale[1] * 2)),
+                    offset=padw,
+                    direction='horizontal')
+                gt_masks_i = gt_masks_i.translate(
+                    out_shape=(int(self.img_scale[0] * 2),
+                               int(self.img_scale[1] * 2)),
+                    offset=padh,
+                    direction='vertical')
+                mosaic_masks.append(gt_masks_i)

        mosaic_bboxes = mosaic_bboxes[0].cat(mosaic_bboxes, 0)
        mosaic_bboxes_labels = np.concatenate(mosaic_bboxes_labels, 0)
@ -377,6 +393,9 @@ class Mosaic(BaseMixImageTransform):

        if self.bbox_clip_border:
            mosaic_bboxes.clip_([2 * img_scale_h, 2 * img_scale_w])
+            if with_mask:
+                mosaic_masks = mosaic_masks[0].cat(mosaic_masks)
+                results['gt_masks'] = mosaic_masks
        else:
            # remove outside bboxes
            inside_inds = mosaic_bboxes.is_inside(
@ -384,12 +403,16 @@ class Mosaic(BaseMixImageTransform):
            mosaic_bboxes = mosaic_bboxes[inside_inds]
            mosaic_bboxes_labels = mosaic_bboxes_labels[inside_inds]
            mosaic_ignore_flags = mosaic_ignore_flags[inside_inds]
+            if with_mask:
+                mosaic_masks = mosaic_masks[0].cat(mosaic_masks)[inside_inds]
+                results['gt_masks'] = mosaic_masks

        results['img'] = mosaic_img
        results['img_shape'] = mosaic_img.shape
        results['gt_bboxes'] = mosaic_bboxes
        results['gt_bboxes_labels'] = mosaic_bboxes_labels
        results['gt_ignore_flags'] = mosaic_ignore_flags
+
        return results

    def _mosaic_combine(
@ -876,6 +899,11 @@ class YOLOv5MixUp(BaseMixImageTransform):
            (results['gt_bboxes_labels'], retrieve_gt_bboxes_labels), axis=0)
        mixup_gt_ignore_flags = np.concatenate(
            (results['gt_ignore_flags'], retrieve_gt_ignore_flags), axis=0)
+        if 'gt_masks' in results:
+            assert 'gt_masks' in retrieve_results
+            mixup_gt_masks = results['gt_masks'].cat(
+                [results['gt_masks'], retrieve_results['gt_masks']])
+            results['gt_masks'] = mixup_gt_masks

        results['img'] = mixup_img.astype(np.uint8)
        results['img_shape'] = mixup_img.shape
--- a/mmyolo/datasets/transforms/transforms.py
+++ b/mmyolo/datasets/transforms/transforms.py
@ -1,6 +1,7 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 import math
-from typing import List, Tuple, Union
+from copy import deepcopy
+from typing import List, Sequence, Tuple, Union

 import cv2
 import mmcv
@ -12,6 +13,7 @@ from mmdet.datasets.transforms import LoadAnnotations as MMDET_LoadAnnotations
 from mmdet.datasets.transforms import Resize as MMDET_Resize
 from mmdet.structures.bbox import (HorizontalBoxes, autocast_box_type,
                                   get_box_type)
+from mmdet.structures.mask import PolygonMasks
 from numpy import random

 from mmyolo.registry import TRANSFORMS
@ -240,7 +242,7 @@ class LetterResize(MMDET_Resize):
        results['img_shape'] = image.shape
        if 'pad_param' in results:
            results['pad_param_origin'] = results['pad_param'] * \
-                np.repeat(ratio, 2)
+                                          np.repeat(ratio, 2)
        results['pad_param'] = np.array(padding_list, dtype=np.float32)

    def _resize_masks(self, results: dict):
@ -248,32 +250,29 @@ class LetterResize(MMDET_Resize):
        if results.get('gt_masks', None) is None:
            return

-        # resize the gt_masks
-        gt_mask_height = results['gt_masks'].height * \
-            results['scale_factor'][1]
-        gt_mask_width = results['gt_masks'].width * \
-            results['scale_factor'][0]
-        gt_masks = results['gt_masks'].resize(
-            (int(round(gt_mask_height)), int(round(gt_mask_width))))
+        gt_masks = results['gt_masks']
+        assert isinstance(
+            gt_masks, PolygonMasks
+        ), f'Only supports PolygonMasks, but got {type(gt_masks)}'

-        # padding the gt_masks
-        if len(gt_masks) == 0:
-            padded_masks = np.empty((0, *results['img_shape'][:2]),
-                                    dtype=np.uint8)
-        else:
-            # TODO: The function is incorrect. Because the mask may not
-            #  be able to pad.
-            padded_masks = np.stack([
-                mmcv.impad(
-                    mask,
-                    padding=(int(results['pad_param'][2]),
-                             int(results['pad_param'][0]),
-                             int(results['pad_param'][3]),
-                             int(results['pad_param'][1])),
-                    pad_val=self.pad_val.get('masks', 0)) for mask in gt_masks
-            ])
-        results['gt_masks'] = type(results['gt_masks'])(
-            padded_masks, *results['img_shape'][:2])
+        # resize the gt_masks
+        gt_mask_h = results['gt_masks'].height * results['scale_factor'][1]
+        gt_mask_w = results['gt_masks'].width * results['scale_factor'][0]
+        gt_masks = results['gt_masks'].resize(
+            (int(round(gt_mask_h)), int(round(gt_mask_w))))
+
+        top_padding, _, left_padding, _ = results['pad_param']
+        if int(left_padding) != 0:
+            gt_masks = gt_masks.translate(
+                out_shape=results['img_shape'][:2],
+                offset=int(left_padding),
+                direction='horizontal')
+        if int(top_padding) != 0:
+            gt_masks = gt_masks.translate(
+                out_shape=results['img_shape'][:2],
+                offset=int(top_padding),
+                direction='vertical')
+        results['gt_masks'] = gt_masks

    def _resize_bboxes(self, results: dict):
        """Resize bounding boxes with ``results['scale_factor']``."""
@ -356,19 +355,74 @@ class YOLOv5HSVRandomAug(BaseTransform):
        results['img'] = cv2.cvtColor(im_hsv, cv2.COLOR_HSV2BGR)
        return results

+    def __repr__(self) -> str:
+        repr_str = self.__class__.__name__
+        repr_str += f'(hue_delta={self.hue_delta}, '
+        repr_str += f'saturation_delta={self.saturation_delta}, '
+        repr_str += f'value_delta={self.value_delta})'
+        return repr_str
+

-# TODO: can be accelerated
@TRANSFORMS.register_module()
 class LoadAnnotations(MMDET_LoadAnnotations):
    """Because the yolo series does not need to consider ignore bboxes for the
    time being, in order to speed up the pipeline, it can be excluded in
    advance."""

+    def __init__(self,
+                 mask2bbox: bool = False,
+                 poly2mask: bool = False,
+                 **kwargs) -> None:
+        self.mask2bbox = mask2bbox
+        assert not poly2mask, 'Does not support BitmapMasks considering ' \
+                              'that bitmap consumes more memory.'
+        super().__init__(poly2mask=poly2mask, **kwargs)
+        if self.mask2bbox:
+            assert self.with_mask, 'Using mask2bbox requires ' \
+                                   'with_mask is True.'
+        self._mask_ignore_flag = None
+
+    def transform(self, results: dict) -> dict:
+        """Function to load multiple types annotations.
+
+        Args:
+            results (dict): Result dict from :obj:``mmengine.BaseDataset``.
+
+        Returns:
+            dict: The dict contains loaded bounding box, label and
+            semantic segmentation.
+        """
+        if self.mask2bbox:
+            self._load_masks(results)
+            if self.with_label:
+                self._load_labels(results)
+                self._update_mask_ignore_data(results)
+            gt_bboxes = results['gt_masks'].get_bboxes(dst_type='hbox')
+            results['gt_bboxes'] = gt_bboxes
+        else:
+            results = super().transform(results)
+            self._update_mask_ignore_data(results)
+        return results
+
+    def _update_mask_ignore_data(self, results: dict) -> None:
+        if 'gt_masks' not in results:
+            return
+
+        if 'gt_bboxes_labels' in results and len(
+                results['gt_bboxes_labels']) != len(results['gt_masks']):
+            assert len(results['gt_bboxes_labels']) == len(
+                self._mask_ignore_flag)
+            results['gt_bboxes_labels'] = results['gt_bboxes_labels'][
+                self._mask_ignore_flag]
+
+        if 'gt_bboxes' in results and len(results['gt_bboxes']) != len(
+                results['gt_masks']):
+            assert len(results['gt_bboxes']) == len(self._mask_ignore_flag)
+            results['gt_bboxes'] = results['gt_bboxes'][self._mask_ignore_flag]
+
    def _load_bboxes(self, results: dict):
        """Private function to load bounding box annotations.
-
        Note: BBoxes with ignore_flag of 1 is not considered.
-
        Args:
            results (dict): Result dict from :obj:``mmengine.BaseDataset``.

@ -394,10 +448,8 @@ class LoadAnnotations(MMDET_LoadAnnotations):
        """Private function to load label annotations.

        Note: BBoxes with ignore_flag of 1 is not considered.
-
        Args:
            results (dict): Result dict from :obj:``mmengine.BaseDataset``.
-
        Returns:
            dict: The dict contains loaded label annotations.
        """
@ -408,14 +460,72 @@ class LoadAnnotations(MMDET_LoadAnnotations):
        results['gt_bboxes_labels'] = np.array(
            gt_bboxes_labels, dtype=np.int64)

+    def _load_masks(self, results: dict) -> None:
+        """Private function to load mask annotations.
+
+        Args:
+            results (dict): Result dict from :obj:``mmengine.BaseDataset``.
+        """
+        gt_masks = []
+        gt_ignore_flags = []
+        self._mask_ignore_flag = []
+        for instance in results.get('instances', []):
+            if instance['ignore_flag'] == 0:
+                if 'mask' in instance:
+                    gt_mask = instance['mask']
+                    if isinstance(gt_mask, list):
+                        gt_mask = [
+                            np.array(polygon) for polygon in gt_mask
+                            if len(polygon) % 2 == 0 and len(polygon) >= 6
+                        ]
+                        if len(gt_mask) == 0:
+                            # ignore
+                            self._mask_ignore_flag.append(0)
+                        else:
+                            gt_masks.append(gt_mask)
+                            gt_ignore_flags.append(instance['ignore_flag'])
+                            self._mask_ignore_flag.append(1)
+                    else:
+                        raise NotImplementedError(
+                            'Only supports mask annotations in polygon '
+                            'format currently')
+                else:
+                    # TODO: Actually, gt with bbox and without mask needs
+                    #  to be retained
+                    self._mask_ignore_flag.append(0)
+        self._mask_ignore_flag = np.array(self._mask_ignore_flag, dtype=bool)
+        results['gt_ignore_flags'] = np.array(gt_ignore_flags, dtype=bool)
+
+        h, w = results['ori_shape']
+        gt_masks = PolygonMasks([mask for mask in gt_masks], h, w)
+        results['gt_masks'] = gt_masks
+
+    def __repr__(self) -> str:
+        repr_str = self.__class__.__name__
+        repr_str += f'(with_bbox={self.with_bbox}, '
+        repr_str += f'with_label={self.with_label}, '
+        repr_str += f'with_mask={self.with_mask}, '
+        repr_str += f'with_seg={self.with_seg}, '
+        repr_str += f'mask2bbox={self.mask2bbox}, '
+        repr_str += f'poly2mask={self.poly2mask}, '
+        repr_str += f"imdecode_backend='{self.imdecode_backend}', "
+        repr_str += f'file_client_args={self.file_client_args})'
+        return repr_str
+

@TRANSFORMS.register_module()
 class YOLOv5RandomAffine(BaseTransform):
-    """Random affine transform data augmentation in YOLOv5. It is different
-    from the implementation in YOLOX.
+    """Random affine transform data augmentation in YOLOv5 and YOLOv8. It is
+    different from the implementation in YOLOX.

    This operation randomly generates affine transform matrix which including
    rotation, translation, shear and scaling transforms.
+    If you set use_mask_refine == True, the code will use the masks
+    annotation to refine the bbox.
+    Our implementation is slightly different from the official. In COCO
+    dataset, a gt may have multiple mask tags.  The official YOLOv5
+    annotation file already combines the masks that an object has,
+    but our code takes into account the fact that an object has multiple masks.

    Required Keys:

@ -423,6 +533,7 @@ class YOLOv5RandomAffine(BaseTransform):
    - gt_bboxes (BaseBoxes[torch.float32]) (optional)
    - gt_bboxes_labels (np.int64) (optional)
    - gt_ignore_flags (bool) (optional)
+    - gt_masks (PolygonMasks) (optional)

    Modified Keys:

@ -431,6 +542,7 @@ class YOLOv5RandomAffine(BaseTransform):
    - gt_bboxes (optional)
    - gt_bboxes_labels (optional)
    - gt_ignore_flags (optional)
+    - gt_masks (PolygonMasks) (optional)

    Args:
        max_rotate_degree (float): Maximum degrees of rotation transform.
@ -456,9 +568,11 @@ class YOLOv5RandomAffine(BaseTransform):
        min_area_ratio (float): Threshold of area ratio between
            original bboxes and wrapped bboxes. If smaller than this value,
            the box will be removed. Defaults to 0.1.
+        use_mask_refine (bool): Whether to refine bbox by mask.
        max_aspect_ratio (float): Aspect ratio of width and height
            threshold to filter bboxes. If max(h/w, w/h) larger than this
            value, the box will be removed. Defaults to 20.
+        resample_num (int): Number of poly to resample to.
    """

    def __init__(self,
@ -471,7 +585,9 @@ class YOLOv5RandomAffine(BaseTransform):
                 bbox_clip_border: bool = True,
                 min_bbox_size: int = 2,
                 min_area_ratio: float = 0.1,
-                 max_aspect_ratio: int = 20):
+                 use_mask_refine: bool = False,
+                 max_aspect_ratio: float = 20.,
+                 resample_num: int = 1000):
        assert 0 <= max_translate_ratio <= 1
        assert scaling_ratio_range[0] <= scaling_ratio_range[1]
        assert scaling_ratio_range[0] > 0
@ -483,9 +599,200 @@ class YOLOv5RandomAffine(BaseTransform):
        self.border_val = border_val
        self.bbox_clip_border = bbox_clip_border
        self.min_bbox_size = min_bbox_size
-        self.min_bbox_size = min_bbox_size
        self.min_area_ratio = min_area_ratio
+        self.use_mask_refine = use_mask_refine
        self.max_aspect_ratio = max_aspect_ratio
+        self.resample_num = resample_num
+
+    @autocast_box_type()
+    def transform(self, results: dict) -> dict:
+        """The YOLOv5 random affine transform function.
+
+        Args:
+            results (dict): The result dict.
+
+        Returns:
+            dict: The result dict.
+        """
+        img = results['img']
+        # self.border is wh format
+        height = img.shape[0] + self.border[1] * 2
+        width = img.shape[1] + self.border[0] * 2
+
+        # Note: Different from YOLOX
+        center_matrix = np.eye(3, dtype=np.float32)
+        center_matrix[0, 2] = -img.shape[1] / 2
+        center_matrix[1, 2] = -img.shape[0] / 2
+
+        warp_matrix, scaling_ratio = self._get_random_homography_matrix(
+            height, width)
+        warp_matrix = warp_matrix @ center_matrix
+
+        img = cv2.warpPerspective(
+            img,
+            warp_matrix,
+            dsize=(width, height),
+            borderValue=self.border_val)
+        results['img'] = img
+        results['img_shape'] = img.shape
+        img_h, img_w = img.shape[:2]
+
+        bboxes = results['gt_bboxes']
+        num_bboxes = len(bboxes)
+        if num_bboxes:
+            orig_bboxes = bboxes.clone()
+            if self.use_mask_refine and 'gt_masks' in results:
+                # If the dataset has annotations of mask,
+                # the mask will be used to refine bbox.
+                gt_masks = results['gt_masks']
+
+                gt_masks_resample = self.resample_masks(gt_masks)
+                gt_masks = self.warp_mask(gt_masks_resample, warp_matrix,
+                                          img_h, img_w)
+
+                # refine bboxes by masks
+                bboxes = gt_masks.get_bboxes(dst_type='hbox')
+                # filter bboxes outside image
+                valid_index = self.filter_gt_bboxes(orig_bboxes,
+                                                    bboxes).numpy()
+                results['gt_masks'] = gt_masks[valid_index]
+            else:
+                bboxes.project_(warp_matrix)
+                if self.bbox_clip_border:
+                    bboxes.clip_([height, width])
+
+                # filter bboxes
+                orig_bboxes.rescale_([scaling_ratio, scaling_ratio])
+
+                # Be careful: valid_index must convert to numpy,
+                # otherwise it will raise out of bounds when len(valid_index)=1
+                valid_index = self.filter_gt_bboxes(orig_bboxes,
+                                                    bboxes).numpy()
+                if 'gt_masks' in results:
+                    results['gt_masks'] = PolygonMasks(
+                        results['gt_masks'].masks, img_h, img_w)
+
+            results['gt_bboxes'] = bboxes[valid_index]
+            results['gt_bboxes_labels'] = results['gt_bboxes_labels'][
+                valid_index]
+            results['gt_ignore_flags'] = results['gt_ignore_flags'][
+                valid_index]
+
+        return results
+
+    @staticmethod
+    def warp_poly(poly: np.ndarray, warp_matrix: np.ndarray, img_w: int,
+                  img_h: int) -> np.ndarray:
+        """Function to warp one mask and filter points outside image.
+
+        Args:
+            poly (np.ndarray): Segmentation annotation with shape (n, ) and
+                with format (x1, y1, x2, y2, ...).
+            warp_matrix (np.ndarray): Affine transformation matrix.
+                Shape: (3, 3).
+            img_w (int): Width of output image.
+            img_h (int): Height of output image.
+        """
+        # TODO: Current logic may cause retained masks unusable for
+        #  semantic segmentation training, which is same as official
+        #  implementation.
+        poly = poly.reshape((-1, 2))
+        poly = np.concatenate((poly, np.ones(
+            (len(poly), 1), dtype=poly.dtype)),
+                              axis=-1)
+        # transform poly
+        poly = poly @ warp_matrix.T
+        poly = poly[:, :2] / poly[:, 2:3]
+
+        # filter point outside image
+        x, y = poly.T
+        valid_ind_point = (x >= 0) & (y >= 0) & (x <= img_w) & (y <= img_h)
+        return poly[valid_ind_point].reshape(-1)
+
+    def warp_mask(self, gt_masks: PolygonMasks, warp_matrix: np.ndarray,
+                  img_w: int, img_h: int) -> PolygonMasks:
+        """Warp masks by warp_matrix and retain masks inside image after
+        warping.
+
+        Args:
+            gt_masks (PolygonMasks): Annotations of semantic segmentation.
+            warp_matrix (np.ndarray): Affine transformation matrix.
+                Shape: (3, 3).
+            img_w (int): Width of output image.
+            img_h (int): Height of output image.
+
+        Returns:
+            PolygonMasks: Masks after warping.
+        """
+        masks = gt_masks.masks
+
+        new_masks = []
+        for poly_per_obj in masks:
+            warpped_poly_per_obj = []
+            # One gt may have multiple masks.
+            for poly in poly_per_obj:
+                valid_poly = self.warp_poly(poly, warp_matrix, img_w, img_h)
+                if len(valid_poly):
+                    warpped_poly_per_obj.append(valid_poly.reshape(-1))
+            # If all the masks are invalid,
+            # add [0, 0, 0, 0, 0, 0,] here.
+            if not warpped_poly_per_obj:
+                # This will be filtered in function `filter_gt_bboxes`.
+                warpped_poly_per_obj = [
+                    np.zeros(6, dtype=poly_per_obj[0].dtype)
+                ]
+            new_masks.append(warpped_poly_per_obj)
+
+        gt_masks = PolygonMasks(new_masks, img_h, img_w)
+        return gt_masks
+
+    def resample_masks(self, gt_masks: PolygonMasks) -> PolygonMasks:
+        """Function to resample each mask annotation with shape (2 * n, ) to
+        shape (resample_num * 2, ).
+
+        Args:
+            gt_masks (PolygonMasks): Annotations of semantic segmentation.
+        """
+        masks = gt_masks.masks
+        new_masks = []
+        for poly_per_obj in masks:
+            resample_poly_per_obj = []
+            for poly in poly_per_obj:
+                poly = poly.reshape((-1, 2))  # xy
+                poly = np.concatenate((poly, poly[0:1, :]), axis=0)
+                x = np.linspace(0, len(poly) - 1, self.resample_num)
+                xp = np.arange(len(poly))
+                poly = np.concatenate([
+                    np.interp(x, xp, poly[:, i]) for i in range(2)
+                ]).reshape(2, -1).T.reshape(-1)
+                resample_poly_per_obj.append(poly)
+            new_masks.append(resample_poly_per_obj)
+        return PolygonMasks(new_masks, gt_masks.height, gt_masks.width)
+
+    def filter_gt_bboxes(self, origin_bboxes: HorizontalBoxes,
+                         wrapped_bboxes: HorizontalBoxes) -> torch.Tensor:
+        """Filter gt bboxes.
+
+        Args:
+            origin_bboxes (HorizontalBoxes): Origin bboxes.
+            wrapped_bboxes (HorizontalBoxes): Wrapped bboxes
+
+        Returns:
+            dict: The result dict.
+        """
+        origin_w = origin_bboxes.widths
+        origin_h = origin_bboxes.heights
+        wrapped_w = wrapped_bboxes.widths
+        wrapped_h = wrapped_bboxes.heights
+        aspect_ratio = np.maximum(wrapped_w / (wrapped_h + 1e-16),
+                                  wrapped_h / (wrapped_w + 1e-16))
+
+        wh_valid_idx = (wrapped_w > self.min_bbox_size) & \
+                       (wrapped_h > self.min_bbox_size)
+        area_valid_idx = wrapped_w * wrapped_h / (origin_w * origin_h +
+                                                  1e-16) > self.min_area_ratio
+        aspect_ratio_valid_idx = aspect_ratio < self.max_aspect_ratio
+        return wh_valid_idx & area_valid_idx & aspect_ratio_valid_idx

    @cache_randomness
    def _get_random_homography_matrix(self, height: int,
@ -527,99 +834,6 @@ class YOLOv5RandomAffine(BaseTransform):
            translate_matrix @ shear_matrix @ rotation_matrix @ scaling_matrix)
        return warp_matrix, scaling_ratio

-    @autocast_box_type()
-    def transform(self, results: dict) -> dict:
-        """The YOLOv5 random affine transform function.
-
-        Args:
-            results (dict): The result dict.
-
-        Returns:
-            dict: The result dict.
-        """
-        img = results['img']
-        # self.border is wh format
-        height = img.shape[0] + self.border[1] * 2
-        width = img.shape[1] + self.border[0] * 2
-
-        # Note: Different from YOLOX
-        center_matrix = np.eye(3, dtype=np.float32)
-        center_matrix[0, 2] = -img.shape[1] / 2
-        center_matrix[1, 2] = -img.shape[0] / 2
-
-        warp_matrix, scaling_ratio = self._get_random_homography_matrix(
-            height, width)
-        warp_matrix = warp_matrix @ center_matrix
-
-        img = cv2.warpPerspective(
-            img,
-            warp_matrix,
-            dsize=(width, height),
-            borderValue=self.border_val)
-        results['img'] = img
-        results['img_shape'] = img.shape
-
-        bboxes = results['gt_bboxes']
-        num_bboxes = len(bboxes)
-        if num_bboxes:
-            orig_bboxes = bboxes.clone()
-
-            bboxes.project_(warp_matrix)
-            if self.bbox_clip_border:
-                bboxes.clip_([height, width])
-
-            # filter bboxes
-            orig_bboxes.rescale_([scaling_ratio, scaling_ratio])
-
-            # Be careful: valid_index must convert to numpy,
-            # otherwise it will raise out of bounds when len(valid_index)=1
-            valid_index = self.filter_gt_bboxes(orig_bboxes, bboxes).numpy()
-            results['gt_bboxes'] = bboxes[valid_index]
-            results['gt_bboxes_labels'] = results['gt_bboxes_labels'][
-                valid_index]
-            results['gt_ignore_flags'] = results['gt_ignore_flags'][
-                valid_index]
-
-            if 'gt_masks' in results:
-                raise NotImplementedError('RandomAffine only supports bbox.')
-        return results
-
-    def filter_gt_bboxes(self, origin_bboxes: HorizontalBoxes,
-                         wrapped_bboxes: HorizontalBoxes) -> torch.Tensor:
-        """Filter gt bboxes.
-
-        Args:
-            origin_bboxes (HorizontalBoxes): Origin bboxes.
-            wrapped_bboxes (HorizontalBoxes): Wrapped bboxes
-
-        Returns:
-            dict: The result dict.
-        """
-        origin_w = origin_bboxes.widths
-        origin_h = origin_bboxes.heights
-        wrapped_w = wrapped_bboxes.widths
-        wrapped_h = wrapped_bboxes.heights
-        aspect_ratio = np.maximum(wrapped_w / (wrapped_h + 1e-16),
-                                  wrapped_h / (wrapped_w + 1e-16))
-
-        wh_valid_idx = (wrapped_w > self.min_bbox_size) & \
-                       (wrapped_h > self.min_bbox_size)
-        area_valid_idx = wrapped_w * wrapped_h / (origin_w * origin_h +
-                                                  1e-16) > self.min_area_ratio
-        aspect_ratio_valid_idx = aspect_ratio < self.max_aspect_ratio
-        return wh_valid_idx & area_valid_idx & aspect_ratio_valid_idx
-
-    def __repr__(self) -> str:
-        repr_str = self.__class__.__name__
-        repr_str += f'(max_rotate_degree={self.max_rotate_degree}, '
-        repr_str += f'max_translate_ratio={self.max_translate_ratio}, '
-        repr_str += f'scaling_ratio_range={self.scaling_ratio_range}, '
-        repr_str += f'max_shear_degree={self.max_shear_degree}, '
-        repr_str += f'border={self.border}, '
-        repr_str += f'border_val={self.border_val}, '
-        repr_str += f'bbox_clip_border={self.bbox_clip_border})'
-        return repr_str
-
    @staticmethod
    def _get_rotation_matrix(rotate_degrees: float) -> np.ndarray:
        """Get rotation matrix.
@ -686,6 +900,17 @@ class YOLOv5RandomAffine(BaseTransform):
                                      dtype=np.float32)
        return translation_matrix

+    def __repr__(self) -> str:
+        repr_str = self.__class__.__name__
+        repr_str += f'(max_rotate_degree={self.max_rotate_degree}, '
+        repr_str += f'max_translate_ratio={self.max_translate_ratio}, '
+        repr_str += f'scaling_ratio_range={self.scaling_ratio_range}, '
+        repr_str += f'max_shear_degree={self.max_shear_degree}, '
+        repr_str += f'border={self.border}, '
+        repr_str += f'border_val={self.border_val}, '
+        repr_str += f'bbox_clip_border={self.bbox_clip_border})'
+        return repr_str
+

@TRANSFORMS.register_module()
 class PPYOLOERandomDistort(BaseTransform):
@ -723,7 +948,7 @@ class PPYOLOERandomDistort(BaseTransform):
        self.contrast_cfg = contrast_cfg
        self.brightness_cfg = brightness_cfg
        self.num_distort_func = num_distort_func
-        assert 0 < self.num_distort_func <= 4,\
+        assert 0 < self.num_distort_func <= 4, \
            'num_distort_func must > 0 and <= 4'
        for cfg in [
                self.hue_cfg, self.saturation_cfg, self.contrast_cfg,
@ -809,6 +1034,15 @@ class PPYOLOERandomDistort(BaseTransform):
            results = func(results)
        return results

+    def __repr__(self) -> str:
+        repr_str = self.__class__.__name__
+        repr_str += f'(hue_cfg={self.hue_cfg}, '
+        repr_str += f'saturation_cfg={self.saturation_cfg}, '
+        repr_str += f'contrast_cfg={self.contrast_cfg}, '
+        repr_str += f'brightness_cfg={self.brightness_cfg}, '
+        repr_str += f'num_distort_func={self.num_distort_func})'
+        return repr_str
+

@TRANSFORMS.register_module()
 class PPYOLOERandomCrop(BaseTransform):
@ -837,7 +1071,7 @@ class PPYOLOERandomCrop(BaseTransform):
    Args:
        aspect_ratio (List[float]): Aspect ratio of cropped region. Default to
             [.5, 2].
-        thresholds (List[float]): Iou thresholds for decide a valid bbox crop
+        thresholds (List[float]): Iou thresholds for deciding a valid bbox crop
            in [min, max] format. Defaults to [.0, .1, .3, .5, .7, .9].
        scaling (List[float]): Ratio between a cropped region and the original
            image in [min, max] format. Default to [.3, 1.].
@ -1079,3 +1313,194 @@ class PPYOLOERandomCrop(BaseTransform):
            valid, (cropped_box[:, :2] < cropped_box[:, 2:]).all(axis=1))

        return np.where(valid)[0]
+
+    def __repr__(self) -> str:
+        repr_str = self.__class__.__name__
+        repr_str += f'(aspect_ratio={self.aspect_ratio}, '
+        repr_str += f'thresholds={self.thresholds}, '
+        repr_str += f'scaling={self.scaling}, '
+        repr_str += f'num_attempts={self.num_attempts}, '
+        repr_str += f'allow_no_crop={self.allow_no_crop}, '
+        repr_str += f'cover_all_box={self.cover_all_box})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class YOLOv5CopyPaste(BaseTransform):
+    """Copy-Paste used in YOLOv5 and YOLOv8.
+
+    This transform randomly copy some objects in the image to the mirror
+    position of the image.It is different from the `CopyPaste` in mmdet.
+
+    Required Keys:
+
+    - img (np.uint8)
+    - gt_bboxes (BaseBoxes[torch.float32])
+    - gt_bboxes_labels (np.int64) (optional)
+    - gt_ignore_flags (bool) (optional)
+    - gt_masks (PolygonMasks) (optional)
+
+    Modified Keys:
+
+    - img
+    - gt_bboxes
+    - gt_bboxes_labels (np.int64) (optional)
+    - gt_ignore_flags (optional)
+    - gt_masks (optional)
+
+    Args:
+        ioa_thresh (float): Ioa thresholds for deciding valid bbox.
+        prob (float): Probability of choosing objects.
+            Defaults to 0.5.
+    """
+
+    def __init__(self, ioa_thresh: float = 0.3, prob: float = 0.5):
+        self.ioa_thresh = ioa_thresh
+        self.prob = prob
+
+    @autocast_box_type()
+    def transform(self, results: dict) -> Union[dict, None]:
+        """The YOLOv5 and YOLOv8 Copy-Paste transform function.
+
+        Args:
+            results (dict): The result dict.
+
+        Returns:
+            dict: The result dict.
+        """
+        if len(results.get('gt_masks', [])) == 0:
+            return results
+        gt_masks = results['gt_masks']
+        assert isinstance(gt_masks, PolygonMasks),\
+            'only support type of PolygonMasks,' \
+            ' but get type: %s' % type(gt_masks)
+        gt_bboxes = results['gt_bboxes']
+        gt_bboxes_labels = results.get('gt_bboxes_labels', None)
+        img = results['img']
+        img_h, img_w = img.shape[:2]
+
+        # calculate ioa
+        gt_bboxes_flip = deepcopy(gt_bboxes)
+        gt_bboxes_flip.flip_(img.shape)
+
+        ioa = self.bbox_ioa(gt_bboxes_flip, gt_bboxes)
+        indexes = torch.nonzero((ioa < self.ioa_thresh).all(1))[:, 0]
+        n = len(indexes)
+        valid_inds = random.choice(
+            indexes, size=round(self.prob * n), replace=False)
+        if len(valid_inds) == 0:
+            return results
+
+        if gt_bboxes_labels is not None:
+            # prepare labels
+            gt_bboxes_labels = np.concatenate(
+                (gt_bboxes_labels, gt_bboxes_labels[valid_inds]), axis=0)
+
+        # prepare bboxes
+        copypaste_bboxes = gt_bboxes_flip[valid_inds]
+        gt_bboxes = gt_bboxes.cat([gt_bboxes, copypaste_bboxes])
+
+        # prepare images
+        copypaste_gt_masks = gt_masks[valid_inds]
+        copypaste_gt_masks_flip = copypaste_gt_masks.flip()
+        # convert poly format to bitmap format
+        # example: poly: [[array(0.0, 0.0, 10.0, 0.0, 10.0, 10.0, 0.0, 10.0]]
+        #  -> bitmap: a mask with shape equal to (1, img_h, img_w)
+        # # type1 low speed
+        # copypaste_gt_masks_bitmap = copypaste_gt_masks.to_ndarray()
+        # copypaste_mask = np.sum(copypaste_gt_masks_bitmap, axis=0) > 0
+
+        # type2
+        copypaste_mask = np.zeros((img_h, img_w), dtype=np.uint8)
+        for poly in copypaste_gt_masks.masks:
+            poly = [i.reshape((-1, 1, 2)).astype(np.int32) for i in poly]
+            cv2.drawContours(copypaste_mask, poly, -1, (1, ), cv2.FILLED)
+
+        copypaste_mask = copypaste_mask.astype(bool)
+
+        # copy objects, and paste to the mirror position of the image
+        copypaste_mask_flip = mmcv.imflip(
+            copypaste_mask, direction='horizontal')
+        copypaste_img = mmcv.imflip(img, direction='horizontal')
+        img[copypaste_mask_flip] = copypaste_img[copypaste_mask_flip]
+
+        # prepare masks
+        gt_masks = copypaste_gt_masks.cat([gt_masks, copypaste_gt_masks_flip])
+
+        if 'gt_ignore_flags' in results:
+            # prepare gt_ignore_flags
+            gt_ignore_flags = results['gt_ignore_flags']
+            gt_ignore_flags = np.concatenate(
+                [gt_ignore_flags, gt_ignore_flags[valid_inds]], axis=0)
+            results['gt_ignore_flags'] = gt_ignore_flags
+
+        results['img'] = img
+        results['gt_bboxes'] = gt_bboxes
+        if gt_bboxes_labels is not None:
+            results['gt_bboxes_labels'] = gt_bboxes_labels
+        results['gt_masks'] = gt_masks
+
+        return results
+
+    @staticmethod
+    def bbox_ioa(gt_bboxes_flip: HorizontalBoxes,
+                 gt_bboxes: HorizontalBoxes,
+                 eps: float = 1e-7) -> np.ndarray:
+        """Calculate ioa between gt_bboxes_flip and gt_bboxes.
+
+        Args:
+            gt_bboxes_flip (HorizontalBoxes): Flipped ground truth
+                bounding boxes.
+            gt_bboxes (HorizontalBoxes): Ground truth bounding boxes.
+            eps (float): Default to 1e-10.
+        Return:
+            (Tensor): Ioa.
+        """
+        gt_bboxes_flip = gt_bboxes_flip.tensor
+        gt_bboxes = gt_bboxes.tensor
+
+        # Get the coordinates of bounding boxes
+        b1_x1, b1_y1, b1_x2, b1_y2 = gt_bboxes_flip.T
+        b2_x1, b2_y1, b2_x2, b2_y2 = gt_bboxes.T
+
+        # Intersection area
+        inter_area = (torch.minimum(b1_x2[:, None],
+                                    b2_x2) - torch.maximum(b1_x1[:, None],
+                                                           b2_x1)).clip(0) * \
+                     (torch.minimum(b1_y2[:, None],
+                                    b2_y2) - torch.maximum(b1_y1[:, None],
+                                                           b2_y1)).clip(0)
+
+        # box2 area
+        box2_area = (b2_x2 - b2_x1) * (b2_y2 - b2_y1) + eps
+
+        # Intersection over box2 area
+        return inter_area / box2_area
+
+    def __repr__(self) -> str:
+        repr_str = self.__class__.__name__
+        repr_str += f'(ioa_thresh={self.ioa_thresh},'
+        repr_str += f'prob={self.prob})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class RemoveDataElement(BaseTransform):
+    """Remove unnecessary data element in results.
+
+    Args:
+        keys (Union[str, Sequence[str]]): Keys need to be removed.
+    """
+
+    def __init__(self, keys: Union[str, Sequence[str]]):
+        self.keys = [keys] if isinstance(keys, str) else keys
+
+    def transform(self, results: dict) -> dict:
+        for key in self.keys:
+            results.pop(key, None)
+        return results
+
+    def __repr__(self) -> str:
+        repr_str = self.__class__.__name__
+        repr_str += f'(keys={self.keys})'
+        return repr_str
--- a/tests/test_datasets/test_transforms/test_mix_img_transforms.py
+++ b/tests/test_datasets/test_transforms/test_mix_img_transforms.py
@ -6,7 +6,7 @@ import unittest
 import numpy as np
 import torch
 from mmdet.structures.bbox import HorizontalBoxes
-from mmdet.structures.mask import BitmapMasks
+from mmdet.structures.mask import BitmapMasks, PolygonMasks

 from mmyolo.datasets import YOLOv5CocoDataset
 from mmyolo.datasets.transforms import Mosaic, Mosaic9, YOLOv5MixUp, YOLOXMixUp
@ -23,7 +23,6 @@ class TestMosaic(unittest.TestCase):
        TestCase calls functions in this order: setUp() -> testMethod() ->
        tearDown() -> cleanUp()
        """
-        rng = np.random.RandomState(0)
        self.pre_transform = [
            dict(
                type='LoadImageFromFile',
@ -49,8 +48,6 @@ class TestMosaic(unittest.TestCase):
                     dtype=np.float32),
            'gt_ignore_flags':
            np.array([0, 0, 1], dtype=bool),
-            'gt_masks':
-            BitmapMasks(rng.rand(3, 224, 224), height=224, width=224),
            'dataset':
            self.dataset
        }
@ -107,6 +104,48 @@ class TestMosaic(unittest.TestCase):
        self.assertTrue(results['gt_bboxes'].dtype == torch.float32)
        self.assertTrue(results['gt_ignore_flags'].dtype == bool)

+    def test_transform_with_mask(self):
+        rng = np.random.RandomState(0)
+        pre_transform = [
+            dict(
+                type='LoadImageFromFile',
+                file_client_args=dict(backend='disk')),
+            dict(type='LoadAnnotations', with_bbox=True, with_mask=True)
+        ]
+
+        dataset = YOLOv5CocoDataset(
+            data_prefix=dict(
+                img=osp.join(osp.dirname(__file__), '../../data')),
+            ann_file=osp.join(
+                osp.dirname(__file__), '../../data/coco_sample_color.json'),
+            filter_cfg=dict(filter_empty_gt=False, min_size=32),
+            pipeline=[])
+        results = {
+            'img':
+            np.random.random((224, 224, 3)),
+            'img_shape': (224, 224),
+            'gt_bboxes_labels':
+            np.array([1, 2, 3], dtype=np.int64),
+            'gt_bboxes':
+            np.array([[10, 10, 20, 20], [20, 20, 40, 40], [40, 40, 80, 80]],
+                     dtype=np.float32),
+            'gt_ignore_flags':
+            np.array([0, 0, 1], dtype=bool),
+            'gt_masks':
+            PolygonMasks.random(num_masks=3, height=224, width=224, rng=rng),
+            'dataset':
+            dataset
+        }
+        transform = Mosaic(img_scale=(12, 10), pre_transform=pre_transform)
+        results['gt_bboxes'] = HorizontalBoxes(results['gt_bboxes'])
+        results = transform(results)
+        self.assertTrue(results['img'].shape[:2] == (20, 24))
+        self.assertTrue(results['gt_bboxes_labels'].shape[0] ==
+                        results['gt_bboxes'].shape[0])
+        self.assertTrue(results['gt_bboxes_labels'].dtype == np.int64)
+        self.assertTrue(results['gt_bboxes'].dtype == torch.float32)
+        self.assertTrue(results['gt_ignore_flags'].dtype == bool)
+

 class TestMosaic9(unittest.TestCase):

@ -209,7 +248,6 @@ class TestYOLOv5MixUp(unittest.TestCase):
        TestCase calls functions in this order: setUp() -> testMethod() ->
        tearDown() -> cleanUp()
        """
-        rng = np.random.RandomState(0)
        self.pre_transform = [
            dict(
                type='LoadImageFromFile',
@ -235,8 +273,6 @@ class TestYOLOv5MixUp(unittest.TestCase):
                     dtype=np.float32),
            'gt_ignore_flags':
            np.array([0, 0, 1], dtype=bool),
-            'gt_masks':
-            BitmapMasks(rng.rand(3, 288, 512), height=288, width=512),
            'dataset':
            self.dataset
        }
@ -268,6 +304,48 @@ class TestYOLOv5MixUp(unittest.TestCase):
        self.assertTrue(results['gt_bboxes'].dtype == torch.float32)
        self.assertTrue(results['gt_ignore_flags'].dtype == bool)

+    def test_transform_with_mask(self):
+        rng = np.random.RandomState(0)
+        pre_transform = [
+            dict(
+                type='LoadImageFromFile',
+                file_client_args=dict(backend='disk')),
+            dict(type='LoadAnnotations', with_bbox=True, with_mask=True)
+        ]
+        dataset = YOLOv5CocoDataset(
+            data_prefix=dict(
+                img=osp.join(osp.dirname(__file__), '../../data')),
+            ann_file=osp.join(
+                osp.dirname(__file__), '../../data/coco_sample_color.json'),
+            filter_cfg=dict(filter_empty_gt=False, min_size=32),
+            pipeline=[])
+
+        results = {
+            'img':
+            np.random.random((288, 512, 3)),
+            'img_shape': (288, 512),
+            'gt_bboxes_labels':
+            np.array([1, 2, 3], dtype=np.int64),
+            'gt_bboxes':
+            np.array([[10, 10, 20, 20], [20, 20, 40, 40], [40, 40, 80, 80]],
+                     dtype=np.float32),
+            'gt_ignore_flags':
+            np.array([0, 0, 1], dtype=bool),
+            'gt_masks':
+            PolygonMasks.random(num_masks=3, height=288, width=512, rng=rng),
+            'dataset':
+            dataset
+        }
+
+        transform = YOLOv5MixUp(pre_transform=pre_transform)
+        results = transform(copy.deepcopy(results))
+        self.assertTrue(results['img'].shape[:2] == (288, 512))
+        self.assertTrue(results['gt_bboxes_labels'].shape[0] ==
+                        results['gt_bboxes'].shape[0])
+        self.assertTrue(results['gt_bboxes_labels'].dtype == np.int64)
+        self.assertTrue(results['gt_bboxes'].dtype == np.float32)
+        self.assertTrue(results['gt_ignore_flags'].dtype == bool)
+

 class TestYOLOXMixUp(unittest.TestCase):

--- a/tests/test_datasets/test_transforms/test_transforms.py
+++ b/tests/test_datasets/test_transforms/test_transforms.py
@ -7,14 +7,15 @@ import mmcv
 import numpy as np
 import torch
 from mmdet.structures.bbox import HorizontalBoxes
-from mmdet.structures.mask import BitmapMasks
+from mmdet.structures.mask import BitmapMasks, PolygonMasks

 from mmyolo.datasets.transforms import (LetterResize, LoadAnnotations,
                                        YOLOv5HSVRandomAug,
                                        YOLOv5KeepRatioResize,
                                        YOLOv5RandomAffine)
 from mmyolo.datasets.transforms.transforms import (PPYOLOERandomCrop,
-                                                   PPYOLOERandomDistort)
+                                                   PPYOLOERandomDistort,
+                                                   YOLOv5CopyPaste)


 class TestLetterResize(unittest.TestCase):
@ -30,7 +31,7 @@ class TestLetterResize(unittest.TestCase):
            img=np.random.random((300, 400, 3)),
            gt_bboxes=np.array([[0, 0, 150, 150]], dtype=np.float32),
            batch_shape=np.array([192, 672], dtype=np.int64),
-            gt_masks=BitmapMasks(rng.rand(1, 300, 400), height=300, width=400))
+            gt_masks=PolygonMasks.random(1, height=300, width=400, rng=rng))
        self.data_info2 = dict(
            img=np.random.random((300, 400, 3)),
            gt_bboxes=np.array([[0, 0, 150, 150]], dtype=np.float32))
@ -88,7 +89,6 @@ class TestLetterResize(unittest.TestCase):

        # Test
        transform = LetterResize(scale=(640, 640), pad_val=dict(img=144))
-        rng = np.random.RandomState(0)
        for _ in range(5):
            input_h, input_w = np.random.randint(100, 700), np.random.randint(
                100, 700)
@ -99,8 +99,8 @@ class TestLetterResize(unittest.TestCase):
                img=np.random.random((input_h, input_w, 3)),
                gt_bboxes=np.array([[0, 0, 10, 10]], dtype=np.float32),
                batch_shape=np.array([output_h, output_w], dtype=np.int64),
-                gt_masks=BitmapMasks(
-                    rng.rand(1, input_h, input_w),
+                gt_masks=PolygonMasks(
+                    [[np.array([0., 0., 0., 10., 10., 10., 10., 0.])]],
                    height=input_h,
                    width=input_w))
            results = transform(data_info)
@ -111,15 +111,14 @@ class TestLetterResize(unittest.TestCase):

        # Test without batchshape
        transform = LetterResize(scale=(640, 640), pad_val=dict(img=144))
-        rng = np.random.RandomState(0)
        for _ in range(5):
            input_h, input_w = np.random.randint(100, 700), np.random.randint(
                100, 700)
            data_info = dict(
                img=np.random.random((input_h, input_w, 3)),
                gt_bboxes=np.array([[0, 0, 10, 10]], dtype=np.float32),
-                gt_masks=BitmapMasks(
-                    rng.rand(1, input_h, input_w),
+                gt_masks=PolygonMasks(
+                    [[np.array([0., 0., 0., 10., 10., 10., 10., 0.])]],
                    height=input_h,
                    width=input_w))
            results = transform(data_info)
@ -178,7 +177,8 @@ class TestYOLOv5KeepRatioResize(unittest.TestCase):
        self.data_info1 = dict(
            img=np.random.random((300, 400, 3)),
            gt_bboxes=np.array([[0, 0, 150, 150]], dtype=np.float32),
-            gt_masks=BitmapMasks(rng.rand(1, 300, 400), height=300, width=400))
+            gt_masks=PolygonMasks.random(
+                num_masks=1, height=300, width=400, rng=rng))
        self.data_info2 = dict(img=np.random.random((300, 400, 3)))

    def test_yolov5_keep_ratio_resize(self):
@ -454,3 +454,37 @@ class TestPPYOLOERandomDistort(unittest.TestCase):
        self.assertTrue(results['gt_bboxes_labels'].dtype == np.int64)
        self.assertTrue(results['gt_bboxes'].dtype == torch.float32)
        self.assertTrue(results['gt_ignore_flags'].dtype == bool)
+
+
+class TestYOLOv5CopyPaste(unittest.TestCase):
+
+    def setUp(self):
+        """Set up the data info which are used in every test method.
+
+        TestCase calls functions in this order: setUp() -> testMethod() ->
+        tearDown() -> cleanUp()
+        """
+        self.data_info = dict(
+            img=np.random.random((300, 400, 3)),
+            gt_bboxes=np.array([[0, 0, 10, 10]], dtype=np.float32),
+            gt_masks=PolygonMasks(
+                [[np.array([0., 0., 0., 10., 10., 10., 10., 0.])]],
+                height=300,
+                width=400))
+
+    def test_transform(self):
+        # test transform
+        transform = YOLOv5CopyPaste(prob=1.0)
+        results = transform(copy.deepcopy(self.data_info))
+        self.assertTrue(len(results['gt_bboxes']) == 2)
+        self.assertTrue(len(results['gt_masks']) == 2)
+
+        rng = np.random.RandomState(0)
+        # test with bitmap
+        with self.assertRaises(AssertionError):
+            results = transform(
+                dict(
+                    img=np.random.random((300, 400, 3)),
+                    gt_bboxes=np.array([[0, 0, 10, 10]], dtype=np.float32),
+                    gt_masks=BitmapMasks(
+                        rng.rand(1, 300, 400), height=300, width=400)))