mmpretrain/docs/tutorials/new_dataset.md

# Tutorial 2: Adding New Dataset

## Customize datasets by reorganizing data

### Reorganize dataset to existing format

The simplest way is to convert your dataset to existing dataset formats (ImageNet).

For training, it differentiates classes by folders. The directory of training data is as follows:

```
imagenet
├── ...
├── train
│   ├── n01440764
│   │   ├── n01440764_10026.JPEG
│   │   ├── n01440764_10027.JPEG
│   │   ├── ...
│   ├── ...
│   ├── n15075141
│   │   ├── n15075141_999.JPEG
│   │   ├── n15075141_9993.JPEG
│   │   ├── ...
```

For validation, we provide a annotation list. Each line of the list contrains a filename and its corresponding ground-truth labels. The format is as follows:

```
ILSVRC2012_val_00000001.JPEG 65
ILSVRC2012_val_00000002.JPEG 970
ILSVRC2012_val_00000003.JPEG 230
ILSVRC2012_val_00000004.JPEG 809
ILSVRC2012_val_00000005.JPEG 516
```

Note: The value of ground-truth labels should fall in range `[0, num_classes - 1]`.

### An example of customized dataset

You can write a new Dataset class inherited from `BaseDataset`, and overwrite `load_annotations(self)`,
like [CIFAR10](https://github.com/open-mmlab/mmclassification/blob/master/mmcls/datasets/cifar.py) and [ImageNet](https://github.com/open-mmlab/mmclassification/blob/master/mmcls/datasets/imagenet.py).
Typically, this function returns a list, where each sample is a dict, containing necessary data informations, e.g., `img` and `gt_label`.

Assume we are going to implement a `Filelist` dataset, which takes filelists for both training and testing. The format of annotation list is as follows:

```
000001.jpg 0
000002.jpg 1
```

We can create a new dataset in `mmcls/datasets/filelist.py` to load the data.

```python
import mmcv
import numpy as np

from .builder import DATASETS
from .base_dataset import BaseDataset


@DATASETS.register_module()
class MyDataset(BaseDataset):

    def load_annotations(self):
        assert isinstance(self.ann_file, str)

        data_infos = []
        with open(self.ann_file) as f:
            samples = [x.strip().split(' ') for x in f.readlines()]
            for filename, gt_label in samples:
                info = {'img_prefix': self.data_prefix}
                info['img_info'] = {'filename': filename}
                info['gt_label'] = np.array(gt_label, dtype=np.int64)
                data_infos.append(info)
            return data_infos

```

Then in the config, to use `Filelist` you can modify the config as the following

```python
dataset_A_train = dict(
    type='Filelist',
    ann_file = 'image_list.txt',
    pipeline=train_pipeline
)
```

## Customize datasets by mixing dataset

MMClassification also supports to mix dataset for training.
Currently it supports to concat and repeat datasets.

### Repeat dataset

We use `RepeatDataset` as wrapper to repeat the dataset. For example, suppose the original dataset is `Dataset_A`, to repeat it, the config looks like the following

```python
dataset_A_train = dict(
        type='RepeatDataset',
        times=N,
        dataset=dict(  # This is the original config of Dataset_A
            type='Dataset_A',
            ...
            pipeline=train_pipeline
        )
    )
```

### Class balanced dataset

We use `ClassBalancedDataset` as wrapper to repeat the dataset based on category
frequency. The dataset to repeat needs to instantiate function `self.get_cat_ids(idx)`
to support `ClassBalancedDataset`.
For example, to repeat `Dataset_A` with `oversample_thr=1e-3`, the config looks like the following

```python
dataset_A_train = dict(
        type='ClassBalancedDataset',
        oversample_thr=1e-3,
        dataset=dict(  # This is the original config of Dataset_A
            type='Dataset_A',
            ...
            pipeline=train_pipeline
        )
    )
```

You may refer to [source code](../../mmcls/datasets/dataset_wrappers.py) for details.
Add tutorial docs 2020-07-08 12:59:15 +08:00			`# Tutorial 2: Adding New Dataset`

			`## Customize datasets by reorganizing data`

			`### Reorganize dataset to existing format`

			`The simplest way is to convert your dataset to existing dataset formats (ImageNet).`

			`For training, it differentiates classes by folders. The directory of training data is as follows:`
Add markdown lint in pre-commit hook (#106) * add mdlint * remove duplicate check * modify md files to pass linting check * change mmcv download url * remove torch1.4.0 version check * remove torch1.4.0 version check 2020-12-02 19:42:45 +08:00
Add tutorial docs 2020-07-08 12:59:15 +08:00			```
			`imagenet`
			`├── ...`
			`├── train`
			`│ ├── n01440764`
			`│ │ ├── n01440764_10026.JPEG`
			`│ │ ├── n01440764_10027.JPEG`
			`│ │ ├── ...`
			`│ ├── ...`
			`│ ├── n15075141`
			`│ │ ├── n15075141_999.JPEG`
			`│ │ ├── n15075141_9993.JPEG`
			`│ │ ├── ...`
			```

			`For validation, we provide a annotation list. Each line of the list contrains a filename and its corresponding ground-truth labels. The format is as follows:`

			```
			`ILSVRC2012_val_00000001.JPEG 65`
			`ILSVRC2012_val_00000002.JPEG 970`
			`ILSVRC2012_val_00000003.JPEG 230`
			`ILSVRC2012_val_00000004.JPEG 809`
			`ILSVRC2012_val_00000005.JPEG 516`
			```

Add note about ground-truth labels range (#44) * Add note about ground-truth labels range * Fix 2020-09-09 10:21:25 +02:00			Note: The value of ground-truth labels should fall in range `[0, num_classes - 1]`.

Add tutorial docs 2020-07-08 12:59:15 +08:00			`### An example of customized dataset`

			You can write a new Dataset class inherited from `BaseDataset`, and overwrite `load_annotations(self)`,
			`like [CIFAR10](https://github.com/open-mmlab/mmclassification/blob/master/mmcls/datasets/cifar.py) and [ImageNet](https://github.com/open-mmlab/mmclassification/blob/master/mmcls/datasets/imagenet.py).`
			Typically, this function returns a list, where each sample is a dict, containing necessary data informations, e.g., `img` and `gt_label`.

			Assume we are going to implement a `Filelist` dataset, which takes filelists for both training and testing. The format of annotation list is as follows:

			```
			`000001.jpg 0`
			`000002.jpg 1`
			```

correct module name from mmdet to mmcls (#8) 2020-07-21 19:28:14 +08:00			We can create a new dataset in `mmcls/datasets/filelist.py` to load the data.
Add tutorial docs 2020-07-08 12:59:15 +08:00
			```python
			`import mmcv`
			`import numpy as np`

			`from .builder import DATASETS`
			`from .base_dataset import BaseDataset`


			`@DATASETS.register_module()`
			`class MyDataset(BaseDataset):`

			`def load_annotations(self):`
			`assert isinstance(self.ann_file, str)`

			`data_infos = []`
			`with open(self.ann_file) as f:`
			`samples = [x.strip().split(' ') for x in f.readlines()]`
			`for filename, gt_label in samples:`
			`info = {'img_prefix': self.data_prefix}`
			`info['img_info'] = {'filename': filename}`
			`info['gt_label'] = np.array(gt_label, dtype=np.int64)`
			`data_infos.append(info)`
			`return data_infos`

			```

			Then in the config, to use `Filelist` you can modify the config as the following

			```python
			`dataset_A_train = dict(`
			`type='Filelist',`
			`ann_file = 'image_list.txt',`
			`pipeline=train_pipeline`
			`)`
			```

			`## Customize datasets by mixing dataset`

correct module name from mmdet to mmcls (#8) 2020-07-21 19:28:14 +08:00			`MMClassification also supports to mix dataset for training.`
Add tutorial docs 2020-07-08 12:59:15 +08:00			`Currently it supports to concat and repeat datasets.`

			`### Repeat dataset`

			We use `RepeatDataset` as wrapper to repeat the dataset. For example, suppose the original dataset is `Dataset_A`, to repeat it, the config looks like the following
Add markdown lint in pre-commit hook (#106) * add mdlint * remove duplicate check * modify md files to pass linting check * change mmcv download url * remove torch1.4.0 version check * remove torch1.4.0 version check 2020-12-02 19:42:45 +08:00
Add tutorial docs 2020-07-08 12:59:15 +08:00			```python
			`dataset_A_train = dict(`
			`type='RepeatDataset',`
			`times=N,`
			`dataset=dict( # This is the original config of Dataset_A`
			`type='Dataset_A',`
			`...`
			`pipeline=train_pipeline`
			`)`
			`)`
			```

			`### Class balanced dataset`

			We use `ClassBalancedDataset` as wrapper to repeat the dataset based on category
			frequency. The dataset to repeat needs to instantiate function `self.get_cat_ids(idx)`
			to support `ClassBalancedDataset`.
			For example, to repeat `Dataset_A` with `oversample_thr=1e-3`, the config looks like the following
Add markdown lint in pre-commit hook (#106) * add mdlint * remove duplicate check * modify md files to pass linting check * change mmcv download url * remove torch1.4.0 version check * remove torch1.4.0 version check 2020-12-02 19:42:45 +08:00
Add tutorial docs 2020-07-08 12:59:15 +08:00			```python
			`dataset_A_train = dict(`
			`type='ClassBalancedDataset',`
			`oversample_thr=1e-3,`
			`dataset=dict( # This is the original config of Dataset_A`
			`type='Dataset_A',`
			`...`
			`pipeline=train_pipeline`
			`)`
			`)`
			```
Add markdown lint in pre-commit hook (#106) * add mdlint * remove duplicate check * modify md files to pass linting check * change mmcv download url * remove torch1.4.0 version check * remove torch1.4.0 version check 2020-12-02 19:42:45 +08:00
correct module name from mmdet to mmcls (#8) 2020-07-21 19:28:14 +08:00			`You may refer to [source code](../../mmcls/datasets/dataset_wrappers.py) for details.`