2022-08-17 12:06:41 +08:00
# Add Datasets
2021-12-15 19:06:36 +08:00
In this tutorial, we introduce the basic steps to create your customized dataset:
2022-08-17 12:06:41 +08:00
- [Add Datasets ](#add-datasets )
2022-06-01 09:59:05 +08:00
- [An example of customized dataset ](#an-example-of-customized-dataset )
- [Creating the `DataSource` ](#creating-the-datasource )
- [Creating the `Dataset` ](#creating-the-dataset )
- [Modify config file ](#modify-config-file )
2021-12-15 19:06:36 +08:00
If your algorithm does not need any customized dataset, you can use these off-the-shelf datasets under [datasets ](../../mmselfsup/datasets ). But to use these existing datasets, you have to convert your dataset to existing dataset format.
2022-08-17 12:06:41 +08:00
## An example of customized dataset
2021-12-15 19:06:36 +08:00
Assuming the format of your dataset's annotation file is:
```text
000001.jpg 0
000002.jpg 1
```
To write a new dataset, you need to implement:
- `DataSource` : inherited from `BaseDataSource` and responsible for loading the annotation files and reading images.
2022-01-04 10:26:15 +08:00
- `Dataset` : inherited from `BaseDataset` and responsible for applying transformation to images and packing these images.
2021-12-15 19:06:36 +08:00
2022-08-17 12:06:41 +08:00
## Creating the `DataSource`
2021-12-15 19:06:36 +08:00
Assume the name of your `DataSource` is `NewDataSource` , you can create a file, named `new_data_source.py` under `mmselfsup/datasets/data_sources` and implement `NewDataSource` in it.
2022-01-10 12:39:14 +08:00
```python
2021-12-15 19:06:36 +08:00
import mmcv
import numpy as np
from ..builder import DATASOURCES
from .base import BaseDataSource
@DATASOURCES .register_module()
class NewDataSource(BaseDataSource):
def load_annotations(self):
assert isinstance(self.ann_file, str)
data_infos = []
# writing your code here.
return data_infos
```
Then, add `NewDataSource` in `mmselfsup/dataset/data_sources/__init__.py` .
2022-01-10 12:39:14 +08:00
```python
2021-12-15 19:06:36 +08:00
from .base import BaseDataSource
...
from .new_data_source import NewDataSource
__all__ = [
'BaseDataSource', ..., 'NewDataSource'
]
```
2022-08-17 12:06:41 +08:00
## Creating the `Dataset`
2021-12-15 19:06:36 +08:00
Assume the name of your `Dataset` is `NewDataset` , you can create a file, named `new_dataset.py` under `mmselfsup/datasets` and implement `NewDataset` in it.
2022-01-10 12:39:14 +08:00
```python
2021-12-15 19:06:36 +08:00
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmcv.utils import build_from_cfg
from torchvision.transforms import Compose
from .base import BaseDataset
from .builder import DATASETS, PIPELINES, build_datasource
from .utils import to_numpy
@DATASETS .register_module()
class NewDataset(BaseDataset):
def __init__ (self, data_source, num_views, pipelines, prefetch=False):
# writing your code here
def __getitem__ (self, idx):
# writing your code here
return dict(img=img)
def evaluate(self, results, logger=None):
return NotImplemented
```
Then, add `NewDataset` in `mmselfsup/dataset/__init__.py` .
2022-01-10 12:39:14 +08:00
```python
2021-12-15 19:06:36 +08:00
from .base import BaseDataset
...
from .new_dataset import NewDataset
__all__ = [
'BaseDataset', ..., 'NewDataset'
]
```
2022-08-17 12:06:41 +08:00
## Modify config file
2021-12-15 19:06:36 +08:00
To use `NewDataset` , you can modify the config as the following:
2022-01-10 12:39:14 +08:00
```python
2021-12-15 19:06:36 +08:00
train=dict(
type='NewDataset',
data_source=dict(
type='NewDataSource',
),
num_views=[2],
pipelines=[train_pipeline],
prefetch=prefetch,
))
```