2022-08-17 12:06:41 +08:00
# Add Datasets
2021-12-15 19:06:36 +08:00
2022-08-31 19:20:49 +08:00
In this tutorial, we introduce the basic steps to create your customized dataset. Before learning to create your customized datasets, it is recommended to learn the basic concept of datasets in file [datasets.md ](datasets.md ).
2021-12-15 19:06:36 +08:00
2022-08-17 12:06:41 +08:00
- [Add Datasets ](#add-datasets )
2022-08-31 19:20:49 +08:00
- [Step 1: Creating the Dataset ](#step-1-creating-the-dataset )
- [Step 2: Add NewDataset to \_\_init\_\_py ](#step-2-add-newdataset-to-__init__py )
- [Step 3: Modify the config file ](#step-3-modify-the-config-file )
2021-12-15 19:06:36 +08:00
2022-08-31 19:20:49 +08:00
If your algorithm does not need any customized dataset, you can use these off-the-shelf datasets under [datasets directory ](mmselfsup.datasets ). But to use these existing datasets, you have to convert your dataset to existing dataset format.
2021-12-15 19:06:36 +08:00
2022-08-31 19:20:49 +08:00
As for image pretraining, it is recommended to follow the format of MMClassification.
2021-12-15 19:06:36 +08:00
2022-08-31 19:20:49 +08:00
## Step 1: Creating the Dataset
2021-12-15 19:06:36 +08:00
2022-08-31 19:20:49 +08:00
You could implement a new dataset class, inherited from `CustomDataset` from MMClassification for image pretraining.
2021-12-15 19:06:36 +08:00
Assume the name of your `Dataset` is `NewDataset` , you can create a file, named `new_dataset.py` under `mmselfsup/datasets` and implement `NewDataset` in it.
2022-01-10 12:39:14 +08:00
```python
2022-08-31 19:20:49 +08:00
from typing import List, Optional, Union
from mmcls.datasets import CustomDataset
2021-12-15 19:06:36 +08:00
2022-08-31 19:20:49 +08:00
from mmselfsup.registry import DATASETS
2021-12-15 19:06:36 +08:00
@DATASETS .register_module()
2022-08-31 19:20:49 +08:00
class NewDataset(CustomDataset):
IMG_EXTENSIONS = ('.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.pgm', '.tif')
def __init__ (self,
ann_file: str = '',
metainfo: Optional[dict] = None,
data_root: str = '',
data_prefix: Union[str, dict] = '',
**kwargs) -> None:
kwargs = {'extensions': self.IMG_EXTENSIONS, **kwargs}
super().__init__(
ann_file=ann_file,
metainfo=metainfo,
data_root=data_root,
data_prefix=data_prefix,
**kwargs)
def load_data_list(self) -> List[dict]:
# Rewrite load_data_list() to satisfy your specific requirement.
# The returned data_list could include any information you need from
# data or transforms.
2021-12-15 19:06:36 +08:00
# writing your code here
2022-08-31 19:20:49 +08:00
return data_list
2021-12-15 19:06:36 +08:00
```
2022-08-31 19:20:49 +08:00
## Step 2: Add NewDataset to \_\_init\_\_py
Then, add `NewDataset` in `mmselfsup/dataset/__init__.py` . If it is not imported, the `NewDataset` will not be registered successfully.
2021-12-15 19:06:36 +08:00
2022-01-10 12:39:14 +08:00
```python
2021-12-15 19:06:36 +08:00
...
from .new_dataset import NewDataset
__all__ = [
2022-08-31 19:20:49 +08:00
..., 'NewDataset'
2021-12-15 19:06:36 +08:00
]
```
2022-08-31 19:20:49 +08:00
## Step 3: Modify the config file
2021-12-15 19:06:36 +08:00
To use `NewDataset` , you can modify the config as the following:
2022-01-10 12:39:14 +08:00
```python
2022-08-31 19:20:49 +08:00
train_dataloader = dict(
...
dataset=dict(
2021-12-15 19:06:36 +08:00
type='NewDataset',
2022-08-31 19:20:49 +08:00
data_root=your_data_root,
ann_file=your_data_root,
data_prefix=dict(img_path='train/'),
pipeline=train_pipeline))
2021-12-15 19:06:36 +08:00
```