mmselfsup/docs/en/advanced_guides/add_datasets.md

# Add Datasets

In this tutorial, we introduce the basic steps to create your customized dataset:

- [Add Datasets](#add-datasets)
  - [An example of customized dataset](#an-example-of-customized-dataset)
  - [Creating the `DataSource`](#creating-the-datasource)
  - [Creating the `Dataset`](#creating-the-dataset)
  - [Modify config file](#modify-config-file)

If your algorithm does not need any customized dataset, you can use these off-the-shelf datasets under [datasets](../../mmselfsup/datasets). But to use these existing datasets, you have to convert your dataset to existing dataset format.

## An example of customized dataset

Assuming the format of your dataset's annotation file is:

```text
000001.jpg 0
000002.jpg 1
```

To write a new dataset, you need to implement:

- `DataSource`: inherited from `BaseDataSource` and responsible for loading the annotation files and reading images.
- `Dataset`: inherited from `BaseDataset` and responsible for applying transformation to images and packing these images.

## Creating the `DataSource`

Assume the name of your `DataSource` is `NewDataSource`, you can create a file, named `new_data_source.py` under `mmselfsup/datasets/data_sources` and implement `NewDataSource` in it.

```python
import mmcv
import numpy as np

from ..builder import DATASOURCES
from .base import BaseDataSource


@DATASOURCES.register_module()
class NewDataSource(BaseDataSource):

    def load_annotations(self):

        assert isinstance(self.ann_file, str)
        data_infos = []
        # writing your code here.
        return data_infos
```

Then, add `NewDataSource` in `mmselfsup/dataset/data_sources/__init__.py`.

```python
from .base import BaseDataSource
...
from .new_data_source import NewDataSource

__all__ = [
    'BaseDataSource', ..., 'NewDataSource'
]
```

## Creating the `Dataset`

Assume the name of your `Dataset` is `NewDataset`, you can create a file, named `new_dataset.py` under `mmselfsup/datasets` and implement `NewDataset` in it.

```python
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmcv.utils import build_from_cfg
from torchvision.transforms import Compose

from .base import BaseDataset
from .builder import DATASETS, PIPELINES, build_datasource
from .utils import to_numpy


@DATASETS.register_module()
class NewDataset(BaseDataset):

    def __init__(self, data_source, num_views, pipelines, prefetch=False):
        # writing your code here
    def __getitem__(self, idx):
        # writing your code here
        return dict(img=img)

    def evaluate(self, results, logger=None):
        return NotImplemented
```

Then, add `NewDataset` in `mmselfsup/dataset/__init__.py`.

```python
from .base import BaseDataset
...
from .new_dataset import NewDataset

__all__ = [
    'BaseDataset', ..., 'NewDataset'
]
```

## Modify config file

To use `NewDataset`, you can modify the config as the following:

```python
train=dict(
        type='NewDataset',
        data_source=dict(
            type='NewDataSource',
        ),
        num_views=[2],
        pipelines=[train_pipeline],
        prefetch=prefetch,
    ))

```
[Refactor] Refactor docs directory (#419) * refactor directory * modify titles * fix lint * update index.rst * update * fix typo * update * fix typo * update model zoo * update index.rst * fix typo * fix typo 2022-08-17 12:06:41 +08:00			`# Add Datasets`
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00
			`In this tutorial, we introduce the basic steps to create your customized dataset:`

[Refactor] Refactor docs directory (#419) * refactor directory * modify titles * fix lint * update index.rst * update * fix typo * update * fix typo * update model zoo * update index.rst * fix typo * fix typo 2022-08-17 12:06:41 +08:00			`- [Add Datasets](#add-datasets)`
Bump version to v0.9.1 (#322) * [Fix]: Set qkv bias to False for cae and True for mae (#303) * [Fix]: Add mmcls transformer layer choice * [Fix]: Fix transformer encoder layer bug * [Fix]: Change UT of cae * [Feature]: Change the file name of cosine annealing hook (#304) * [Feature]: Change cosine annealing hook file name * [Feature]: Add UT for cosine annealing hook * [Fix]: Fix lint * read tutorials and fix typo (#308) * [Fix] fix config errors in MAE (#307) * update readthedocs algorithm readme (#310) * [Docs] Replace markdownlint with mdformat (#311) * Replace markdownlint with mdformat to avoid installing ruby * fix typo * add 'ba' to codespell ignore-words-list * Configure Myst-parser to parse anchor tag (#309) * [Docs] rewrite install.md (#317) * rewrite the install.md * add faq.md * fix lint * add FAQ to README * add Chinese version * fix typo * fix format * remove modification * fix format * [Docs] refine README.md file (#318) * refine README.md file * fix lint * format language button * rename getting_started.md * revise index.rst * add model_zoo.md to index.rst * fix lint * refine readme Co-authored-by: Jiahao Xie <52497952+Jiahao000@users.noreply.github.com> * [Enhance] update byol models and results (#319) * Update version information (#321) Co-authored-by: Yuan Liu <30762564+YuanLiuuuuuu@users.noreply.github.com> Co-authored-by: Yi Lu <21515006@zju.edu.cn> Co-authored-by: RenQin <45731309+soonera@users.noreply.github.com> Co-authored-by: Jiahao Xie <52497952+Jiahao000@users.noreply.github.com> 2022-06-01 09:59:05 +08:00			`- [An example of customized dataset](#an-example-of-customized-dataset)`
			- [Creating the `DataSource`](#creating-the-datasource)
			- [Creating the `Dataset`](#creating-the-dataset)
			`- [Modify config file](#modify-config-file)`
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00
			`If your algorithm does not need any customized dataset, you can use these off-the-shelf datasets under [datasets](../../mmselfsup/datasets). But to use these existing datasets, you have to convert your dataset to existing dataset format.`

[Refactor] Refactor docs directory (#419) * refactor directory * modify titles * fix lint * update index.rst * update * fix typo * update * fix typo * update model zoo * update index.rst * fix typo * fix typo 2022-08-17 12:06:41 +08:00			`## An example of customized dataset`
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00
			`Assuming the format of your dataset's annotation file is:`

			```text
			`000001.jpg 0`
			`000002.jpg 1`
			```

			`To write a new dataset, you need to implement:`

			- `DataSource`: inherited from `BaseDataSource` and responsible for loading the annotation files and reading images.
[Docs] translate 1_new_dataset.md into Chinese (#163) * translate 1_new_dataset.md 1.translate 1_new_dataset.md into Chinese 2.fix missing `` in 1_new_dataset.md * Update 1_new_dataset.md fix table of content 2022-01-04 10:26:15 +08:00			- `Dataset`: inherited from `BaseDataset` and responsible for applying transformation to images and packing these images.
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00
[Refactor] Refactor docs directory (#419) * refactor directory * modify titles * fix lint * update index.rst * update * fix typo * update * fix typo * update model zoo * update index.rst * fix typo * fix typo 2022-08-17 12:06:41 +08:00			## Creating the `DataSource`
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00
			Assume the name of your `DataSource` is `NewDataSource`, you can create a file, named `new_data_source.py` under `mmselfsup/datasets/data_sources` and implement `NewDataSource` in it.

[Docs] translate 2_data_pipeline.md and 3_new_module.md into Chinese and fix some typos. (#168) * [Docs] translate 2_data_pipeline.md into Chinese * [Docs] translate 3_new_module.md into Chinese * [Docs] Fix typos from py to python 2022-01-10 12:39:14 +08:00			```python
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00			`import mmcv`
			`import numpy as np`

			`from ..builder import DATASOURCES`
			`from .base import BaseDataSource`


			`@DATASOURCES.register_module()`
			`class NewDataSource(BaseDataSource):`

			`def load_annotations(self):`

			`assert isinstance(self.ann_file, str)`
			`data_infos = []`
			`# writing your code here.`
			`return data_infos`
			```

			Then, add `NewDataSource` in `mmselfsup/dataset/data_sources/__init__.py`.

[Docs] translate 2_data_pipeline.md and 3_new_module.md into Chinese and fix some typos. (#168) * [Docs] translate 2_data_pipeline.md into Chinese * [Docs] translate 3_new_module.md into Chinese * [Docs] Fix typos from py to python 2022-01-10 12:39:14 +08:00			```python
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00			`from .base import BaseDataSource`
			`...`
			`from .new_data_source import NewDataSource`

			`__all__ = [`
			`'BaseDataSource', ..., 'NewDataSource'`
			`]`
			```

[Refactor] Refactor docs directory (#419) * refactor directory * modify titles * fix lint * update index.rst * update * fix typo * update * fix typo * update model zoo * update index.rst * fix typo * fix typo 2022-08-17 12:06:41 +08:00			## Creating the `Dataset`
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00
			Assume the name of your `Dataset` is `NewDataset`, you can create a file, named `new_dataset.py` under `mmselfsup/datasets` and implement `NewDataset` in it.

[Docs] translate 2_data_pipeline.md and 3_new_module.md into Chinese and fix some typos. (#168) * [Docs] translate 2_data_pipeline.md into Chinese * [Docs] translate 3_new_module.md into Chinese * [Docs] Fix typos from py to python 2022-01-10 12:39:14 +08:00			```python
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00			`# Copyright (c) OpenMMLab. All rights reserved.`
			`import torch`
			`from mmcv.utils import build_from_cfg`
			`from torchvision.transforms import Compose`

			`from .base import BaseDataset`
			`from .builder import DATASETS, PIPELINES, build_datasource`
			`from .utils import to_numpy`


			`@DATASETS.register_module()`
			`class NewDataset(BaseDataset):`

			`def __init__(self, data_source, num_views, pipelines, prefetch=False):`
			`# writing your code here`
			`def __getitem__(self, idx):`
			`# writing your code here`
			`return dict(img=img)`

			`def evaluate(self, results, logger=None):`
			`return NotImplemented`
			```

			Then, add `NewDataset` in `mmselfsup/dataset/__init__.py`.

[Docs] translate 2_data_pipeline.md and 3_new_module.md into Chinese and fix some typos. (#168) * [Docs] translate 2_data_pipeline.md into Chinese * [Docs] translate 3_new_module.md into Chinese * [Docs] Fix typos from py to python 2022-01-10 12:39:14 +08:00			```python
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00			`from .base import BaseDataset`
			`...`
			`from .new_dataset import NewDataset`

			`__all__ = [`
			`'BaseDataset', ..., 'NewDataset'`
			`]`
			```

[Refactor] Refactor docs directory (#419) * refactor directory * modify titles * fix lint * update index.rst * update * fix typo * update * fix typo * update model zoo * update index.rst * fix typo * fix typo 2022-08-17 12:06:41 +08:00			`## Modify config file`
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00
			To use `NewDataset`, you can modify the config as the following:

[Docs] translate 2_data_pipeline.md and 3_new_module.md into Chinese and fix some typos. (#168) * [Docs] translate 2_data_pipeline.md into Chinese * [Docs] translate 3_new_module.md into Chinese * [Docs] Fix typos from py to python 2022-01-10 12:39:14 +08:00			```python
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00			`train=dict(`
			`type='NewDataset',`
			`data_source=dict(`
			`type='NewDataSource',`
			`),`
			`num_views=[2],`
			`pipelines=[train_pipeline],`
			`prefetch=prefetch,`
			`))`

			```