mmpretrain/docs/en/advanced_guides/datasets.md

# Adding New Dataset

You can write a new dataset class inherited from `BaseDataset`, and overwrite `load_data_list(self)`,
like [CIFAR10](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/cifar.py) and [ImageNet](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/imagenet.py).
Typically, this function returns a list, where each sample is a dict, containing necessary data information, e.g., `img` and `gt_label`.

Assume we are going to implement a `Filelist` dataset, which takes filelists for both training and testing. The format of annotation list is as follows:

```text
000001.jpg 0
000002.jpg 1
```

## 1. Create Dataset Class

We can create a new dataset in `mmpretrain/datasets/filelist.py` to load the data.

```python
from mmpretrain.registry import DATASETS
from .base_dataset import BaseDataset


@DATASETS.register_module()
class Filelist(BaseDataset):

    def load_data_list(self):
        assert isinstance(self.ann_file, str),

        data_list = []
        with open(self.ann_file) as f:
            samples = [x.strip().split(' ') for x in f.readlines()]
            for filename, gt_label in samples:
                img_path = add_prefix(filename, self.img_prefix)
                info = {'img_path': img_path, 'gt_label': int(gt_label)}
                data_list.append(info)
        return data_list
```

## 2. Add to the package

And add this dataset class in `mmpretrain/datasets/__init__.py`

```python
from .base_dataset import BaseDataset
...
from .filelist import Filelist

__all__ = [
    'BaseDataset', ... ,'Filelist'
]
```

## 3. Modify Related Config

Then in the config, to use `Filelist` you can modify the config as the following

```python
train_dataloader = dict(
    ...
    dataset=dict(
        type='Filelist',
        ann_file='image_list.txt',
        pipeline=train_pipeline,
    )
)
```

All the dataset classes inherit from [`BaseDataset`](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/base_dataset.py) have **lazy loading** and **memory saving** features, you can refer to related documents of {external+mmengine:doc}`BaseDataset <advanced_tutorials/basedataset>`.

```{note}
If the dictionary of the data sample contains 'img_path' but not 'img', then 'LoadImgFromFile' transform must be added in the pipeline.
```
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			`# Adding New Dataset`
Add tutorial docs 2020-07-08 12:59:15 +08:00
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			You can write a new dataset class inherited from `BaseDataset`, and overwrite `load_data_list(self)`,
[Docs] Update links (#1457) * update links * update readtherdocs * update * update * fix lint * update * update * update * update cov branch * update * update * update 2023-04-06 20:58:52 +08:00			`like [CIFAR10](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/cifar.py) and [ImageNet](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/imagenet.py).`
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			Typically, this function returns a list, where each sample is a dict, containing necessary data information, e.g., `img` and `gt_label`.
Add tutorial docs 2020-07-08 12:59:15 +08:00
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			Assume we are going to implement a `Filelist` dataset, which takes filelists for both training and testing. The format of annotation list is as follows:
[Docs] Update install tutorials. (#854) * [Docs] Update install tutorials. * [Docs] Improve dataset docs * Add option to show the results in demo. * fix typo 2022-06-01 18:31:57 +08:00
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			```text
Add tutorial docs 2020-07-08 12:59:15 +08:00			`000001.jpg 0`
			`000002.jpg 1`
			```

[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			`## 1. Create Dataset Class`

[Docs] Update generate_readme.py and readme files. (#1388) * Update generate_readme.py and readme files. * Update reamde * Update docs * update metafile * update simmim readme * update * update mae * fix lint * update mocov2 * update readme pic * fix lint * Fix mmcls download links. * Fix Chinese docs. * Decrease readthedocs requirements. --------- Co-authored-by: fangyixiao18 <fangyx18@hotmail.com> 2023-03-02 13:29:07 +08:00			We can create a new dataset in `mmpretrain/datasets/filelist.py` to load the data.
Add tutorial docs 2020-07-08 12:59:15 +08:00
			```python
[Docs] Update generate_readme.py and readme files. (#1388) * Update generate_readme.py and readme files. * Update reamde * Update docs * update metafile * update simmim readme * update * update mae * fix lint * update mocov2 * update readme pic * fix lint * Fix mmcls download links. * Fix Chinese docs. * Decrease readthedocs requirements. --------- Co-authored-by: fangyixiao18 <fangyx18@hotmail.com> 2023-03-02 13:29:07 +08:00			`from mmpretrain.registry import DATASETS`
Add tutorial docs 2020-07-08 12:59:15 +08:00			`from .base_dataset import BaseDataset`


			`@DATASETS.register_module()`
[Docs] Fix error in new_dataset.md and add Chinese translation of finture.md, new_dataset.md (#243) * Fix error in new_dataset.md * Add Chinese Translation of finture.md, new_dataset.md 2021-05-10 17:17:37 +08:00			`class Filelist(BaseDataset):`
Add tutorial docs 2020-07-08 12:59:15 +08:00
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			`def load_data_list(self):`
			`assert isinstance(self.ann_file, str),`
Add tutorial docs 2020-07-08 12:59:15 +08:00
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			`data_list = []`
Add tutorial docs 2020-07-08 12:59:15 +08:00			`with open(self.ann_file) as f:`
			`samples = [x.strip().split(' ') for x in f.readlines()]`
			`for filename, gt_label in samples:`
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			`img_path = add_prefix(filename, self.img_prefix)`
			`info = {'img_path': img_path, 'gt_label': int(gt_label)}`
			`data_list.append(info)`
			`return data_list`
Add tutorial docs 2020-07-08 12:59:15 +08:00			```

[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			`## 2. Add to the package`

[Docs] Update generate_readme.py and readme files. (#1388) * Update generate_readme.py and readme files. * Update reamde * Update docs * update metafile * update simmim readme * update * update mae * fix lint * update mocov2 * update readme pic * fix lint * Fix mmcls download links. * Fix Chinese docs. * Decrease readthedocs requirements. --------- Co-authored-by: fangyixiao18 <fangyx18@hotmail.com> 2023-03-02 13:29:07 +08:00			And add this dataset class in `mmpretrain/datasets/__init__.py`
[Docs] Fix error in new_dataset.md and add Chinese translation of finture.md, new_dataset.md (#243) * Fix error in new_dataset.md * Add Chinese Translation of finture.md, new_dataset.md 2021-05-10 17:17:37 +08:00
			```python
			`from .base_dataset import BaseDataset`
			`...`
			`from .filelist import Filelist`

			`__all__ = [`
			`'BaseDataset', ... ,'Filelist'`
			`]`
			```

[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			`## 3. Modify Related Config`
Add tutorial docs 2020-07-08 12:59:15 +08:00
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			Then in the config, to use `Filelist` you can modify the config as the following
Add markdown lint in pre-commit hook (#106) * add mdlint * remove duplicate check * modify md files to pass linting check * change mmcv download url * remove torch1.4.0 version check * remove torch1.4.0 version check 2020-12-02 19:42:45 +08:00
Add tutorial docs 2020-07-08 12:59:15 +08:00			```python
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			`train_dataloader = dict(`
[Docs] Update install tutorials. (#854) * [Docs] Update install tutorials. * [Docs] Improve dataset docs * Add option to show the results in demo. * fix typo 2022-06-01 18:31:57 +08:00			`...`
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			`dataset=dict(`
			`type='Filelist',`
			`ann_file='image_list.txt',`
			`pipeline=train_pipeline,`
			`)`
[Docs] Update install tutorials. (#854) * [Docs] Update install tutorials. * [Docs] Improve dataset docs * Add option to show the results in demo. * fix typo 2022-06-01 18:31:57 +08:00			`)`
Add tutorial docs 2020-07-08 12:59:15 +08:00			```

[Docs] Update links (#1457) * update links * update readtherdocs * update * update * fix lint * update * update * update * update cov branch * update * update * update 2023-04-06 20:58:52 +08:00			All the dataset classes inherit from [`BaseDataset`](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/base_dataset.py) have lazy loading and memory saving features, you can refer to related documents of {external+mmengine:doc}`BaseDataset <advanced_tutorials/basedataset>`.
Add tutorial docs 2020-07-08 12:59:15 +08:00
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			```{note}
			`If the dictionary of the data sample contains 'img_path' but not 'img', then 'LoadImgFromFile' transform must be added in the pipeline.`
Add tutorial docs 2020-07-08 12:59:15 +08:00			```