2022-08-30 18:45:58 +08:00
# Adding New Dataset
2020-07-08 12:59:15 +08:00
2022-08-30 18:45:58 +08:00
You can write a new dataset class inherited from `BaseDataset` , and overwrite `load_data_list(self)` ,
2023-04-06 20:58:52 +08:00
like [CIFAR10 ](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/cifar.py ) and [ImageNet ](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/imagenet.py ).
2022-08-30 18:45:58 +08:00
Typically, this function returns a list, where each sample is a dict, containing necessary data information, e.g., `img` and `gt_label` .
2020-07-08 12:59:15 +08:00
2022-08-30 18:45:58 +08:00
Assume we are going to implement a `Filelist` dataset, which takes filelists for both training and testing. The format of annotation list is as follows:
2022-06-01 18:31:57 +08:00
2022-08-30 18:45:58 +08:00
```text
2020-07-08 12:59:15 +08:00
000001.jpg 0
000002.jpg 1
```
2022-08-30 18:45:58 +08:00
## 1. Create Dataset Class
2023-03-02 13:29:07 +08:00
We can create a new dataset in `mmpretrain/datasets/filelist.py` to load the data.
2020-07-08 12:59:15 +08:00
```python
2023-03-02 13:29:07 +08:00
from mmpretrain.registry import DATASETS
2020-07-08 12:59:15 +08:00
from .base_dataset import BaseDataset
@DATASETS .register_module()
2021-05-10 17:17:37 +08:00
class Filelist(BaseDataset):
2020-07-08 12:59:15 +08:00
2022-08-30 18:45:58 +08:00
def load_data_list(self):
assert isinstance(self.ann_file, str),
2020-07-08 12:59:15 +08:00
2022-08-30 18:45:58 +08:00
data_list = []
2020-07-08 12:59:15 +08:00
with open(self.ann_file) as f:
samples = [x.strip().split(' ') for x in f.readlines()]
for filename, gt_label in samples:
2022-08-30 18:45:58 +08:00
img_path = add_prefix(filename, self.img_prefix)
info = {'img_path': img_path, 'gt_label': int(gt_label)}
data_list.append(info)
return data_list
2020-07-08 12:59:15 +08:00
```
2022-08-30 18:45:58 +08:00
## 2. Add to the package
2023-03-02 13:29:07 +08:00
And add this dataset class in `mmpretrain/datasets/__init__.py`
2021-05-10 17:17:37 +08:00
```python
from .base_dataset import BaseDataset
...
from .filelist import Filelist
__all__ = [
'BaseDataset', ... ,'Filelist'
]
```
2022-08-30 18:45:58 +08:00
## 3. Modify Related Config
2020-07-08 12:59:15 +08:00
2022-08-30 18:45:58 +08:00
Then in the config, to use `Filelist` you can modify the config as the following
2020-12-02 19:42:45 +08:00
2020-07-08 12:59:15 +08:00
```python
2022-08-30 18:45:58 +08:00
train_dataloader = dict(
2022-06-01 18:31:57 +08:00
...
2022-08-30 18:45:58 +08:00
dataset=dict(
type='Filelist',
ann_file='image_list.txt',
pipeline=train_pipeline,
)
2022-06-01 18:31:57 +08:00
)
2020-07-08 12:59:15 +08:00
```
2023-04-06 20:58:52 +08:00
All the dataset classes inherit from [`BaseDataset` ](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/base_dataset.py ) have **lazy loading** and **memory saving** features, you can refer to related documents of {external+mmengine:doc}`BaseDataset < advanced_tutorials / basedataset > `.
2020-07-08 12:59:15 +08:00
2022-08-30 18:45:58 +08:00
```{note}
If the dictionary of the data sample contains 'img_path' but not 'img', then 'LoadImgFromFile' transform must be added in the pipeline.
2020-07-08 12:59:15 +08:00
```