mmfewshot/docs/en/classification/customize_dataset.md

# Tutorial 2: Adding New Dataset

## Customize datasets by reorganizing data

### Customize loading annotations

You can write a new Dataset class inherited from `BaseFewShotDataset`, and overwrite `load_annotations(self)`,
like [CUB](https://github.com/open-mmlab/mmfewshot/blob/main/mmfewshot/classification/datasets/cub.py) and [MiniImageNet](https://github.com/open-mmlab/mmfewshot/blob/main/mmfewshot/classification/datasets/mini_imagenet.py).
Typically, this function returns a list, where each sample is a dict, containing necessary data information, e.g., `img` and `gt_label`.

Assume we are going to implement a `Filelist` dataset, which takes filelists for both training and testing. The format of annotation list is as follows:

```
000001.jpg 0
000002.jpg 1
```

We can create a new dataset in `mmfewshot/classification/datasets/filelist.py` to load the data.

```python
import mmcv
import numpy as np

from mmcls.datasets.builder import DATASETS
from .base import BaseFewShotDataset


@DATASETS.register_module()
class Filelist(BaseFewShotDataset):

    def load_annotations(self):
        assert isinstance(self.ann_file, str)

        data_infos = []
        with open(self.ann_file) as f:
            samples = [x.strip().split(' ') for x in f.readlines()]
            for filename, gt_label in samples:
                info = {'img_prefix': self.data_prefix}
                info['img_info'] = {'filename': filename}
                info['gt_label'] = np.array(gt_label, dtype=np.int64)
                data_infos.append(info)
            return data_infos

```

And add this dataset class in `mmcls/datasets/__init__.py`

```python
from .base_dataset import BaseDataset
...
from .filelist import Filelist

__all__ = [
    'BaseDataset', ... ,'Filelist'
]
```

Then in the config, to use `Filelist` you can modify the config as the following

```python
train = dict(
    type='Filelist',
    ann_file = 'image_list.txt',
    pipeline=train_pipeline
)
```

### Customize different subsets

To support different subset, we first predefine the classes of different subsets.
Then we modify `get_classes` to handle different classes of subset.

```python
import mmcv
import numpy as np

from mmcls.datasets.builder import DATASETS
from .base import BaseFewShotDataset

@DATASETS.register_module()
class Filelist(BaseFewShotDataset):

    TRAIN_CLASSES = ['train_a', ...]
    VAL_CLASSES = ['val_a', ...]
    TEST_CLASSES = ['test_a', ...]

    def __init__(self, subset, *args, **kwargs):
        ...
        self.subset = subset
        super().__init__(*args, **kwargs)

    def get_classes(self):
        if self.subset == 'train':
            class_names = self.TRAIN_CLASSES
        ...
        return class_names
```

## Customize datasets sampling

### EpisodicDataset

We use `EpisodicDataset` as wrapper to perform N way K shot sampling.
For example, suppose the original dataset is Dataset_A, the config looks like the following

```python
dataset_A_train = dict(
        type='EpisodicDataset',
        num_episodes=100000, # number of total episodes = length of dataset wrapper
        # each call of `__getitem__` will return
        # {'support_data': [(num_ways * num_shots) images],
        #  'query_data': [(num_ways * num_queries) images]}
        num_ways=5, # number of way (different classes)
        num_shots=5, # number of support shots of each class
        num_queries=5, # number of query shots of each class
        dataset=dict(  # This is the original config of Dataset_A
            type='Dataset_A',
            ...
            pipeline=train_pipeline
        )
    )
```

### Customize sampling logic

An example of customizing data sampling logic for training:

#### Create a new dataset wrapper
We can create a new dataset wrapper in mmfewshot/classification/datasets/dataset_wrappers.py to customize sampling logic.

```python
class MyDatasetWrapper:
    def __init__(self, dataset, args_a, args_b, ...):
        self.dataset = dataset
        ...
        self.episode_idxes = self.generate_episodic_idxes()

    def generate_episodic_idxes(self):
        episode_idxes = []
        # sampling each episode
        for _ in range(self.num_episodes):
            episodic_a_idx, episodic_b_idx, episodic_c_idx= [], [], []
            # customize sampling logic
            # select the index of data_infos from original dataset
            ...
            episode_idxes.append({
                'a': episodic_a_idx,
                'b': episodic_b_idx,
                'c': episodic_c_idx,
            })
        return episode_idxes

    def __getitem__(self, idx):
        # the key can be any value, but it needs to modify the code
        # in the forward function of model.
        return {
            'a_data' : [self.dataset[i] for i in self.episode_idxes[idx]['a']],
            'b_data' : [self.dataset[i] for i in self.episode_idxes[idx]['b']],
            'c_data' : [self.dataset[i] for i in self.episode_idxes[idx]['c']]
        }

```

#### Update dataset builder
We need to add the build code in mmfewshot/classification/datasets/builder.py
for our customize dataset wrapper.


```python
def build_dataset(cfg, default_args=None):
    if isinstance(cfg, (list, tuple)):
        dataset = ConcatDataset([build_dataset(c, default_args) for c in cfg])
    ...
    elif cfg['type'] == 'MyDatasetWrapper':
        dataset = MyDatasetWrapper(
            build_dataset(cfg['dataset'], default_args),
            # pass customize arguments
            args_a=cfg['args_a'],
            args_b=cfg['args_b'],
            ...)
    else:
        dataset = build_from_cfg(cfg, DATASETS, default_args)

    return dataset
```

#### Update the arguments in model

The argument names in forward function need to be consistent with the customize dataset wrapper.

```python
class MyClassifier(BaseFewShotClassifier):
    ...
    def forward(self, a_data=None, b_data=None, c_data=None, ...):
        # pass the modified arguments name.
        if mode == 'train':
            return self.forward_train(a_data=a_data, b_data=b_data, c_data=None, **kwargs)
        elif mode == 'query':
            return self.forward_query(img=img, **kwargs)
        elif mode == 'support':
            return self.forward_support(img=img, **kwargs)
        elif mode == 'extract_feat':
            return self.extract_feat(img=img)
        else:
            raise ValueError()
```

#### Using customize dataset wrapper in config
Then in the config, to use `MyDatasetWrapper` you can modify the config as the following,
```python
dataset_A_train = dict(
        type='MyDatasetWrapper',
        args_a=None,
        args_b=None,
        dataset=dict(  # This is the original config of Dataset_A
            type='Dataset_A',
            ...
            pipeline=train_pipeline
        )
    )
```
update docs & docker file (#52) * fix init * fix test api fix test api bug * add metarcnn fsdetview config * update docs & docker file * update README * update work flow 2021-11-22 07:17:10 +08:00			`# Tutorial 2: Adding New Dataset`

			`## Customize datasets by reorganizing data`

			`### Customize loading annotations`

			You can write a new Dataset class inherited from `BaseFewShotDataset`, and overwrite `load_annotations(self)`,
fix links and typos (#58) * fix init * fix test api fix test api bug * add metarcnn fsdetview config * fix link and typos 2021-11-23 10:10:11 +08:00			`like [CUB](https://github.com/open-mmlab/mmfewshot/blob/main/mmfewshot/classification/datasets/cub.py) and [MiniImageNet](https://github.com/open-mmlab/mmfewshot/blob/main/mmfewshot/classification/datasets/mini_imagenet.py).`
update docs & docker file (#52) * fix init * fix test api fix test api bug * add metarcnn fsdetview config * update docs & docker file * update README * update work flow 2021-11-22 07:17:10 +08:00			Typically, this function returns a list, where each sample is a dict, containing necessary data information, e.g., `img` and `gt_label`.

			Assume we are going to implement a `Filelist` dataset, which takes filelists for both training and testing. The format of annotation list is as follows:

			```
			`000001.jpg 0`
			`000002.jpg 1`
			```

			We can create a new dataset in `mmfewshot/classification/datasets/filelist.py` to load the data.

			```python
			`import mmcv`
			`import numpy as np`

			`from mmcls.datasets.builder import DATASETS`
			`from .base import BaseFewShotDataset`


			`@DATASETS.register_module()`
			`class Filelist(BaseFewShotDataset):`

			`def load_annotations(self):`
			`assert isinstance(self.ann_file, str)`

			`data_infos = []`
			`with open(self.ann_file) as f:`
			`samples = [x.strip().split(' ') for x in f.readlines()]`
			`for filename, gt_label in samples:`
			`info = {'img_prefix': self.data_prefix}`
			`info['img_info'] = {'filename': filename}`
			`info['gt_label'] = np.array(gt_label, dtype=np.int64)`
			`data_infos.append(info)`
			`return data_infos`

			```

			And add this dataset class in `mmcls/datasets/__init__.py`

			```python
			`from .base_dataset import BaseDataset`
			`...`
			`from .filelist import Filelist`

			`__all__ = [`
			`'BaseDataset', ... ,'Filelist'`
			`]`
			```

			Then in the config, to use `Filelist` you can modify the config as the following

			```python
			`train = dict(`
			`type='Filelist',`
			`ann_file = 'image_list.txt',`
			`pipeline=train_pipeline`
			`)`
			```

			`### Customize different subsets`

			`To support different subset, we first predefine the classes of different subsets.`
			Then we modify `get_classes` to handle different classes of subset.

			```python
			`import mmcv`
			`import numpy as np`

			`from mmcls.datasets.builder import DATASETS`
			`from .base import BaseFewShotDataset`

			`@DATASETS.register_module()`
			`class Filelist(BaseFewShotDataset):`

			`TRAIN_CLASSES = ['train_a', ...]`
			`VAL_CLASSES = ['val_a', ...]`
			`TEST_CLASSES = ['test_a', ...]`

			`def __init__(self, subset, args, *kwargs):`
			`...`
			`self.subset = subset`
			`super().__init__(args, *kwargs)`

			`def get_classes(self):`
			`if self.subset == 'train':`
			`class_names = self.TRAIN_CLASSES`
			`...`
			`return class_names`
			```

			`## Customize datasets sampling`

			`### EpisodicDataset`

			We use `EpisodicDataset` as wrapper to perform N way K shot sampling.
			`For example, suppose the original dataset is Dataset_A, the config looks like the following`

			```python
			`dataset_A_train = dict(`
			`type='EpisodicDataset',`
			`num_episodes=100000, # number of total episodes = length of dataset wrapper`
			# each call of `__getitem__` will return
			`# {'support_data': [(num_ways * num_shots) images],`
			`# 'query_data': [(num_ways * num_queries) images]}`
			`num_ways=5, # number of way (different classes)`
			`num_shots=5, # number of support shots of each class`
			`num_queries=5, # number of query shots of each class`
			`dataset=dict( # This is the original config of Dataset_A`
			`type='Dataset_A',`
			`...`
			`pipeline=train_pipeline`
			`)`
			`)`
			```

			`### Customize sampling logic`

			`An example of customizing data sampling logic for training:`

			`#### Create a new dataset wrapper`
			`We can create a new dataset wrapper in mmfewshot/classification/datasets/dataset_wrappers.py to customize sampling logic.`

			```python
			`class MyDatasetWrapper:`
			`def __init__(self, dataset, args_a, args_b, ...):`
			`self.dataset = dataset`
			`...`
			`self.episode_idxes = self.generate_episodic_idxes()`

			`def generate_episodic_idxes(self):`
			`episode_idxes = []`
			`# sampling each episode`
			`for _ in range(self.num_episodes):`
			`episodic_a_idx, episodic_b_idx, episodic_c_idx= [], [], []`
			`# customize sampling logic`
			`# select the index of data_infos from original dataset`
			`...`
			`episode_idxes.append({`
			`'a': episodic_a_idx,`
			`'b': episodic_b_idx,`
			`'c': episodic_c_idx,`
			`})`
			`return episode_idxes`

			`def __getitem__(self, idx):`
			`# the key can be any value, but it needs to modify the code`
			`# in the forward function of model.`
			`return {`
			`'a_data' : [self.dataset[i] for i in self.episode_idxes[idx]['a']],`
			`'b_data' : [self.dataset[i] for i in self.episode_idxes[idx]['b']],`
			`'c_data' : [self.dataset[i] for i in self.episode_idxes[idx]['c']]`
			`}`

			```

			`#### Update dataset builder`
			`We need to add the build code in mmfewshot/classification/datasets/builder.py`
			`for our customize dataset wrapper.`


			```python
			`def build_dataset(cfg, default_args=None):`
			`if isinstance(cfg, (list, tuple)):`
			`dataset = ConcatDataset([build_dataset(c, default_args) for c in cfg])`
			`...`
			`elif cfg['type'] == 'MyDatasetWrapper':`
			`dataset = MyDatasetWrapper(`
			`build_dataset(cfg['dataset'], default_args),`
			`# pass customize arguments`
			`args_a=cfg['args_a'],`
			`args_b=cfg['args_b'],`
			`...)`
			`else:`
			`dataset = build_from_cfg(cfg, DATASETS, default_args)`

			`return dataset`
			```

			`#### Update the arguments in model`

			`The argument names in forward function need to be consistent with the customize dataset wrapper.`

			```python
			`class MyClassifier(BaseFewShotClassifier):`
			`...`
			`def forward(self, a_data=None, b_data=None, c_data=None, ...):`
			`# pass the modified arguments name.`
			`if mode == 'train':`
			`return self.forward_train(a_data=a_data, b_data=b_data, c_data=None, **kwargs)`
			`elif mode == 'query':`
			`return self.forward_query(img=img, **kwargs)`
			`elif mode == 'support':`
			`return self.forward_support(img=img, **kwargs)`
			`elif mode == 'extract_feat':`
			`return self.extract_feat(img=img)`
			`else:`
			`raise ValueError()`
			```

[Docs] unify title format 2021-11-23 22:35:41 +08:00			`#### Using customize dataset wrapper in config`
update docs & docker file (#52) * fix init * fix test api fix test api bug * add metarcnn fsdetview config * update docs & docker file * update README * update work flow 2021-11-22 07:17:10 +08:00			Then in the config, to use `MyDatasetWrapper` you can modify the config as the following,
			```python
			`dataset_A_train = dict(`
			`type='MyDatasetWrapper',`
			`args_a=None,`
			`args_b=None,`
			`dataset=dict( # This is the original config of Dataset_A`
			`type='Dataset_A',`
			`...`
			`pipeline=train_pipeline`
			`)`
			`)`
			```