mirror of
https://github.com/open-mmlab/mmsegmentation.git
synced 2025-06-03 22:03:48 +08:00
[Doc] Add explanation and usage instructions for data configuration (#1548)
* [WIP] Data configuration * [Doc] Add data configuration * version info * grammar * typo * typo * format * fix based on comments * grammar * comments
This commit is contained in:
parent
a33d27e356
commit
ee67c808b2
@ -1,5 +1,84 @@
|
|||||||
# Tutorial 2: Customize Datasets
|
# Tutorial 2: Customize Datasets
|
||||||
|
|
||||||
|
## Data configuration
|
||||||
|
|
||||||
|
`data` in config file is the variable for data configuration, to define the arguments that are used in datasets and dataloaders.
|
||||||
|
|
||||||
|
Here is an example of data configuration:
|
||||||
|
|
||||||
|
```python
|
||||||
|
data = dict(
|
||||||
|
samples_per_gpu=4,
|
||||||
|
workers_per_gpu=4,
|
||||||
|
train=dict(
|
||||||
|
type='ADE20KDataset',
|
||||||
|
data_root='data/ade/ADEChallengeData2016',
|
||||||
|
img_dir='images/training',
|
||||||
|
ann_dir='annotations/training',
|
||||||
|
pipeline=train_pipeline),
|
||||||
|
val=dict(
|
||||||
|
type='ADE20KDataset',
|
||||||
|
data_root='data/ade/ADEChallengeData2016',
|
||||||
|
img_dir='images/validation',
|
||||||
|
ann_dir='annotations/validation',
|
||||||
|
pipeline=test_pipeline),
|
||||||
|
test=dict(
|
||||||
|
type='ADE20KDataset',
|
||||||
|
data_root='data/ade/ADEChallengeData2016',
|
||||||
|
img_dir='images/validation',
|
||||||
|
ann_dir='annotations/validation',
|
||||||
|
pipeline=test_pipeline))
|
||||||
|
```
|
||||||
|
|
||||||
|
- `train`, `val` and `test`: The [`config`](https://github.com/open-mmlab/mmcv/blob/master/docs/en/understand_mmcv/config.md)s to build dataset instances for model training, validation and testing by
|
||||||
|
using [`build and registry`](https://github.com/open-mmlab/mmcv/blob/master/docs/en/understand_mmcv/registry.md) mechanism.
|
||||||
|
|
||||||
|
- `samples_per_gpu`: How many samples per batch and per gpu to load during model training, and the `batch_size` of training is equal to `samples_per_gpu` times gpu number, e.g. when using 8 gpus for distributed data parallel trainig and `samples_per_gpu=4`, the `batch_size` is `8*4=16`.
|
||||||
|
If you would like to define `batch_size` for testing and validation, please use `test_dataloaser` and
|
||||||
|
`val_dataloader` with mmseg >=0.24.1.
|
||||||
|
|
||||||
|
- `workers_per_gpu`: How many subprocesses per gpu to use for data loading. `0` means that the data will be loaded in the main process.
|
||||||
|
|
||||||
|
**Note:** `samples_per_gpu` only works for model training, and the default setting of `samples_per_gpu` is 1 in mmseg when model testing and validation (DO NOT support batch inference yet).
|
||||||
|
|
||||||
|
**Note:** before v0.24.1, except `train`, `val` `test`, `samples_per_gpu` and `workers_per_gpu`, the other keys in `data` must be the
|
||||||
|
input keyword arguments for `dataloader` in pytorch, and the dataloaders used for model training, validation and testing have the same input arguments.
|
||||||
|
In v0.24.1, mmseg supports to use `train_dataloader`, `test_dataloaser` and `val_dataloader` to specify different keyword arguments, and still supports the overall arguments definition but the specific dataloader setting has a higher priority.
|
||||||
|
|
||||||
|
Here is an example for specific dataloader:
|
||||||
|
|
||||||
|
```python
|
||||||
|
data = dict(
|
||||||
|
samples_per_gpu=4,
|
||||||
|
workers_per_gpu=4,
|
||||||
|
shuffle=True,
|
||||||
|
train=dict(type='xxx', ...),
|
||||||
|
val=dict(type='xxx', ...),
|
||||||
|
test=dict(type='xxx', ...),
|
||||||
|
# Use different batch size during validation and testing.
|
||||||
|
val_dataloader=dict(samples_per_gpu=1, workers_per_gpu=4, shuffle=False),
|
||||||
|
test_dataloader=dict(samples_per_gpu=1, workers_per_gpu=4, shuffle=False))
|
||||||
|
```
|
||||||
|
|
||||||
|
Assume only one gpu used for model training and testing, as the priority of the overall arguments definition is low, the batch_size
|
||||||
|
for training is `4` and dataset will be shuffled, and batch_size for testing and validation is `1`, and dataset will not be shuffled.
|
||||||
|
|
||||||
|
To make data configuration much clearer, we recommend use specific dataloader setting instead of overall dataloader setting after v0.24.1, just like:
|
||||||
|
|
||||||
|
```python
|
||||||
|
data = dict(
|
||||||
|
train=dict(type='xxx', ...),
|
||||||
|
val=dict(type='xxx', ...),
|
||||||
|
test=dict(type='xxx', ...),
|
||||||
|
# Use specific dataloader setting
|
||||||
|
train_dataloader=dict(samples_per_gpu=4, workers_per_gpu=4, shuffle=True),
|
||||||
|
val_dataloader=dict(samples_per_gpu=1, workers_per_gpu=4, shuffle=False),
|
||||||
|
test_dataloader=dict(samples_per_gpu=1, workers_per_gpu=4, shuffle=False))
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note:** in model training, default values in the script of mmseg for dataloader are `shuffle=True, and drop_last=True`,
|
||||||
|
in model validation and testing, default values are `shuffle=False, and drop_last=False`
|
||||||
|
|
||||||
## Customize datasets by reorganizing data
|
## Customize datasets by reorganizing data
|
||||||
|
|
||||||
The simplest way is to convert your dataset to organize your data into folders.
|
The simplest way is to convert your dataset to organize your data into folders.
|
||||||
|
Loading…
x
Reference in New Issue
Block a user