mmfewshot/docs/en/get_started.md

## Test a model

- single GPU
- CPU
- single node multiple GPU
- multiple node

You can use the following commands to infer a dataset.

```shell
# single-gpu
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

# CPU: disable GPUs and run single-gpu testing script
export CUDA_VISIBLE_DEVICES=-1
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

# multi-gpu
./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [optional arguments]

# multi-node in slurm environment
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments] --launcher slurm
```


Examples:

For classification, inference Baseline on CUB under 5way 1shot setting.

```shell
python ./tools/classification/test.py \
  configs/classification/baseline/cub/baseline_conv4_1xb64_cub_5way-1shot.py \
  checkpoints/SOME_CHECKPOINT.pth
```

For detection, inference TFA on VOC split1 1shot setting.

```shell
python ./tools/detection/test.py \
  configs/detection/tfa/voc/split1/tfa_r101_fpn_voc-split1_1shot-fine-tuning.py \
  checkpoints/SOME_CHECKPOINT.pth --eval mAP
```

## Train a model

### Train with a single GPU

```shell
python tools/train.py ${CONFIG_FILE} [optional arguments]
```

If you want to specify the working directory in the command, you can add an argument `--work_dir ${YOUR_WORK_DIR}`.

### Train on CPU

The process of training on the CPU is consistent with single GPU training. We just need to disable GPUs before the training process.

```shell
export CUDA_VISIBLE_DEVICES=-1
python tools/train.py ${CONFIG_FILE} [optional arguments]
```

**Note**:

We do not recommend users to use CPU for training because it is too slow. We support this feature to allow users to debug on machines without GPU for convenience.


### Train with multiple GPUs

```shell
./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]
```

Optional arguments are:

- `--no-validate` (**not suggested**): By default, the codebase will perform evaluation during the training. To disable this behavior, use `--no-validate`.
- `--work-dir ${WORK_DIR}`: Override the working directory specified in the config file.
- `--resume-from ${CHECKPOINT_FILE}`: Resume from a previous checkpoint file.

Difference between `resume-from` and `load-from`:
`resume-from` loads both the model weights and optimizer status, and the epoch is also inherited from the specified checkpoint. It is usually used for resuming the training process that is interrupted accidentally.
`load-from` only loads the model weights and the training epoch starts from 0. It is usually used for finetuning.

### Train with multiple machines

If you run MMClassification on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `slurm_train.sh`. (This script also supports single machine training.)

```shell
[GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR}
```

You can check [slurm_train.sh](https://github.com/open-mmlab/mmclassification/blob/master/tools/slurm_train.sh) for full arguments and environment variables.

If you have just multiple machines connected with ethernet, you can refer to
PyTorch [launch utility](https://pytorch.org/docs/stable/distributed_deprecated.html#launch-utility).
Usually it is slow if you do not have high speed networking like InfiniBand.

### Launch multiple jobs on a single machine

If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs,
you need to specify different ports (29500 by default) for each job to avoid communication conflict.

If you use `dist_train.sh` to launch training jobs, you can set the port in commands.

```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4
```

If you use launch training jobs with Slurm, you need to modify the config files (usually the 6th line from the bottom in config files) to set different communication ports.

In `config1.py`,

```python
dist_params = dict(backend='nccl', port=29500)
```

In `config2.py`,

```python
dist_params = dict(backend='nccl', port=29501)
```

Then you can launch two jobs with `config1.py` ang `config2.py`.

```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}
```
update docs & docker file (#52) * fix init * fix test api fix test api bug * add metarcnn fsdetview config * update docs & docker file * update README * update work flow 2021-11-22 07:17:10 +08:00			`## Test a model`

			`- single GPU`
[Enhance] Formatting non distributed training and inference and Supporting CPU training. (#42) * [Docs] update batch size * Fix bug in non-distributed multi-gpu training/testing * support cpu training * update cpu training and testing 2022-03-11 00:25:34 +08:00			`- CPU`
update docs & docker file (#52) * fix init * fix test api fix test api bug * add metarcnn fsdetview config * update docs & docker file * update README * update work flow 2021-11-22 07:17:10 +08:00			`- single node multiple GPU`
			`- multiple node`

			`You can use the following commands to infer a dataset.`

			```shell
			`# single-gpu`
			`python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]`

[Enhance] Formatting non distributed training and inference and Supporting CPU training. (#42) * [Docs] update batch size * Fix bug in non-distributed multi-gpu training/testing * support cpu training * update cpu training and testing 2022-03-11 00:25:34 +08:00			`# CPU: disable GPUs and run single-gpu testing script`
			`export CUDA_VISIBLE_DEVICES=-1`
			`python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]`

update docs & docker file (#52) * fix init * fix test api fix test api bug * add metarcnn fsdetview config * update docs & docker file * update README * update work flow 2021-11-22 07:17:10 +08:00			`# multi-gpu`
			`./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [optional arguments]`

			`# multi-node in slurm environment`
			`python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments] --launcher slurm`
			```


			`Examples:`

			`For classification, inference Baseline on CUB under 5way 1shot setting.`

			```shell
			`python ./tools/classification/test.py \`
			`configs/classification/baseline/cub/baseline_conv4_1xb64_cub_5way-1shot.py \`
			`checkpoints/SOME_CHECKPOINT.pth`
			```

			`For detection, inference TFA on VOC split1 1shot setting.`

			```shell
			`python ./tools/detection/test.py \`
			`configs/detection/tfa/voc/split1/tfa_r101_fpn_voc-split1_1shot-fine-tuning.py \`
			`checkpoints/SOME_CHECKPOINT.pth --eval mAP`
			```

			`## Train a model`

			`### Train with a single GPU`

			```shell
			`python tools/train.py ${CONFIG_FILE} [optional arguments]`
			```

			If you want to specify the working directory in the command, you can add an argument `--work_dir ${YOUR_WORK_DIR}`.

[Enhance] Formatting non distributed training and inference and Supporting CPU training. (#42) * [Docs] update batch size * Fix bug in non-distributed multi-gpu training/testing * support cpu training * update cpu training and testing 2022-03-11 00:25:34 +08:00			`### Train on CPU`

			`The process of training on the CPU is consistent with single GPU training. We just need to disable GPUs before the training process.`

			```shell
			`export CUDA_VISIBLE_DEVICES=-1`
			`python tools/train.py ${CONFIG_FILE} [optional arguments]`
			```

			`Note:`

			`We do not recommend users to use CPU for training because it is too slow. We support this feature to allow users to debug on machines without GPU for convenience.`


update docs & docker file (#52) * fix init * fix test api fix test api bug * add metarcnn fsdetview config * update docs & docker file * update README * update work flow 2021-11-22 07:17:10 +08:00			`### Train with multiple GPUs`

			```shell
			`./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]`
			```

			`Optional arguments are:`

			- `--no-validate` (not suggested): By default, the codebase will perform evaluation during the training. To disable this behavior, use `--no-validate`.
			- `--work-dir ${WORK_DIR}`: Override the working directory specified in the config file.
			- `--resume-from ${CHECKPOINT_FILE}`: Resume from a previous checkpoint file.

			Difference between `resume-from` and `load-from`:
			`resume-from` loads both the model weights and optimizer status, and the epoch is also inherited from the specified checkpoint. It is usually used for resuming the training process that is interrupted accidentally.
			`load-from` only loads the model weights and the training epoch starts from 0. It is usually used for finetuning.

			`### Train with multiple machines`

			If you run MMClassification on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `slurm_train.sh`. (This script also supports single machine training.)

			```shell
			`[GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR}`
			```

			`You can check [slurm_train.sh](https://github.com/open-mmlab/mmclassification/blob/master/tools/slurm_train.sh) for full arguments and environment variables.`

			`If you have just multiple machines connected with ethernet, you can refer to`
			`PyTorch [launch utility](https://pytorch.org/docs/stable/distributed_deprecated.html#launch-utility).`
			`Usually it is slow if you do not have high speed networking like InfiniBand.`

			`### Launch multiple jobs on a single machine`

			`If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs,`
			`you need to specify different ports (29500 by default) for each job to avoid communication conflict.`

			If you use `dist_train.sh` to launch training jobs, you can set the port in commands.

			```shell
			`CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4`
			`CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4`
			```

			`If you use launch training jobs with Slurm, you need to modify the config files (usually the 6th line from the bottom in config files) to set different communication ports.`

			In `config1.py`,

			```python
			`dist_params = dict(backend='nccl', port=29500)`
			```

			In `config2.py`,

			```python
			`dist_params = dict(backend='nccl', port=29501)`
			```

			Then you can launch two jobs with `config1.py` ang `config2.py`.

			```shell
			`CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}`
			`CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}`
			```