# Training and Test ## Training ### Training with your PC You can use `tools/train.py` to train a model on a single machine with a CPU and optionally a GPU. Here is the full usage of the script: ```shell python tools/train.py ${CONFIG_FILE} [ARGS] ``` ````{note} By default, MMClassification prefers GPU to CPU. If you want to train a model on CPU, please empty `CUDA_VISIBLE_DEVICES` or set it to -1 to make GPU invisible to the program. ```bash CUDA_VISIBLE_DEVICES=-1 python tools/train.py ${CONFIG_FILE} [ARGS] ``` ```` | ARGS | Description | | ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `CONFIG_FILE` | The path to the config file. | | `--work-dir WORK_DIR` | The target folder to save logs and checkpoints. Defaults to a folder with the same name of the config file under `./work_dirs`. | | `--resume [RESUME]` | Resume training. If specify a path, resume from it, while if not specify, try to auto resume from the latest checkpoint. | | `--amp` | Enable automatic-mixed-precision training. | | `--no-validate` | **Not suggested**. Disable checkpoint evaluation during training. | | `--auto-scale-lr` | Auto scale the learning rate according to the actual batch size and the original batch size. | | `--cfg-options CFG_OPTIONS` | Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either `key="[a,b]"` or `key=a,b`. The argument also allows nested list/tuple values, e.g. `key="[(a,b),(c,d)]"`. Note that the quotation marks are necessary and that no white space is allowed. | | `--launcher {none,pytorch,slurm,mpi}` | Options for job launcher. | ### Training with multiple GPUs We provide a shell script to start a multi-GPUs task with `torch.distributed.launch`. ```shell bash ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS] ``` | ARGS | Description | | ------------- | ------------------------------------------------------------------------------------- | | `CONFIG_FILE` | The path to the config file. | | `GPU_NUM` | The number of GPUs to be used. | | `[PY_ARGS]` | The other optional arguments of `tools/train.py`, see [here](#training-with-your-pc). | You can also specify extra arguments of the launcher by environment variables. For example, change the communication port of the launcher to 29666 by the below command: ```shell PORT=29666 bash ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS] ``` If you want to startup multiple training jobs and use different GPUs, you can launch them by specifying different ports and visible devices. ```shell CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash ./tools/dist_train.sh ${CONFIG_FILE1} 4 [PY_ARGS] CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 bash ./tools/dist_train.sh ${CONFIG_FILE2} 4 [PY_ARGS] ``` ### Training with multiple machines #### Multiple machines in the same network If you launch a training job with multiple machines connected with ethernet, you can run the following commands: On the first machine: ```shell NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS ``` On the second machine: ```shell NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS ``` Comparing with multi-GPUs in a single machine, you need to specify some extra environment variables: | ENV_VARS | Description | | ------------- | ---------------------------------------------------------------------------- | | `NNODES` | The total number of machines. | | `NODE_RANK` | The index of the local machine. | | `PORT` | The communication port, it should be the same in all machines. | | `MASTER_ADDR` | The IP address of the master machine, it should be the same in all machines. | Usually it is slow if you do not have high speed networking like InfiniBand. #### Multiple machines managed with slurm If you run MMClassification on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `tools/slurm_train.sh`. ```shell [ENV_VARS] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} [PY_ARGS] ``` Here are the arguments description of the script. | ARGS | Description | | ------------- | ------------------------------------------------------------------------------------- | | `PARTITION` | The partition to use in your cluster. | | `JOB_NAME` | The name of your job, you can name it as you like. | | `CONFIG_FILE` | The path to the config file. | | `WORK_DIR` | The target folder to save logs and checkpoints. | | `[PY_ARGS]` | The other optional arguments of `tools/train.py`, see [here](#training-with-your-pc). | Here are the environment variables can be used to configure the slurm job. | ENV_VARS | Description | | --------------- | ---------------------------------------------------------------------------------------------------------- | | `GPUS` | The number of GPUs to be used. Defaults to 8. | | `GPUS_PER_NODE` | The number of GPUs to be allocated per node.. | | `CPUS_PER_TASK` | The number of CPUs to be allocated per task (Usually one GPU corresponds to one task). Defaults to 5. | | `SRUN_ARGS` | The other arguments of `srun`. Available options can be found [here](https://slurm.schedmd.com/srun.html). | ## Test ### Test with your PC You can use `tools/test.py` to test a model on a single machine with a CPU and optionally a GPU. Here is the full usage of the script: ```shell python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS] ``` ````{note} By default, MMClassification prefers GPU to CPU. If you want to test a model on CPU, please empty `CUDA_VISIBLE_DEVICES` or set it to -1 to make GPU invisible to the program. ```bash CUDA_VISIBLE_DEVICES=-1 python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS] ``` ```` | ARGS | Description | | ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `CONFIG_FILE` | The path to the config file. | | `CHECKPOINT_FILE` | The path to the checkpoint file (It can be a http link, and you can find checkpoints [here](https://mmclassification.readthedocs.io/en/1.x/modelzoo_statistics.html)). | | `--work-dir WORK_DIR` | The directory to save the file containing evaluation metrics. | | `--out OUT` | The path to save the file containing evaluation metrics. | | `--dump DUMP` | The path to dump all outputs of the model for offline evaluation. | | `--cfg-options CFG_OPTIONS` | Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either `key="[a,b]"` or `key=a,b`. The argument also allows nested list/tuple values, e.g. `key="[(a,b),(c,d)]"`. Note that the quotation marks are necessary and that no white space is allowed. | | `--show-dir SHOW_DIR` | The directory to save the result visualization images. | | `--show` | Visualize the prediction result in a window. | | `--interval INTERVAL` | The interval of samples to visualize. | | `--wait-time WAIT_TIME` | The display time of every window (in seconds). Defaults to 1. | | `--launcher {none,pytorch,slurm,mpi}` | Options for job launcher. | ### Test with multiple GPUs We provide a shell script to start a multi-GPUs task with `torch.distributed.launch`. ```shell bash ./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [PY_ARGS] ``` | ARGS | Description | | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `CONFIG_FILE` | The path to the config file. | | `CHECKPOINT_FILE` | The path to the checkpoint file (It can be a http link, and you can find checkpoints [here](https://mmclassification.readthedocs.io/en/1.x/modelzoo_statistics.html)). | | `GPU_NUM` | The number of GPUs to be used. | | `[PY_ARGS]` | The other optional arguments of `tools/test.py`, see [here](#test-with-your-pc). | You can also specify extra arguments of the launcher by environment variables. For example, change the communication port of the launcher to 29666 by the below command: ```shell PORT=29666 bash ./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [PY_ARGS] ``` If you want to startup multiple test jobs and use different GPUs, you can launch them by specifying different port and visible devices. ```shell CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash ./tools/dist_test.sh ${CONFIG_FILE1} ${CHECKPOINT_FILE} 4 [PY_ARGS] CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 bash ./tools/dist_test.sh ${CONFIG_FILE2} ${CHECKPOINT_FILE} 4 [PY_ARGS] ``` ### Test with multiple machines #### Multiple machines in the same network If you launch a test job with multiple machines connected with ethernet, you can run the following commands: On the first machine: ```shell NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT_FILE $GPUS ``` On the second machine: ```shell NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT_FILE $GPUS ``` Comparing with multi-GPUs in a single machine, you need to specify some extra environment variables: | ENV_VARS | Description | | ------------- | ---------------------------------------------------------------------------- | | `NNODES` | The total number of machines. | | `NODE_RANK` | The index of the local machine. | | `PORT` | The communication port, it should be the same in all machines. | | `MASTER_ADDR` | The IP address of the master machine, it should be the same in all machines. | Usually it is slow if you do not have high speed networking like InfiniBand. #### Multiple machines managed with slurm If you run MMClassification on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `tools/slurm_test.sh`. ```shell [ENV_VARS] ./tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${CHECKPOINT_FILE} [PY_ARGS] ``` Here are the arguments description of the script. | ARGS | Description | | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `PARTITION` | The partition to use in your cluster. | | `JOB_NAME` | The name of your job, you can name it as you like. | | `CONFIG_FILE` | The path to the config file. | | `CHECKPOINT_FILE` | The path to the checkpoint file (It can be a http link, and you can find checkpoints [here](https://mmclassification.readthedocs.io/en/1.x/modelzoo_statistics.html)). | | `[PY_ARGS]` | The other optional arguments of `tools/test.py`, see [here](#test-with-your-pc). | Here are the environment variables can be used to configure the slurm job. | ENV_VARS | Description | | --------------- | ---------------------------------------------------------------------------------------------------------- | | `GPUS` | The number of GPUs to be used. Defaults to 8. | | `GPUS_PER_NODE` | The number of GPUs to be allocated per node. | | `CPUS_PER_TASK` | The number of CPUs to be allocated per task (Usually one GPU corresponds to one task). Defaults to 5. | | `SRUN_ARGS` | The other arguments of `srun`. Available options can be found [here](https://slurm.schedmd.com/srun.html). |