# C++ FastReID-TensorRT Implementation of reid model with TensorRT network definition APIs to build the whole network. So we don't use any parsers here. ### How to Run 1. Generate '.wts' file from pytorch with `model_best.pth` See [How_to_Generate.md](tools/How_to_Generate.md) 2. Config your model See [Tensorrt Model Config](#ConfigSection) 3. (Optional) Build `third party` libs See [Build third_party section](#third_party) 4. Build `fastrt` execute file ``` mkdir build cd build cmake -DBUILD_FASTRT_ENGINE=ON \ -DBUILD_DEMO=ON \ -DUSE_CNUMPY=ON .. make ``` 5. Run `fastrt` put `model_best.wts` into `FastRT/` ``` ./demo/fastrt -s // serialize model & save as 'xxx.engine' file ``` ``` ./demo/fastrt -d // deserialize 'xxx.engine' file and run inference ``` 6. Verify the output with pytorch 7. (Optional) Once you verify the result, you can set FP16 for speed up ``` mkdir build cd build cmake -DBUILD_FASTRT_ENGINE=ON \ -DBUILD_DEMO=ON \ -DBUILD_FP16=ON .. make ``` then go to [step 5](#step5) 8. (Optional) You can use INT8 quantization for speed up prepare CALIBRATE DATASET and set the path via cmake. (The path must end with /) ``` mkdir build cd build cmake -DBUILD_FASTRT_ENGINE=ON \ -DBUILD_DEMO=ON \ -DBUILD_INT8=ON \ -DINT8_CALIBRATE_DATASET_PATH="/data/Market-1501-v15.09.15/bounding_box_test/" .. make ``` then go to [step 5](#step5) 9. (Optional) Build tensorrt model as shared libs ``` mkdir build cd build cmake -DBUILD_FASTRT_ENGINE=ON \ -DBUILD_DEMO=OFF \ -DBUILD_FP16=ON .. make make install ``` You should find libs in `FastRT/libs/FastRTEngine/` Now build your application execute file ``` cmake -DBUILD_FASTRT_ENGINE=OFF -DBUILD_DEMO=ON .. make ``` then go to [step 5](#step5) 10. (Optional) Build tensorrt model with python interface, then you can use FastRT model in python. ``` mkdir build cd build cmake -DBUILD_FASTRT_ENGINE=ON \ -DBUILD_DEMO=ON \ -DBUILD_PYTHON_INTERFACE=ON .. make ``` You should get a so file `FastRT/build/pybind_interface/ReID.cpython-37m-x86_64-linux-gnu.so`. Then go to [step 5](#step5) to create engine file. After that you can import this so file in python, and deserialize engine file to infer in python. You can find use example in `pybind_interface/test.py` and `pybind_interface/market_benchmark.py`. ``` from PATH_TO_SO_FILE import ReID model = ReID(GPU_ID) model.build(PATH_TO_YOUR_ENGINEFILE) numpy_feature = np.array([model.infer(CV2_FRAME)]) ``` * `pybind_interface/test.py` use `pybind_interface/docker/trt7cu100/Dockerfile` (without pytorch installed) * `pybind_interface/market_benchmark.py` use `pybind_interface/docker/trt7cu102_torch160/Dockerfile` (with pytorch installed) ### `Tensorrt Model Config` Edit `FastRT/demo/inference.cpp`, according to your model config The config is related to [How_to_Generate.md](tools/How_to_Generate.md) + Ex1. `sbs_R50-ibn` ``` static const std::string WEIGHTS_PATH = "../sbs_R50-ibn.wts"; static const std::string ENGINE_PATH = "./sbs_R50-ibn.engine"; static const int MAX_BATCH_SIZE = 4; static const int INPUT_H = 384; static const int INPUT_W = 128; static const int OUTPUT_SIZE = 2048; static const int DEVICE_ID = 0; static const FastreidBackboneType BACKBONE = FastreidBackboneType::r50; static const FastreidHeadType HEAD = FastreidHeadType::EmbeddingHead; static const FastreidPoolingType HEAD_POOLING = FastreidPoolingType::gempoolP; static const int LAST_STRIDE = 1; static const bool WITH_IBNA = true; static const bool WITH_NL = true; static const int EMBEDDING_DIM = 0; ``` + Ex2. `sbs_R50` ``` static const std::string WEIGHTS_PATH = "../sbs_R50.wts"; static const std::string ENGINE_PATH = "./sbs_R50.engine"; static const int MAX_BATCH_SIZE = 4; static const int INPUT_H = 384; static const int INPUT_W = 128; static const int OUTPUT_SIZE = 2048; static const int DEVICE_ID = 0; static const FastreidBackboneType BACKBONE = FastreidBackboneType::r50; static const FastreidHeadType HEAD = FastreidHeadType::EmbeddingHead; static const FastreidPoolingType HEAD_POOLING = FastreidPoolingType::gempoolP; static const int LAST_STRIDE = 1; static const bool WITH_IBNA = false; static const bool WITH_NL = true; static const int EMBEDDING_DIM = 0; ``` + Ex3. `sbs_r34_distill` ``` static const std::string WEIGHTS_PATH = "../sbs_r34_distill.wts"; static const std::string ENGINE_PATH = "./sbs_r34_distill.engine"; static const int MAX_BATCH_SIZE = 4; static const int INPUT_H = 384; static const int INPUT_W = 128; static const int OUTPUT_SIZE = 512; static const int DEVICE_ID = 0; static const FastreidBackboneType BACKBONE = FastreidBackboneType::r34_distill; static const FastreidHeadType HEAD = FastreidHeadType::EmbeddingHead; static const FastreidPoolingType HEAD_POOLING = FastreidPoolingType::gempoolP; static const int LAST_STRIDE = 1; static const bool WITH_IBNA = false; static const bool WITH_NL = false; static const int EMBEDDING_DIM = 0; ``` + Ex4.`kd-r34-r101_ibn` ``` static const std::string WEIGHTS_PATH = "../kd_r34_distill.wts"; static const std::string ENGINE_PATH = "./kd_r34_distill.engine"; static const int MAX_BATCH_SIZE = 4; static const int INPUT_H = 384; static const int INPUT_W = 128; static const int OUTPUT_SIZE = 512; static const int DEVICE_ID = 0; static const FastreidBackboneType BACKBONE = FastreidBackboneType::r34_distill; static const FastreidHeadType HEAD = FastreidHeadType::EmbeddingHead; static const FastreidPoolingType HEAD_POOLING = FastreidPoolingType::gempoolP; static const int LAST_STRIDE = 1; static const bool WITH_IBNA = false; static const bool WITH_NL = false; static const int EMBEDDING_DIM = 0; ``` + Ex5.`kd-r18-r101_ibn` ``` static const std::string WEIGHTS_PATH = "../kd-r18-r101_ibn.wts"; static const std::string ENGINE_PATH = "./kd_r18_distill.engine"; static const int MAX_BATCH_SIZE = 16; static const int INPUT_H = 384; static const int INPUT_W = 128; static const int OUTPUT_SIZE = 512; static const int DEVICE_ID = 1; static const FastreidBackboneType BACKBONE = FastreidBackboneType::r18_distill; static const FastreidHeadType HEAD = FastreidHeadType::EmbeddingHead; static const FastreidPoolingType HEAD_POOLING = FastreidPoolingType::gempoolP; static const int LAST_STRIDE = 1; static const bool WITH_IBNA = true; static const bool WITH_NL = false; static const int EMBEDDING_DIM = 0; ``` ### Supported conversion * Backbone: resnet50, resnet34, distill-resnet50, distill-resnet34, distill-resnet18 * Heads: embedding_head * Plugin layers: ibn, non-local * Pooling layers: maxpool, avgpool, GeneralizedMeanPooling, GeneralizedMeanPoolingP ### Benchmark | Model | Engine | Batch size | Image size | Embd | Time | |:-:|:-:|:-:|:-:|:-:|:-:| | Vanilla R34 | Python/Pytorch1.6 fp32 | 1 | 256x128 | 512 | 6.49ms | | Vanilla R34 | Python/Pytorch1.6 fp32 | 4 | 256x128 | 512 | 7.16ms | | Vanilla R34 | C++/trt7 fp32 | 1 | 256x128 | 512 | 2.34ms | | Vanilla R34 | C++/trt7 fp32 | 4 | 256x128 | 512 | 3.99ms | | Vanilla R34 | C++/trt7 fp16 | 1 | 256x128 | 512 | 1.83ms | | Vanilla R34 | C++/trt7 fp16 | 4 | 256x128 | 512 | 2.38ms | | Distill R34 | Python/Pytorch1.6 fp32 | 1 | 256x128 | 512 | 5.68ms | | Distill R34 | Python/Pytorch1.6 fp32 | 4 | 256x128 | 512 | 6.26ms | | Distill R34 | C++/trt7 fp32 | 1 | 256x128 | 512 | 2.36ms | | Distill R34 | C++/trt7 fp32 | 4 | 256x128 | 512 | 4.05ms | | Distill R34 | C++/trt7 fp16 | 1 | 256x128 | 512 | 1.86ms | | Distill R34 | C++/trt7 fp16 | 4 | 256x128 | 512 | 2.68ms | | R50-NL-IBN | Python/Pytorch1.6 fp32 | 1 | 256x128 | 2048 | 14.86ms | | R50-NL-IBN | Python/Pytorch1.6 fp32 | 4 | 256x128 | 2048 | 15.14ms | | R50-NL-IBN | C++/trt7 fp32 | 1 | 256x128 | 2048 | 4.67ms | | R50-NL-IBN | C++/trt7 fp32 | 4 | 256x128 | 2048 | 6.15ms | | R50-NL-IBN | C++/trt7 fp16 | 1 | 256x128 | 2048 | 2.87ms | | R50-NL-IBN | C++/trt7 fp16 | 4 | 256x128 | 2048 | 3.81ms | * Time: preprocessing(normalization) + inference (100 times average) * GPU: GTX 2080 TI ### Test Environment 1. fastreid v1.0.0 / 2080TI / Ubuntu18.04 / Nvidia driver 435 / cuda10.0 / cudnn7.6.5 / trt7.0.0 / nvinfer7.0.0 / opencv3.2 2. fastreid v1.0.0 / 2080TI / Ubuntu18.04 / Nvidia driver 450 / cuda10.2 / cudnn7.6.5 / trt7.0.0 / nvinfer7.0.0 / opencv3.2 ### Installation * Set up with Docker for cuda10.0 ``` cd docker/trt7cu100 sudo docker build -t trt7:cuda100 . sudo docker run --gpus all -it --name fastrt -v /home/YOURID/workspace:/workspace -d trt7:cuda100 // then put the repo into `/home/YOURID/workspace/` before you getin container ``` for cuda10.2 ``` cd docker/trt7cu102 sudo docker build -t trt7:cuda102 . sudo docker run --gpus all -it --name fastrt -v /home/YOURID/workspace:/workspace -d trt7:cuda102 // then put the repo into `/home/YOURID/workspace/` before you getin container ``` * [Installation reference](https://github.com/wang-xinyu/tensorrtx/blob/master/tutorials/install.md) ### Build third party * for read/write numpy ``` cd third_party/cnpy cmake -DCMAKE_INSTALL_PREFIX=../../libs/cnpy -DENABLE_STATIC=OFF . && make -j4 && make install ```