diff --git a/docs/en/advanced_tutorials/distributed.md b/docs/en/advanced_tutorials/distributed.md index 46078706..c620bfad 100644 --- a/docs/en/advanced_tutorials/distributed.md +++ b/docs/en/advanced_tutorials/distributed.md @@ -1,3 +1,57 @@ # Distribution communication -Coming soon. Please refer to [chinese documentation](https://mmengine.readthedocs.io/zh_CN/latest/advanced_tutorials/data_element.html). +In distributed training, different processes sometimes need to apply different logics depending on their ranks, local_ranks, etc. +They also need to communicate with each other and do synchronizations on data. +These demands rely on distributed communication. +PyTorch provides a set of basic distributed communication primitives. +Based on these primitives, MMEngine provides some higher level APIs to meet more diverse demands. +Using these APIs provided by MMEngine, modules can: + +- ignore the differences between distributed/non-distributed environment +- deliver data in various types apart from Tensor +- ignore the frameworks or backends used for communication + +These APIs are roughly categorized into 3 types: + +- Initialization: `init_dist` for setting up distributed environment for the runner +- Query & control: functions including `get_world_size` for querying `world_size`, `rank` and other distributed information +- Collective communication: collective communication functions such as `all_reduce` + +We will detail on these APIs in the following chapters. + +## Initialization + +- [init_dist](mmengine.dist.init_dist): Launch function of distributed training. Currently it supports 3 launchers including pytorch, slurm and MPI. It also setup the given communication backends, defaults to NCCL. + +## Query and control + +The query and control functions are all argument free. +They can be used in both distributed and non-distributed environment. +Their functionalities are listed below: + +- [get_world_size](mmengine.dist.get_world_size): Returns the number of processes in current process group. Returns 1 when non-distributed +- [get_rank](mmengine.dist.get_rank): Returns the global rank of current process in current process group. Returns 0 when non-distributed +- [get_backend](mmengine.dist.get_backend): Returns the communication backends used by current process group. Returns `None` when non-distributed +- [get_local_rank](mmengine.dist.get_local_rank): Returns the local rank of current process in current process group. Returns 0 when non-distributed +- [get_local_size](mmengine.dist.get_local_size): Returns the number of processes which are both in current process group and on the same machine as the current process. Returns 1 when non-distributed +- [get_dist_info](mmengine.dist.get_dist_info): Returns the world_size and rank of the current process group. Returns world_size = 1, rank = 0 when non-distributed +- [is_main_process](mmengine.dist.is_main_process): Returns `True` if current process is rank 0 in current process group, otherwise `False` . Always returns `True` when non-distributed +- [master_only](mmengine.dist.master_only): A function decorator. Functions decorated by `master_only` will only execute on rank 0 process. +- [barrier](mmengine.dist.barrier): A synchronization primitive. Every process will hold until all processes in the current process group reach the same barrier location + +## Collective communication + +Collective communication functions are used for data transfer between processes in the same process group. +We provide the following APIs based on PyTorch native functions including all_reduce, all_gather, gather, broadcast. +These APIs are compatible with non-distributed environment and support more data types apart from Tensor. + +- [all_reduce](mmengine.dist.all_reduce): AllReduce operation on Tensors in the current process group +- [all_gather](mmengine.dist.all_gather): AllGather operation on Tensors in the current process group +- [gather](mmengine.dist.gather): Gather Tensors in the current process group to a destinated rank +- [broadcast](mmengine.dist.broadcast): Broadcast a Tensor to all processes in the current process group +- [sync_random_seed](mmengine.dist.sync_random_seed): Synchronize random seed between processes in the current process group +- [broadcast_object_list](mmengine.dist.broadcast_object_list): Broadcast a list of Python objects. It requires the object can be serialized by Pickle. +- [all_reduce_dict](mmengine.dist.all_reduce_dict): AllReduce operation on dict. It is based on broadcast and all_reduce. +- [all_gather_object](mmengine.dist.all_gather_object): AllGather operations on any Python object than can be serialized by Pickle. It is based on all_gather +- [gather_object](mmengine.dist.gather_object): Gather Python objects that can be serialized by Pickle +- [collect_results](mmengine.dist.collect_results): Unified API for collecting a list of data in current process group. It support both CPU and GPU communication diff --git a/docs/zh_cn/advanced_tutorials/distributed.md b/docs/zh_cn/advanced_tutorials/distributed.md index 049ce28c..ffa2e5ee 100644 --- a/docs/zh_cn/advanced_tutorials/distributed.md +++ b/docs/zh_cn/advanced_tutorials/distributed.md @@ -42,6 +42,6 @@ PyTorch 提供了一套基础的通信原语用于多进程之间张量的通信 - [sync_random_seed](mmengine.dist.sync_random_seed):同步进程之间的随机种子 - [broadcast_object_list](mmengine.dist.broadcast_object_list):支持对任意可被 Pickle 序列化的 Python 对象列表进行广播,基于 broadcast 接口实现 - [all_reduce_dict](mmengine.dist.all_reduce_dict):对 dict 中的内容进行 all_reduce 操作,基于 broadcast 和 all_reduce 接口实现 -- [all_gather_object](mmengine.dist.all_gather_object):基于 all_gather 实现对任意可以被 Pickle 序列化的 Python 对象进行 all_tather 操作 +- [all_gather_object](mmengine.dist.all_gather_object):基于 all_gather 实现对任意可以被 Pickle 序列化的 Python 对象进行 all_gather 操作 - [gather_object](mmengine.dist.gather_object):将 group 里每个 rank 中任意可被 Pickle 序列化的 Python 对象 gather 到指定的目标 rank - [collect_results](mmengine.dist.collect_results):支持基于 CPU 通信或者 GPU 通信对不同进程间的列表数据进行收集