6.6 KiB
Introduction
MMEngine is a foundational library for training deep learning models based on PyTorch. It supports running on Linux, Windows, and macOS. It has the following three features:
-
Universal and powerful executor:
- Supports training different tasks with minimal code, such as training ImageNet with just 80 lines of code (original PyTorch examples require 400 lines).
- Easily compatible with models from popular algorithm libraries like TIMM, TorchVision, and Detectron2.
-
Open architecture with unified interfaces:
- Handles different tasks with a unified API: you can implement a method once and apply it to all compatible models.
- Supports various backend devices through a simple, high-level abstraction. Currently, MMEngine supports model training on Nvidia CUDA, Mac MPS, AMD, MLU, and other devices.
-
Customizable training process:
- Defines a highly modular training engine with "Lego"-like composability.
- Offers a rich set of components and strategies.
- Total control over the training process with different levels of APIs.
Architecture
The above diagram illustrates the hierarchy of MMEngine in OpenMMLab 2.0. MMEngine implements a next-generation training architecture for the OpenMMLab algorithm library, providing a unified execution foundation for over 30 algorithm libraries within OpenMMLab. Its core components include the training engine, evaluation engine, and module management.
Module Introduction
MMEngine abstracts the components involved in the training process and their relationships. Components of the same type in different algorithm libraries share the same interface definition.
Core Modules and Related Components
The core module of the training engine is the
Runner
. The Runner
is responsible for executing
training, testing, and inference tasks and managing the various components
required during these processes. In specific locations throughout the
execution of training, testing, and inference tasks, the Runner
sets up Hooks
to allow users to extend, insert, and execute custom logic. The Runner
primarily invokes the following components to complete the training and
inference loops:
- Dataset: Responsible for constructing datasets in training, testing, and inference tasks, and feeding the data to the model. In usage, it is wrapped by a PyTorch DataLoader, which launches multiple subprocesses to load the data.
- Model: Accepts data and outputs the loss during the
training process; accepts data and performs predictions during testing and
inference tasks. In a distributed environment, the model is wrapped by a
Model Wrapper (e.g.,
MMDistributedDataParallel
). - Optimizer Wrapper: The optimizer wrapper performs backpropagation to optimize the model during the training process and supports mixed-precision training and gradient accumulation through a unified interface.
- Parameter Scheduler: Dynamically adjusts optimizer hyperparameters such as learning rate and momentum during the training process.
During training intervals or testing phases, the Metrics &
Evaluator are responsible for evaluating the
performance of the model. The Evaluator
evaluates the model's predictions
based on the dataset. Within the Evaluator
, there is an abstraction called
Metrics
, which calculates various metrics such as recall, accuracy, and
others.
To ensure a unified interface, the communication interfaces between the evaluators, models, and data in various algorithm libraries within OpenMMLab 2.0 are encapsulated using Data Elements.
During training and inference execution, the aforementioned components can
utilize the logging management module and visualizer for structured and
unstructured logging storage and visualization. Logging
Modules: Responsible for managing various
log information generated during the execution of the Runner. The Message Hub
implements data sharing between components, runners, and log processors, while
the Log Processor processes the log information. The processed logs are then
sent to the Logger
and Visualizer
for management and display. The
Visualizer
is responsible for
visualizing the model's feature maps, prediction results, and structured logs
generated during the training process. It supports multiple visualization
backends such as TensorBoard and WanDB.
Common Base Modules
MMEngine also implements various common base modules required during the execution of algorithmic models, including:
- Config: In the OpenMMLab algorithm library, users can configure the training, testing process, and related components by writing a configuration file (config).
- Registry: Responsible for managing modules within the algorithm library that have similar functionality. Based on the abstraction of algorithm library modules, MMEngine defines a set of root registries. Registries within the algorithm library can inherit from these root registries, enabling cross-algorithm library module invocations and interactions. This allows for seamless integration and utilization of modules across different algorithms within the OpenMMLab framework.
- File I/O: Provides a unified interface for file read/write operations in various modules, supporting multiple file backend systems and formats in a consistent manner, with extensibility.
- Distributed Communication Primitives: Handles communication between different processes during distributed program execution. This interface abstracts the differences between distributed and non-distributed environments and automatically handles data devices and communication backends.
- Other Utilities: There are also
utility modules, such as
ManagerMixin
, which implements a way to create and access global variables. The base class for many globally accessible objects within theRunner
isManagerMixin
.
Users can further read the tutorials to understand the advanced usage of these modules or refer to the design documents to understand their design principles and details.