[Doc] Refine evaluation docs (#701)

* refine evaluation docs * fix link * Update docs/zh_cn/design/evaluation.md Co-authored-by: Qian Zhao <112053249+C1rN09@users.noreply.github.com> * Update docs/zh_cn/tutorials/evaluation.md Co-authored-by: RangiLyu <lyuchqi@gmail.com> * Update docs/zh_cn/tutorials/evaluation.md Co-authored-by: RangiLyu <lyuchqi@gmail.com> * resolve comments * Apply suggestions from code review * Update docs/zh_cn/tutorials/evaluation.md Co-authored-by: Qian Zhao <112053249+C1rN09@users.noreply.github.com> Co-authored-by: RangiLyu <lyuchqi@gmail.com> Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com>
2025-06-03 21:54:44 +08:00 · 2022-11-10 15:17:07 +08:00 · 2022-11-10 15:17:07 +08:00 · 1323251c02
commit 1323251c02
parent ee75ad4396
2 changed files with 151 additions and 102 deletions
--- a/docs/zh_cn/design/evaluation.md
+++ b/docs/zh_cn/design/evaluation.md
@ -1,9 +1,73 @@
 # 模型精度评测

-在模型验证和模型测试中，通常需要对模型精度做定量评测。在 MMEngine 中实现了[评测指标](mmengine.evaluator.BaseMetric)和[评测器](mmengine.evaluator.Evaluator)来完成这一功能。
+## 评测指标与评测器

-**评测指标** 根据模型的输入数据和预测结果，完成特定指标下模型精度的计算。评测指标与数据集之间相互解耦，这使得用户可以任意组合所需的测试数据和评测指标。如 [COCOMetric](Todo:coco-metric-doc-link) 可用于计算 COCO 数据集的 AP，AR 等评测指标，也可用于其他的目标检测数据集上。
-**评测器** 是评测指标的上层模块，通常包含一个或多个评测指标。评测器的作用是在模型评测时完成必要的数据格式转换，并调用评测指标计算模型精度。评测器通常由[执行器](../tutorials/runner.md)或测试脚本构建，分别用于在线评测和离线评测。
+在模型验证和模型测试中，通常需要对模型精度做定量评测。在 MMEngine 中实现了评测指标（Metric）和评测器（Evaluator）来完成这一功能。
+
+- **评测指标** 用于根据测试数据和模型预测结果，完成特定模型精度指标的计算。在 OpenMMLab 各算法库中提供了对应任务的常用评测指标，如 [MMClassification](https://github.com/open-mmlab/mmclassification) 中提供了[Accuracy](https://mmclassification.readthedocs.io/en/1.x/api/generated/mmcls.evaluation.Accuracy.html#mmcls.evaluation.Accuracy) 用于计算分类模型的 Top-k 分类正确率；MMDetection 中提供了 [COCOMetric](https://github.com/open-mmlab/mmdetection/blob/3.x/mmdet/evaluation/metrics/coco_metric.py) 用于计算目标检测模型的 AP，AR 等评测指标。评测指标与数据集解耦，如 COCOMetric 也可用于 COCO 以外的目标检测数据集上。
+
+- **评测器** 是评测指标的上层模块，通常包含一个或多个评测指标。评测器的作用是在模型评测时完成必要的数据格式转换，并调用评测指标计算模型精度。评测器通常由[执行器](../tutorials/runner.md)或测试脚本构建，分别用于在线评测和离线评测。
+
+### 评测指标基类 `BaseMetric`
+
+评测指标基类 `BaseMetric` 是一个抽象类，初始化参数如下:
+
+- `collect_device`：在分布式评测中用于同步结果的设备名，如 `'cpu'` 或 `'gpu'`。
+- `prefix`：评测指标名前缀，用以区别多个同名的评测指标。如果该参数未给定，则会尝试使用类属性 `default_prefix` 作为前缀。
+
+```python
+class BaseMetric(metaclass=ABCMeta):
+
+    default_prefix: Optional[str] = None
+
+    def __init__(self,
+                 collect_device: str = 'cpu',
+                 prefix: Optional[str] = None) -> None:
+        ...
+```
+
+`BaseMetric` 有以下 2 个重要的方法需要在子类中重写：
+
+- **`process()`** 用于处理每个批次的测试数据和模型预测结果。处理结果应存放在 `self.results` 列表中，用于在处理完所有测试数据后计算评测指标。该方法具有以下 2 个参数：
+
+  - `data_batch`：一个批次的测试数据样本，通常直接来自与数据加载器
+  - `data_samples`：对应的模型预测结果
+    该方法没有返回值。函数接口定义如下：
+
+  ```python
+  @abstractmethod
+  def process(self, data_batch: Any, data_samples: Sequence[dict]) -> None:
+      """Process one batch of data samples and predictions. The processed
+      results should be stored in ``self.results``, which will be used to
+      compute the metrics when all batches have been processed.
+      Args:
+          data_batch (Any): A batch of data from the dataloader.
+          data_samples (Sequence[dict]): A batch of outputs from the model.
+      """
+  ```
+
+- **`compute_metrics()`** 用于计算评测指标，并将所评测指标存放在一个字典中返回。该方法有以下 1 个参数：
+
+  - `results`：列表类型，存放了所有批次测试数据经过 `process()` 方法处理后得到的结果
+    该方法返回一个字典，里面保存了评测指标的名称和对应的评测值。函数接口定义如下：
+
+  ```python
+  @abstractmethod
+  def compute_metrics(self, results: list) -> dict:
+      """Compute the metrics from processed results.
+
+      Args:
+          results (list): The processed results of each batch.
+
+      Returns:
+          dict: The computed metrics. The keys are the names of the metrics,
+          and the values are corresponding results.
+      """
+  ```
+
+其中，`compute_metrics()` 会在 `evaluate()` 方法中被调用；后者在计算评测指标前，会在分布式测试时收集和汇总不同 rank 的中间处理结果。
+
+需要注意的是，`self.results` 中存放的具体类型取决于评测指标子类的实现。例如，当测试样本或模型输出数据量较大（如语义分割、图像生成等任务），不宜全部存放在内存中时，可以在 `self.results` 中存放每个批次计算得到的指标，并在 `compute_metrics()` 中汇总；或将每个批次的中间结果存储到临时文件中，并在 `self.results` 中存放临时文件路径，最后由 `compute_metrics()` 从文件中读取数据并计算指标。

 ## 模型精度评测流程

@ -21,91 +85,4 @@

 在 OpenMMLab 的各个算法库中，已经实现了对应方向的常用评测指标。如 MMDetection 中提供了 COCO 评测指标，MMClassification 中提供了 Accuracy、F1Score 等评测指标等。

-用户也可以增加自定义的评测指标。在实现自定义评测指标时，需要继承 MMEngine 中提供的评测指标基类 [BaseMetric](mmengine.evaluator.BaseMetric)，并实现对应的抽象方法。
-
-### 评测指标基类
-
-评测指标基类 `BaseMetric` 是一个抽象类，具有以下 2 个抽象方法：
-
- `process()`: 处理每个批次的测试数据和模型预测结果。处理结果应存放在 `self.results` 列表中，用于在处理完所有测试数据后计算评测指标。
- `compute_metrics()`: 计算评测指标，并将所评测指标存放在一个字典中返回。
-
-其中，`compute_metrics()` 会在 `evaluate()` 方法中被调用；后者在计算评测指标前，会在分布式测试时收集和汇总不同 rank 的中间处理结果。
-
-需要注意的是，`self.results` 中存放的具体类型取决于评测指标子类的实现。例如，当测试样本或模型输出数据量较大（如语义分割、图像生成等任务），不宜全部存放在内存中时，可以在 `self.results` 中存放每个批次计算得到的指标，并在 `compute_metrics()` 中汇总；或将每个批次的中间结果存储到临时文件中，并在 `self.results` 中存放临时文件路径，最后由 `compute_metrics()` 从文件中读取数据并计算指标。
-
-### 自定义评测指标类
-
-我们以实现分类正确率（Classification Accuracy）评测指标为例，说明自定义评测指标的方法。
-
-首先，评测指标类应继承自 `BaseMetric`，并应加入注册器 `METRICS` (关于注册器的说明请参考[相关文档](../advanced_tutorials/registry.md))。
-
-`process()` 方法有 2 个输入参数，分别是一个批次的测试数据样本 `data_batch` 和模型预测结果 `predictions`。我们从中分别取出样本类别标签和分类预测结果，并存放在 `self.results` 中。
-
-`compute_metrics()` 方法有 1 个输入参数 `results`，里面存放了所有批次测试数据经过 `process()` 方法处理后得到的结果。从中取出样本类别标签和分类预测结果，即可计算得到分类正确率 `acc`。最终，将计算得到的评测指标以字典的形式返回。
-
-此外，我们建议在子类中为类属性 `default_prefix` 赋值。如果在初始化参数（即 config 中）没有指定 `prefix`，则会自动使用 `default_prefix` 作为评测指标名的前缀。同时，应在 docstring 中说明该评测指标类的 `default_prefix` 值以及所有的返回指标名称。
-
-具体的实现如下：
-
-```python
-from mmengine.evaluator import BaseMetric
-from mmengine.registry import METRICS
-
-import numpy as np
-
-
-@METRICS.register_module()  # 将 Accuracy 类注册到 METRICS 注册器
-class Accuracy(BaseMetric):
-    """ Accuracy Evaluator
-
-    Default prefix: ACC
-
-    Metrics:
-        - accuracy (float): classification accuracy
-    """
-
-    default_prefix = 'ACC'  # 设置 default_prefix
-
-    def process(self, data_batch: Sequence[dict], predictions: Sequence[dict]):
-        """Process one batch of data and predictions. The processed
-        Results should be stored in `self.results`, which will be used
-        to computed the metrics when all batches have been processed.
-
-        Args:
-            data_batch (Sequence[Tuple[Any, dict]]): A batch of data
-                from the dataloader.
-            predictions (Sequence[dict]): A batch of outputs from
-                the model.
-        """
-
-        # 取出分类预测结果和类别标签
-        result = {
-            'pred': predictions['pred_label'],
-            'gt': data_batch['data_sample']['gt_label']
-        }
-
-        # 将当前 batch 的结果存进 self.results
-        self.results.append(result)
-
-    def compute_metrics(self, results: List):
-        """Compute the metrics from processed results.
-
-        Args:
-            results (dict): The processed results of each batch.
-
-        Returns:
-            Dict: The computed metrics. The keys are the names of the metrics,
-            and the values are corresponding results.
-        """
-
-        # 汇总所有样本的分类预测结果和类别标签
-        preds = np.concatenate([res['pred'] for res in results])
-        gts = np.concatenate([res['gt'] for res in results])
-
-        # 计算分类正确率
-        acc = (preds == gts).sum() / preds.size
-
-        # 返回评测指标结果
-        return {'accuracy': acc}
-```
+用户也可以增加自定义的评测指标。具体方法可以参考[教程文档](../tutorials/evaluation.md#自定义评测指标)中给出的示例。
--- a/docs/zh_cn/tutorials/evaluation.md
+++ b/docs/zh_cn/tutorials/evaluation.md
@ -1,21 +1,21 @@
 # 模型精度评测（Evaluation）

-在模型验证和模型测试中，通常需要对模型精度做定量评测。在 MMEngine 中实现了评测指标（Metric）和评测器（Evaluator）模块来完成这一功能：
-
- 评测指标：用于根据测试数据和模型预测结果，完成模型特定精度指标的计算。在 OpenMMLab 各算法库中提供了对应任务的常用评测指标，如 [MMClassification](https://github.com/open-mmlab/mmclassification) 中提供了[分类正确率指标（Accuracy）](https://mmclassification.readthedocs.io/zh_CN/dev-1.x/generated/mmcls.evaluation.Accuracy.html)用于计算分类模型的 Top-k 分类正确率。
-
- 评测器：是评测指标的上层模块，用于在数据输入评测指标前完成必要的格式转换，并提供分布式支持。在模型训练和测试中，评测器由[执行器（Runner）](runner.md)自动构建。用户亦可根据需求手动创建评测器，进行离线评测。
+在模型验证和模型测试中，通常需要对模型精度做定量评测。我们可以通过在配置文件中指定评测指标（Metric）来实现这一功能。

 ## 在模型训练或测试中进行评测

-### 评测指标配置
+### 使用单个评测指标

-在基于 MMEngine 进行模型训练或测试时，执行器会自动构建评测器进行评测，用户只需要在配置文件中通过 `val_evaluator` 和 `test_evaluator` 2 个字段分别指定模型验证和测试阶段的评测指标即可。例如，用户在使用 [MMClassification](https://github.com/open-mmlab/mmclassification) 训练分类模型时，希望在模型验证阶段评测 top-1 和 top-5 分类正确率，可以按以下方式配置：
+在基于 MMEngine 进行模型训练或测试时，用户只需要在配置文件中通过 `val_evaluator` 和 `test_evaluator` 2 个字段分别指定模型验证和测试阶段的评测指标即可。例如，用户在使用 [MMClassification](https://github.com/open-mmlab/mmclassification) 训练分类模型时，希望在模型验证阶段评测 top-1 和 top-5 分类正确率，可以按以下方式配置：

 ```python
 val_evaluator = dict(type='Accuracy', top_k=(1, 5))  # 使用分类正确率评测指标
 ```

+关于具体评测指标的参数设置，用户可以查阅相关算法库的文档。如上例中的 [Accuracy 文档](https://mmclassification.readthedocs.io/en/1.x/api/generated/mmcls.evaluation.Accuracy.html#mmcls.evaluation.Accuracy)。
+
+### 使用多个评测指标
+
 如果需要同时评测多个指标，也可以将 `val_evaluator` 或 `test_evaluator` 设置为一个列表，其中每一项为一个评测指标的配置信息。例如，在使用 [MMDetection](https://github.com/open-mmlab/mmdetection) 训练全景分割模型时，希望在模型测试阶段同时评测模型的目标检测（COCO AP/AR）和全景分割精度，可以按以下方式配置：

 ```python
@ -37,11 +37,83 @@ test_evaluator = [

 ### 自定义评测指标

-如果算法库中提供的常用评测指标无法满足需求，用户也可以增加自定义的评测指标。具体的方法可以参考[评测指标和评测器设计](../design/evaluation.md)。
+如果算法库中提供的常用评测指标无法满足需求，用户也可以增加自定义的评测指标。我们以简化的分类正确率为例，介绍实现自定义评测指标的方法：
+
+1. 在定义新的评测指标类时，需要继承基类 [`BaseMetric`](mmengine.evaluator.BaseMetric)（关于该基类的介绍，可以参考[设计文档](../design/evaluation.md)）。此外，评测指标类需要用注册器 `METRICS` 进行注册（关于注册器的说明请参考 [Registry 文档](./registry.md)）。
+
+2. 实现 `process()` 方法。该方法有 2 个输入参数，分别是一个批次的测试数据样本 `data_batch` 和模型预测结果 `data_samples`。我们从中分别取出样本类别标签和分类预测结果，并存放在 `self.results` 中。
+
+3. 实现 `compute_metrics()` 方法。该方法有 1 个输入参数 `results`，里面存放了所有批次测试数据经过 `process()` 方法处理后得到的结果。从中取出样本类别标签和分类预测结果，即可计算得到分类正确率 `acc`。最终，将计算得到的评测指标以字典的形式返回。
+
+4. （可选）可以为类属性 `default_prefix` 赋值。该属性会自动作为输出的评测指标名前缀（如 `defaut_prefix='my_metric'`,则实际输出的评测指标名为 `'my_metric/acc'`），用以进一步区分不同的评测指标。该前缀也可以在配置文件中通过 `prefix` 参数改写。我们建议在 docstring 中说明该评测指标类的 `default_prefix` 值以及所有的返回指标名称。
+
+具体实现如下：
+
+```python
+from mmengine.evaluator import BaseMetric
+from mmengine.registry import METRICS
+
+import numpy as np
+
+
+@METRICS.register_module()  # 将 Accuracy 类注册到 METRICS 注册器
+class SimpleAccuracy(BaseMetric):
+    """ Accuracy Evaluator
+
+    Default prefix: ACC
+
+    Metrics:
+        - accuracy (float): classification accuracy
+    """
+
+    default_prefix = 'ACC'  # 设置 default_prefix
+
+    def process(self, data_batch: Sequence[dict], data_samples: Sequence[dict]):
+        """Process one batch of data and predictions. The processed
+        Results should be stored in `self.results`, which will be used
+        to computed the metrics when all batches have been processed.
+
+        Args:
+            data_batch (Sequence[Tuple[Any, dict]]): A batch of data
+                from the dataloader.
+            data_samples (Sequence[dict]): A batch of outputs from
+                the model.
+        """
+
+        # 取出分类预测结果和类别标签
+        result = {
+            'pred': data_samples['pred_label'],
+            'gt': data_samples['data_sample']['gt_label']
+        }
+
+        # 将当前 batch 的结果存进 self.results
+        self.results.append(result)
+
+    def compute_metrics(self, results: List):
+        """Compute the metrics from processed results.
+
+        Args:
+            results (dict): The processed results of each batch.
+
+        Returns:
+            Dict: The computed metrics. The keys are the names of the metrics,
+            and the values are corresponding results.
+        """
+
+        # 汇总所有样本的分类预测结果和类别标签
+        preds = np.concatenate([res['pred'] for res in results])
+        gts = np.concatenate([res['gt'] for res in results])
+
+        # 计算分类正确率
+        acc = (preds == gts).sum() / preds.size
+
+        # 返回评测指标结果
+        return {'accuracy': acc}
+```

 ## 使用离线结果进行评测

-另一种常见的模型评测方式，是利用提前保存在文件中的模型预测结果进行离线评测。此时，由于不存在执行器，用户需要手动构建评测器，并调用评测器的相应接口完成评测。以下是一个离线评测示例：
+另一种常见的模型评测方式，是利用提前保存在文件中的模型预测结果进行离线评测。此时，用户需要手动构建**评测器**，并调用评测器的相应接口完成评测。关于离线评测的详细说明，以及评测器和评测指标的关系，可以参考[设计文档](/docs/zh_cn/design/evaluation.md)。我们仅在此给出一个离线评测示例：

 ```python
 from mmengine.evaluator import Evaluator
@ -55,10 +127,10 @@ data = load('test_data.pkl')

 # 从文件中读取模型预测结果。该结果由待评测算法在测试数据集上推理得到。
 # 数据格式需要参考具使用的 metric。
-predictions = load('prediction.pkl')
+data_samples = load('prediction.pkl')

 # 调用评测器离线评测接口，得到评测结果
 # chunk_size 表示每次处理的样本数量，可根据内存大小调整
-results = evaluator.offline_evaluate(data, predictions, chunk_size=128)
+results = evaluator.offline_evaluate(data, data_samples, chunk_size=128)

 ```