📊Evaluation Protocol on CLEAR

How to faithfully evaluate a continual learning algorithm on CLEAR?

For bucket 1st to 10th that each comes with an annotated labeled trainset, we also release a held-out testset over the same timespan (now downloadable here).

Evaluating on CLEAR is the same as on any other continual learning benchmarks: we measure the performance of a model per timestamp on all the 10 testsets. This produces a 10x10 accuracy matrix.

Different parts of the accuracy matrix focus on different aspects of the performance of a model. For example, the diagonal entries represent performance on a testset that is sampled from the same distribution as the current bucket (assuming each bucket is a locally iid distribution). Lower triangular part of the matrix instead focus on test performance on previously seen buckets, which is the focus of most state-of-the-art works combatting the forgetting issue. Therefore, we introduce 4 simplified evaluation metrics to summarize the accuracy matrix:

  1. In-Domain Accuracy: The average of diagonal entries, i.e., test performance within the same domain of current bucket.

  2. Next-Domain Accuracy: The average of super-diagonal entries, i.e., test performance on the immediate next domain.

  3. Backward Transfer: The average of lower triangular entries, i.e., test performance on previously seen domains.

  4. Forward Transfer: The average of upper triangular entries, i.e., test performance on future unseen domains.

We hope that our proposed metrics can simplify evaluation on CLEAR and similar domain-incremental benchmarks; nonetheless, CLEAR can also be repurposed for task-/class-incremental scenarios, which could be exciting future works.

Last updated