The CLEAR Benchmark

Searchโฆ

๐

Evaluation Protocol on CLEAR

How to faithfully evaluate a continual learning algorithm on CLEAR?

For bucket 1st to 10th that each comes with an annotated labeled trainset, we also release **a held-out testset** over the same timespan (now downloadable here).

Evaluating on CLEAR is the same as on any other continual learning benchmarks: we measure the performance of a model per timestamp on all the 10 testsets. This produces a *10x10 accuracy matrix*.

Different parts of the accuracy matrix focus on different aspects of the performance of a model. For example, the diagonal entries represent performance on a testset that is sampled from the same distribution as the current bucket (assuming each bucket is a locally iid distribution). Lower triangular part of the matrix instead focus on test performance on previously seen buckets, which is the focus of most state-of-the-art works combatting the forgetting issue. Therefore, we introduce 4 simplified evaluation metrics to summarize the accuracy matrix:

A visual illustration of the 4 evaluation metrics on a 4x4 accuracy matrix.

- 1.
`In-Domain Accuracy:`

The average of diagonal entries, i.e., test performance*within the same**domain*of current bucket. - 2.
`Next-Domain Accuracy:`

The average of super-diagonal entries, i.e., test performance on the*immediate next domain*. - 3.
`Backward Transfer:`

The average of lower triangular entries, i.e., test performance on*previously seen domains*. - 4.
`Forward Transfer:`

The average of upper triangular entries, i.e., test performance on*future unseen domains*.

We hope that our proposed metrics can simplify evaluation on CLEAR and similar domain-incremental benchmarks; nonetheless, CLEAR can also be repurposed for task-/class-incremental scenarios, which could be exciting future works.

Copy link