📊Evaluation Protocol on CLEAR
How to faithfully evaluate a continual learning algorithm on CLEAR?
Last updated
How to faithfully evaluate a continual learning algorithm on CLEAR?
Last updated
For bucket 1st to 10th that each comes with an annotated labeled trainset, we also release a held-out testset over the same timespan (now downloadable here).
Evaluating on CLEAR is the same as on any other continual learning benchmarks: we measure the performance of a model per timestamp on all the 10 testsets. This produces a 10x10 accuracy matrix.
Different parts of the accuracy matrix focus on different aspects of the performance of a model. For example, the diagonal entries represent performance on a testset that is sampled from the same distribution as the current bucket (assuming each bucket is a locally iid distribution). Lower triangular part of the matrix instead focus on test performance on previously seen buckets, which is the focus of most state-of-the-art works combatting the forgetting issue. Therefore, we introduce 4 simplified evaluation metrics to summarize the accuracy matrix:
In-Domain Accuracy:
The average of diagonal entries, i.e., test performance within the same domain of current bucket.
Next-Domain Accuracy:
The average of super-diagonal entries, i.e., test performance on the immediate next domain.
Backward Transfer:
The average of lower triangular entries, i.e., test performance on previously seen domains.
Forward Transfer:
The average of upper triangular entries, i.e., test performance on future unseen domains.
We hope that our proposed metrics can simplify evaluation on CLEAR and similar domain-incremental benchmarks; nonetheless, CLEAR can also be repurposed for task-/class-incremental scenarios, which could be exciting future works.