The CORD-19 data set is a dynamic document set reflecting the real-world structure of the COVID-19 literature. While later releases of CORD-19 are generally larger than earlier releases, later releases are not strictly supersets of those earlier releases in that articles can be dropped from a release---because the article is no longer available from the original source or because the article no longer qualifies as being part of the collection according to CORD-19 construction processes, for example. Sometimes a "dropped" article has actually just been given a new CORD-UID, as can happen when a preprint is published and thus appears in PMC.
These changes complicate the use of relevance judgments for evaluating TREC-COVID runs, especially in the light of residual collection scoring, because of the bookkeeping required to make sure the right judgments are used to score a given run. We use a naming scheme for the qrels (relevance judgment) files to self-document what judgments the file contains. [This naming scheme is courtesy of Chris Buckley as suggested on the TREC-COVID mailing list.] The name of a qrels file is composed of three parts, the header ("qrels-covid"); the document round (e.g., "d3"); and a range of judgment rounds (e.g., "j0.5-2"). (Recall that the judgments made between TREC-COVID rounds are designated as half-rounds. So, judgment set 1.5 are documents judged in the week between Rounds 1 and 2. The documents to be judged in that set were selected from Round 1 submissions, but were used to score Round 2 submissions.) The document round refers to the CORD-19 release that was used in the given TREC-COVID round, and all of the cord_uid's in that qrels file are with respect to that release. Specifically, this means that judged-but-dropped documents are not included in that qrels and judged-but-changed-id documents have the id of the document in the target CORD-19 release. Note that this means that two qrels files that cover the same judgment rounds (say, 0.5-2) but refer to different document rounds (2 vs. 3) can have different numbers of documents judged depending on dropped documents of the document round. Also note that the CORD-UID in use in a qrels file is not neccesarily the CORD-UID of the article at the time it was judged, it is the id of the article in the document round's CORD-19 release.
To get valid evaluation scores, you must ensure that you are using a correct qrels file for the document set the run was produced from. For the case of official Round 5 submissions, the changed CORD_UIDs were not known at submission time, so almost all submissions contained some previously-judged (but "disguised") documents. These disguised documents were removed before scoring, so most Round 5 submissions have fewer than 1000 documents retrieved per topic.
The format of a relevance judgments file ("qrels") is lines of
topic-id iteration cord-id judgment
where judgment is 0 for not relevant, 1 for partially relevant, and 2 for
fully relevant; and iteration records the round in which the document
was judged. trec_eval does not make use of the iteration field (though
it expects it to be present for historical reasons), and TREC-COVID is
using it for bookkeeping. Since annotators continued to work on
weeks when a round was active, the iteration field contains "half rounds"
as well as whole rounds. A document judged in round X.5 was selected
to be judged from a run in round X but was used to score Round X+1 runs.
(For round 0.5, the documents were selected from runs
produced by the organizers that were not part of official runs.)
For most measures, documents not in the qrels file because they were
not judged are assumed to be not relevant.
Judgment file used to score Round 5 runs. It contains judgments on documents selected from Round 4 runs (judgment set 4.5) that were mapped to Round 5 CORD_UIDs, with any dropped documents removed from the judgment set, in union with Round 5 judgments. Round 4.5 judgment sets were created by pooling all Round 4 submissions. For topics 1-35, pooling was to depth 15 and for topics 36-45, pooling was to depth 30. Note, however, that since Round 4 judgments pooled to depth 8 for topics 1-35 and depth 20 for topics 36-45 over the top 2 priority runs, many of the total pool documents were judged for Round 4 and are not included in the 4.5 judgment set.
Since Round 5 is the final round, the judgment sets for Round 5 are the last judgments to be made for TREC-COVID. That meant that the judgment sets needed to accomplish two somewhat conflicting goals: allow for a reasonable evaluation of Round 5 runs where residual collection scoring was used, and have the overall TREC-COVID final test collection have roughly comparable numbers of documents judged for each topic. To this end, the judgment sets for Round 5 are much larger than any other single (half) round of judging, and the process of building the judgment sets was much more complicated than earlier rounds. The judgment sets were built in three parts depending on the topic number as follows: