The CORD-19 data set is a dynamic document set reflecting the real-world structure of the COVID-19 literature. While later releases of CORD-19 are generally larger than earlier releases, later releases are not strictly supersets of those earlier releases in that articles can be dropped from a release---because the article is no longer available from the original source or because the article no longer qualifies as being part of the collection according to CORD-19 construction processes, for example. Sometimes a "dropped" article has actually just been given a new CORD-UID, as can happen when a preprint is published and thus appears in PMC.
These changes complicate the use of relevance judgments for evaluating TREC-COVID runs, especially in the light of residual collection scoring, because of the bookkeeping required to make sure the right judgments are used to score a given run. We use a naming scheme for the qrels (relevance judgment) files to self-document what judgments the file contains. [This naming scheme is courtesy of Chris Buckley as suggested on the TREC-COVID mailing list.] The name of a qrels file is composed of three parts, the header ("qrels-covid"); the document round (e.g., "d3"); and a range of judgment rounds (e.g., "j0.5-2"). (Recall that the judgments made between TREC-COVID rounds are designated as half-rounds. So, judgment set 1.5 are documents judged in the week between Rounds 1 and 2. The documents to be judged in that set were selected from Round 1 submissions, but were used to score Round 2 submissions.) The document round refers to the CORD-19 release that was used in the given TREC-COVID round, and all of the cord_uid's in that qrels file are with respect to that release. Specifically, this means that judged-but-dropped documents are not included in that qrels and judged-but-changed-id documents have the id of the document in the target CORD-19 release. Note that this means that two qrels files that cover the same judgment rounds (say, 0.5-2) but refer to different document rounds (2 vs. 3) can have different numbers of documents judged depending on dropped documents of the document round. Also note that the CORD-UID in use in a qrels file is not neccesarily the CORD-UID of the article at the time it was judged, it is the id of the article in the document round's CORD-19 release.
To get valid evaluation scores, you must ensure that you are using a correct qrels file for the document set the run was produced from. For the case of official Round 4 submissions, the changed CORD_UIDs were not known at submission time, so almost all submissions contained some previously-judged (but "disguised") documents. These disguised documents were removed before scoring, so most Round 4 submissions have fewer than 1000 documents retrieved per topic.
The format of a relevance judgments file ("qrels") is lines of
topic-id iteration cord-id judgment
where judgment is 0 for not relevant, 1 for partially relevant, and 2 for
fully relevant; and iteration records the round in which the document
was judged. trec_eval does not make use of the iteration field (though
it expects it to be present for historical reasons), and TREC-COVID is
using it for bookkeeping. Since annotators are continuing to work on
weeks when a round is active, the iteration field contains "half rounds"
as well as whole rounds. A document judged in round X.5 was selected
to be judged from a run in round X but won't be used in scoring until
round X+1. (For round 0.5, the documents were selected from runs
produced by the organizers that were not part of official runs.)
For most measures, documents not in the qrels file because they were
not judged are assumed to be not relevant.