TREC-COVID Round 4 Task Guidelines

Round 4 vs. previous rounds

Most of the Guidelines for Round 4 are the same as the Guidelines for the previous rounds. The differences are simply new versions of the data and topic sets to be used, and a new submission deadline. The submission deadline for Round 4 runs is 7:00AM EDT on Monday, July 6.

For a summary of the outcomes of Round 1 see

TREC-COVID: Rationale and Structure of an Information Retrieval Shared Task for COVID-19. (Journal of the American Medical Informatics Association. May, 2020.) and
TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection. (In ACM SIGIR Forum. Volume 54, Number 1. June, 2020.)

There is also a short note on the stability of the Round 1 ranking of systems when more relevance judgments were obtained, which was then updated to include Round 2.0 judgments.

Overview

The systems' task in TREC-COVID is a classic ad hoc search task. Systems are given a document set and a set of information needs called topics. They produce a ranked list of documents per topic where each list is ordered by decreasing likelihood that the document matches the information need (this is called a run). Human annotators will judge a fraction of the documents for relevance, and those relevance judgments will be used to score runs. Participants will describe how the run was created at the time it is submitted. At the conclusion of a round, the runs, descriptions, and scores will be posted to the TREC-COVID web site as a data archive in support of furthering research on pandemic search systems. For more information on ad hoc retrieval evaluation tasks, see the Overview papers in the TREC proceedings, such as this one.

Organizers anticipate that there will be a series of five rounds in TREC-COVID. Each round will be a separate evaluation exercise. Later rounds will use document sets that are largely (but not strictly) supersets of earlier rounds, and topic sets for later rounds are strict supersets of earlier rounds. You may choose to participate in whatever rounds you desire, though you will need to register for each round before submitting any runs to that round. A participant may submit at most three runs to any single round.

Document Set

The document set to be used in TREC-COVID is the CORD-19 data set. This data set, released and maintained by the Allen Institute for Artificial Intelligence, contains scholarly articles about COVID-19 and the coronavirus family of viruses. The articles are drawn from peer-reviewed literature in PMC, as well as archival sites such as medRxiv and bioRxiv. The data set is free to download from its website subject to a data license; see the website for more details about the data set and formats.

The Allen Institute for Artificial Intelligence periodically updates CORD-19. The TREC-COVID data page will specify the date of the version of the CORD-19 data set to be used in each round. It is important to ensure you are using the correct version of the data set for the round. Round 4 will use the June 19, 2020 version of CORD-19.

Each version of CORD-19 contains a Metadata file. This file is a comma-separated-values file that lists information about each document in the collection. The first column of the file is the cord_uid, an alpha-numeric string that is to be used as the document id in TREC-COVID submissions. (This is the only valid document id for runs.)

Because CORD-19 is a dynamic data set that reflects the state of the COVID-19 literature at the date of the release, different releases can differ from one another in the set of valid cord_uid's, including some cord_uid's that appear in earlier releases but are dropped from later releases. The set of cord_uid's in the metadata file of a release is the sole definition of the documents contained within the collection built on that release.

Topics

A topic is the written expression of an information need. The topics used in TREC-COVID have been written by the organizers with biomedical training with inspiration drawn from consumer questions submitted to NLM, discussions by "medical influencers" on social media, and suggestions solicited on Twitter via the #COVIDSearch tag.

The TREC-COVID data page points to the topic set to be used in each round. The Round 4 topic set contains 45 topics, 30 topics (numbers 1--30) that were used in all previous rounds, five topics (31--35) that were new for Round 2, another five topics (36--40) that were new for Round 3, and a third set of five topics (41--45) that are new for Round 4. There are relevance judgments available for the first 40 topics, but none for the last five since they are new to this round.

The topic file is an xml file that contains all of the topics to be used in the round. The format of a topic is as follows (this is an example topic, not part of the official topic set):

      <topic number="1000">
      <query>covid effects, muggles vs. wizards</query>
      <question>Are wizards and muggles affected differently by COVID-19?</question>
      <narrative> 
      Seeking comparison of specific outcomes regarding infections in
      wizards vs. muggles population groups.
      </narrative>
      </topic>

The query field provides a short keyword statement of the information need, while the question field provides a more complete description of the information need. The narrative provides extra clarification, and is not necessarily a super-set of the information in the question.

Runs

To participate in a round of TREC-COVID you must submit at least one run by the stated deadline (after registering). A run consists of a set of document lists, one per topic. Each list must contain at least one and no more than 1000 entries in the format below. Lists are ordered by decreasing likelihood that the document is relevant to the topic; that is, the document at rank 1 is the document the system thinks is the best match to the topic, the next-best match at rank 2, etc. You must retrieve at least one document for every topic. If your system retrieves no documents for some topic, create a dummy list consisting of a single arbitrary document for that topic. Your run may contain fewer than 1000 documents for a topic, though be aware that most of the ranked-based evaluation measures used by trec_eval treat empty ranks as not relevant, so you cannot harm your score, and could conceivably help it, by retrieving the maximum 1000 documents for each topic.

There are many possible methods for converting the supplied topics into queries that your system can execute. Following TREC, TREC-COVID defines two broad categories of methods, "automatic" and "manual", based on whether manual intervention is used. Automatic construction is when there is no human involvement of any sort in the query construction process; manual construction is everything else. Note that this is a very broad definition of manual construction, including both runs in which the queries are constructed manually and then run without looking at the results, and runs in which the results are used to alter the queries using some manual operation. If you make any change to your retrieval system based on the content of the TREC-COVID topics (say add words to a dictionary or modify a routine after looking at retrieved results), then your runs are manual runs.

The availability of relevance judgments for (most of) the topics means that systems can do relevance feedback (aka supervised training) to create a run. This creates a third category of methods: automatic except for the use of official relevance judgments when available. A run in this category is produced completely automatically, but does make use of the (human-produced) official relevance judgments. If you make and use your own judgments, then the run is manual. If you modified your system based on the content of the topics, then the run is manual.

The public availability of relevance judgments for the exact same topic run on a collection containing those judged documents means that steps must be taken to prevent invalid experimental methodology. It is easy to explicitly game evaluation scores in such a case, obviously, but judgments can creep into results in subtle ways, too. The history of IR research includes numerous studies on how best to evaluate relevance feedback systems---that is, systems that can make use of prior relevance judgments for a topic. A technique that is methodologically valid and easy to implement is residual collection evaluation (see Salton and Buckley's paper). In residual collection evaluation, any document that has already been judged for a topic is conceptually removed from the collection before scoring. TREC-COVID will use residual collection scoring for all rounds after the first. Thus, the ranked lists that you submit should not contain any documents that exist in a qrels file for the topic (even if you did not make use of the judgments). Any pre-judged documents that are submitted will be automatically removed upon submission, and thus you will have effectively returned fewer documents than you might have.

The dynamic nature of the CORD-19 data set complicates residual collection evaluation. One complication is that some of the dropped documents were judged, so cumulative qrels could contain judged documents that no longer exist in the current collection. A more troubling complication is judged documents that appear in the collection with a new (different) cord_uid (which can happen, for example, when a preprint gets published and is added to PMC). Since it is the work itself that is judged, it is the work that should be removed regardless of its cord_uid. However, a mapping between old and new cord_uid's generally arrives too late for participants' submissions. This happended for Round 3 and may happen again in Round 4. When it happens, NIST removes the new-id document from submissions before scoring.

The Round 3 qrels page describes this situation in more detail and also provides a cumulative qrels file with respect to the Round 3 CORD-19 release (May 19). A mapping between Round 3 and Round 4 (May 19 and June 19) cord_uid's is not yet available. Any topic-docid combination that is in the cumulative qrels is a pre-judged document that will be removed from Round 4 submissions.

When submitting a run, you will be asked to indicate the category of the run because that information is helpful to fully understand the results. You will also be asked to provide a description of your run when you submit it. The quality of this description is crucial to the ability of the run archive to support future research---no one will be able to build upon your method if they don't know what that method is. Please cite papers that describe your method, if available, or provide other references to the method. The description should also include the kinds of manual processing that took place, if any.

Each run must be contained in a single text file. Each line in the file must be in the form

         topicid Q0 docid rank score run-tag

where

topicid	is the topic number (1..40)
Q0	is the literal 'Q0' (currently unused, but trec_eval expects this column to be present)
docid	is the cord_uid of the document retrieved in this position. It must be a valid cord_id in the June 19 release of CORD-19. If it has already been judged for this topic, it will be removed.
rank	is the rank position of this document in the list
score	is the similarity score computed by the system for this document. When your run is processed (to create judgment sets and to score it using trec_eval), the run will be sorted by decreasing score and the assigned ranks will be ignored. In particular, trec_eval will sort documents with tied scores in an arbitrary order. If you want the precise ranking you submit to be used, that ranking must be reflected in the assigned scores, not the ranks.
run-tag	is a name assigned to the run. Tags must be unique across both your own runs and all other participants' runs, so you may have to choose a new tag if the one you used is already taken. Tags are strings of no more than 20 characters and may use letters, numbers, underscore (_), hyphen (-), and period (.) only. It will be best if you make the run tag semantically meaningful to identify you (e.g., if your team ID is NIST, then 'NIST-prise-tfidf' is a much better run tag than 'run1'). Every line in the submission file must end with the same run-tag. In particular, make sure your file does not contain a header line; the file must contain all and only lines that form the submission itself.

You will submit your runs by filling out the submission form to upload the run file. Each uploaded file must contain exactly one complete run. You can upload a file that has been compressed using gzip, but you cannot upload archive files (such as tar or zip). One of the fields on the submission form asks for the run tag. The tag entered into the form must exactly match the run tag contained in the run file.

After you click submit, the submission system will run a validation script that will test the submission file for various kinds of formatting errors. A copy of this (perl) validation script is provided on the Tools page as check_sub.pl. Over the years NIST has found that strict checking of the "sanity" of an input file leads to far fewer problems down the line as it catches a lot of mistakes in the run at a time that the submitter can actually correct them. You are strongly encouraged to use the script to test your submission file prior to uploading the file to NIST. If any errors are found by the script at the time the run is submitted, the submission system will reject the run. Rejected runs are not considered to be submitted; indeed, no information is retained about rejected runs.

Invoke the script giving the run file name as the argument to the script and an error log file will be created. The error log will contain error messages if any errors exist, and will say that the run was successfully processed otherwise (note all output is directed to this log file, none to STDOUT). The script uses a list of valid document ids to make sure the run contains only documents contained in the appropriate version of the collection. The list is posted to the Data page. If you modify the check script on line 20 so that the "docno_loc" variable points to the valid docids file, your checking of a submission file can also detect invalid documents. Note that this document list is for the entire collection, i.e., it is not topic-specific. Thus this check will not catch valid documents that have already been judged for a specific topic and thus should not be in the topic's ranked list.

The submission deadline for Round 4 runs is 7:00 a.m. EDT on ~~July 2~~ July 6, 2020. At that time, the submission system will be turned off and it will not be re-opened. Only valid runs that were submitted through the submission system prior to the deadline will be counted as TREC-COVID Round 4 runs. Each participant can submit at most three runs.

After you submit a run, you will receive an automated reply from the submission system stating that the run was received and echoing back the information gathered from the submission form as a confirmation. This confirmation is sent by email to the address entered on the submission form (so please be sure to enter a valid address!). The email message is the only response you will receive about a submitted run until after the relevance judgments are complete and the run is scored. The trec_eval report containing the scores will be mailed to the same submission email address as the confirmation is sent to.

You cannot delete or modify a run once it is submitted. In particular, you cannot submit a "corrected" version of a run by uploading a new run that uses the same run tag. This prohibition against remote removal of runs is a safety precaution to ensure no one mistakenly (or deliberately!) overwrites someone else's run. Since there is a three-run limit on submissions from a participant, please carefully check your runs before you submit them to avoid problems. Nonetheless, the submission system will accept more-than-the-limit number of runs from a team to accommodate submitters who discover bugs in the submitted runs prior to the deadline. If you need to correct a run, submit a new run with a different run tag, and notify NIST stating which run the new run should replace. Please use this as a last resort since all such fixes require manual intervention by NIST staff. If too many runs are submitted without explanation, NIST will choose an arbitrary three runs and discard the rest.

Relevance Judgments and Scoring

The relevance judgments are what turns a document and topic set into a test collection. However, interesting document sets, including CORD-19, are much too large to get complete judgments---a human assessment of relevance for every document for each topic. Instead, collection builders must sample the collection so that a small fraction of the entire document set is judged for a topic but (most of) the relevant documents are nonetheless identified. TREC pioneered the use of pooling and other sampling techniques to create the TREC test collections.

Since TREC-COVID consists of multiple rounds with quick turn-around within rounds, organizers started with the target of approximately 100 documents per topic per week of judging, with two weeks of judging per round. In practice, judging has been proceeding at a somewhat faster rate, and the first two rounds produced aproximately 20,000 judgments across 35 topics. Round 3 judging added appriximately 12,000 judgments across topics 1--40.

In the first round, the number of runs and the diversity (lack of overlap) among the runs meant the target number of judgments was reached by using depth-7 pools from one run per participant. These judgments sets are sufficiently incomplete that run comparisons are likely to be unstable for many measures. While full trec_eval output was returned to participants, the focus was on high-precision metrics such as ndcg@10 and precision@5 as well as on measures that do not rely on complete judgments such as Bpref.

Since the topic set for each round is the cumulative set of topics, most topics will be judged in many rounds and the relevance judgments will accrete. Thus, the final pandemic test collection judgment set can be expected to be more stable. Each individual round is still likely to be affected by relativly small judgment sets, however, since the use of residual collection evaluation means that each round's runs can only be scored using that round's judgments.

For Round 2, the judgment sets were produced from ranks 8-14 of Round 1 runs (judged in the week that Round 2 kicked off so labeled round 1.5 in the qrels file) plus ranks 1-7 of one run per participant for topics 1--30 and ranks 1--15 of that same run for topics 31--35 (labeled as 2.0 in the qrels file). For Round 3, the judgment sets were produced by pooling ranks 1-10 of priority 1 and 2 Round 2 runs for topics 1--30 and pooling to depth 20 for topics 31-35 (labeled 2.5 in the qrels); in addition, priority 1 Round 3 runs and seven Kaggle runs were pooled to depth 10 for all topics. Because of the changes in CORD-19 between the Round 2 and Round 3 releases, some of the 2.5 judged documents were dropped and others had new cord_uid's by the time Round 3 runs were scored. For the qrels used to score Round 3 runs, the dropped documents were dropped from the qrels [a qrels with invalid documents listed as relevant will cause trec_eval to compute depressed scores], and the documents with new cord_uid's were included with the new (i.e., Round 3) id. Note also that because of some bugs, about 250 documents that were intended to be judged in round 3.0 were not in fact judged. These documents are now slated to be judged in Round 3.5.

The relevance judgment period following Round 4 submissions will be two weeks long rather than one week as was the case for Round 3. It remains likely that the overall total of runs received in Round 4 will be too large for all runs to be judged, so we will have to restrict the number of runs from each participant that can contribute to the judgment sets to fewer than three. To prepare for such an eventuality, the submission form asks the submitter to assign a priority (1--3) to the run being submitted. Priority 1 runs will be the first runs to contribute to the judgment sets, followed by priority 2 and then priority 3 runs. If a participant assigns the same priority to multiple runs, NIST will choose from among those runs arbitrarily.

The relevance judgments will be made by human annotators that have biomedical expertise. Annotators will use a three-way scale:

Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.
Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.
Not Relevant: everything else.

Relevance is known to be highly subjective; as in all test collections, the opinion of the particular human annotator making the judgment will be final.

Scores will be computed using trec_eval. The more common measures computed by trec_eval are described in the appendix to the TREC proceedings, such as here.

Timeline

June 24	Round 4 kick-off; topics released
Monday, ~~Thursday, July 2~~ July 6, 7:00am EDT	Round 4 submission deadline
~~July 6--July 16~~ July 8--July 19	Annotators make relevance judgments
(anticipated) July 22	Relevance judgments posted; evaluation scores returned to participant. Final round kick-off.
(anticipated) ~~July 29~~ Aug 3, 7:00am EDT	Round 5 submission deadline.

Round 4 Task Definition