TREC-COVID Round 1 Task Guidelines

Overview

The systems' task in TREC-COVID is a classic ad hoc search task. Systems are given a document set and a set of information needs called topics. They produce a ranked list of documents per topic where each list is ordered by decreasing likelihood that the document matches the information need (this is called a run). Human annotators will judge a fraction of the documents for relevance, and those relevance judgments will be used to score runs. Participants will describe how the run was created at the time it is submitted. At the conclusion of a round, the runs, descriptions, and scores will be posted to the TREC-COVID web site as a data archive in support of furthering research on pandemic search systems. For more information on ad hoc retrieval evaluation tasks, see the Overview papers in the TREC proceedings, such as this one.

Organizers anticipate that there will be a series of about five rounds in TREC-COVID. Each round will be a separate evaluation exercise, but later rounds will use document and topic sets that are supersets of earlier rounds. You may choose to participate in whatever rounds you desire, though you will need to register for each round before submitting any runs to that round. A participant may submit at most three runs to any single round.

Document Set

The document set to be used in TREC-COVID is the CORD-19 data set. This data set, released and maintained by the Allen Institute for Artificial Intelligence, contains more than 45,000 scholarly articles about COVID-19 and the coronavirus family of viruses. The articles are drawn from peer-reviewed literature in PMC, as well as archival sites such as medRxiv and bioRxivi. The data set is free to download from its website subject to a data license; see the website for more details about the data set and formats.

The Allen Institute for Artificial Intelligence will update the CORD-19 data set each Friday. The TREC-COVID data page will specify the date of the version of the CORD-19 data set to be used in each round. It is important to ensure you are using the correct version of the data set for the round. Round 1 will use the April 10, 2020 version of CORD-19.

Each version of Cord-19 contains a Metadata file. This file is a comma-separated-values file that lists information about each document in the collection. The first column of the file is the cord_uid, an alpha-numeric string that is to be used as the document id in TREC-COVID submissions. (This is the only valid document id for runs.)

Topics

A topic is the written expression of an information need. The topics used in TREC-COVID have been written by the organizers with biomedical training with inspiration drawn from consumer questions submitted to NLM, discussions by "medical influencers" on social media, and suggestions solicited on Twitter via the #COVIDSearch tag.

The TREC-COVID data page will also point to the topic set to be used in each round. The topic set for the first round contains 30 topics. Each subsequent round will use all topics from the previous rounds and add approximately five new topics.

The topic file is an xml file that contains all of the topics to be used in the round. The format of a topic is as follows (this is an example topic, not part of the official topic set):

      <topic number="1000">
      <query>covid effects, muggles vs. wizards</query>
      <question>Are wizards and muggles affected differently by COVID-19?</question>
      <narrative> 
      Seeking comparison of specific outcomes infections in
      wizards vs. muggles population groups.
      </narrative>
      </topic>

The query field provides a short keyword statement of the information need, while the question field provides a more complete description of the information need. The narrative provides extra clarification, and is not necessarily a super-set of the information in the question.

Runs

To participate in a round of covd-TREC you must submit at least one run by the stated deadline (after registering). A run consists of a set of document lists, one per topic. Each list must contain at least one and no more than 1000 entries in the format below. Lists are ordered by decreasing likelihood that the document is relevant to the topic; that is, the document at rank 1 is the document the system thinks is the best match to the topic, the next-best match at rank 2, etc. You must retrieve at least one document for every topic. If your system retrieves no documents for some topic, create a dummy list consisting of a single arbitrary document for that topic. Your run may contain fewer than 1000 documents for a topic, though be aware that most of the ranked-based evaluation measures used by trec_eval treat empty ranks as not relevant, so you cannot harm your score, and could conceivably help it, by retrieving the maximum 1000 documents for each topic.

There are many possible methods for converting the supplied topics into queries that your system can execute. Following TREC, TREC-COVID defines two broad categories of methods, "automatic" and "manual", based on whether manual intervention is used. Automatic construction is when there is no human involvement of any sort in the query construction process; manual construction is everything else. Note that this is a very broad definition of manual construction, including both runs in which the queries are constructed manually and then run without looking at the results, and runs in which the results are used to alter the queries using some manual operation. If you make any change to your retrieval system based on the content of the TREC-COVID topics (say add words to a dictionary or modify a routine after looking at retrieved results), then your runs are manual runs. When submitting a run, you will be asked whether the run is automatic or manual because that information is helpful to fully understand the results.

You will also be asked to provide a description of your run when you submit it. The quality of this description is crucial to the ability of the run archive to support future research---no one will be able to build upon your method if they don't know what that method is. Please cite papers that describe your method, if available, or provide other references to the method. The description should also include the kinds of manual processing that took place, if any.

Each run must be contained in a single text file. Each line in the file must be in the form

         topicid Q0 docid rank score run-tag

where

topicid	is the topic number (1..30)
Q0	is the literal 'Q0' (currently unused, but trec_eval expects this column to be present)
docid	is the cord_uid of the document retrieved in this position
rank	is the rank position of this document in the list
score	is the similarity score computed by the system for this document. When your run is processed (to create judgment sets and to score it using trec_eval), the run will be sorted by decreasing score and the assigned ranks will be ignored. In particular, trec_eval will sort documents with tied scores in an arbitrary order. If you want the precise ranking you submit to be used, that ranking must be reflected in the assigned scores, not the ranks.
run-tag	is a name assigned to the run. Tags must be unique across both your own runs and all other participants' runs, so you may have to choose a new tag if the one you used is already taken. Tags are strings of no more than 20 characters and may use letters, numbers, underscore (_), hyphen (-), and period (.) only. It will be best if you make the run tag semantically meaningful to identify you (e.g., if your team ID is NIST, then 'NIST-prise-tfidf' is a much better run tag than 'run1'). Every line in the submission file must end with the same run-tag.

You will submit your runs by filling out the submission form to upload the run file. Each uploaded file must contain exactly one complete run. You can upload a file that has been compressed using gzip, but you cannot upload archive files (such as tar or zip). One of the fields on the submission form asks for the run tag. The tag entered into the form must exactly match the run tag contained in the run file.

After you click submit, the submission system will run a validation script that will test the submission file for various kinds of formatting errors. A copy of this (perl) validation script is provided on the Tools page as check_sub.pl. Over the years NIST has found that strict checking of the "sanity" of an input file leads to far fewer problems down the line as it catches a lot of mistakes in the run at a time that the submitter can actually correct them. You are strongly encouraged to use the script to test your submission file prior to uploading the file to NIST. If any errors are found by the script at the time the run is submitted, the submission system will reject the run. Rejected runs are not considered to be submitted; indeed, no information is retained about rejected runs.

Invoke the script giving the run file name as the argument to the script and an error log file will be created. The error log will contain error messages if any errors exist, and will say that the run was successfully processed otherwise (note all output is directed to this log file, none to STDOUT). The script uses a list of valid document ids to make sure the run contains only documents contained in the appropriate version of the collection. The list is posted to the Data page. If you modify the check script on line 20 so that the "docno_loc" variable points to the valid docids file, your checking of a submission file can also detect invalid documents.

The submission deadline for Round 1 runs is 7:00 a.m. EDT on April 23, 2020. At that time, the submission system will be turned off. Only valid runs that were submitted through the submission system prior to the deadline will be counted as TREC-COVID Round 1 runs. Each participant can submit at most three runs.

After you submit a run, you will receive an automated reply from the submission system stating that the run was received and echoing back the information gathered from the submission form as a confirmation. This confirmation is sent by email to the address entered on the submission form (so please be sure to enter a valid address!). The email message is the only response you will receive about a submitted run until after the relevance judgments are complete and the run is scored (about 10 days after the submission deadline). The trec_eval report containing the scores will be mailed to the same submission email address as the confirmation is sent to.

You cannot delete or modify a run once it is submitted. In particular, you cannot submit a "corrected" version of a run by uploading a new run that uses the same run tag. This prohibition against remote removal of runs is a safety precaution to ensure no one mistakenly (or deliberately!) overwrites someone else's run. Since there is a three-run limit on submissions from a participant, please carefully check your runs before you submit them to avoid problems. Nonetheless, the submission system will accept more-than-the-limit number of runs from a team to accommodate submitters who discover bugs in the submitted runs prior to the deadline. If you need to correct a run, submit a new run with a different run tag, and notify NIST stating which run the new run should replace. Please use this as a last resort since all such fixes require manual intervention by NIST staff. If too many runs are submitted without explanation, NIST will choose an arbitrary three runs and discard the rest.

Relevance Judgments and Scoring

The relevance judgments are what turns a document and topic set into a test collection. However, interesting document sets, including CORD-19, are much too large to get complete judgments---a human assessment of relevance for every document for each topic. Instead, collection builders must sample the collection so that a small fraction of the entire document set is judged for a topic but (most of) the relevant documents are nonetheless identified. TREC pioneered the use of pooling and other sampling techniques to create the TREC test collections.

Since TREC-COVID consists of multiple rounds with quick turn-around within rounds, organizers are targeting an average of about 100 document judgments per topic in the initial round. Exactly what sampling strategy will be used to create the sets of documents to be judged is still to be determined---it will depend on the number and diversity of runs received. Any such strategy will use the ranks at which documents are retrieved as a factor (the earlier a document is retrieved by some run the more likely it will be to enter the judgment set). If the overall total of runs received is too large, we will have to restrict the number of runs from each participant that can contribute to the judgment sets to fewer than three. To prepare for such an eventuality, the submission form asks the submitter to assign a priority (1--3) to the run being submitted. Priority 1 runs will be the first runs to contribute to the judgment sets, followed by priority 2 and then priority 3 runs. If a participant assigns the same priority to multiple runs, NIST will choose from among those runs arbitrarily.

The relevance judgments will be made by human annotators that have biomedical expertise. Annotators will use a three-way scale:

Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.
Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.
Not Relevant: everything else.

Relevance is known to be highly subjective; as in all test collections, the opinion of the particular human annotator making the judgment will be final.

Scores will be computed using trec_eval. The more common measures computed by trec_eval are described in the appendix to the TREC proceedings, such as here. Since a relatively small number of documents will be judged per topic, recall-based measures are unlikely to be reliable in the first round, so measures focused on early ranks (e.g., Precision@10 or ndcg@10) will be more meaningful.

Timeline

April 15	Round 1 kick-off; topics released
April 23, 7:00am EDT	Round 1 submission deadline
April 25--May 3	Annotators make relevance judgments
May 4	Relevance judgments posted; evaluation scores returned to participant. Round 2 kick-off
May 13	Round 2 submission deadline

Round 1 Task Definition