TREC Washington Post Corpus

The TREC Washington Post Corpus contains 728,626 news articles and blog posts from January 2012 through December 2020. The articles are stored in JSON format, and include:

  • title
  • byline
  • date of publication
  • kicker (a section header)
  • article text broken into paragraphs
  • links to embedded images and multimedia (for 2012-2017 documents)
Compressed, the tarball is about 2.4GB; decompressed the data is 14GB.

You need to have an organizational agreement on file with NIST to download these files.

NEWS: Version 4 is released, and includes articles in 2020.





This page created on November 8, 2017
Last updated on April 23, 2020
Contact: [email protected]