Index of /cdip
This directory contains ISO images of the six DVDs comprising the IIT CDIP
collection, used in the first several years of the TREC legal track.
We are offering it here freely for download, since no one is distributing
these DVDs anymore.
Below is the text of the web page from IIT describing the collection.
The IIT CDIP Test Collection
IIT CDIP 1.0 (Illinois Institute of Technology Complex Document
Information Processing Test Collection, version 1.0) is a data set
supporting research in information retrieval, document analysis,
computational linguistics, data mining, and related fields. It
consists of:
IIT CDIP Records 1.0: 6,910,192 XML records describing documents that
were released in various lawsuits against the US tobacco companies and
research institutes
IIT CDIP Queries 1.0: 40 topic descriptions used in the TREC 2006
Legal Track (see below)
IIT CDIP Assessments 1.0: Relevance judgments on the 40 topics against
pooling-based samples of the records
The records contain both text and metadata. The text was produced by
applying optical character recognition (OCR) to document images in
TIFF format. The metadata was produced by the tobacco organizations
using a variety of techniques. It includes a title, a listing of the
senders and recipients of the document, important names mentioned in
the document, controlled vocabulary categories, geographical and
organizational context data, and other information. Not all metadata
fields are available for all documents, the formatting is
inconsistent, and there is an unknown level of errors and
omissions. IIT CDIP 1.0 was used in text retrieval experiments in the
TREC 2006 Legal Track evaluation, as described in this paper:
Baron, J.; Lewis, D.; Oard, D. TREC 2006 Legal Track Overview. The
Fifteenth Text REtrieval Conference Proceedings (TREC 2006). To
Appear.
Further details on the collection can be found in this paper:
Lewis, D.; Agam, G.; Argamon, S.; Frieder, O.; Grossman, D.; and
Heard, J. Building a Test Collection for Complex Document Information
Processing. SIGIR '06, pp 665-666.