Abstract IE
Identify events in a sentence, and mark them as material or verbal, helpful or harmful.
The BETTER program aims to dramatically compress the information discovery cycle by designing systems that extract personalized, semantic information from text and leverage this information to substantially improve search capabilities.
The BETTER program created datasets for information extraction and cross-language information retrieval. These datasets were built using content from the CommonCrawl news collection. The linguistic annotations were provided by MITRE and ARLIS, and relevance assessment annotations by NIST. The datasets were structured around the following six evaluation tasks:
Identify events in a sentence, and mark them as material or verbal, helpful or harmful.
Basic events have a type, an agent, a patient, and possibly related events. You might think of Basic as stripped-down MUC events.
Granular events are templates for events that consist of several component Basic events.
Cross-language retrieval by example, fully automatic given a small number of example documents and passages in place of a query.
Retrieval with a human in the loop. The user is shown a small number of requests with narrative descriptions, and allowed to tune the system for future requests on the same topic.
"Automatic" HITL, where systems are shown the small number of requests with narrative descriptions, and automatically adapt the system to future requests, for example by creating background queries.
Abstract events consist of agent, patient, event anchor, and quad-class. The quad-class is a two-dimensional event type that can be either Material and/or Verbal, and Helpful or Harmful.
As an example, consider the sentence below and the abstract events identified in the table below it. This sentence mentions four events that would be captured in the abstract extraction task.
“According to several witnesses, the boiler explosion in the factory injured three people, but did not melt any of the nearby fuel lines.”
Event ID | Agent(s) | Anchor(s) | Patient(s) | Quad-class |
1 | "several witnesses" | "according" | "injure", "melt" | Verbal-Neutral |
2 | "boiler" | "explosion" | "boiler" | Material-Harmful |
3 | "explosion" | "injure" | "three people" | Material-Harmful |
4 | "explosion" | "melt" | "the nearby fuel lines" | Material-Harmful |
Abstract documentation | (pdf format) |
abstract-eng.bp.json | English abstract data. This data was hidden in the BETTER evaluation. |
abstract-arb.bp.json | Arabic abstract data. This was the phase 1 evaluation test set. |
abstract-fas.bp.json | Farsi abstract data. This was the phase 2 evaluation test data. There is also a README file. |
Description of the Basic annotation and task goes here.
basic-eng.full.bp.json | English Basic data (full). In the BETTER evaluation, this set was split into train, devtest, analysis, and hidden subsets. |
basic-arb.bp.json | Arabic Basic data. |
basic-fas.bp.json | Farsi Basic data. |
Description of granular annotation and task goes here.
granular-eng-p1.full.bp.json | English Granular data, phase 1 (full). In the BETTER evaluation, this set was split into train, devtest, analysis, and hidden subsets |
granular-eng-p2.full.bp.json | English Granular data, phase 2. |
granular-arb.bp.json | Arabic Granular data. |
granular-fas.bp.json | Farsi Granular data. |
Description of information retrieval annotation and task goes here.
BETTER-Phase1-IR-HITL-package.tar.gz | Phase 1 IR collection. This includes an English training corpus, English queries with Basic annotations, and an Arabic target corpus with relevance judgments and Basic annotations. |
BETTER-Phase2-IR-HITL-package.tar.gz | Phase 2 IR collection. This includes an English training corpus, English queries with Basic annotations, and a Farsi target corpus with relevance judgments and Basic annotations. |
For more information, contact Ian Soboroff