Background

In 2016, we ran the first FEIII Data Challenge. This challenge is now over. From this page, you can read the details of the challenge and get the data.

Data

Here is the 2016 challenge data. The zip file contains:

Data and Metadata: these are the data files that were made available to challenge participants. The files are in both CSV and Excel format.
Resources: These are slides and documents provided to participants as light background information.
Ground Truth: These are the answer keys for the data challenge. Included are all entity match adjudications including those which remained "ambiguous" despite significant effort to determine the right answer.

Guidelines

(The following are the instructions given to challenge participants. Follow this information if you want to pretend to do the 2016 challenge. If you just want to understand the data, you should read this as well.)

(guidelines v1.1, 5 December 2015)

Overview

We provide four data files, giving information on lists of financial entities. The task is to identify matching entities -- records (rows) that indicate the same entity -- across two of the files. This document describes the file formats, the provenance of the data, the specific evaluation tasks, how to participate, and how we will measure matching effectiveness.

Input Data

We provide four input data files; all are in csv format, as follows:

FFIEC.csv: This file comes from the Federal Financial Institution Examination Council and provides information about banks and other financial institutions that are regulated by agencies affiliated with the Council.

LEI.csv: This file contains Legal Entity Identifiers (LEI) for a wide range of institutions.

SEC.csv: This file comes from the Securities and Exchange Commission and contains entity information for entities registered with the SEC.

FHLB.csv: This file comes from the Federal Home Loan Banks and contains information for banks within the FHLB system.

The files are all in CSV format. As usual, the first row in each file contains column names. The columns are described in the data dictionary document.

The majority of the information in the files is obtained from the indicated source and has not been altered or cleaned. We have added some additional columns with clean versions of some of the original data columns, and the data dictionary indicates such added fields.

In general, many records align trivially, but there are a number of factors that make certain cases complicated.

The different regulators keep different data on each organization. For one, an address might be a single field, whereas for another, the address might be broken into three columns, and in another might only have a zip code.

There are often inconsistencies in how entity names and addresses are entered, in addition to outright errors and typos.
There is implicit semantic knowledge included in a name, e.g., a name may contain “National Association” or “State Bank of” in its name. This complicates matching based on a similarity score that is obtained using some edit distance metric.

We will provide some sample ground truth matches for the evaluation tasks in the same format as the expected result submission files. These ground truth sample files will be made available on January 15, 2016.

Ideally, the record matches for each task should be produced automatically (no human in the loop), using only information in the provided files indicated for each task. Our baseline ground-truthing was done using the Duke record deduplication tool, which you should feel free to use as a starting point.

External data: You may choose to make use of additional information external to the data files provided. If you do so, we will ask you to indicate this and to describe these resources with your challenge entry.

Manual intervention: You can examine the data but the results that you submit to the challenge must be the result of an automated match and must not include results from a manual match. You may not hand-tune your matches to the challenge data. We will ask you to sign a statement about this when you submit a match.

Evaluation Tasks

There are four tasks in the FEIII Challenge. The first two are required for participation, the second two are optional.

For each task, you will submit your results in CSV format using a filename following the convention ORG_TASK_X, where ORG is a short organization name you provide when you sign up to participate, TASK is as indicated below, and X is a sequence number which allows you to submit more than one set of results to each task.

Task 1: FFIEC -> LEI

The first task is to align records (rows) between the FFIEC and LEI files. You should produce a file ORG_FFIEC_LEI_TP_X in CSV format with the following two columns:

"FFIEC_IDRSSD" and "LEI_LEI" where the entry in each row is the respective identifier from the FFIEC and LEI files. Aside from a header row, this file should only contain matches between the two files; do not include blank cells for rows that do not have a match in the other file.

If you wish you may produce a second file ORG_FFIEC_LEI_TN_X in CSV format with the following column: "FFIEC_IDRSSD"; this will contain entries from the FFIEC file where you are certain that there is no match in the LEI file.

Task 2: FFIEC -> SEC

The second task is to align records (rows) between the FFIEC and SEC tables. You should produce a file ORG_FFIEC_SEC_TP_X in CSV format with the following two columns: "FFIEC_IDRSSD" and "SEC_CIK" where the entry in each row is the respective identifier from the FFIEC and SEC files. You may also wish to produce a file ORG_FFIEC_SEC_TN_X that contains entries from the FFIEC_IDRSSD column where you are certain there is no match in the SEC file.

Optional Task 3: FFIEC -> LEI, SEC

This task is to find FFIEC entities that are contained in both the LEI and SEC files. The output file ORG_FFIEC_LEI_SEC_TP_X should include three columns, "FFIEC_IDRSSD", "LEI_LEI", and "SEC_CIK".

Optional Task 4: LEI -> SEC

This is an alignment of the LEI and SEC files. The output file ORG_LEI_SEC_TP_X

should include two columns, "LEI_CEI", and "SEC_CIK". This can be a many-to-many match so there may be multiple matching entries for an entry from the LEI or SEC files.

Scoring

Using the ground truth matches (see below), we can compute quality measures that reflect how effective your matching algorithm is at performing each task. The measures we plan to compute for each task entry are as follows:

Matches submitted: The count of how many match entries are in your XXX_TP file.

Ground truth matches: The count of true positive matches in the "ground truth" match.

Precision: The fraction of matches submitted in your XXX_TP file that are correct with respect to the "ground truth" match.

Recall: The fraction of "ground truth" matches that appear in your XXX_TP file.

For TN submissions, we will compute the same metrics based on the rows in the source file that are known not to have a match in the second.

Submission Process

Each group may submit a maximum of three entries per task.

As indicated above, you may submit more than one set of results for each task. You may submit a maximum of four result files for each task. This might reflect different parameter settings in your matching algorithm.

Following the submission deadline on March 15, 2016, we will compute the scores for each submission, and send you your scores, along with the distribution of scores across all participants.

We will invite a subset of participants to prepare a one-page description of the methods used by your entry. A subset of these will be invited to speak at the SIGMOD workshop.

At the report-out workshop at SIGMOD, we will release all entries, scores, and ground truth data.

Ground truth process

To build the ground truth, we wrote a baseline algorithm which outputs a match score for each pair of rows. We chose a score threshold to separate the baseline output into "possible true positives" and "possible false positives". All possible true positive matches, and a sample of possible false positives matches were reviewed by a team of experts to determine the correct answer. Non-adjudicated matches are assumed to be false positive matches for purposes of the evaluation.

Following the submission deadline, we will sample from the non-adjudicated false positives that may appear in any of the XXX_TP submission files, to form a second adjudication set. These will be also reviewed by the experts. After this second round of review is complete, final scores will be computed and sent to challenge participants.

Baseline for matching

A baseline system for matching was customized using the Duke deduplication engine (https://github.com/larsga/Duke). Some details of this baseline matching process are in these slides.

Timeline

Data release: November 16, 2015
Task guidelines: December 14, 2015
Example ground truth release: January 15, 2016
Deadline for submission of results: March 31, 2016 (deadline extended!)
Evaluation scores and ground truth released to participants: April 15, 2016
One page report from invited participants: May 6, 2016
Workshop report from organizers: May 6 2016
Workshop event: Friday July 1, 2016 in conjunction with SIGMOD

Information on the 2016 Challenge