Information on the 2020 Challenge

FEIII 2020 Dataset README

Download:
https://drive.google.com/file/d/1aaXx7vrrvkalXbBUtJ7mPLQmGWZQ-fL6/view?usp=sharing

The dataset is created around 1000 public companies traded on the NYSE.

For each company, we provide basic identification information including the company website, the 6 digit NAICS code, and the industry sector. The 1000 companies are spread across 15+ industry sectors.

The NSF sponsored BOKN project has created a historic archive (over the past 20+ years) of the websites for approximately a million US based companies, both public and private. We have created a 300 dimensional vector embedding centered around each company. The FEIII 2020 dataset includes a subset of this embedding for the 1000 public companies.

The protocol to create the dataset was to start with several seed companies across the industry sectors. Using the embedding, we obtained the Top K most-similar neighbor companies for each seed company. The Top K was varied to reflect the distribution of the companies across the sectors.

We invite FEIII Participants to identify an interesting prediction challenge centered around this dataset. You are welcome to combine this dataset with additional data. Participants are invited to share their findings at the DSMM Workshop in conjunction with ACM SIGMOD.
https://sigmod2020.org/sigmod_workshops.shtml

Tentative timeline

February 17 Release datasets
April 15 1 Page Abstract to tthe DSMM Workshop http://dsmmworkshop.org
May 15Camera ready 2-4 page submission to DSMM Workshop
Sunday June 14DSMM Workshop