Information on the 2020 Challenge

FEIII 2020 Dataset README

Download:
https://drive.google.com/file/d/1aaXx7vrrvkalXbBUtJ7mPLQmGWZQ-fL6/view?usp=sharing

The dataset is created around 1000 public companies traded on the NYSE.

For each company, we provide basic identification information including the company website, the 6 digit NAICS code, and the industry sector. The 1000 companies are spread across 15+ industry sectors.

The NSF sponsored BOKN project has created a historic archive (over the past 20+ years) of the websites for approximately a million US based companies, both public and private. We have created a 300 dimensional vector embedding centered around each company. The FEIII 2020 dataset includes a subset of this embedding for the 1000 public companies.

The protocol to create the dataset was to start with several seed companies across the industry sectors. Using the embedding, we obtained the Top K most-similar neighbor companies for each seed company. The Top K was varied to reflect the distribution of the companies across the sectors.

The data is provided as a Python file: FEIII_2020_WTNIC_2015_data_and_embeddings.npy
We provide an ipynb notebook explore_data_1 that illustrates how the embedding can be used to compute pairwise similarity between two companies.
Examples of using the embedding for prediction e.g., the sector or NAICS code, is available in the sim_viz_prediction_2 notebook.
BOKN project: https://www.nsf.gov/awardsearch/showAward?AWD_ID=1937153

We invite FEIII Participants to identify an interesting prediction challenge centered around this dataset. You are welcome to combine this dataset with additional data. Participants are invited to share their findings at the DSMM Workshop in conjunction with ACM SIGMOD.
https://sigmod2020.org/sigmod_workshops.shtml

Tentative timeline

February 17	Release datasets
April 15	1 Page Abstract to tthe DSMM Workshop http://dsmmworkshop.org
May 15	Camera ready 2-4 page submission to DSMM Workshop
Sunday June 14	DSMM Workshop