Movies and TV
Source
The Internet Movie Database (IMDb) provides these data sets:
These data sets are refreshed on a daily basis.
Description of the Data
There are 7 data files provided by IMDb:
name.basics.tsv.gz
title.akas.tsv.gz
title.basics.tsv.gz
title.crew.tsv.gz
title.episode.tsv.gz
title.principals.tsv.gz
title.ratings.tsv.gz
The "Rotten Tomatoes movies and critic reviews dataset" is likely from here:
The Office dialogue likely is from here:
Transformations to the original data source
Kevin performed several transformation of the data.
He created comma-separated versions of the data from the IMDb tsv files:
akas.csv, crew.csv, episodes.csv, people.csv, ratings.csv, titles.csv
He also created a database file:
imdb.db
These two files are from the Rotten Tomatoes dataset:
rotten_tomatoes_movies.csv, rotten_tomatoes_reviews.csv
The dialogues from The Office are stored in:
the_office_dialogue.csv
Note that the source website mentions "55130 observations of 12 variables" but there seems to be 1 line in the csv file that stretches onto 3 lines (i.e., it has 2 extra lines). There is also the header line. Thus, the file the_office_dialogue.csv
has a total of 55133 lines.
We can download the IMDb data using: