Movie data analysis

8 minute read

Intro

I decided to dig into my film-watching habits as a data exploration exercise.

I used to watch a lot of films. And while watching movies is no longer a significant pastime for me, it was fun at the time. Aside from the apparent interest in the films themselves, the extensive pop-culture knowledge I gained acted as some social currency. Back then, I would make the (outrageous) claim that I’d probably watched every film that would ever come up in casual conversation.

After each film I watched, I would immediately then give it a rating out of 10 on IMDb. I didn’t think I was a critic - this was just the best way to get recommendations for even more films in the pre-Netflix era. A pleasant side effect of this ritual is that I now have a record of every movie I ever watched in that time, along with when I watched it, what I thought of it and other metadata. With the help of IMDb’s data export feature, I now have my hands on my own little cinematic time capsule ! This blog post aims to take a look at that data in pursuit of insights.

Some of the questions I want to answer

  • What were my viewing habits?
  • Did I have interesting or unusual tastes, or was I poser?
  • How do my ratings compare to other IMDB users?

Housekeeping notes

  • I’ve written this post as a Jupyter notebook, with inline code snippets, and exported as markdown (you can see the original notebook here).
  • If you’re not familiar with Jupyter, it’s a notebook environment that enables literate programming.
  • I will keep my dataset private (see here for why !), but you can grab your own dataset here.
  • I’ll be using the terms ‘movies’ and ‘films’ interchangably throughout. I’m sure there’s some technical distinction, but generally I think the former is an Americanism.
  • I’ll be making use of the two canonical Python libraries: pandas (for data analysis) and seaborn (for visualisation).

Setup

First, some (non-interesting) initial setup

# Imports
import pandas as pd
import seaborn as sns

# Configuration
pd.options.display.float_format = '{:.1f}'.format
pd.options.mode.chained_assignment = None  # default='warn'
sns.set(rc={'figure.figsize':(16,12)})
# Let's read in the data
INPUT_FILE_ENCODING = "ISO-8859-1"
input_data_path = "imdb_ratings.csv"
imdb_data = pd.read_csv(input_data_path, encoding=INPUT_FILE_ENCODING)
# ... and preview it
imdb_data.head(1)
Const Your Rating Date Rated Title URL Title Type IMDb Rating Runtime (mins) Year Genres Num Votes Release Date Directors
0 tt0100053 1 2013-01-18 Loose Cannons https://www.imdb.com/title/tt0100053/ movie 4.9 94.0 1990 Action, Comedy, Crime, Thriller 4394.0 1990-02-09 Bob Clark
# Let's perform some minor cleanup of the data
# ... filtering out the non-movie entries (TV shows, video games, etc.)
movies = imdb_data[imdb_data['Title Type'] == 'movie']
# ... and formatting the date field
movies['Date Rated'] = pd.to_datetime(movies['Date Rated'])

Great, now we’re ready to explore!

Summary

Next, let’s look at some summary statistics to get an overview of the dataset. Think of this as a tl;dr.

We have different datatypes in our dataset, so we’ll summarize these separately:

# Numeric fields
movies.describe()
Your Rating IMDb Rating Runtime (mins) Year Num Votes
count 869.0 868.0 868.0 869.0 868.0
mean 5.1 6.8 111.6 2000.3 273587.3
std 2.2 1.1 22.2 11.7 320816.3
min 1.0 2.1 60.0 1941.0 88.0
25% 4.0 6.2 96.0 1995.0 65191.5
50% 6.0 6.9 107.0 2002.0 166071.5
75% 7.0 7.6 123.2 2009.0 361292.5
max 10.0 9.3 210.0 2018.0 2301033.0
# Object fields
movies.describe(include=[object])
Const Title URL Title Type Genres Release Date Directors
count 869 869 869 869 868 868 869
unique 869 867 869 1 300 835 533
top tt0277371 The Omen https://www.imdb.com/title/tt0061695/ movie Comedy 2007-06-12 Steven Spielberg
freq 1 2 1 869 56 3 14

OK, so I’ve watched (at least) 869 films. A quick glance of the IMDb website shows that figure includes 96 of their Top 250 films (and also 6 of their bottom 100 films). It goes without saying that 869 is a lower-bound for my total lifetime count, since I’ve continued to watch films but not rate them, and of course I watched films before I discovered IMDb.

Questions

Now, let’s ask some questions of the data

What do my ratings look like?

Zooming in on just the ratings, I’ll use a histogram to visualise the frequency distribution:

sns.catplot(data=movies, kind="count",  x='Your Rating')

png

A lot more ‘1’ ratings than I expected! I must have been hard to please.

We can also plot the KDE for an estimate of the probability density (the y-axis here is ‘density’):

sns.set(rc={'figure.figsize':(8,6)})
sns.distplot(a=movies['Your Rating'], bins=range(1,11))

png

It look’s like about half of my ratings were either a 6 or a 7 out of 10. How very diplomatic(!)

How do my ratings compare to other IMDB users?

Let’s calculate the Spearman’s rank correlation coefficient, as a measure of Inter-rater reliability. It’s values range from -1 to 1 (fully opposed to identical). We’re using Spearman here because our data (Ratings out of 10) is ordinal.

movies['Your Rating'].corr(movies['IMDb Rating'], method='spearman')
0.4413428717210753

Intuitively, a weak positive correlation.

Let’s visualize this relationship. I’ll use a catplot here to get around the problem of representing categorical data with a scatter plot

sns.catplot(x='Your Rating', y='IMDb Rating', data=movies)

png

What day did I tend to watch films?

# non-interesting date formatting
movies['Weekday'] = movies['Date Rated'].dt.day_name()
movies['WeekdayNumeric'] = movies['Date Rated'].dt.weekday
movies = movies.sort_values('WeekdayNumeric', inplace=False)
# Let's plot 
sns.catplot(x="Weekday", kind="count", data=movies, height=5, aspect=12/9)

png

The weekends of course make sense. But Monday is unexpected - TGIM?

What year did I watch the most films?

movies['YearWatched'] = movies['Date Rated'].dt.year
sns.catplot(x="YearWatched", kind="count", data=movies, height=6, aspect=12/9)

png

2008, clearly! That was about 1 film per weekday! It looks like the intensity waned over time: 2010, in contrast, was my first year in college - studying obviously took precedence!

What is the least well-known film I’ve watched? (by number of ratings by other users)

movies.sort_values('Num Votes').head(1)
Const Your Rating Date Rated Title URL Title Type IMDb Rating Runtime (mins) Year Genres Num Votes Release Date Directors Weekday WeekdayNumeric YearWatched
878 tt0097073 1 2007-08-11 Code Name Vengeance https://www.imdb.com/title/tt0097073/ movie 4.1 96.0 1987 Action, Adventure, Drama, Thriller 88.0 1987-12-31 David Winters Saturday 5 2007

Frankly, this movie looks terrible. Thankfully, I don’t remember watching it (!)

What is the most well-known film I’ve watched? (by number of ratings by other users)

movies.sort_values('Num Votes', ascending=False).head(1)
Const Your Rating Date Rated Title URL Title Type IMDb Rating Runtime (mins) Year Genres Num Votes Release Date Directors Weekday WeekdayNumeric YearWatched
100 tt0111161 10 2007-08-11 The Shawshank Redemption https://www.imdb.com/title/tt0111161/ movie 9.3 142.0 1994 Drama 2301033.0 1994-09-10 Frank Darabont Saturday 5 2007

This one needs no introduction!

What was the least liked film I’ve watched? (by ratings by other users)

movies.sort_values('IMDb Rating').head(1)
Const Your Rating Date Rated Title URL Title Type IMDb Rating Runtime (mins) Year Genres Num Votes Release Date Directors Weekday WeekdayNumeric YearWatched
704 tt0473024 1 2007-08-11 Crossover https://www.imdb.com/title/tt0473024/ movie 2.1 95.0 2006 Action, Sport 8910.0 2006-07-22 Preston A. Whitmore II Saturday 5 2007

Ah, this doesn’t look that bad

What was the most liked film I’ve watched? (by ratings by other users)

movies.sort_values('IMDb Rating', ascending=False).head(1)
Const Your Rating Date Rated Title URL Title Type IMDb Rating Runtime (mins) Year Genres Num Votes Release Date Directors Weekday WeekdayNumeric YearWatched
100 tt0111161 10 2007-08-11 The Shawshank Redemption https://www.imdb.com/title/tt0111161/ movie 9.3 142.0 1994 Drama 2301033.0 1994-09-10 Frank Darabont Saturday 5 2007

A certified classic.

Biggest disparity in my vote vs IMDB?

  • What was the popularly-least-liked film I’ve enjoyed?
  • And the popularly-most-liked film I didn’t?
movies['Vote Disparity - IMDb Liked more'] = movies['IMDb Rating'] - movies['Your Rating']
movies.sort_values(['Vote Disparity - IMDb Liked more', 'IMDb Rating', 'Num Votes'], ascending=False).head(1)
Const Your Rating Date Rated Title URL Title Type IMDb Rating Runtime (mins) Year Genres Num Votes Release Date Directors Weekday WeekdayNumeric YearWatched Vote Disparity - IMDb Liked more
95 tt0110912 1 2008-08-19 Pulp Fiction https://www.imdb.com/title/tt0110912/ movie 8.9 154.0 1994 Crime, Drama 1796406.0 1994-05-21 Quentin Tarantino Tuesday 1 2008 7.9

Yeah, Pulp Fiction is overrated.

movies['Vote Disparity - I Liked more'] = movies['Your Rating'] - movies['IMDb Rating']
movies.sort_values(['Vote Disparity - I Liked more', 'IMDb Rating', 'Num Votes'], ascending=False).head(1)
Const Your Rating Date Rated Title URL Title Type IMDb Rating Runtime (mins) Year Genres Num Votes Release Date Directors Weekday WeekdayNumeric YearWatched Vote Disparity - IMDb Liked more Vote Disparity - I Liked more
469 tt0250310 7 2009-02-19 Corky Romano https://www.imdb.com/title/tt0250310/ movie 4.7 86.0 2001 Comedy, Crime 12425.0 2001-10-12 Rob Pritts Thursday 3 2009 -2.3 2.3

Corky Romano is a masterpiece!

General patterns

Finally, let’s visualize some general patterns

Film durations

sns.distplot(movies['Runtime (mins)'])

png

Film Release Years

sns.catplot(x="Year", kind="count", data=movies, height=10, aspect=21/9)

png

Film genres

genres = movies['Genres'].astype('str').tolist()
genres = list(map(lambda x: x.split(','), genres))
genres = sum(list(genres), [])
genres = list(map(lambda x: x.strip(), genres))
genres = list(map(lambda x: [x], genres))
genres = pd.DataFrame(genres, columns=['Genres'])
sns.catplot(x="Genres", kind="count", data=genres, height=9, aspect=21/9, order = genres['Genres'].value_counts().index)

png

Directors

directors = movies['Directors'].astype('str').tolist()
directors = list(map(lambda x: x.split(','), directors))
directors = sum(list(directors), [])
directors = list(map(lambda x: x.strip(), directors))
directors = list(map(lambda x: [x], directors))
directors = pd.DataFrame(directors, columns=['Directors'])
directors = directors.groupby(['Directors']).size()
directors = directors.reset_index(name='size')
directors = directors.sort_values(['size'], ascending=False)
directors.head(10)
Directors size
514 Steven Spielberg 15
453 Robert Zemeckis 10
111 David Fincher 8
345 Martin Scorsese 8
415 Peter Jackson 7
423 Quentin Tarantino 7
256 John Hughes 7
85 Christopher Nolan 6
464 Ron Howard 6
155 Ethan Coen 6

What’s the perfect film for me?

Based on frequent attributes: It looks like that would be a Drama or Comedy, directed by Spielberg, made in 2011, that is about 100 minutes long! ‘War Horse’, anybody?

If we base it on rating, by restricting this to just films I rated >= 8: A Drama, directed by Spielberg, made in 1999, that’s 120 minutes long. Perhaps, ‘Saving Private Ryan’?

This however would be an interesting small ML project: to create a model to predict a numeric rating for a given film, based on my previous ratings.

Updated: