INTRODUCTION

Top 3 profitable movies from the datasetTop 3 profitable movies from the datasetTop 3 profitable movies from the dataset

Top 3 profitable movies from the dataset

Whether it be Marvel’s ascension into the spotlight in the 2010s, the continued longevity of the Star Wars franchise, or stand-alone sensations like Avatar, the film industry has soared since the turn of the century and become one of the most profitable industries in entertainment. The success of movie studios is evident, but what are the factors that have created these continued “box office hits” that have turned viewers into loyal fandoms for years to come? The profit change of James Bond Collection over the years (as shown in the figure below) somewhat reflects the history of the film industry that it becomes more and more profitable, and watching film becomes a fashion and a part of people’s daily life.

The profit change of 'James Bond Collection' over seasons/time

The profit change of ‘James Bond Collection’ over seasons/time

Now keeping the question above in mind, consider a scenario where we are on an experienced team of film studio investors trying to decide on the next profitable film. In a world full of up-and-coming filmmakers, the market to receive investment and appeal to financiers is competitive. Many notable film studios of today’s world, such as Universal Pictures, Warner Bro’s, and Walt Disney Pictures have mastered the process of pumping out profitable blockbuster hits every year but is there a formula that they use to replicate this success?

The Exploratory Data Analysis helped shed some light on this guiding question as it allowed us to dive into a dataset with over 45,000 different movies and 26 million ratings from 270,000+ users. We tested many different variables that could have a possible impact on the revenue of a movie. Understanding the factors that truly push a movie to success is key in choosing movies that are more likely to blow up at the box office.

10 Highest Earning Movies included in dataset
original_title revenue belongs_to_collection runtime vote_average vote_count genres production_companies spoken_languages
Avatar 2787965087 Avatar Collection 162 7.2 12114 [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fiction’] [‘Ingenious Film Partners’, ‘Twentieth Century Fox Film Corporation’, ‘Dune Entertainment’, ‘Lightstorm Entertainment’] [‘English’, ‘Español’]
Star Wars: The Force Awakens 2068223624 Star Wars Collection 136 7.5 7993 [‘Action’, ‘Adventure’, ‘Science Fiction’, ‘Fantasy’] [‘Lucasfilm’, ‘Truenorth Productions’, ‘Bad Robot’] [‘English’]
Titanic 1845034188 194 7.5 7770 [‘Drama’, ‘Romance’, ‘Thriller’] [‘Paramount Pictures’, ‘Twentieth Century Fox Film Corporation’, ‘Lightstorm Entertainment’] [‘English’, ‘Français’, ‘Deutsch’, ‘svenska’, ‘Italiano’, ‘Pусский’]
The Avengers 1519557910 The Avengers Collection 143 7.4 12000 [‘Science Fiction’, ‘Action’, ‘Adventure’] [‘Paramount Pictures’, ‘Marvel Studios’] [‘English’]
Jurassic World 1513528810 Jurassic Park Collection 124 6.5 8842 [‘Action’, ‘Adventure’, ‘Science Fiction’, ‘Thriller’] [‘Universal Studios’, ‘Amblin Entertainment’, ‘Legendary Pictures’, ‘Fuji Television Network’, ‘Dentsu’] [‘English’]
Furious 7 1506249360 The Fast and the Furious Collection 137 7.3 4253 [‘Action’] [‘Universal Pictures’, ‘Original Film’, ‘Fuji Television Network’, ‘Dentsu’, ‘One Race Films’, ‘China Film Co.’, ‘Québec Production Services Tax Credit’, ‘Media Rights Capital (MRC)’, ‘Abu Dhabi Film Commission’, ‘Colorado Office of Film, Television & Media’] [‘English’]
Avengers: Age of Ultron 1405403694 The Avengers Collection 141 7.3 6908 [‘Action’, ‘Adventure’, ‘Science Fiction’] [‘Marvel Studios’, ‘Prime Focus’, ‘Revolution Sun Studios’] [‘English’]
Harry Potter and the Deathly Hallows: Part 2 1342000000 Harry Potter Collection 130 7.9 6141 [‘Family’, ‘Fantasy’, ‘Adventure’] [‘Warner Bros.’, ‘Heyday Films’] [‘English’]
Frozen 1274219009 Frozen Collection 102 7.3 5440 [‘Animation’, ‘Adventure’, ‘Family’] [‘Walt Disney Pictures’, ‘Walt Disney Animation Studios’] [‘English’]
Beauty and the Beast 1262886337 129 6.8 5530 [‘Family’, ‘Fantasy’, ‘Romance’] [‘Walt Disney Pictures’, ‘Mandeville Films’] [‘English’]

A core part of generating revenue for a film is connecting a target audience to an idea and keeping them engaged with it going into the future. Collections of movie series seem to demonstrate this idea the strongest. When doing a brief profile of the highest earning movies of all time, we noticed that 8 out of 10 of the returned selections belonged to a collection. Marvel’s success in the superhero industry of cinema is evident of this trend as they have a wide universe of heroes that appeal to many different groups of people - all of which create high-grossing films and series.

After choosing the movie, producing it, and then finally releasing it, the studio assesses the success of their new film and has to make a huge decision: should the studio make a sequel? More specifically, given the success of a first movie, should we expect the sequel or future installments of the series to be increasingly profitable? Movies are not cheap and can be huge risks, so the studio wants to ensure they are placing an investment in a worthy cause. Our team will use data and analysis in an attempt to choose an initial movie and decide on a sequel after its first movie’s release.

DATA

The dataset we chose was compiled on Kaggle by Rounak Banik, a Data Science Fellow at Mckinsey & Company and Electronics and Communication Engineering graduate from IIT Roorkee. He included information about roughly 45,000 movies released by July 2017 that were available on MovieLens, a movie recommendation service run by GroupLens Research at the University of Minnesota. The data was then supplemented with data collected from TMDB, which provides an API to access a variety of metadata about each movie.

After cleaning the original dataset, we created a condensed version with 18 variables. The following paragraph details what each variable measured in the context of this dataset:

Variable Name Description
belongs_to_collection Stringified dictionary that gives information on the movie series the particular film belongs to
budget The budget of the movie in dollars
genres Stringified list of dictionaries that list out all the genres associated with the movie
id The unique ID of the movie in the dataset
imdb_id The IMDB ID of the movie
original_language The language in which the movie was originally shot in
original_title The original title of the movie
overview A brief blurb of the movie
popularity Popularity Score assigned by TMDB
production_companies Stringified list of production companies involved with the making of the movie
production_countries Stringified list of countries where the movie was shot/produced in
release_date Theatrical release date of the movie
revenue The total revenue of the movie in dollars
runtime Length of the movie in minutes
spoken_languages Stringified list of spoken languages in the film
title The Official Title of the movie
vote_average The average rating of the movie
vote_count The number of votes by users, as counted by TMDB
profit_in_million (Revenue - Budget)/1,000,000
budget_in_million Budget/1,000,000
in_collection True/False based on if movie is in a collection of movies
english True/False based on if movie’s original language is English
num_languages_spoken Length of stringified list of languages associated with specific movie
num_genres Length of stringified list of genres associated with specific movie
num_keywords Length of stringified list of keywords associated with specific movie
age_days Specific movie’s age calculated by last data collection in data - release date
num_companies Length of stringified list of production companies associated with specific movie
num_countries Length of stringified list of production countries associated with specific movie
title_length Length of characters in a given movie title
title_consistent True/False based on if original_title matches title

The Exploratory Data Analysis portion of the project allowed us to lay out 12 different questions to examine these variables’ relationships with each other. For instance, we identified the highest-earning movie collections, we found that higher budget tends to lead to higher revenue, there is little impact between title length and revenue, movies make more money as the years go forward, movies are most profitable when released in June, and that certain keywords can lead to higher ratings on average. Among these relationships between variables, we also found some abnormalities in the data, specifically that there were some movies that seemed to have missing values for budget and revenue, which are needed to calculate the movie’s profit. For example, the dataset shows that the sci-fi film “What Happened to Monday” has 0 revenue and budget:

title budget revenue
What Happened to Monday 0 0

However, Wikipedia shows that the movie earned $28 million from a $20 million budget. Observations such as this one had to be removed from the dataset, which left us with a total of 5,319 observations.

The figure we have chosen to include demonstrates the distributions of important variables that were used in the project.

RESULTS

Result 1:

The model is difficult to compose with this data due to the amount of variables present. Due to this problem, we cannot find which model is the best manually, so we used the bestglm function to get the best models over the data with Information Criteria AIC. AIC (Akaike Information Criterion) provides measures of model performance that account for model complexity, and the smaller the AIC the better. During the process of finding the best model, we found that it is difficult to create a satisfying model over the entire data set with relatively small MAE or AIC. This is likely due to the fact that profitable movies are variable in many aspects. However, after selecting a subset of the data of movies that were profitable over $100M, we found that the accuracy of the models increases according to the AIC criterion.

Moreover, when we further take into account all English-original-spoken and in-collection movies in the data set, the models become more accurate. The following are the three models that we conclude from these data sets: (All the models below have significant p-values).

Best three models concluded over the ‘ENTIRE’ dataset:

  • Best Model 1: profit_in_million = popularity + runtime + vote_count + in_collection + num_genres + num_keywords + num_companies + num_countries

  • Best Model 2: profit_in_million = popularity + runtime + vote_count + in_collection + num_genres + num_keywords + num_countries

  • Best Model 3: profit_in_million = popularity + english + vote_count + in_collection + num_genres + num_keywords + num_companies + num_countries

Best three models concluded over the ‘PROFIT > 100M’ dataset:

  • Best Model 1: profit_in_million = popularity + runtime + vote_count + in_collection + num_genres + num_countries

  • Best Model 2: profit_in_million = popularity + runtime + vote_count + in_collection + num_genres + num_companies + num_countries

  • Best Model 3: profit_in_million = popularity + vote_count + in_collection + num_genres + num_countries

Best three models concluded over the ‘PROFIT > 100M & English-original-spoken and In-collection’ dataset:

  • Best Model 1: profit_in_million = popularity + runtime + vote_count + num_genres + num_countries

  • Best Model 2: profit_in_million = popularity + runtime + vote_average + vote_count + num_genres + num_countries

  • Best Model 3: profit_in_million = popularity + runtime + vote_count + num_genres + num_companies + num_countries

Models and Errors over the ‘ENTIRE’ data
Models MAE RMSE BIAS Average Profit in million Relative Error = Absolute Value of MAE/Average Profit in million AIC Criterion
Best Model 1 with eight predictors 49.3164 90.4925 3.185e-13 63.3743 0.7782 45050.7665
Best Model 2 with seven predictors 49.6759 90.5707 5.297e-13 63.3743 0.7838 45052.3734
Best Model 3 with eight predictors 49.1803 90.5286 1.147e-13 63.3743 0.776 45052.4881
1 Best Model 1: profit_in_million = popularity + runtime + vote_count + in_collection + num_genres + num_keywords + num_companies + num_countries
2 Best Model 2: profit_in_million = popularity + runtime + vote_count + in_collection + num_genres + num_keywords + num_countries
3 Best Model 3: profit_in_million = popularity + english + vote_count + in_collection + num_genres + num_keywords + num_companies + num_countries
Models and Errors over the ‘PROFITABLE PART’ data
Models MAE RMSE BIAS Average Profit in million Relative Error = Absolute Value of MAE/Average Profit in million AIC Criterion
Best Model 1 with six predictors 105.1146 157.2542 3.32e-14 274.2788 0.3832 9571.1213
Best Model 2 with seven predictors 103.8709 157.0336 2.313e-13 274.2788 0.3787 9571.954
Best Model 3 with five predictors 105.8548 159.4557 -3.161e-14 274.2788 0.3859 9572.4367
1 Best Model 1: profit_in_million = popularity + runtime + vote_count + in_collection + num_genres + num_countries
2 Best Model 2: profit_in_million = popularity + runtime + vote_count + in_collection + num_genres + num_companies + num_countries
3 Best Model 3: profit_in_million = popularity + vote_count + in_collection + num_genres + num_countries
Models and Errors over the ‘PROFITABLE English-original-spoken and In-collection PART’ data
Models MAE RMSE BIAS Average Profit in million Relative Error = Absolute Value of MAE/Average Profit in million AIC Criterion
Best Model 1 with five predictors 124.1941 183.0815 -1.811e-13 334.8733 0.3709 4981.4499
Best Model 2 with six predictors 124.192 181.4148 -1.846e-13 334.8733 0.3709 4981.9257
Best Model 3 with six predictors 122.7311 181.3394 -2.015e-13 334.8733 0.3665 4982.5413
1 Best Model 1: profit_in_million = popularity + runtime + vote_count + num_genres + num_countries
2 Best Model 2: profit_in_million = popularity + runtime + vote_average + vote_count + num_genres + num_countries
3 Best Model 3: profit_in_million = popularity + runtime + vote_count + num_genres + num_companies + num_countries

According to the three tables above, we can see that the AIC criterion goes from extremely large (around 45051) to “somewhat” small (around 4981), which indicates that when the data used to generalize the models narrows down, the corresponding models become more accurate. It also indicates that our guess of profitable movies not being various in many aspects. According to the three tables produced, AIC within each table is nearly the same, which indicates that the best models from bestglm function have nearly the same accuracy when comparing AIC. Therefore, we need to create our own criteria to decide which is the best model.

To make our own criteria we use 10-fold cross validation to get the MAE for the best models we have, and we also use the bias function to get the prediction bias. Only analyzing MAE is not appropriate since our average y (response variable - profit_in_million) may be either too large or small, so we create our own criteria - comparing the relative error between models, where \(\text{Relative Error} = \mathopen|{\frac{\text{MAE}}{\text{Average Profit in million of dollars}}}\mathclose|\).

As shown in the three tables above, the relative_error decreases as the data gets narrowed down, especially from the entire data set to just the ‘profit > 100M’ data set. This indicates that our defined criteria in our data set is helpful. We added green to the background of the best model with lowest relative_error in each table, thus leaving us with the three best models from each data set.

Now, we want to see if our model predicts the majority of the data correctly. We decided to use the third data set to create the prediction result graph since it has the lowest error/relative_error. We assume:

  • Good Prediction: 0 ≤ relative_error ≤ 0.3 (relative_error=0 is a perfect prediction)

  • Somewhat Good Prediction: 0.3 < relative_error ≤ 0.7

  • Somewhat Bad Prediction: 0.7 < relative_error ≤ 1.2

  • Bad Prediction: relative_error > 1.2

We will also split the third data set into TRAIN (approximately 80% of the original data set) and TEST (approximately 20% of the original data set) data sets. We fit the model formula onto the TRAIN data set, giving us the coefficients for the model. Then, we add this completed model to the TEST data set and create the prediction result plot to see if our model is accurate.

According to the first figure, it shows that the majority of our predictions are “good.”

According to the second figure, we can see that the models’ predictions are mostly good.

With these verified models, we can now use it to choose the highest predicted profitable movie out of a data set.

The movies our studio has to decide between producing is “Joker”, “Jumanji: The Next Level”, and “Knives Out.” We get the information for these movies from TMDB. The popularity and the number of keywords in the original data set are controlled by the author himself/herself/themselves. Therefore, we exclude these two variables when looking to find the best models using the same method as before. Our ending results leave us with:

  • Best model over the ‘PROFIT > 100M’ dataset:

model1: profit_in_million = vote_count + in_collection + num_genres + age_days + num_countries

  • Best model over the ‘PROFIT > 100M & English-original-language and in-collection PART’ dataset:

model2: profit_in_million = runtime + vote_count + num_genres + num_countries

  • Best model over the ‘PROFIT > 100M & English-original-language and out-of-collection PART’ dataset:

model3: profit_in_million = vote_count + num_genres + age_days + num_companies

Using models to Predict out data movie
Movie Model Actual Profit in million Predicted Profit in million Residual Relative Error = Absolute Value of Residual/Actual Profit in million
Joker (2019) model1 1019.2513 1507.9335 -488.6822 0.4795
Joker (2019) model3 1019.2513 1126.5284 -107.2771 0.1053
Jumanji: The Next Level (2019) model1 675.0597 628.069 46.9907 0.0696
Jumanji: The Next Level (2019) model2 675.0597 677.7523 -2.6926 0.004
Knives Out (2019) model1 269.2328 688.0188 -418.786 1.5555
Knives Out (2019) model3 269.2328 535.6398 -266.407 0.9895
1 model1: profit_in_million = vote_count + in_collection + num_genres + age_days + num_countries
2 model2: profit_in_million = runtime + vote_count + num_genres + num_countries
3 model3: profit_in_million = vote_count + num_genres + age_days + num_companies

We use these models to predict the three expected profits from each of our movies. The relative_errors are shown in the table above and we get varying results. For “Joker” and “Jumanji”, we nearly make the perfect prediction, seeing the extremely small relative_errors. However, when looking at the “Knives Out” movie, it seems that we make a disastrous prediction, seeing an extremely large relative_error. Therefore, we cannot say our models are perfect in predicting profits. A better model has a need for more complexity and/or different variables.

Result 2:

Another important decision that movie studios have to make is whether or not to expand a franchise by producing a sequel, or many sequels, to a movie. We want to supplement this decision with some data about how the collections of movies in our data set performed over time, which can help us make decisions about whether or not to produce a sequel, as well as what the ideal number of sequels would be and how long we should wait between installments. If sequels tend to be less profitable the more we produce, then it makes more financial sense to focus on making movies that aren’t part of a franchise.

The following two figures show how the profit changes for a sequel movie over seasons (over time), which can help us to have a sense between the relationship of “whether to produce a sequel” and “the profit of this season”.

We also wanted to determine how profits are affected by the number of movies in the franchise and how long it’s been since the most recent movie was made. To do this, we filtered the data set based on what number each movie is in the franchise. The original movies are only considered in order to provide a comparison for the second movie in the franchise, since we want to focus primarily on sequels for this question. The second movie released in each collection is considered to be the first sequel, and is labeled as sequel number 1 in the collection, then the second sequel in the collection is sequel number 2, and so on. For each sequel number, we plotted the change in profits from the previous movie to this one against the number of days since the previous movie in the collection was released. We also included a trendline to show whether or not the amount of time between movies made a difference. This was all compiled into an animated scatter plot to show how each collection moved over time.

Change in profits for movies in collections based on order of movies

Change in profits for movies in collections based on order of movies

The data seems to show a couple important details, although the relatively small sample size for collections of more than 4 movies makes it hard to pull any significant conclusions from the data. The first trend to notice is that the first couple sequels in each collection tend to have very variable results in terms of being more or less profitable than the movie before them. They also don’t seem to show much of a trend in terms of the time between movies, indicating that it might still make sense to produce a sequel to a movie that came out a long time ago. However, collections that contained a higher number of movies seem to consistently get higher profits, likely because they have established a committed fanbase and can generate much more excitement from the same amount of marketing.

CONCLUSION

In our analysis, we focused on two questions that were relevant to our data and purposeful for the film industry. First, we tried to create appropriate models to assess whether we could predict profitability for films outside our dataset. After going through the process of using different modeling techniques and choosing significant predictors, we discovered that our “training” models predicted profitability closely with the fitted values. With models using either the whole dataset or limited amounts of it based on high profitability, we were able to make 2 successful predictions on “test” movies of our choosing and 1 unsuccessful prediction, using data from the official TMDB website. Since all of our original dataset’s ratings and other important predictor values were directly taken from TMDB, we decided to keep it consistent and get our test movies’ predictor values from there as well. Throughout our time working with this data, we found that we likely were unable to predict profitability from a large array of variables. Profitability is largely dependent on the multiple characteristics of a film, but our model selection guided us to conclude that less than half the amount of our predictor list was appropriate to create our models.

The second question of this analysis attempted to build on our conclusions from the primary question by asking: If a given movie is profitable, will the sequel and following installments of the collection be increasingly profitable? In order to analyze this question, we began by laying out the most profitable movie collections and tracking each installment’s respective profit in comparison to previous ones. Another important consideration of this analysis surrounded the disparity between each successive sequel’s profit. When viewed through the lens of a movie studio, both elements of this initial question are important in determining the future of a potential movie franchise and the potential revenue associated. While the findings presented in the results were important to consider, they would be more conclusive if a larger sample of movie collections existed at the time that this data was collected.

Ultimately, we need to connect how the results to our questions are relevant to the broader world and consumers. For film studios, profits are more often than not a direct determinant in deciding to pursue another film in the same series. Ask yourself this, would you as a film studio executive want to invest in a sequel when your film made little to zero profits? Probably not, but if you did it would not make financial sense. Secondly, as much as film studios can benefit from achievements, critical praise, and other metrics, the ability to model and predict profits is something useful we created as an aid in decision-making among studios. Our findings also allow passionate filmgoers to have a tool to keep up with current box office trends and the overall state of the film industry. When we think of films, many think of going to the movie theater but the landscape of “going to the box office” has shifted. One important element of the film industry post-COVID is that many films have been released on streaming services to subscribers at no or some extra costs, while some studios have decided to have a limited release or a combination of both. Since our dataset includes films up until 2017, we were unable to account for the effects of the pandemic but those who find the data relevant might try to expand on our existing model to get a more accurate picture of the future. Possibilities include adding information for films that were released during or after the covid pandemic and incorporating how wide a film’s release is, as in the number of theaters that showed this particular film. Whether it is profitability or determining the future of a film series, this data gives us better confidence to make those decisions.

Extending this analysis out into the future, it is important to consider additional external factors in the movie industry that the analyzed data set did not encompass. The first element to consider is the impact of the COVID pandemic on the movie industry as noted above. 2020 saw a significant downturn in movie releases as studios postponed releases and looked elsewhere for sources of revenue such as streaming services. The latter half of 2021 began a rebound for this industry since the pandemic but there should be considerable concerns regarding if movies will be released at the increasing rate that they were during the 2010s. Another factor to consider is the effect of the growing streaming service side of entertainment which prominent studios have begun to sink time and costs into since the onset of the pandemic. With it now being easier than ever to stream a movie from the comfort of your home, it is likely that many movies not bound for box office success find their way to streaming platforms. Going forward for future analyses, it will be imperative that movie studios appropriately account for effects such as these when evaluating beginning production on individual movies and when creating successive installments of existing franchises.