Privacy Policy Best Selling Disney Movies from 1935 until 2016 Python Business Analysis - With Code! - Andrea Ciufo

# Best Selling Disney Movies from 1935 until 2016 Python Business Analysis – With Code!

Everything started with this DataCamp small project on Linear Regression Analysis: “Disney Movies and Box Office Success

It was based on a modified dataset from this repository on data.world.

The dataset contains 4 csv files with the following information:

• Disney Movies Revenues Adjusted for Inflation with Movie Titles and Year Release
• Disney Revenues from movies, parks, tv network, videogames
• Disney Movies Directors
• Disney Movies Voices

You can find also on my github Disney Repository the original dataset.

After finishing the guided exercise I decided to dive deep into the dataset to understand its reliability. This is the most important output from the analysis.

## Why the two plots differ?

Some directors were responsible for more than one movie so I decided to group the information and determine the average value.

## Diving Deep Into Data

As stated before I wanted to understand data reliability. Applying some Python techniques is just a matter of practice, understanding, if data are consistent and reliable, requires one step more.

The first two important things that I must highlight:

• I can not figure out how Inflation Adjustment was calculated and where the authors took the Inflation Time Series Data to calculate the present value
• The Dataset is incomplete

## How I discovered the Dataset was incomplete?

Quite easy with

The DataFrame with all the movie titles and the revenues is composed by 579 rows.

On the other hand, the DataFrame containing all the directors is composed by 56 rows.

At first sight we can say that we are missing 523 (579-56) films right?

Correct, but our analysis can go even deeper and we can discover something more.

To get the best selling directors I made an inner join between the first DataFrame containing the movies’ revenues and the DataFrame containing the list of directors. This step uncovered another problem with the dataset.

## From this Inner Joint what emerged?

The resulting DataFrame was composed by 49 rows so in total I was missing 530 titles not 523.

## Which movies the first DataFrame was missing?

The DataFrame with the list of the Disney Movies and their revenues was missing the following titles:

• Saludos Amigos
• Bambi
• Robin Hood
• Peter Pan
• Dumbo
• Make Mine Music
• The Three Caballeros
• Melody Time
• Fantasia 2000
• Fun and Fancy Free

Instead the above list was contained in the directors DataFrame.

And if this was just related on a typo/whitespaces that corrupted the inner join?

To be sure I made a regex analysis, this way I doublechecked if we were truly missing this movies(especially Bambi-Robin Hood-Peter Pan-Dumbo) from the movies’ revenues DataFrame.

From this check, It was confirmed that we where missing this data.

Curiously, Fantasia 2000 was missing in the movies DataFrame, the DataFrame contained only the revenues from Fantasia 2000 IMAX distribution.

I didn’t know anything about the IMAX distribution and I was really curious about it.

From Wikipedia: “Fantasia 2000 premiered on December 17, 1999 at Carnegie Hall in New York City as part of a concert tour that also visited London, Paris, Tokyo, and Pasadena, California. The film was then released in 75 IMAX theaters worldwide from January 1 to April 30, 2000, marking the first animated feature-length film to be released in the format”

“IMAX is a proprietary system of high-resolution cameras, film formats, film projectors, and theaters known for having very large screens with a tall aspect ratio (approximately either 1.43:1 or 1.90:1) and steep stadium seating.

## Why We are missing so much data?

Two possible reasons.

One could be related to the scraping method used to build the dataset. Another possible reason could be related to the original datasource scraped. They could already miss this data when the dataset was build throgh scraping.

## Lessons Learned

1. Garbage In Garbage Out, when you make an analysis you should have always a framework to check how “dirty” your data are. The consequence is that you will be not aware of how much biased your analysis are going to be.
2. Why I never see Lady and the Tramp?

What are Disney Revenues Streams? How Disney made money during recent years?

Can we create a linear regression to explain the relationship between movies revenues and genre?

If you liked or you think is useful feel free to share.

With just a tweet or a like on LinkedIn new opportunities might arise.