Everything started with this DataCamp small project on Linear Regression Analysis: “Disney Movies and Box Office Success“
The dataset contains 4 csv files with the following information:
- Disney Movies Revenues Adjusted for Inflation with Movie Titles and Year Release
- Disney Revenues from movies, parks, tv network, videogames
- Disney Movies Directors
- Disney Movies Voices
You can find also on my github Disney Repository the original dataset.
After finishing the guided exercise I decided to dive deep into the dataset to understand its reliability. This is the most important output from the analysis.
Best Selling Movies
|Snow White and the Seven Dwarfs||Dec 21 1937||$ 184.925.485,00||$ 5.228.953.251,00|
|Pinocchio||Feb 9 1940||$ 84.300.000,00||$ 2.188.229.052,00|
|Fantasia||Nov 13 1940||$ 83.320.000,00||$ 2.187.090.808,00|
|101 Dalmatians||Jan 25 1961||$ 153.000.000,00||$ 1.362.870.985,00|
|Lady and the Tramp||Jun 22 1955||$ 93.600.000,00||$ 1.236.035.515,00|
|Song of the South||Nov 12 1946||$ 65.000.000,00||$ 1.078.510.579,00|
|Star Wars Ep. VII: The Force Awakens||Dec 18 2015||$ 936.662.225,00||$ 936.662.225,00|
|Cinderella||Feb 15 1950||$ 85.000.000,00||$ 920.608.730,00|
|The Jungle Book||Oct 18 1967||$ 141.843.000,00||$ 789.612.346,00|
|The Lion King||Jun 15 1994||$ 422.780.140,00||$ 761.640.898,00|
Best Selling Directors
|David Hand||$ 184.925.485,00||$ 5.228.953.251,00|
|Ben Sharpsteen||$ 84.300.000,00||$ 2.188.229.052,00|
|full credits||$ 83.320.000,00||$ 2.187.090.808,00|
|Hamilton Luske||$ 93.600.000,00||$ 1.236.035.515,00|
|Roger Allers||$ 422.780.140,00||$ 761.640.898,00|
|Wilfred Jackson||$ 143.075.676,50||$ 560.880.042,00|
|Wolfgang Reitherman||$ 107.334.398,00||$ 381.435.547,00|
|Chris Buck||$ 285.914.914,00||$ 349.448.714,00|
|Byron Howard||$ 341.268.248,00||$ 341.268.248,00|
|Don Hall||$ 222.527.828,00||$ 229.249.222,00|
Why the two plots differ?
Some directors were responsible for more than one movie so I decided to group the information and determine the average value.
Diving Deep Into Data
As stated before I wanted to understand data reliability. Applying some Python techniques is just a matter of practice, understanding, if data are consistent and reliable, requires one step more.
The first two important things that I must highlight:
- I can not figure out how Inflation Adjustment was calculated and where the authors took the Inflation Time Series Data to calculate the present value
- The Dataset is incomplete
How I discovered the Dataset was incomplete?
Quite easy with
The DataFrame with all the movie titles and the revenues is composed by 579 rows.
On the other hand, the DataFrame containing all the directors is composed by 56 rows.
At first sight we can say that we are missing 523 (579-56) films right?
Correct, but our analysis can go even deeper and we can discover something more.
To get the best selling directors I made an inner join between the first DataFrame containing the movies’ revenues and the DataFrame containing the list of directors. This step uncovered another problem with the dataset.
From this Inner Joint what emerged?
The resulting DataFrame was composed by 49 rows so in total I was missing 530 titles not 523.
Which movies the first DataFrame was missing?
The DataFrame with the list of the Disney Movies and their revenues was missing the following titles:
- Saludos Amigos
- Robin Hood
- The Adventures of Ichabod and Mr. Toad
- Peter Pan
- Make Mine Music
- The Three Caballeros
- Melody Time
- Fantasia 2000
- Fun and Fancy Free
Instead the above list was contained in the directors DataFrame.
And if this was just related on a typo/whitespaces that corrupted the inner join?
To be sure I made a regex analysis, this way I doublechecked if we were truly missing this movies(especially Bambi-Robin Hood-Peter Pan-Dumbo) from the movies’ revenues DataFrame.
From this check, It was confirmed that we where missing this data.
Curiously, Fantasia 2000 was missing in the movies DataFrame, the DataFrame contained only the revenues from Fantasia 2000 IMAX distribution.
I didn’t know anything about the IMAX distribution and I was really curious about it.
From Wikipedia: “Fantasia 2000 premiered on December 17, 1999 at Carnegie Hall in New York City as part of a concert tour that also visited London, Paris, Tokyo, and Pasadena, California. The film was then released in 75 IMAX theaters worldwide from January 1 to April 30, 2000, marking the first animated feature-length film to be released in the format”
“IMAX is a proprietary system of high-resolution cameras, film formats, film projectors, and theaters known for having very large screens with a tall aspect ratio (approximately either 1.43:1 or 1.90:1) and steep stadium seating.“
Why We are missing so much data?
Two possible reasons.
One could be related to the scraping method used to build the dataset. Another possible reason could be related to the original datasource scraped. They could already miss this data when the dataset was build throgh scraping.
- Garbage In Garbage Out, when you make an analysis you should have always a framework to check how “dirty” your data are. The consequence is that you will be not aware of how much biased your analysis are going to be.
- Why I never see Lady and the Tramp?
Two questions will be answered with other two posts on this Blog in the following days:
What are Disney Revenues Streams? How Disney made money during recent years?
Can we create a linear regression to explain the relationship between movies revenues and genre?
If you liked or you think is useful feel free to share.
With just a tweet or a like on LinkedIn new opportunities might arise.
Thank you for your time,
Here you can find attached the entire notebook