Some weeks ago I created a list of projects that I wanted to build.
The goal of the list is to assess how my Data Science and Python knowledge changed during these years.
In this list one of the projects key topic is the Linear Regression Model.
For this topic I wanted something new from the classic house price exercise on Coursera and my research landed on the project section of Datacamp.com.
Obviously the Coursera exercise and all the course on Machine Learning, for me, it is still invaluable, the best introduction you can have on Machine Learning(and it is totally free).
As some friends already know I am a huge fan of DataCamp.com, because It is where I learned Python applied to Machine Learning and DataScience (after taking the Coursera course)*
Here I found an interesting project to complete. They used a Disney Movies dataset to train a Linear Regression.
Based on that I decided to give a try to the project.
The Title of the original project was “Disney Movies and Box Office Success” and the subtitle was “Explore Disney movie data, then build a linear regression model to predict box office success.”
I have done and completed it.
Despite this I was not satisfied, there was something still off.
So I dived deep into data and the theory of Linear Regression Models (not GLM), discovered and proved that in this situation the use of linear regression was meaningless.
Why It was meaningless to use a Linear Regression?
Because they used a categorical variable to train a linear regression model (again not a Generalized Linear Regression) and this is wrong.
Later in the post I am going to explain why.
Based on that, my thesis is the following:
It is not possible to model with a linear regression the relationship between a dependent continuous variable and an independent categorical variable. Even if you can do it with some workarounds like one-hot encoding to convert the categorical variables to numerical, is still meaningless.
Obviously, I am open to discuss my above thesis if the discussion is supported by a strong and robust argument.
How it starts
The project starts exploring the dataset in a graphical and tabular way.
The dataset contains:
- Disney Film Titles since 1935
- Movies inflation-adjusted revenues
- Movies rating
- Movies genre
The dataset is freely available here. Be aware, the dataset is not complete just as an example Dumbo-Robin Hood-Bambi movies are missing.
The author in my opinion was a little bit sloppy in the graphical representation of the adjusted inflation gross revenue for three main reasons:
- The plot is missing the title
- The Y graph is missing the unit of measurement. Moreover, data are labeled with scientific notation. Using scientific notation to describe revenues requires more cognitive energy, so it is distracting
- The goal is to show a trend in data, but with 12 variables to show especially in the most recent years, it’s tough to have a clear idea of what is happening to each movie genre. Instead, what you get is that: global movie revenues significantly dropped through time. This last point raises another question, how the authors of the original dataset calculated the inflation-adjusted gross revenues? I have already asked here.
Then the project continues implementing linear regression. The class used is LinearRegression() from scklearn.
The goal is to understand the relationship between movies genre and box office gross.
“Since linear regression requires numerical variables and the genre variable is a categorical variable, we’ll use a technique called one-hot encoding to convert the categorical variables to numerical. This technique transforms each category value into a new column and assigns a 1 or 0 to the column.”
“Now that we have dummy variables, we can build a linear regression model to predict the adjusted gross using these dummy variables.”
“From the regression model, we can check the effect of each genre by looking at its coefficient given in units of box office gross dollars. We will focus on the impact of action and adventure genres here. (Note that the intercept and the first coefficient values represent the effect of action and adventure genres respectively). We expect that movies like the Lion King or Star Wars would perform better for box office.” -Quoting DataCamp Notebook
Based on that what is doing the author?
The equation the author defined can be expressed as follow:
Where Theta is the coefficient and X the possible values for each movie genre. Because they are using dummy variables, the X value can only be 0 or 1.
Based on that, the estimated inflation-adjusted gross revenues for an horror film will be:
On the graphical side a sketched ad intuitive way to visualize this model.
What’s wrong with LinearRegression()?
The answer is already contained in the underlying assuptions of the Linear Model:
- Variables should be normally distributed
- Independent variable should have a linear and additive relationship with the target
- There should be no linear relationship among the independent variables
- Target should not be autocorrelated
- Homoskedasticity should not be there in data
The second point holds the main reason why is not possible to model a linear regression between a Continuous Variable (target) and a categorical variable (independent).
We must evaluate the relationship between Target and Independent Varibles.
A possible way is the Pearson Correlation Coefficient:
The Pearson Correlation Coefficient is the covariance between two variables normalized by the product of the variances of the two random variables.
The Pearson Correlation Coefficient will be 0 in that situation. This because the independent variable can assume only two values 0 and 1.
What I have done?
Besides the idea that using a Linear Regression was a wrong approach, I also thought that an individual estimator for the expected value of the next film based on movie genre would be more accurate than a global one.
To understand if my assuption was right:
- I calculated the R score, be aware it’s different from the Pearson Correlation Coeficient. See the documentation on sklearn.
- I calculated the Mean Squared Error of DataCamp Linear Regression Model
- I compared the Mean Squared Error of my Individual Linear Model with DataCamp Model
#evaluating the results from DataCamp Regression Model #regr is the linear model trained with DataCamp HP prediction=regr.predict(genre_dummies) score=regr.score(genre_dummies,gross['inflation_adjusted_gross']) print("R Score is:\n",score.round(3)) reg_mean_squared_error=mean_squared_error(gross['inflation_adjusted_gross'],prediction) #For better visualization I suppressed the scientific notation print(np.vectorize("%.1f".__mod__)(reg_mean_squared_error))
R Score is: 0.109 72801697571099360.0
#here I started training my model just to understand if an individual predictor has a lower Mean Squared Error #we will fit 13 linear regression individually #one for each genre #get results and make our assumptions genre_list=list(set(gross['genre'].dropna())) print(type(genre_list)) score_df=pd.DataFrame(genre_list,columns=['genre_list']) score_df=score_df.assign(r2=np.zeros([len(genre_list),1])) score_df=score_df.assign(MSE=np.zeros([len(genre_list),1])) print(score_df) for idx,x in enumerate(score_df['genre_list']): y_vector=gross[gross['genre']==x]['inflation_adjusted_gross'] x_vector=np.ones([len(y_vector),1]) regr = LinearRegression() regr.fit(x_vector,y_vector) y_pred=regr.predict(x_vector) score_df.at[idx,'r2']=regr.score(x_vector,y_vector) #regr.score(genre_dummies,gross['inflation_adjusted_gross'] score_df.at[idx,'MSE']=mean_squared_error(y_vector,y_pred).round()
#comparing Andrea Ciufo estimator with the DataCamp Estimator print(score_df.info()) score_df=score_df.assign(dc_mse=reg_mean_squared_error) score_df=score_df.assign(dc_mse_grt_mse=score_df['dc_mse']>score_df['MSE']) print(score_df.sort_values(by=['MSE'],ascending=False))
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12 entries, 0 to 11 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 genre_list 12 non-null object 1 r2 12 non-null float64 2 MSE 12 non-null float64 3 dc_mse 12 non-null float64 dtypes: float64(3), object(1) memory usage: 512.0+ bytes None genre_list r2 ... dc_mse dc_mse_grt_mse 4 Musical 0.0 ... 7.280169757109936e+16 False 9 Adventure 0.0 ... 7.280169757109936e+16 True 7 Drama 0.0 ... 7.280169757109936e+16 True 11 Action 0.0 ... 7.280169757109936e+16 True 3 Comedy 0.0 ... 7.280169757109936e+16 True 6 Thriller/Suspense 0.0 ... 7.280169757109936e+16 True 5 Romantic Comedy 0.0 ... 7.280169757109936e+16 True 0 Western 0.0 ... 7.280169757109936e+16 True 10 Black Comedy 0.0 ... 7.280169757109936e+16 True 2 Concert/Performance 0.0 ... 7.280169757109936e+16 True 8 Horror 0.0 ... 7.280169757109936e+16 True 1 Documentary 0.0 ... 7.280169757109936e+16 True
Exception for the Musical Genre, the Individual Estimator always has a lower MSE.
I also posted the question on https://stats.stackexchange.com/
Improve plots readability. It is something I already started to do in my previous post, but I can push more. Using a Slope Chart or a Bump Chart could be an idea to explore. A slope chart shows changes between two points, in our situation It could be ad interesting representation of how movie genres revenues changed between the last two decades.
Understand better how to implement for this particular case an ANOVA and model a Generalized Linear Model.
Thanks for reading. Readers like you drive my passion for data and data dissemination.