Sharing is caring!
A French startup (https://navee.co/) wrote me on LinkedIn for a challenging job interview (spoiler: It didn’t go well).
They identify potential fraud on-line through Natural Language Processing and Image Recognition on online marketplace.
E.g., a house picture on a marketplace that is also on a stock photo website with a standard message could be classified as a scam by the startup technology.
The technical interview was a project that I had to develop in one week.
An SMS fraud detector.
Detection would be based on Natural Language Processing techniques.
The dataset was 5500 SMS from Kaggle.
I was aware not to be much confident with Natural Language Processing
Until that moment I just worked with some regex and less complex analysis, that’s why I accepted to proceed with the interview.
When you have a deadline everything moves faster, also learning.
The workflow was the following:
- Uploaded the necessary modules and libraries
- Defined two functions
- The first one in order to train the classifier
- The second one to evaluate classifier performance
- Imported the csv as a DataFrame taking care about encoding
- Explored data in tabular format
- With a pie chart, I plotted the % of “Genuine SMS” and “Fraud SMS”
- Defined a classifier list to train with different hyperparameters
- Trained and tested the classifier
- Sorted the results by performance metrics, I decided to use the F1_score
Which parameters did I test for every classifier?
Basically n-grams.
N-gram is a subsequence of n element for a specific sequence
E.g. the phrase “Andrea Ciufo is really nice” is composed by:
- 5 1-grams [“Andrea” “Ciufo” “is” “really” “nice”]
- 4 2-grams [“Andrea Ciufo” “Ciufo is” “is really” “really nice”]
- 3 3-grams [“Andrea Ciufo is” “Ciufo is really” “is really nice”]
In the task, I evaluated only 1-grams, 2-grams and their combination (in one case I trained the classifier for just one subsequence, in another case I trained considering both)
Delivered the work, the interview was not so long, one hour or less. During that, I explained all the workflow to the CEO and he went deep on some issues, such as how I cleaned and preprocessed the data and the performance metrics that I chose.
After the job interview, I asked the courtesy to get a structured feedback if I get rejected.
In the following days, they replied me explaining my mistakes and gap, and I really appreciated:
- You saw that the dataset was very unbalanced but you didn’t do anything about it. Take a look at class weights and sample weights methods
- You have to be clear on what metric to use to compare different algorithm. You were computing every metric possible but a lot of them were irrelevant. “ROC AUC” was the best in this case
- You did not do any pre-processing: remove punction, lowercase everything, group numbers … [-I really forgot to do that–]
- adding other features like: length of the text, ratio of capital, use of “!”, presence of url, email …
- take a look at the xgboost algorithm, it performs usually better than any other method or at some deep learning methods
Without the need to find excuses, ALL the previous notes are extremely useful and to be considered for the next projects.
I must thank Pier, an Innlaber and researcher at Aldo Moro University in Bary, he also collaborated with the Alan Turing Institute
He is well grounded in NLP (Natural Language Processing), during the 7 days I always asked for tips and I always showed my results and algorithm used.
In the following posts, I will address all the feedback that I received with some code included.
On git hub, you can find the code, or attached below.
If you liked or you find it useful, share it through social networks, with just one click you can raise an opportunity.
If you think something needs to be fixed, you found any typo, write me!
Thanks for reading my article!
Andrea
#Importing all the module that we will use
#to read csv and manipulate dataframe
import pandas as pd
#to modify array
import numpy as np
#to do some data visualization
import matplotlib.pyplot as plt
#importing the class to convert SMSs to a matrix of token counts
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
#Note that count vectorizer just counts the word frequencies
#TFIDF vectorizer assigns a score
from sklearn.model_selection import train_test_split
#Importing the classifiers that we wil test
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
#Importing performance metrix to evaluate our models
from sklearn.metrics import confusion_matrix,auc,roc_auc_score
from sklearn.metrics import recall_score, precision_score, accuracy_score, f1_score
c:\python\lib\site-packages\sklearn\ensemble\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release. from numpy.core.umath_tests import inner1d
#We define the two main function for our task
#one for get prediction from out classifier
#the other one to evaluate performance of the model based on different
#key perfomance metrics
def get_predictions(clf, count_train, y_train, count_test):
# create classifier
clf = clf
# fit it to training data
clf.fit(count_train,y_train)
# predict using test data
y_pred = clf.predict(count_test)
# Compute predicted probabilities: y_pred_prob
y_pred_prob = clf.predict_proba(count_test)
#for fun: train-set predictions
#train_pred = clf.predict(count_test)
#print('train-set confusion matrix:\n', confusion_matrix(y_train,train_pred))
return y_pred, y_pred_prob
def print_scores(y_test,y_pred,y_pred_prob):
conf_matrix=confusion_matrix(y_test,y_pred)
recall= recall_score(y_test,y_pred)
precision= precision_score(y_test,y_pred)
f1=f1_score(y_test,y_pred)
accuracy= accuracy_score(y_test,y_pred)
roc= roc_auc_score(y_test, y_pred_prob[:,1])
print('test-set confusion matrix:\n',conf_matrix )
print("recall score: ",recall )
print("precision score: ", precision)
print("f1 score: ",f1 )
print("accuracy score: ", accuracy)
print("ROC AUC: {}".format(roc_auc_score(y_test, y_pred_prob[:,1])))
kpi=pd.DataFrame([[recall,precision,f1,accuracy,roc]], columns=['Recall','Precision','F1_Score','Accuracy','ROC-AUC'])
return kpi
#Name of the file in the directory
name_path='data_spam.csv'
#reading the csv and converting in a df
csv_raw=pd.read_csv(name_path,index_col=0,encoding="ISO-8859-1")
#We need to inspect our dataframe
print(csv_raw.head(2))
print(csv_raw.info())
label SMS 0 ham Go until jurong point, crazy.. Available only ... 1 ham Ok lar... Joking wif u oni... <class 'pandas.core.frame.DataFrame'> Int64Index: 5572 entries, 0 to 5571 Data columns (total 2 columns): label 5572 non-null object SMS 5572 non-null object dtypes: object(2) memory usage: 130.6+ KB None
#We count how many genuine and fraud SMS are in the dataset
#We are going to group the dataset by label
#and plot thorugh a pie chart
group_label=csv_raw.groupby(by='label').count()
print('The dataset is composed by')
print(group_label)
# We Plot Our Dataset
print("Our Dataset as pie chart:")
fig, ax = plt.subplots(1, 1)
ax.pie(group_label,autopct='%1.1f%%', labels=['Genuine','Fraud'], colors=['yellow','r'])
plt.axis('equal')
plt.ylabel('')
The dataset is composed by SMS label ham 4825 spam 747 Our Dataset as pie chart:
Text(0,0.5,'')
Probabilistic Note, if we pick a random SMS, on a frequentistic approach we have 13.4% of probability to pick a Fraudulent SMS.
#We need to create a dummy variable 1-0 for our dataset for faster test-training and later manipulation
df=csv_raw.copy()
b={'ham':True,'spam':False}
df['Status']=df['label'].map(b)
print(df.head())
label SMS Status 0 ham Go until jurong point, crazy.. Available only ... True 1 ham Ok lar... Joking wif u oni... True 2 spam Free entry in 2 a wkly comp to win FA Cup fina... False 3 ham U dun say so early hor... U c already then say... True 4 ham Nah I don't think he goes to usf, he lives aro... True
list_classifier=[MultinomialNB(),LogisticRegression(),RandomForestClassifier(),RandomForestClassifier(criterion='entropy')]
names=['MultinomialNaiveBayes','Logistic Regression', 'Random Forest Classifier', 'Random Forest Classifier crit=Entropy']
y=df['Status']
X_train, X_test, y_train, y_test = train_test_split(df['SMS'],y,test_size=0.33)
Naive Bayes–> Generative classifiers learn a model of joint probabilities p(x, y) and use Bayes rule to calculate p(x y) to make a prediction
Logistic Regression –> Discriminative models learn the posterior probability p(x y) “directly”
#In this section we will use count vectorizer for the three classifiers, selecting keywords and bi-grams
results_df=pd.DataFrame(columns=['n_grams','Classifier','Recall','Precision','F1_Score','Accuracy','ROC-AUC'])
performance=pd.DataFrame()
for i in [[1,1],[2,2],[1,2]]:
count_vectorizer=CountVectorizer(stop_words='english',ngram_range=(i))
count_train=count_vectorizer.fit_transform(X_train.values)
count_test=count_vectorizer.transform(X_test.values)
for name, clf in zip(names,list_classifier):
print('Classifier used',name)
print('n-gram range is',i)
print()
y_pred, y_pred_prob = get_predictions(clf, count_train, y_train, count_test)
n_grams=str(i)
classifier=name
loop_performance=pd.DataFrame()
loop_performance=print_scores(y_test,y_pred,y_pred_prob)
loop_performance['n_grams']=n_grams
loop_performance['Classifier']=name
performance=performance.append(loop_performance)
print('__________________')
Classifier used MultinomialNaiveBayes n-gram range is [1, 1] test-set confusion matrix: [[ 225 24] [ 11 1579]] recall score: 0.9930817610062893 precision score: 0.9850280723643169 f1 score: 0.9890385217663639 accuracy score: 0.9809679173463839 ROC AUC: 0.9847414816498699 __________________ Classifier used Logistic Regression n-gram range is [1, 1] test-set confusion matrix: [[ 210 39] [ 2 1588]] recall score: 0.9987421383647799 precision score: 0.9760295021511985 f1 score: 0.9872552067143301 accuracy score: 0.977705274605764 ROC AUC: 0.98672930716577 __________________ Classifier used Random Forest Classifier n-gram range is [1, 1] test-set confusion matrix: [[ 212 37] [ 7 1583]] recall score: 0.9955974842767296 precision score: 0.9771604938271605 f1 score: 0.9862928348909658 accuracy score: 0.9760739532354541 ROC AUC: 0.9797858099062918 __________________ Classifier used Random Forest Classifier crit=Entropy n-gram range is [1, 1] test-set confusion matrix: [[ 201 48] [ 6 1584]] recall score: 0.9962264150943396 precision score: 0.9705882352941176 f1 score: 0.9832402234636872 accuracy score: 0.9706362153344209 ROC AUC: 0.9735495440883029 __________________ Classifier used MultinomialNaiveBayes n-gram range is [2, 2] test-set confusion matrix: [[ 200 49] [ 0 1590]] recall score: 1.0 precision score: 0.9701037217815741 f1 score: 0.9848250232270053 accuracy score: 0.9733550842849374 ROC AUC: 0.9582657674724052 __________________ Classifier used Logistic Regression n-gram range is [2, 2] test-set confusion matrix: [[ 132 117] [ 0 1590]] recall score: 1.0 precision score: 0.9314586994727593 f1 score: 0.9645131938125568 accuracy score: 0.9363784665579119 ROC AUC: 0.9600111136369377 __________________ Classifier used Random Forest Classifier n-gram range is [2, 2] test-set confusion matrix: [[ 157 92] [ 1 1589]] recall score: 0.9993710691823899 precision score: 0.9452706722189174 f1 score: 0.9715683277285234 accuracy score: 0.9494290375203915 ROC AUC: 0.9276426460559218 __________________ Classifier used Random Forest Classifier crit=Entropy n-gram range is [2, 2] test-set confusion matrix: [[ 157 92] [ 0 1590]] recall score: 1.0 precision score: 0.9453032104637337 f1 score: 0.9718826405867971 accuracy score: 0.9499728113104948 ROC AUC: 0.9222601096208733 __________________ Classifier used MultinomialNaiveBayes n-gram range is [1, 2] test-set confusion matrix: [[ 227 22] [ 6 1584]] recall score: 0.9962264150943396 precision score: 0.9863013698630136 f1 score: 0.9912390488110138 accuracy score: 0.9847743338771071 ROC AUC: 0.9812129019221542 __________________ Classifier used Logistic Regression n-gram range is [1, 2] test-set confusion matrix: [[ 209 40] [ 2 1588]] recall score: 0.9987421383647799 precision score: 0.9754299754299754 f1 score: 0.9869484151646987 accuracy score: 0.9771615008156607 ROC AUC: 0.9874542193932964 __________________ Classifier used Random Forest Classifier n-gram range is [1, 2] test-set confusion matrix: [[ 195 54] [ 2 1588]] recall score: 0.9987421383647799 precision score: 0.9671132764920828 f1 score: 0.9826732673267327 accuracy score: 0.9695486677542142 ROC AUC: 0.9724179737819201 __________________ Classifier used Random Forest Classifier crit=Entropy n-gram range is [1, 2] test-set confusion matrix: [[ 199 50] [ 5 1585]] recall score: 0.9968553459119497 precision score: 0.9694189602446484 f1 score: 0.9829457364341084 accuracy score: 0.9700924415443176 ROC AUC: 0.9787944229749186 __________________
#Now we have to tidy our output
print(performance.sort_values(['F1_Score'],ascending=False))
Recall Precision F1_Score Accuracy ROC-AUC n_grams \ 0 0.996226 0.986301 0.991239 0.984774 0.981213 [1, 2] 0 0.993082 0.985028 0.989039 0.980968 0.984741 [1, 1] 0 0.998742 0.976030 0.987255 0.977705 0.986729 [1, 1] 0 0.998742 0.975430 0.986948 0.977162 0.987454 [1, 2] 0 0.995597 0.977160 0.986293 0.976074 0.979786 [1, 1] 0 1.000000 0.970104 0.984825 0.973355 0.958266 [2, 2] 0 0.996226 0.970588 0.983240 0.970636 0.973550 [1, 1] 0 0.996855 0.969419 0.982946 0.970092 0.978794 [1, 2] 0 0.998742 0.967113 0.982673 0.969549 0.972418 [1, 2] 0 1.000000 0.945303 0.971883 0.949973 0.922260 [2, 2] 0 0.999371 0.945271 0.971568 0.949429 0.927643 [2, 2] 0 1.000000 0.931459 0.964513 0.936378 0.960011 [2, 2] Classifier 0 MultinomialNaiveBayes 0 MultinomialNaiveBayes 0 Logistic Regression 0 Logistic Regression 0 Random Forest Classifier 0 MultinomialNaiveBayes 0 Random Forest Classifier crit=Entropy 0 Random Forest Classifier crit=Entropy 0 Random Forest Classifier 0 Random Forest Classifier crit=Entropy 0 Random Forest Classifier 0 Logistic Regression
X_train, X_test, y_train, y_test = train_test_split(df['SMS'],y,test_size=0.33)
#In this section we will use Tfidf vectorizer for the three classifiers, selecting keywords and bi-grams
results_df=pd.DataFrame(columns=['n_grams','Classifier','Recall','Precision','F1_Score','Accuracy','ROC-AUC'])
performance=pd.DataFrame()
for i in [[1,1],[2,2],[1,2]]:
tfidf_vectorizer=TfidfVectorizer(stop_words='english',ngram_range=(i))
tfidf_train=tfidf_vectorizer.fit_transform(X_train.values)
tfidf_test=tfidf_vectorizer.transform(X_test.values)
for name, clf in zip(names,list_classifier):
print('Classifier used',name)
print('n-gram range is',i)
print()
y_pred, y_pred_prob = get_predictions(clf, tfidf_train, y_train, tfidf_test)
n_grams=str(i)
classifier=name
loop_performance=pd.DataFrame()
loop_performance=print_scores(y_test,y_pred,y_pred_prob)
loop_performance['n_grams']=n_grams
loop_performance['Classifier']=name
performance=performance.append(loop_performance)
print('__________________')
Classifier used MultinomialNaiveBayes n-gram range is [1, 1] test-set confusion matrix: [[ 184 60] [ 0 1595]] recall score: 1.0 precision score: 0.9637462235649547 f1 score: 0.9815384615384616 accuracy score: 0.967373572593801 ROC AUC: 0.9870574027442314 __________________ Classifier used Logistic Regression n-gram range is [1, 1] test-set confusion matrix: [[ 173 71] [ 2 1593]] recall score: 0.9987460815047022 precision score: 0.9573317307692307 f1 score: 0.9776004909481435 accuracy score: 0.9603045133224578 ROC AUC: 0.9891926614933964 __________________ Classifier used Random Forest Classifier n-gram range is [1, 1] test-set confusion matrix: [[ 206 38] [ 3 1592]] recall score: 0.9981191222570532 precision score: 0.9766871165644172 f1 score: 0.9872868217054264 accuracy score: 0.977705274605764 ROC AUC: 0.9783377871421963 __________________ Classifier used Random Forest Classifier crit=Entropy n-gram range is [1, 1] test-set confusion matrix: [[ 194 50] [ 7 1588]] recall score: 0.9956112852664577 precision score: 0.9694749694749695 f1 score: 0.9823693164243736 accuracy score: 0.9690048939641109 ROC AUC: 0.9824310087877075 __________________ Classifier used MultinomialNaiveBayes n-gram range is [2, 2] test-set confusion matrix: [[ 118 126] [ 0 1595]] recall score: 1.0 precision score: 0.926786751888437 f1 score: 0.9620024125452352 accuracy score: 0.9314845024469821 ROC AUC: 0.9611914795210442 __________________ Classifier used Logistic Regression n-gram range is [2, 2] test-set confusion matrix: [[ 21 223] [ 0 1595]] recall score: 1.0 precision score: 0.8773377337733773 f1 score: 0.9346615880457075 accuracy score: 0.8787384448069603 ROC AUC: 0.9613636363636363 __________________ Classifier used Random Forest Classifier n-gram range is [2, 2] test-set confusion matrix: [[ 152 92] [ 1 1594]] recall score: 0.9993730407523511 precision score: 0.9454329774614472 f1 score: 0.971654983236818 accuracy score: 0.9494290375203915 ROC AUC: 0.9390320674238142 __________________ Classifier used Random Forest Classifier crit=Entropy n-gram range is [2, 2] test-set confusion matrix: [[ 141 103] [ 0 1595]] recall score: 1.0 precision score: 0.9393404004711425 f1 score: 0.9687215305192833 accuracy score: 0.9439912996193583 ROC AUC: 0.9389369957346215 __________________ Classifier used MultinomialNaiveBayes n-gram range is [1, 2] test-set confusion matrix: [[ 164 80] [ 0 1595]] recall score: 1.0 precision score: 0.9522388059701492 f1 score: 0.9755351681957186 accuracy score: 0.9564980967917346 ROC AUC: 0.9840664987923327 __________________ Classifier used Logistic Regression n-gram range is [1, 2] test-set confusion matrix: [[ 150 94] [ 2 1593]] recall score: 0.9987460815047022 precision score: 0.944279786603438 f1 score: 0.9707495429616088 accuracy score: 0.9477977161500816 ROC AUC: 0.9891438408962436 __________________ Classifier used Random Forest Classifier n-gram range is [1, 2] test-set confusion matrix: [[ 194 50] [ 9 1586]] recall score: 0.9943573667711598 precision score: 0.969437652811736 f1 score: 0.9817393995666975 accuracy score: 0.9679173463839043 ROC AUC: 0.978073128115525 __________________ Classifier used Random Forest Classifier crit=Entropy n-gram range is [1, 2] test-set confusion matrix: [[ 187 57] [ 2 1593]] recall score: 0.9987460815047022 precision score: 0.9654545454545455 f1 score: 0.9818181818181819 accuracy score: 0.9679173463839043 ROC AUC: 0.9781964643609641 __________________
#Now we have to tidy our output
print(performance.sort_values(['F1_Score'],ascending=False))
Recall Precision F1_Score Accuracy ROC-AUC n_grams \ 0 0.998119 0.976687 0.987287 0.977705 0.978338 [1, 1] 0 0.995611 0.969475 0.982369 0.969005 0.982431 [1, 1] 0 0.998746 0.965455 0.981818 0.967917 0.978196 [1, 2] 0 0.994357 0.969438 0.981739 0.967917 0.978073 [1, 2] 0 1.000000 0.963746 0.981538 0.967374 0.987057 [1, 1] 0 0.998746 0.957332 0.977600 0.960305 0.989193 [1, 1] 0 1.000000 0.952239 0.975535 0.956498 0.984066 [1, 2] 0 0.999373 0.945433 0.971655 0.949429 0.939032 [2, 2] 0 0.998746 0.944280 0.970750 0.947798 0.989144 [1, 2] 0 1.000000 0.939340 0.968722 0.943991 0.938937 [2, 2] 0 1.000000 0.926787 0.962002 0.931485 0.961191 [2, 2] 0 1.000000 0.877338 0.934662 0.878738 0.961364 [2, 2] Classifier 0 Random Forest Classifier 0 Random Forest Classifier crit=Entropy 0 Random Forest Classifier crit=Entropy 0 Random Forest Classifier 0 MultinomialNaiveBayes 0 Logistic Regression 0 MultinomialNaiveBayes 0 Random Forest Classifier 0 Logistic Regression 0 Random Forest Classifier crit=Entropy 0 MultinomialNaiveBayes 0 Logistic Regression
# We didn't make in depth analysis of overfitting/underfitting, next step will use cross folds validation
#As we saw in the pie chart, data are skewed. One test to do is to undersample the dataset
#in order to get a less skewed dataset an analyze the results