The best selling drugs in Italy are the ones that could be advertised
On the Ministry of Health Website, there is a open data section where you can find the information in a *.csv format on the top-50best selling drugs in Italy.
I decided to investigate this dataset, grouping some information (Python Code Attached Below, I will also upload on Git Hub) and visualize this data better.
Often plots are better than data in a tabular format.
Drugs are divided into two categories:
- SOP, “Senza Obbligo di Prescrizione” that means without a prescription
- Over The Counter or Self-Care/Medication drugs that could be advertised
If you clean a bit the Data from the Ministry you could plot an interesting graph.
I realized a scatter plot where SOPs are the blue dots and OTCs are the red dots.
Strictly speaking, this drugs are not the best selling, but the most “distributed” this is the definition on the dataset metadata, but it is a good proxy sales.
If we group the two categories (SOP and OTC) and we sum “Boxes quantity provided to pharmacy and drugstore” we obtain that in the first semester 2016*:
30.3 Millions of OCT boxes where distributed against 12.6 Millions of SOP boxes
I am sorry if I did not format in a readable way Y-axis, it is expressed ad 10^7 power ( 3*10^7= 30Milions)
Based on this results I can say:
- Equivalent drugs are little used, even though Italian pharmacists are obliged during the purchase to inform the customer about this option
- There might be an excessive drugs consumption with a negative impact on health, this could be related on advertising and not a use from the General Practitioner, but I am not a General Practitioner so this consideration is beyond my knowledge
- It’s fundamental for a rigorous analysis that Ministry of Helth publishes :
- 2017 and 2018 time series on 50 best selling drugs
- Data on General Drugs, in order to evaluate correlation and run a data comparison
For an intellectual integrity, I wrote to the Ministry Offices asking an updated dataset and to get data on equivalent drugs, if they will reply me I will update you
At the end of the article, you can find all the Python work for data selection, analysis, and visualization from the starting dataset.
After the aggregation through a groupby method, I added a column with a new column with the label drugs “Prescription Modality” This column was added through a union (Merge) with a DataFrame containing the drugs set and the “Prescription Modality” extracted from the original DataFrame.
Thanks for reading the article!
If you like it and you find it useful share it! If you think a fix is needed or any improvement, text me : )
Andrea
# Import pandas
import pandas as pd
import numpy as np
# Import plotting module
import matplotlib.pyplot as plt
#Import regex module
import re
file_name='C_17_dataset_15_download_itemDownload_0_upFile.csv'
csv=pd.read_csv(file_name,sep=';',encoding="ISO-8859-1",skiprows=1)
csv.info()
#We explore drugs name
#A necessary step for
#the next phase of cleaning and aggregation
csv_mod=csv.dropna().copy()
csv_mod['Farmaco']=None
#A drug "Rinazina" started with ** because the Ministry have to confirm the data
#for semplicity i rempoved **
csv_mod['Denominazione della confezione']=csv_mod['Denominazione della confezione'].str.replace('*','')
#We add a Column "Farmaco" to our Dataframe
#Why? This column help us for the following group operations
csv_mod['Farmaco']=None
def primaparola(colonna_di_testo):
pattern=r'\W*(\w[^,. !?"]*)'
return re.match(pattern,colonna_di_testo).group()
print(type(csv_mod['Denominazione della confezione']))
estratto=csv_mod['Denominazione della confezione'].apply(primaparola)
csv_mod['Farmaco']=estratto
csv_mod['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali']=pd.to_numeric(csv_mod['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'].str.replace('.',''))
#In this section we create a DataFrame
#Contaiining the unique "Farmaco" type
#Es. TACHIPIRINA label could be in multiple rows inside
#farmaco column because
#there are at least two version
#containg or 500mg or 1000 mg paracetamolo
#For doing so
#We drop duplicates
selezione= csv_mod.drop_duplicates(subset=['Farmaco']).copy()
#We select only te relevant columns
#for building our DataFrame
selezione=selezione[['Modalità\nPrescrizione','Farmaco']]
#Here we start our aggragation job
#here we also visualize in descending order
#the boxes distributed
grouped=csv_mod.groupby(by='Farmaco').sum()
grouped=grouped.sort_values(by='Quantità confezioni fornite alle farmacie ed agli esercizi commerciali',ascending=False)
print("Grouped Prima del reset dell'index")
print(grouped.head(2))
#we reset index in order to get
#drugs name as a column and not as an index
grouped=grouped.reset_index()
print("Grouped Dopo il reset dell' index")
print(grouped.head(2))
#We merge the two DataFrame
#on the "Farmaco" column
df_farma=grouped.merge(selezione,on='Farmaco')
print(df_farma.head(3))
print(grouped.info())
##### Now we added a column to know
#wich drugs are SOP and wich are OTC
#Now we can create our scatter plot with different colour
df_farma=df_farma.sort_values(by='Quantità confezioni fornite alle farmacie ed agli esercizi commerciali',ascending=False)
label=df_farma['Farmaco']
ln=np.arange(0,len(df_farma))
#We create two DataFrame one for the OTC and one for the SOP
df_otc=df_farma[df_farma['Modalità\nPrescrizione']=='OTC']
df_sop=df_farma[df_farma['Modalità\nPrescrizione']=='SOP']
#DataFrame index is necessary for
#displaying correctly on x-axis
#the x values and
#the corresponding y-values
index_otc=df_otc.index
index_sop=df_sop.index
otc_plot=plt.scatter(index_otc,df_otc['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'], color='red')
sop_plot=plt.scatter(index_sop,df_sop['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'], color='blue')
plt.ylabel('# of distributed boxes')
plt.xticks(ln,(label) ,rotation=90)
plt.title('Best selling Drugs without prescription in the first semester of 2016')
plt.text(10, 4, 'Data Elaboration Ing. Andrea Ciufo. Data Source: Ministry of Health',
fontsize=15, color='gray',
ha='center', va='top', alpha=0.5)
plt.legend((sop_plot,otc_plot),('SOP','Over The Counter that could be advertised'),loc='upper right')
plt.rcParams["figure.figsize"] = (15,4)
plt.show()
#In this sectio we aggregate
#All the SOP and OTC drugs
#In order to visualize
#through a Bar Plot
#For time reason I did not well formatted Y axes ticks
#The are expressed in power of 10
sum_otc=df_otc.groupby(by='Modalità\nPrescrizione').sum()
sum_sop=df_sop.groupby(by='Modalità\nPrescrizione').sum()
print(sum_otc['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'])
print(sum_sop['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'])
sum_df_y=[sum_otc['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'].values, sum_sop['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'].values]
sum_df_x=['OTC','SOP']
sum_df = pd.DataFrame({'Modalità Prescrizione':sum_df_x, 'Quantità confezioni fornite alle farmacie ed agli esercizi commerciali':sum_df_y})
ind=np.arange(len(sum_df_y))
plt.bar(ind,sum_df['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'], width=0.2)
plt.rcParams["figure.figsize"] = (7,3)
plt.title('Best selling Drugs without prescription in the first semester of 2016')
plt.ylabel('# of boxes distributed')
plt.xticks(ind,(sum_df_x) ,rotation=90)
plt.text(0.5, 0.5, 'Data Elaboration Ing. Andrea Ciufo. Data Source: Ministry of Health',
fontsize=15, color='gray',
ha='center', va='top', alpha=0.5)
plt.show()