Using Python solve the below questions.¶
- Build a machine learning model to predict the response. The model can be any algorithm of your choice. This model shouldbe the baseline model (without any feature improvement technique)
- What is the accuracy of the model? Do you find any insights on the output?
In [1]:
import pandas as pd
In [3]:
# Load the dataset
df = pd.read_excel('PresCorp-Insurance-1.xlsx')
In [5]:
df
Out[5]:
id | Gender | Age | Driving_License | Region_Code | Previously_Insured | Vehicle_Age | Vehicle_Damage | Annual_Premium | Policy_Sales_Channel | Vintage | Date of Lead | Response | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 166 | Male | 25 | 1 | 30 | 1 | < 1 Year | No | 29396 | 152 | 22 | 2017-01-01 | 0 |
1 | 442 | Female | 67 | 1 | 29 | 0 | 1-2 Year | Yes | 43753 | 124 | 30 | 2017-01-01 | 1 |
2 | 2702 | Male | 21 | 1 | 8 | 0 | < 1 Year | Yes | 32604 | 152 | 161 | 2017-01-01 | 0 |
3 | 5961 | Male | 21 | 1 | 50 | 1 | < 1 Year | No | 34607 | 160 | 176 | 2017-01-01 | 0 |
4 | 7184 | Male | 49 | 1 | 17 | 0 | 1-2 Year | Yes | 2630 | 156 | 76 | 2017-01-01 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
149995 | 143373 | Male | 71 | 1 | 33 | 1 | 1-2 Year | No | 34558 | 26 | 204 | 2020-04-14 | 0 |
149996 | 144405 | Male | 69 | 1 | 3 | 1 | 1-2 Year | No | 36979 | 26 | 252 | 2020-04-14 | 0 |
149997 | 146143 | Male | 24 | 1 | 35 | 1 | < 1 Year | No | 31775 | 152 | 118 | 2020-04-14 | 0 |
149998 | 148595 | Male | 63 | 1 | 41 | 0 | 1-2 Year | Yes | 43313 | 26 | 78 | 2020-04-14 | 0 |
149999 | 149122 | Male | 33 | 1 | 24 | 1 | 1-2 Year | No | 38221 | 157 | 293 | 2020-04-14 | 0 |
150000 rows × 13 columns
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150000 entries, 0 to 149999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 150000 non-null int64 1 Gender 150000 non-null object 2 Age 150000 non-null int64 3 Driving_License 150000 non-null int64 4 Region_Code 150000 non-null int64 5 Previously_Insured 150000 non-null int64 6 Vehicle_Age 150000 non-null object 7 Vehicle_Damage 150000 non-null object 8 Annual_Premium 150000 non-null int64 9 Policy_Sales_Channel 150000 non-null int64 10 Vintage 150000 non-null int64 11 Date of Lead 150000 non-null datetime64[ns] 12 Response 150000 non-null int64 dtypes: datetime64[ns](1), int64(9), object(3) memory usage: 14.9+ MB
In [9]:
# Identify datetime columns
datetime_cols = df.select_dtypes(include=['datetime64[ns]', 'datetime64']).columns
In [11]:
datetime_cols
Out[11]:
Index(['Date of Lead'], dtype='object')
In [13]:
# Convert datetime columns to numerical features
for col in datetime_cols:
df[col + '_year'] = df[col].dt.year
df[col + '_month'] = df[col].dt.month
df[col + '_day'] = df[col].dt.day
df = df.drop(col, axis=1)
In [15]:
df.head()
Out[15]:
id | Gender | Age | Driving_License | Region_Code | Previously_Insured | Vehicle_Age | Vehicle_Damage | Annual_Premium | Policy_Sales_Channel | Vintage | Response | Date of Lead_year | Date of Lead_month | Date of Lead_day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 166 | Male | 25 | 1 | 30 | 1 | < 1 Year | No | 29396 | 152 | 22 | 0 | 2017 | 1 | 1 |
1 | 442 | Female | 67 | 1 | 29 | 0 | 1-2 Year | Yes | 43753 | 124 | 30 | 1 | 2017 | 1 | 1 |
2 | 2702 | Male | 21 | 1 | 8 | 0 | < 1 Year | Yes | 32604 | 152 | 161 | 0 | 2017 | 1 | 1 |
3 | 5961 | Male | 21 | 1 | 50 | 1 | < 1 Year | No | 34607 | 160 | 176 | 0 | 2017 | 1 | 1 |
4 | 7184 | Male | 49 | 1 | 17 | 0 | 1-2 Year | Yes | 2630 | 156 | 76 | 0 | 2017 | 1 | 1 |
In [17]:
# Identify object type columns
object_cols = df.select_dtypes(include=['object']).columns
In [19]:
object_cols
Out[19]:
Index(['Gender', 'Vehicle_Age', 'Vehicle_Damage'], dtype='object')
In [21]:
# Convert categorical variables to dummy/indicator variables (one-hot encoding)
df = pd.get_dummies(df, columns=object_cols, drop_first=True)
In [23]:
df.head()
Out[23]:
id | Age | Driving_License | Region_Code | Previously_Insured | Annual_Premium | Policy_Sales_Channel | Vintage | Response | Date of Lead_year | Date of Lead_month | Date of Lead_day | Gender_Male | Vehicle_Age_< 1 Year | Vehicle_Age_> 2 Years | Vehicle_Damage_Yes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 166 | 25 | 1 | 30 | 1 | 29396 | 152 | 22 | 0 | 2017 | 1 | 1 | True | True | False | False |
1 | 442 | 67 | 1 | 29 | 0 | 43753 | 124 | 30 | 1 | 2017 | 1 | 1 | False | False | False | True |
2 | 2702 | 21 | 1 | 8 | 0 | 32604 | 152 | 161 | 0 | 2017 | 1 | 1 | True | True | False | True |
3 | 5961 | 21 | 1 | 50 | 1 | 34607 | 160 | 176 | 0 | 2017 | 1 | 1 | True | True | False | False |
4 | 7184 | 49 | 1 | 17 | 0 | 2630 | 156 | 76 | 0 | 2017 | 1 | 1 | True | False | False | True |
In [25]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150000 entries, 0 to 149999 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 150000 non-null int64 1 Age 150000 non-null int64 2 Driving_License 150000 non-null int64 3 Region_Code 150000 non-null int64 4 Previously_Insured 150000 non-null int64 5 Annual_Premium 150000 non-null int64 6 Policy_Sales_Channel 150000 non-null int64 7 Vintage 150000 non-null int64 8 Response 150000 non-null int64 9 Date of Lead_year 150000 non-null int32 10 Date of Lead_month 150000 non-null int32 11 Date of Lead_day 150000 non-null int32 12 Gender_Male 150000 non-null bool 13 Vehicle_Age_< 1 Year 150000 non-null bool 14 Vehicle_Age_> 2 Years 150000 non-null bool 15 Vehicle_Damage_Yes 150000 non-null bool dtypes: bool(4), int32(3), int64(9) memory usage: 12.6 MB
In [33]:
# to identify missing value
df.isnull().sum()
Out[33]:
id 0 Age 0 Driving_License 0 Region_Code 0 Previously_Insured 0 Annual_Premium 0 Policy_Sales_Channel 0 Vintage 0 Response 0 Date of Lead_year 0 Date of Lead_month 0 Date of Lead_day 0 Gender_Male 0 Vehicle_Age_< 1 Year 0 Vehicle_Age_> 2 Years 0 Vehicle_Damage_Yes 0 dtype: int64
In [146]:
# Identifying dependent and independent variables
X = X = df.drop('Response', axis=1)
Y = df['Response']
In [148]:
# spliting data into train and test
from sklearn.model_selection import train_test_split
# Split the data into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
In [150]:
len(X_train),len(X_test),len(Y_train),len(Y_test)
Out[150]:
(120000, 30000, 120000, 30000)
In [152]:
X_train.head()
Out[152]:
id | Age | Driving_License | Region_Code | Previously_Insured | Annual_Premium | Policy_Sales_Channel | Vintage | Date of Lead_year | Date of Lead_month | Date of Lead_day | Gender_Male | Vehicle_Age_< 1 Year | Vehicle_Age_> 2 Years | Vehicle_Damage_Yes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
104025 | 116886 | 44 | 1 | 32 | 0 | 37831 | 124 | 160 | 2019 | 4 | 11 | True | False | True | True |
5415 | 59954 | 53 | 1 | 28 | 0 | 24163 | 26 | 96 | 2017 | 2 | 14 | True | False | False | True |
75612 | 34646 | 53 | 1 | 36 | 0 | 2630 | 124 | 275 | 2018 | 8 | 28 | True | False | False | True |
138169 | 100915 | 26 | 1 | 8 | 1 | 33186 | 152 | 84 | 2020 | 1 | 10 | True | True | False | False |
87184 | 76283 | 45 | 1 | 32 | 0 | 27533 | 124 | 204 | 2018 | 11 | 28 | True | False | False | True |
In [156]:
# Building the model
from sklearn.linear_model import LogisticRegression
# Train a Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
C:\Users\Admin\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result(
Out[156]:
LogisticRegression(max_iter=1000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=1000)
In [157]:
# Predict on the test data
y_pred = model.predict(X_test)
In [166]:
#Evaluating model based on confusion metrix
from sklearn.metrics import accuracy_score, confusion_matrix
In [176]:
confusion_matrix(Y_test,y_pred)
Out[176]:
array([[26345, 0], [ 3655, 0]], dtype=int64)
In [159]:
# Calculate accuracy
accuracy = accuracy_score(Y_test, y_pred)
print(f'Accuracy of the baseline model: {accuracy:.2f}')
Accuracy of the baseline model: 0.88
In [160]:
accuracy
Out[160]:
0.8781666666666667