This notebook was created as part of the Data Science Challenge 2020 of the German Actuarial Association.
Our goal is to investigate drivers that lead to limitations keeping people from working. These results are not only interesting for individuals that may get an incentive to adjust their own behavior in order to reduce the likelihood of not being able to work but they are also interesting from the perspective of an insurance company selling occupational disability insurance (OD) for two reasons. First, our results affect the insurance company’s risk assessment before contract conclusion. Typically, insurance companies request a lot of information from the potential policyholder before selling products like OD insurance. Our results support insurance companies to focus on the most relevant drivers in this regard. This enables insurance companies to shorten their risk assessment during the application process increasing (potential) customer experience. Second, our results provide insights regarding promising preventive measures that are worth to be supported by an insurance company.
In the last years mental health problems are becoming increasingly important as cause to keep people from working. For this purpose, we particularly focus on mental health problems and take into account various potential drivers such as educational level, BMI, smoking behavior or sleeping behavior, which are typically used in insurance companies’ application processes or might be added to them.
Our analysis is based on publicly available data from the National Center for Health Statistics. In one of their programes they assess the health and nutritional status of adults and children in the United States. Their survey examines a nationally representative sample of about 5,000 persons every second year, where we use their results from 2005 to 2015.
In this notebook we show how to prepare such data and based on this how to use this data in order to be able to make predictions. In our case we aim to predict the dependent variable "occupational disability" by certain independent variables such as BMI, smoking behavior or sleeping behavior. We train and evaluate different machine learning models (e.g., random forests or gradient tree boosting) and illustrate resampling techniques in order to improve the results of the models. Moreover, in this context, trustworthy artificial intelligence (AI) is particularly important for insurance companies. One main driver of trustworthy AI is transparency meaning explainability of AI models. Hence, we also pay special attention to the explainability of our models.
Our main results show that in particular trouble sleeping, age and smoking are related to occupational disability. As "occupational disability" is an underrepresented event in the data set, special data preparation in the data processing and modelling steps is necessary. Our results show that resampling techniques (over- or downsampling) are in this case important. Furthermore, our results yield that in our case predictions by a logistic regression are not inferior to predictions by more sophisticated machine learning models such as random forests or gradient tree boosting.
The remainder of our notebook is structured as follows. Section 1 describes our data collection process, while Section 2 focuses on processing and visualizing the data including data cleaning and one-hot encoding techniques. In Section 3 we normalize and split our data in training und test sample before we provide theoretical background on ratios to evaluate models in Section 4. Section 5 contains techniqual aspects and defines the functions used for predicting, evaluating and visualizing the model data in the following sections. In Section 6 we apply various AI models on our data and evaluate them. Using resampling techniques we aim to improve our results in Section 7, while Section 8 focuses on the explainability of the AI models. We give a final overview on the achieved results again in Section 9.
import os
from time import time
import warnings
# to make the notebook more readable we will take Future Warnings out
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import numpy as np
# visuals
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score, fbeta_score
from sklearn.metrics import mean_squared_error
from sklearn.utils import resample
# model explanation
import shap
import eli5
from eli5.sklearn import PermutationImportance
import lime
# python version
import platform
# this notebook has been created with the following Version of Python
print('This notebook has been created with Python Version: ' + platform.python_version())
Data source: National Center for Health Statistics (NHANES)
NHANES is a program of studies designed to assess the health and nutritional status of adults and children in the United States.
The survey examines a nationally representative sample of about 5,000 persons every second year.
The data can be downloaded from https://wwwn.cdc.gov/nchs/nhanes/Default.aspx
We use the data from 2005 to 2015 with consistent questions throughout the years. In earlier years some data is missing.
We select the following chapters:
-BPQ: Blood Pressure & Cholesterol
-DEMO: Demographic Variables
-PFQ: Physical Functioning (containing the target variable: occupational disability)
-SLQ: Sleep Disorders
-SMQ: Smoking - Cigarette Use
-WHQ: Weight History
If downloaded the data from 2005 to 2015, the following code can be used to import all .XPT-files and merge as well as filter the relevant data fields. Make sure to provide the correct data path. To include the code, uncomment and turn Markdown to Code.
my_path = os.getcwd()
path = os.path.join(my_path, 'NHANES\\Daten\\')
year = ('2005','2007','2009','2011','2013','2015')
file_extension = ('D','E','F','G','H','I')
bpq = pd.DataFrame()
demo = pd.DataFrame()
pfq = pd.DataFrame()
slq = pd.DataFrame()
smq = pd.DataFrame()
whq = pd.DataFrame()
for i in range(0,len(year)):
bpq=bpq.append(pd.read_sas(path+year[i]+'\\BPQ_'+file_extension[i]+'.XPT'))
demo=demo.append(pd.read_sas(path+year[i]+'\\DEMO_'+file_extension[i]+'.XPT'))
pfq=pfq.append(pd.read_sas(path+year[i]+'\\PFQ_'+file_extension[i]+'.XPT'))
slq=slq.append(pd.read_sas(path+year[i]+'\\SLQ_'+file_extension[i]+'.XPT'))
smq=smq.append(pd.read_sas(path+year[i]+'\\SMQ_'+file_extension[i]+'.XPT'))
whq=whq.append(pd.read_sas(path+year[i]+'\\WHQ_'+file_extension[i]+'.XPT'))
Merge the tables on SEQN (individual number per person).
result = pfq.merge(bpq,how="left",on='SEQN').merge(demo,how="left",on='SEQN').merge(slq,how="left",on='SEQN').merge(smq,how="left",on='SEQN').merge(whq,how="left",on='SEQN')
In the next step we filter for relevant questions in each dataset.
We focus on questions that are plausible to influence the probability for occupational disability. Another point is, that the question relates to the past or is not directly influenced by the fact, weather the person is able to work right now.
For example a question we decided to drop is "INDFMIN2" about the family income. From our perspective, a low family income in combination with occupational disability is rather a result and not an indicator. In any such cases we decided to not use the questions.
data=result[['SEQN','PFQ049','RIDAGEYR','DMDEDUC2','DMDMARTL','DMDHHSIZ','WHD010','WHD020','WHD050','WHD110','WHD130','WHD120','WHD140','SMQ020','SMD030','SMQ040','SMQ050Q','SMQ050U','SMQ621','SMQ900','SLD012','SLD010H','SLQ050','SLQ120','BPQ020','BPQ030','BPD035','BPQ040A','BPQ050A','BPQ080','BPQ060','BPQ070','BPQ090D','BPQ100D']]
For convenience we provide the final result of the data collection steps in a .csv-file to be simply read in by pandas read_csv
.
# Import data
data = pd.read_csv ('Daten.csv')
print('Number and features of imported values:', data.shape)
print('Number and features of imported values when question PFQ049 has the answer yes = 1 or no = 2:', data[data.PFQ049.isin((1,2))].shape)
Before we can do anything with the data we need to clean it up. To get an idea we first take a look at some sample data.
# Ten random data sets
data[data.PFQ049.isin((1,2))].sample (10)
# Missing values when question PFQ049 (OD) has the answer 1=yes or 2=no
data[data.PFQ049.isin((1,2))].apply(lambda x: sum(x.isnull().values), axis = 0)
Some questions with a lot of missing values are logically linked to each other. This will be discussed later.
# drop scarcely answered questions
data = data.drop('SMQ621', axis=1)
data = data.drop('WHD110', axis=1)
data = data.drop('WHD130', axis=1)
data = data.drop('BPD035', axis=1)
Some data is coded in 1=yes, 2=no. We replace 2 with 0 and delete the missing values.
# OD = 1 and not OD = 0
data = data[data.PFQ049.isin((1,2))]
data['PFQ049'] = data['PFQ049'].replace(2,0)
# smoked 100 cigarettes = 1, no = 0
data = data[data.SMQ020.isin((1,2))]
data['SMQ020'] = data['SMQ020'].replace(2,0)
# Ever told doctor had trouble sleeping?
data = data[data.SLQ050.isin((1,2))]
data['SLQ050'] = data['SLQ050'].replace(2,0)
# high blood pressure / cholesterol
data['BPQ020'] = data['BPQ020'].replace(2,0)
data['BPQ030'] = data['BPQ030'].replace(2,0)
data['BPQ040A'] = data['BPQ040A'].replace(2,0)
data['BPQ050A'] = data['BPQ050A'].replace(2,0)
data['BPQ060'] = data['BPQ060'].replace(2,0)
data['BPQ080'] = data['BPQ080'].replace(2,0)
data['BPQ090D'] = data['BPQ090D'].replace(2,0)
data['BPQ100D'] = data['BPQ100D'].replace(2,0)
# plot: people answering the question about occupational disability grouped by age
f, ax = plt.subplots(figsize=(30, 5))
sns.countplot(x='RIDAGEYR',hue='PFQ049',data=data)
# reducing the dataset to the target group of OD insurance: ages >= 25 and < 67
data = data[(data.RIDAGEYR < 67) & (data.RIDAGEYR >= 25)]
# drop sequence number
data = data.drop('SEQN', axis=1)
In the next step we calcualte the BMI for each weight indication.
For the question WHD020 (weight by age 25), with about 15% missing values, we decided to use the ratio of the weight at 25 and the current weight to estimate the missing values.
#calculate weight coefficient
data = data[(data.WHD020 != 7777) & (data.WHD020 != 9999) & (data['WHD020'].notna())]# data without answers "Refused" 7777 or "Don't know" 9999 and NaN
data_WHD = data[(data.WHD120 != 7777) & (data.WHD120 != 9999) & (data['WHD120'].notna())]
weight_coeff = data_WHD["WHD120"]/data_WHD["WHD020"] # Anteil Weight_Age_25/CurrentWeight
weight_coeff.describe()
Since the mean and the median is about the same value, we use the mean to estimate the weight by the age of 25.
data.loc[:,("WHD120")] = data["WHD120"].fillna(weight_coeff.mean()*data["WHD020"])
# drop NAs
data = data.dropna(subset=['WHD010', 'WHD050', 'WHD120', 'WHD140'])
# drop "Refused" 7777 and "Don't know" 9999
data = data[(data.WHD010 != 7777) & (data.WHD010 != 9999)]
data = data[(data.WHD050 != 7777) & (data.WHD050 != 9999)]
data = data[(data.WHD120 != 7777) & (data.WHD120 != 9999)]
data = data[(data.WHD140 != 7777) & (data.WHD140 != 9999)]
# calculate BMI:
data['BMI_0']=(data['WHD020']/2.205)/((data['WHD010']*0.0254)**2) # current BMI
data['BMI_1']=(data['WHD050']/2.205)/((data['WHD010']*0.0254)**2) # BMI one year ago
data['BMI_25']=(data['WHD120']/2.205)/((data['WHD010']*0.0254)**2) # BMI by the age of 25
data['BMI_max']=(data['WHD140']/2.205)/((data['WHD010']*0.0254)**2) # maximum BMI
# drop weight and hight
data = data.drop('WHD010', axis=1)
data = data.drop('WHD020', axis=1)
data = data.drop('WHD050', axis=1)
data = data.drop('WHD120', axis=1)
data = data.drop('WHD140', axis=1)
The next set of questions is about the smoking behavior.
Question SMQ020: Smoked at least 100 cigarettes in life?
# drop "Refused" 7 and "Don't know" 9
data = data[(data.SMQ020 != 7) & (data.SMQ020 != 9)]
Question SMD030: Age started smoking?
# plot: people answering the question about "starting age smoking"
f, ax = plt.subplots(figsize=(30, 5))
sns.countplot(x='SMD030',hue='PFQ049',data=data)
We face the problem, that this question logically excludes non-smokers.
Our solution is to categorize the starting age by quartiles and create four categories.
A non-smoker has a "0" in every category and a smoker falls exactly in one.
# drop "Refused" 777 and "Don't know" 999
data = data[(data.SMD030 != 777) & (data.SMD030 != 999)]
# NaN to 0: (none smokers)
data['SMD030'][data['SMQ020'] == 0].describe()
data.loc[data['SMQ020']==0,'SMD030'] = 0
# Quartiles
data['SMD030'][(data['SMD030']>0)].describe()
Quantil25 = data['SMD030'][data['SMD030']>0].quantile(.25)
Quantil50 = data['SMD030'][data['SMD030']>0].quantile(.5)
Quantil75 = data['SMD030'][data['SMD030']>0].quantile(.75)
data['Smoking_Age_25Q'] = data['SMQ020']
data.loc[data['SMD030']>Quantil25, 'Smoking_Age_25Q'] = 0 # [0,25]
data['Smoking_Age_50Q'] = data['SMQ020']
data.loc[data['SMD030']>Quantil50, 'Smoking_Age_50Q'] = 0 # (25,50]
data.loc[data['SMD030']<=Quantil25, 'Smoking_Age_50Q'] = 0
data['Smoking_Age_75Q'] = data['SMQ020']
data.loc[data['SMD030']>Quantil75, 'Smoking_Age_75Q'] = 0 # (50,75]
data.loc[data['SMD030']<=Quantil50, 'Smoking_Age_75Q'] = 0
data['Smoking_Age_100Q'] = data['SMQ020']
data.loc[data['SMD030']<=Quantil75, 'Smoking_Age_100Q'] = 0 # (75,100]
Question SMQ040: Are you currenty smoking?
The goal ist to recode the answers to yes = 1 and no = 0.
# none smokers have missing values -> NaNs to 0
data.loc[data['SMQ020']==0,'SMQ040'] = 0
# drop 7 (Refused) and 9 (Don't know)
data = data[(data.SMQ040 != 7) & (data.SMQ040 != 9)]
# answer 3 ("Not at all") to 0
data['SMQ040'] = data['SMQ040'].replace(3,0)
# answer 2 ("Some days") to 1
data['SMQ040'] = data['SMQ040'].replace(2,1)
Question SMQ050Q: Since when did u stop smoking? (years)
Current-smokers and non-smokers have missing values.
We decide to set the upper limit to 25, because the youngest person is 25. The value for non-smokers will then be set to 25 and for current-smokers to 0.
# none smokers --> 25
data.loc[data['SMQ020'] == 0, 'SMQ050Q'] = 25
# currently-smokers --> 0
data.loc[data['SMQ040'] == 1, 'SMQ050Q'] = 0
# drop 77777 (Refused) and 99999 (Don't know)
data = data[(data.SMQ050Q != 77777) & (data.SMQ050Q != 99999)]
# recode the values to years
data.loc[data['SMQ050U']== 1, 'SMQ050Q'] = round(data['SMQ050Q']/365) # SMQ050U = 1 --> days
data.loc[data['SMQ050U']== 2, 'SMQ050Q'] = round(data['SMQ050Q']/52) # SMQ050U = 2 --> weeks
data.loc[data['SMQ050U']== 3, 'SMQ050Q'] = round(data['SMQ050Q']/12) # SMQ050U = 1 --> month
# maximum 25 years
data.loc[data['SMQ050Q']> 25, 'SMQ050Q'] = 25
# plot SMQ050Q
f, ax = plt.subplots(figsize=(30, 5))
sns.countplot(x='SMQ050Q',data=data[(data['SMQ050Q']<25) & (data['SMQ050Q']>0)])
# drop SMQ050U
data = data.drop('SMQ050U', axis=1)
# drop SMD030
data = data.drop('SMD030', axis=1)
# drop SMQ900 (scarcely answered)
data = data.drop('SMQ900', axis=1)
The set of questions about blood pressure are all linked to each other.
Question BPQ020: Doctor ever told you, you had high blood pressure? is answered by everyone.
Question BPQ030: Told had high blood pressure 2+ times? When BPQ020 = 0 then replace NaN's for this question with 0.
Question BPQ040A: Taking prescription for hypertension? When BPQ020 = 0 then replace NaN's for this question with 0.
Question BPQ050A: Now taking prescribed medicine? When BPQ040A = 0 then replace NaN's for this question with 0.
data.loc[data['BPQ020']==0,'BPQ030'] = 0
data.loc[data['BPQ020']==0,'BPQ040A'] = 0
data.loc[data['BPQ040A']==0,'BPQ050A'] = 0
The set of questions about cholesterol level are all linked to each other.
Question BPQ080: Doctor ever told you, you had high cholesterol level? When BPQ060 = 0 then replace NaN's for this question with 0.
Question BPQ060: Ever had blood cholesterol checked? When BPQ080 = 0 then replace NaN's for this question with 0.
Question BPQ070: When was blood cholesterol last checked? When BPQ080 = 0 then replace NaN's for this question with 4 (5 years or more).
Question BPQ090D: Told to take prescription for cholesterol? When BPQ080 = 0 then replace NaN's for this question with 0.
Question BPQ100D: Now taking prescribed medicine? When BPQ090D = 0 then replace NaN's for this question with 0.
data.loc[data['BPQ080']==1,'BPQ060'] = 1
data.loc[data['BPQ060']==0,'BPQ080'] = 0
data.loc[data['BPQ060']==0,'BPQ070'] = 4
data.loc[data['BPQ080']==0,'BPQ090D'] = 0
data.loc[data['BPQ090D']==0,'BPQ100D'] = 0
# drop "Refused" 7 and "Don't know" 9 and NaN's
data = data[(data.BPQ020 != 7) & (data.BPQ020 != 9) & (data.BPQ020.isnull() == False)]
data = data[(data.BPQ030 != 7) & (data.BPQ030 != 9) & (data.BPQ030.isnull() == False)]
data = data[(data.BPQ040A != 7) & (data.BPQ040A != 9) & (data.BPQ040A.isnull() == False)]
data = data[(data.BPQ050A != 7) & (data.BPQ050A != 9) & (data.BPQ050A.isnull() == False)]
data = data[(data.BPQ080 != 7) & (data.BPQ080 != 9) & (data.BPQ080.isnull() == False)]
data = data[(data.BPQ060 != 7) & (data.BPQ060 != 9) & (data.BPQ060.isnull() == False)]
data = data[(data.BPQ070 != 7) & (data.BPQ070 != 9) & (data.BPQ070.isnull() == False)]
data = data[(data.BPQ090D != 7) & (data.BPQ090D != 9) & (data.BPQ090D.isnull() == False)]
data = data[(data.BPQ100D != 7) & (data.BPQ100D != 9) & (data.BPQ100D.isnull() == False)]
The next set of questions is about sleep disorders.
Question SLQ050: Ever told doctor you had trouble sleeping?
# SLQ050: drop "Refused" 7 and "Don't know" 9
data = data[(data.SLQ050 != 7) & (data.SLQ050 != 9)]
# drop SLQ120: scarcely answered
data = data.drop('SLQ120', axis=1)
Question SLD012 and SLD010H: Sleep hours? This question has different variable names over the years. We replace missing values with the median.
# combine SLD012 and SLD010H in the new variable SLD
data.loc[data['SLD012'].isnull() == True, 'SLD'] = data['SLD012']
data.loc[data['SLD012'].isnull() == False, 'SLD'] = data['SLD012']
# 99 (Refused) to NaN
data['SLD'] = data['SLD'].replace(99, np.nan)
# replace NaN's with median
print('Median:',data['SLD'].median())
data['SLD']=data['SLD'].fillna(data['SLD'].median())
# drop SLD012 and SLD010H
data = data.drop('SLD012', axis=1)
data = data.drop('SLD010H', axis=1)
Finally we drop the missing answers to the questions about Education Level and Marital Status
# drop "Refused" 7 and "Don't know" 9
data = data[(data.DMDEDUC2 != 7) & (data.DMDEDUC2 != 9)]
# drop "Refused" 77 and "Don't know" 99
data = data[(data.DMDMARTL != 77) & (data.DMDMARTL != 99)]
To increase the readability of our data we re-code the answers. You can find the description of the data here: NHanes Search Variables. For easy processing we put the code, a short explanation and a long explanation in a csv.
DataDescription = pd.read_csv('Description.csv')
DataDescription.head(20)
data = data.rename(columns=dict(zip(DataDescription["code"], DataDescription["ShortExplanation"])))
Education Level and Marital Status have multiple categories represented by integer values. Without adjustment, machine learning algorithms might interprete a natural ordered relationship between these integers, which makes no sense. Hence, we convert these information in binary variables using one-hot encoding to make sure we do not implicate some kind of relationship between the data that is not true.
Note that with respect to Education Level the value of the integer is related to the education level, i.e. a higher value represents a higher education level. However, the mean between Education Level 1 (Less than 9th grade) and Education Level 5 (College graduate or above) does not make much sense. With Marital Status it is even more obvious. For instance, 'Never married' doesn't equal five times 'Married'.
# One-hot encoding Education Level
Education = pd.get_dummies(data['EducationLevel'], prefix='EduLevel')
data = data.drop('EducationLevel', axis=1)
data = pd.concat([data, Education], axis=1)
EduLevel_1.0: Less than 9th grade
EduLevel_2.0: 9-11th grade (Includes 12th grade with no diploma)
EduLevel_3.0: High school graduate/GED or equivalent
EduLevel_4.0: Some college or AA degree
EduLevel_5.0: College graduate or above
# One-hot encoding Marital Status
MaritalStatus = pd.get_dummies(data['MaritalStatus'], prefix='MaritalStatus')
data = data.drop('MaritalStatus', axis = 1)
data = pd.concat([data, MaritalStatus], axis=1)
MaritalStatus_1.0: Married
MaritalStatus_2.0: Widowed
MaritalStatus_3.0: Divorced
MaritalStatus_4.0: Separated
MaritalStatus_5.0: Never married
MaritalStatus_6.0: Living with partner
# drop infs
data.dropna()
data = data[np.isfinite(data).all(1)]
If several variables correlate in the data set, they usually do not add additional information to the model. It might even be a cause-effect relationship that will not improve model quality or disturb the learning process. Therefore we analyse correlations in the data and drop certain variables.
# Pairplot: none binary variables
sns.set(style="ticks")
pairplot=['Age', 'BMI_max', 'PeopleInHousehold', 'YearsQuitSmoking']
pairplot.append('OD')
sns.pairplot(data[pairplot], hue="OD")
We notice some correlation. A higher age correlates with Occupational Disability (OD). You can also see that if you do not smoke or quit smoking a long time ago you are less likely to suffer from OD.
DataHeatmap=['OD','Age', 'Smoked100', 'DoctorSleep', 'DoctorHighBloodPressure', 'DoctorHighBloodPressure2', 'YearsQuitSmoking', 'MedicineCholesterol']
sns.heatmap(data[DataHeatmap].corr(), annot=True, cmap="YlGnBu");
The heatmap shows us correlation between different variables. We drop "DoctorHighBloodPressure2 - a doctor has told you that you have high blood pressure twice" because it correlates to the similar question "DoctorHighBloodPressure - a doctor has told you that you have high blood pressure".
data = data.drop('DoctorHighBloodPressure2', axis=1)
DataHeatmapOD=['OD','BMI_0', 'BMI_1', 'BMI_25', 'BMI_max']
sns.heatmap(data[DataHeatmapOD].corr(), annot=True, cmap="YlGnBu");
We drop the (highly correlated) categories for Body Mass Index and decide to only keep the Maximum.
data = data.drop('BMI_0', axis=1)
data = data.drop('BMI_1', axis=1)
data = data.drop('BMI_25',axis=1)
DataHeatmapSmoke = ['OD', 'Smoked100', 'SmokingNow', 'YearsQuitSmoking', 'Smoking_Age_25Q', 'Smoking_Age_50Q', 'Smoking_Age_75Q', 'Smoking_Age_100Q']
sns.heatmap(data[DataHeatmapSmoke].corr(), annot=True, cmap="YlGnBu");
The information if a person currently smokes is correlated with its smoking behaviour in earlier years. Further, if a person smokes in OD state is not an information an insurance company is able to use in the application process. Therefore, we drop the "SmokingNow" variable.
data = data.drop('SmokingNow', axis=1)
It is often good practice to perform some type of scaling on numerical features. Applying a scaling to the data does not change the shape of each feature's distribution (such as above); however, normalization ensures that each feature is treated equally when applying supervised learners. Note that once scaling is applied, observing the data in its raw form will no longer have the same original meaning, as exampled below.
# Initialize a scaler, then apply it to the features
scaler = MinMaxScaler() # default=(0, 1)
# we single out the data which isn't binary
notbinary = ['Age', 'BMI_max', 'PeopleInHousehold', 'YearsQuitSmoking', ]
data_norm = pd.DataFrame(data = data.copy(deep=True))
data_norm[notbinary] = scaler.fit_transform(data[notbinary])
# Show an example of a record with scaling applied
display(data_norm.head(n = 5))
The prediction variable gets separated from the actual input data. We use train_test_split
from sklearn to split the data into 70% of training data and 30% of test data.
# Splitting the data
data_test = data_norm.copy(deep=True)
y = data_test['OD']
X = data_test.drop('OD', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))
To evalute the models described in the upcoming Section 6, we will consider certain ratios giving us the estimation of probabilites for certain events. The considered ratios are standard for the evaluation of classification tasks. We will see with a naive predictor that our data set is unbalanced, i.e., being OD is a rare event that has to be predicted. We consider the so-called f-beta score as additional estimator to handle the specialities of an unbalanced data set.
There are different ratios to evaluate the predictive power of the model.
In our case of a binary problem - you either suffer from OD or you do not - we have four possible outcomes for the classified data in our model:
True Positive (TP): You are suffering from Occupational Disability and the model classifies you as 'OD'.
False Positive (FP): You are not suffering from Occupational Disability but the model predicts you do.
True Negative (TN): You are not suffering from Occupational Disability and the model predicts you do not.
False Negative (FN): You are suffering from Occupational Disability but the model classifies you as 'not OD'.
Based on these outcomes we take into account for different ratios as defined below.
Accuracy:
Accuracy describes how much of the data is classified correctly by the model. Especially with imbalanced date like we have in our model it is very easy to get a high accuracy as we will see in the naive predictor below.
$$ Accuracy = \frac{\sum{TP} + \sum{TN}}{\sum{TP} + \sum{TN} + \sum{FP} + \sum{FN}}$$Precision:
Precision describes how many of the predicted positive (predicted OD) are actual positive (OD). In our case we aim for a high precision. If the precision is low, we will reject to many customers in the application process or invest in preventive measures for customers who don't necessarily need them.
$$ Precision = \frac{\sum{TP}}{\sum{TP} + \sum{FP}}$$Recall (Sensitivity):
Recall is defined by the ratio of those that we classified OD and the actual number of those being OD. In our case a high recall is very important, since customers being OD are very expensive for insurance companies. A high recall allows insurance companies to reject applicants with a high risk of OD or to invest in preventive measures for them in case they are already customer.
$$ Recall = \frac{\sum{TP}}{\sum{TP} + \sum{FN}}$$Fall-Out (False Positive Rate):
Fall-out is defined by the ratio of those being mistakenly classified OD and the actual number of those being not OD.
$$ Fall-Out = \frac{\sum{FP}}{\sum{TN} + \sum{FP}}$$As a naive predictor we use a very easy model: We predict 'not OD' for all our data points. This gives us a baseline to rate our predictions against.
TP = 0 # There are zero true positives predicted as all predictions are "not OD".
FP = 0 # There are zero false positives predicted as all predictions are "not OD".
TN = data['OD'].count() - np.sum(data['OD']) # The number of true negative predictions is the number of zeros of the prediction variable.
FN = np.sum(data['OD']) # The number of false negative preditions is the number of ones of the prediction variable.
# Calculate accuracy, precision and recall
accuracy = (TP + TN) / data['OD'].count()
# Print the results
print("Accuracy : {:.4f}".format(accuracy))
We can get a very high accuracy of over 86.7% with this very naive model. Obviously the accuracy of the naive predictor describes that about 86.7% of the samples are not occupationally disabled. This reflects that the occurance of OD is a rare event of only 13.3% in the data set that we need to predict. The data is unbalanced in this sense.
So it is important to consider other ratios in order to evaluate the benefit of the models.
During the application process we aim to predict the positive cases (OD), in order to reduce the loss ratio. Hence, the recall of the model is particularly important, because it tells us which percentage of OD we got right.
However, the insurance company rejects applicants if our model states that they have a high risk of OD. Rejecting applicants means less new business and a smaller customer base. Hence, insurance companies are not interested in rejecting applicants that actually have not a high risk of OD emphasizing the importance of the Precision - how many of the predicted positive cases are really positive?
However, there is also a ratio that combines the two measures we want to look at - the f-beta score.
F-beta score:
F-beta combines precision and recall:
$$ F_{\beta} = (1 + \beta^2) \cdot \frac{Precision \cdot Recall}{\left( \beta^2 \cdot Precision \right) + Recall} $$β is chosen such that recall is considered β times as important as precision.
For our data we have to choose a pretty high β, because recall has a much higher impact on costs. In the following model process we decide for a β of 10.
This Section contains necessary functions for the prediction and evaluation process in the upcoming sections. For the reader it is possible to skip this Section and directly continue with the model training in Section 6.
Note that it would be convenient to outsource the full section to a .py-file that could be imported as a module.
def train_predict(model, X_train, y_train, X_test, y_test):
'''
inputs:
- model: the learning algorithm to be trained and predicted on
- X_train: features training set
- y_train: income training set
- X_test: features testing set
- y_test: income testing set
output:
- results: dictionary, containing all introduced model ratios for training and test set
'''
results={}
# Fit the learner to the training data
start = time() # Get start time
model = model.fit(X_train, y_train)
end = time() # Get end time
# Calculate the training time
results['time_train'] = end-start
# Get the predictions on the test set(X_test)
start = time() # Get start time
predictions_test = model.predict(X_test)
predictions_train = model.predict(X_train)
end = time() # Get end time
# Calculate the total prediction time
results['time_test'] = end-start
# Compute accuracy on the training set
results['acc_train'] = accuracy_score(y_train, predictions_train)
# Compute precision on the training set
results['pre_train'] = precision_score(y_train, predictions_train)
# Compute recall on the training set
results['rec_train'] = recall_score(y_train, predictions_train)
# Compute F-score on the training set
results['f_train'] = fbeta_score(y_train, predictions_train, beta=10)
# Compute accuracy on the test set
results['acc_test'] = accuracy_score(y_test, predictions_test)
# Compute precision on the test set
results['pre_test'] = precision_score(y_test, predictions_test)
# Compute recall on the test set
results['rec_test'] = recall_score(y_test, predictions_test)
# Compute F-score on the test set
results['f_test'] = fbeta_score(y_test, predictions_test, beta=10)
# Return the results
return results
def show_results(model, results):
'''
inputs:
- model: the learning algorithm to be trained and predicted on
- results: dictionary output of train_predict()
output:
Prints all estimators contained in results.
'''
print(" {}".format(model.__class__.__name__))
print(" Training Testing")
print("Prediction time : {:.4f} {:.4f}".format(results['time_train'],results['time_test']))
print("Accuracy : {:.4f} {:.4f}".format(results['acc_train'],results['acc_test']))
print("Precision : {:.4f} {:.4f}".format(results['pre_train'],results['pre_test']))
print("Recall : {:.4f} {:.4f}".format(results['rec_train'],results['rec_test']))
print("F-score : {:.4f} {:.4f}".format(results['f_train'],results['f_test']))
print("\n")
# plotting function for the confusion matrix
def plot_confusion_matrix(cmatrix):
'''
input:
- cmatrix: confusion matrix
output:
Visualisation of confusion matrix as heatmap.
'''
plt.figure(figsize=(6,6))
sns.set(font_scale=1.6)
ax = sns.heatmap(cmatrix, annot=True, fmt="0", linewidths=.5, square=True, cmap='Blues_r')
cmlabels = ['TN', 'FP',
'FN', 'TP']
for i,t in enumerate(ax.texts):
t.set_text(t.get_text() + "\n" + cmlabels[i])
plt.title('Confusion Matrix', size=15)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
all_sample_title = 'Confusion Matrix'
plt.title(all_sample_title, size=20)
def evaluate(models, allresults):
'''
inputs:
- models: list of models considered
- allresults: dictionary of the result dictionaries from train_predict() for model in models
output:
Visualisation of a comparison of the results of the models in models.
'''
# Create figure
count_models = len(allresults)
fig, ax = plt.subplots(1, 4, figsize = (20,5))
# Constants
bar_width = 3/count_models
colors = "bgrcmykw"
for j, metric in enumerate(['acc_test', 'pre_test','rec_test', 'f_test']):
i=0
for model in models:
ax[j].bar(i*bar_width+bar_width, allresults[model][metric], width = bar_width, color = colors[i])
i+=1
ax[j].set_xlim((-0.1, 3.5))
# Add unique y-labels
ax[0].set_ylabel("Accuracy Score")
ax[1].set_ylabel("Precision Score")
ax[2].set_ylabel("Recall Score")
ax[3].set_ylabel("F-score")
# Add titles
ax[0].set_title("Accuracy")
ax[1].set_title("Precision")
ax[2].set_title("Recall")
ax[3].set_title("F-score")
# Set y-limits for score panels
ax[1].set_ylim((0, 1))
ax[2].set_ylim((0, 1))
ax[3].set_ylim((0, 1))
# Create patches for the legend
patches = []
i=0
for model in models:
patches.append(mpatches.Patch(color = colors[i], label = model.__class__.__name__))
i+=1
plt.legend(handles = patches, loc = 'lower right', ncol = 1)
# Aesthetics
plt.suptitle("Performance Metrics", fontsize = 16, y = 1.10)
plt.tight_layout()
plt.show()
In the following we introduce the models that we focus on in the modelling process. We focus on the most interesting models in terms of explainability, performance, handeling and runtime.
Random forest is a machine learning technique that combines several uncorrelated decision trees for classification purposes. Applying supervised learning techniques, several individual decision trees are trained in parallel on each of the respective training sets. In order to classify a specific case using a random forest, first, each individual decision tree is evaluated leading to an individual classification. Second, the classification of the random forest is the mode of the resulting classes of the individual decision trees.
clf_RandomForest = RandomForestClassifier(random_state=42)
allresults = {clf_RandomForest: train_predict(clf_RandomForest, X_train, y_train, X_test, y_test)}
show_results(clf_RandomForest, allresults[clf_RandomForest])
y_pred = clf_RandomForest.predict(X_test)
cm = metrics.confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm)
We can see that the model learns the training data by heart and it is not possible to transfer the very good training results to the test data.
If we assume that an insurance company would base its application process on this model and reject the customers with predicted OD this would lead to:
So we have a claims ratio of 688 / 6146 = 11.2 % which is only slightly less than the claims ratio if all customers would be accepted.
Similar to random forests, gradient tree boosting is a machine learning technique for classification purposes that uses a set of decision trees. However, in contrast to random forests, gradient tree boosting does not build individual fully grown decision trees independently in parallel. Gradient tree boosting aims to boost weak learners into a strong learner. Starting with a decision tree (initial strong learner) gradient tree boosting sequentially adds results of different shallow decision trees (weak learner) in order to stepwise improve the strong learner. In each iteration, the added weak learner is trained on the remaining errors of the current strong learner.
clf_GradBoost = GradientBoostingClassifier(random_state=42)
allresults.update({clf_GradBoost: train_predict(clf_GradBoost, X_train, y_train, X_test, y_test)})
show_results(clf_GradBoost, allresults[clf_GradBoost])
y_pred = clf_GradBoost.predict(X_test)
cm = metrics.confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm)
Here, the model performs slightly better on the training as on the test set but the imbalanced data set still puts an overweight on the true negative predictions as dominating event.
Stochastic Gradient Descent is kind of a stochastic approximation of Gradient Descent. It is a machine learning algorithm that iteratively aims to optimize a specific objective function.
clf_SGDC = SGDClassifier(random_state=42)
allresults.update({clf_SGDC: train_predict(clf_SGDC, X_train, y_train, X_test, y_test)})
show_results(clf_SGDC, allresults[clf_SGDC])
y_pred = clf_SGDC.predict(X_test)
cm = metrics.confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm)
We also obtain similar results using Stochastic Gradient Descent.
A logistic regression is also a machine learning model that models the relationship of a binary dependent variable and independent variables (not necessarily binary) using a logistic predictor function. The unknown parameters of the predictor function are estimated from the data.
clf_LogisticRegression = linear_model.LogisticRegression(max_iter = 1000)
allresults.update({clf_LogisticRegression: train_predict(clf_LogisticRegression, X_train, y_train, X_test, y_test)})
show_results(clf_LogisticRegression, allresults[clf_LogisticRegression])
y_pred = clf_LogisticRegression.predict(X_test)
cm = metrics.confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm)
The logistic regression performs as good as the other models presented in Section 6. It is most likely that we have a mostly linear behaviour in the data relationships.
models = [clf_LogisticRegression, clf_RandomForest, clf_GradBoost, clf_SGDC]
model_name =['LogisticRegression','RandomForest','GradientBoosting','StochasticGradientDescent']
evaluate(models, allresults)
We can see that no matter what model we choose, while accuracy is high, precision is not good and recall is even worse. The combined estimator f-score with $\beta = 10$ does reflect the same model training behaviour. Because of the unbalanced data set the number of non-OD-cases is dominating the model training.
The idea of artificial neural networks is that the observed data is first entered into an input layer. These data is then passed on to a second layer consisting of so-called neurons. In the neurons, the data is processed with the help of so-called activation functions and sent from there to a third layer, also consisting of neurons. This process is then repeated for all so-called hidden layers. Finally, the resulting classification is the output of the last layer.
The advantage of the modelling strategie of a neural network is that such a network is able to approximate highly nonlinear structures. Further, the training of a neural network takes an extensive amount of time as it is a stochastic optimization process adjusting the variables.
As we can already see in the comparison of the four previously described models that the regression model does perform as good as the other models. So we will not expect an additional value of this kind of model what is validated in our experiments in Appendix A.
Oversampling can be defined as adding more copies of the minority class. We will use the resampling module from Scikit-Learn to randomly replicate samples from the minority class.
Always split into test and train sets before trying oversampling techniques! Oversampling before splitting the data can allow the exact same observations to be present in both the test and train sets. This can allow our model to simply memorize specific data points and cause overfitting and poor generalization to the test data.
For more information see here.
#data_over = data.copy(deep=True)
data_over = data_norm.copy(deep=True)
# Separate input features and target
y_over = data_over.OD
X_over = data_over.drop('OD', axis=1)
# setting up testing and training sets
X_over_train, X_over_test, y_over_train, y_over_test = train_test_split(X_over, y_over, test_size=0.3, random_state=42)
# concatenate our training data back together
X_over = pd.concat([X_over_train, y_over_train], axis=1)
# separate minority and majority classes
not_OD = X_over[X_over.OD==0]
OD = X_over[X_over.OD==1]
OD_upsampled = resample(OD,
replace=True, # sample with replacement
n_samples=len(not_OD), # match number in majority class
random_state=42) # reproducible results
# combine majority and upsampled minority
upsampled = pd.concat([not_OD, OD_upsampled])
# change the training data to the upsampled version
y_over_train = upsampled.OD
X_over_train = upsampled.drop('OD', axis=1)
fig, ax = plt.subplots(1, 4, figsize = (20,5))
plt.setp(ax, xticks=[0,0.5,1], xticklabels=['0', '50:100', '100:100'])
plt.setp(ax, yticks=[0,0.2,0.4,0.6,0.8,1], yticklabels=['0', '20%', '40%','60%','80%','100%'])
fig.text(0.5, 0, 'Class Proportions (minority:majority)', ha='center')
i=0
for model in models:
Accuracy = []
Recall = []
Fall_Out = []
minority_majority = []
for n in range(11):
OD_oversampled = resample(OD,
replace=True, # sample with replacement
n_samples=len(not_OD)-round((len(not_OD)-len(OD))*(10-n)/10), # match number in majority class
random_state=42) # reproducible results
# combine majority and upsampled minority
oversampled = pd.concat([not_OD, OD_oversampled])
# change the training data to the upsampled version
y_over_train = oversampled.OD
X_over_train = oversampled.drop('OD', axis=1)
if model == clf_LogisticRegression:
model.fit(X_over_train, y_over_train)
cm=metrics.confusion_matrix(y_test, np.round(model.predict(X_test),0))
else:
model.fit(X_over_train, y_over_train)
cm=metrics.confusion_matrix(y_test, model.predict(X_test))
Accuracy.append((cm[0][0]+cm[1][1])/cm.sum())
Recall.append(cm[1][1]/cm.sum(axis=1)[1])
Fall_Out.append(cm[0][0]/cm.sum(axis=1)[0])
minority_majority.append(len(OD_oversampled)/len(not_OD))
# Plot: Oversampling the minority class
ax[i].plot(minority_majority,Accuracy)
ax[i].plot(minority_majority,Fall_Out)
ax[i].plot(minority_majority,Recall)
ax[i].set_xlim((0, 1))
ax[i].set_ylim((0, 1))
ax[i].set_title(model_name[i])
i=i+1
plt.suptitle("Oversampling", y=1.05)
plt.legend(('Accuracy', 'Fall-Out', 'Recall'),loc='lower right')
# the data
allresults_over={} # nested dict
for model in models:
allresults_over[model] = train_predict(model, X_over_train, y_over_train, X_test, y_test)
evaluate(models, allresults_over)
Oversampling improves continously two of the four investigated models of Section 6. Especially the logistic regression model is still performing as good as the Gradient Boosting.
The random forest model does not improve significantly by the oversampling strategy. This behaviour was observed before in several publications (i.e., cf. C4.5,ClassImbalance,and CostSensitivity:Why Under-SamplingbeatsOver-Sampling).
Stochastic Gradient Descent improves in recall and f-score on a fully oversampled training set but on the tradeoff of precision and overall accuracy.
Downsampling follows the idea of cutting down the majority class of the training set to the size of the minority class to get again a balanced data set to train with. We will use the resampling module from Scikit-Learn to randomly throw off samples from the majority class.
Again first split into training and test set before downsampling as there is no need in cutting down the overall data.
Note that if there are only 14% OD positive values, we have to cut down about 83% of the not-OD data in the training set to get an equaly distributed training set. Therefore, the approach requires a data base that is huge enough so that the models can still be fitted on this much smaller data base.
fig, ax = plt.subplots(1, 4, figsize = (20,5))
plt.setp(ax, xticks=[0,0.5,1], xticklabels=['0', '50:100', '100:100'])
plt.setp(ax, yticks=[0,0.2,0.4,0.6,0.8,1], yticklabels=['0', '20%', '40%','60%','80%','100%'])
fig.text(0.5, 0, 'Class Proportions (minority:majority)', ha='center')
i=0
for model in models:
Accuracy = []
Recall = []
Fall_Out = []
minority_majority = []
for n in range(11):
not_OD_downsampled = resample(not_OD,
replace = False, # sample without replacement
n_samples = len(OD)+round((len(not_OD)-len(OD))*(10-n)/10),
random_state = 27) # reproducible results
downsampled = pd.concat([not_OD_downsampled, OD])
y_down_train = downsampled.OD
X_down_train = downsampled.drop('OD', axis=1)
model.fit(X_down_train, y_down_train)
cm=metrics.confusion_matrix(y_test, [1 if x >= 0.5 else 0 for x in model.predict(X_test)]) # Binarization of model.predict only effects the linear Regression
Accuracy.append((cm[0][0]+cm[1][1])/cm.sum())
Recall.append(cm[1][1]/cm.sum(axis=1)[1])
Fall_Out.append(cm[0][0]/cm.sum(axis=1)[0])
minority_majority.append(len(OD)/len(not_OD_downsampled))
# Plot: Downsampling the Majority Class
ax[i].plot(minority_majority,Accuracy)
ax[i].plot(minority_majority,Fall_Out)
ax[i].plot(minority_majority,Recall)
ax[i].set_xlim((0, 1))
ax[i].set_ylim((0, 1))
ax[i].set_title(model_name[i])
i=i+1
plt.suptitle("Downsampling", y=1.05)
plt.legend(('Accuracy', 'Fall-Out', 'Recall'),loc='lower right')
# the data
allresults_down={} # nested dict
for model in models:
allresults_down[model] = train_predict(model, X_down_train, y_down_train, X_test, y_test)
evaluate(models, allresults_down)
With downsampling all models improve eventhough in the parameter optimization process of Stochastic Gradient Descent the optimization procedure still struggles in the performance.
In particular on the downsampled training set the random forest model is improving as well.
model = clf_GradBoost
X_down_test = X_over_test
y_down_test = y_over_test
results = train_predict(model, X_down_train, y_down_train, X_down_test, y_down_test)
show_results(model, results)
y_pred = model.predict(X_test)
cm = metrics.confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm)
Now we get the following:
We can see the trade-off we already described: Based on this model we reject a lot of potential customers (33.2%), but we can lower the claims ratio significantly to 196 / 4319 = 4.5 %.
The goal of this chapter is to find the features with the most impact on the prediction and to explain single predictions for the Gradient Boosting Classifier.
regr = linear_model.LinearRegression()
LR=regr.fit(X_down_train, y_down_train)
# get importance
importance = regr.coef_
# plot feature importance
plt.figure(figsize=(15,5))
plt.bar([x for x in range(len(regr.coef_))], regr.coef_)
plt.xticks(np.arange(30), X_down_train.columns, rotation='vertical')
plt.show()
The outstanding features in terms of absolute value of the coefficients are:
It seems like the most important feature is BMI_max, but since the features are not equally distributed on the [0,1] intervall, the interpretation can be misleading.
fig, ax = plt.subplots(1, 2, figsize = (20,5))
sns.distplot(X_down_train['Age'], kde=False, ax=ax[0])
sns.distplot(X_down_train['BMI_max'], kde=False, ax=ax[1])
The values on the training dataset for BMI_max are on avarage a lot lower compared to age. Hence, in reality the feature BMI_max is not as important as the coefficient implies.
For a black-box estimator like Gradient Boosting we need a different approch for measuring feature importance.
The "eli5" package provides a way to compute feature importances by measuring how score decreases when a feature is not available.
The idea is to remove a feature from the test dataset and compute score without using this feature. The problem here is, that the model needs the feature to calculate the score. So the workaround is to randomly shuffle values for one feature. This way the feature no longer contains useful information. The method is known as “permutation importance”.
# Linear Regression
perm = PermutationImportance(LR, random_state=0).fit(X_test, y_test)
eli5.show_weights(perm, feature_names=X_test.columns.tolist(), top=30)
# Gradient Boosting
GB = clf_GradBoost.fit(X_down_train, y_down_train)
perm = PermutationImportance(GB, random_state=0).fit(X_test, y_test)
eli5.show_weights(perm, feature_names=X_test.columns.tolist(), top=30)
The top 3 features for both LR and GB are DoctorSleep, Age and YearsQuitSmoking. These result match the high coefficents of the LR. BMI_max is weighted less as expected.
A big problem for Permutation Importance are correlated features. Permutation importance can be low for all of these features, because dropping one of the features may not affect the result, as estimator still has an access to the same information from other features. This affects especially one-hot encoded features like "EduLevel".
The next explanation approach are SHAP values.
The goal of SHAP is to explain the prediction of observation by computing the contribution of each feature to the prediction. This method computes Shapley values from coalitional game theory. SHAP values tell us how to fairly distribute the prediction among the features.
# SHAP values: LR
explainer_LR = shap.LinearExplainer(LR,X_down_train)
shap_values_LR = explainer_LR.shap_values(X_test)
shap.summary_plot(shap_values_LR, X_test, plot_type="bar")
# SHAP values: GB
explainer_GB = shap.TreeExplainer(GB,X_down_train)
shap_values_GB = explainer_GB.shap_values(X_test)
shap.summary_plot(shap_values_GB, X_test, plot_type="bar")
The SHAP values can show how much each predictor contributes, either positively or negatively, to the target variable.
# SHAP values: GB
shap.summary_plot(shap_values_GB, X_test)
The top 4 features for both LR and GB are DoctorSleep, Age, EduLevel_5.0 and YearsQuitSmoking, while BMI_max is ranked a little bit lower.
After a single prediction it is important to know, what features and values did drive the models prediction.
For the LR the answer is the coefficient multiplied by the features value.
In case of GB the answer is not that easy. A solution is the lime-approach (Local Interpretable Model-Agnostic Explanations).
A lime explanation is a local linear approximation of the model's behaviour. While the model may be very complex globally, it is easier to approximate it around the observation with an linear model.
# LIME explanation for a single observation: GB
predict_proba_gb = lambda x: GB.predict_proba(x).astype(float)
explainer = lime.lime_tabular.LimeTabularExplainer(X_over_train.values, feature_names=X_over_train.columns, class_names=['No OD', 'OD'], kernel_width=5)
choosen_instance = X_over_train.loc[[53865]].values[0]
exp = explainer.explain_instance(choosen_instance, predict_proba_gb,num_features=9)
exp.show_in_notebook(show_all=False)
In this specific example, GB decides for OD based on a OD prediction probability of 0.81 percent that exceeds the probability for no OD (0.19 percent). The decision for OD is particularly due to the values of the variables DoctorSleep, EduLevel_5.0 and YearsQuitSmoking.
We investigated drivers for being in the state of occupational disability. We observed that the influence of the main drivers is modeled well by a logistic regression model as well as using a gradient boosting approach. In Section 8 we identified the highest BMI, the age and sleeping problems as important drivers for becoming occupationally disabled. On the other hand, in particular a high educational level and a high number of years having quit smoking or being a full non-smoker have a positive influence on not becoming occupationally disabled.
Note that especially non-smokers are included in the "YearsQuitSmoking" variable with 25 years as a default. Further, having sleeping problems and being therewith under medical treatment could also be an effect of the occupational disability.
Still the investigation gives hints on how to adapt the application process in insurance companies. It is possible to optimize the questionary used in the application process to focus on the most relevant drivers for an improved customer journey in the application process. Second, preventive measurements might be concerned on customers with a higher risk of becoming occupationally disabled.
For completion we add the modeling with neural networks in this Appendix in addition to the models described in Section 6. As the relationships in the data are mostly logistic the networks put no additional value in the approximation process in comparison to the other investigated models.
In addition, by comparison of the prediciton times, we can see that as expected neural networks take much more time in the model fitting process. Further, note that we do not include a hyperparameter tuning, i.e., optimization of the number of layers, activation functions or batch sizes but we use a fairly standard model structure and parameters in the fitting process.
Please uncomment the following code and switch from markdown to code.
import keras
def train_predict_NN(model, X_train, y_train, X_test, y_test, BS = 0, EP = 0):
'''
inputs:
- learner: the learning algorithm to be trained and predicted on
- sample_size: the size of samples (number) to be drawn from training set
- X_train: features training set
- y_train: income training set
- X_test: features testing set
- y_test: income testing set
- BS: batch_size (in case model is a Neural Network)
- EP: number of epochs (in case model is a Neural Networ)
'''
results={}
# Fit the learner to the training data
start = time() # Get start time
if BS == 0:
model = model.fit(X_train, y_train)
else:
model_history = model.fit(X_train, y_train, batch_size=BS, epochs=EP, validation_data=(X_test, y_test), verbose = 0)
plt.plot(model_history.history['accuracy'])
plt.plot(model_history.history['val_accuracy'])
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
#plt.savefig('Regression_model_loss.png')
plt.show()
# Plot training & validation loss values
plt.plot(model_history.history['loss'])
plt.plot(model_history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
#plt.savefig('Regression_model_loss.png')
plt.show()
end = time() # Get end time
# Calculate the training time
results['time_train'] = end-start
# Get the predictions on the test set(X_test)
start = time() # Get start time
predictions_test = np.round(model.predict(X_test))
predictions_train = np.round(model.predict(X_train))
end = time() # Get end time
# Calculate the total prediction time
results['time_test'] = end-start
#Compute accuracy on the training set
results['acc_train'] = accuracy_score(y_train, predictions_train)
# Compute precision on the training set
results['pre_train'] = precision_score(y_train, predictions_train)
# Compute recall on the training set
results['rec_train'] = recall_score(y_train, predictions_train)
# Compute F-score on the training set
results['f_train'] = fbeta_score(y_train, predictions_train, beta=10)
# Compute accuracy on the test set
results['acc_test'] = accuracy_score(y_test, predictions_test)
# Compute precision on the test set
results['pre_test'] = precision_score(y_test, predictions_test)
# Compute recall on the test set
results['rec_test'] = recall_score(y_test, predictions_test)
# Compute F-score on the test set
results['f_test'] = fbeta_score(y_test, predictions_test, beta=10)
return results
def NN(NoOfFeatures, NoOfReturns, Layers, LossFct, MetricFct):
model_in = keras.layers.Input(shape = (NoOfFeatures,))
model = keras.layers.Dense(200, input_shape = (NoOfFeatures,), kernel_initializer = 'normal', activation = "softmax")(model_in)
model = keras.layers.Dropout(0.4, noise_shape=None, seed=None)(model)
for k in range(0,Layers-1):
model = keras.layers.Dense(200, input_shape = (NoOfFeatures,), kernel_initializer = 'normal', activation = "softmax")(model)
model = keras.layers.Dropout(0.4, noise_shape=None, seed=None)(model)
model = keras.layers.Dense(NoOfReturns, activation = 'sigmoid')(model)
model = keras.models.Model(model_in, model)
model.compile(optimizer = 'adam', loss = LossFct, metrics = [MetricFct])
#Model3.compile(optimizer = 'adam', loss = 'mse', metrics = ['accuracy'])#mean_squared_error
#Model3.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])#mean_squared_error
return model
We are investigating a four layer neural network with binary crossentropy as loss function and standard binary accuracy.
NoOfFeatures = X_train.shape[1]
NoOfReturns = 1
Layers = 4
LossFct = 'binary_crossentropy'
MetricFct = 'accuracy' #
model_bc = NN(NoOfFeatures, NoOfReturns, Layers, LossFct, MetricFct)
NNresults = train_predict_NN(model_bc, X_train, y_train, X_test, y_test, 60, 100)
show_results(model_bc, NNresults)
cm3 = metrics.confusion_matrix(y_test, np.round(model_bc.predict(X_test)))
model_bc_score = model_bc.evaluate(X_test, y_test)
plot_confusion_matrix(cm3)
NNresults_over = train_predict_NN(model_bc, X_over_train, y_over_train, X_test, y_test, 60, 100)
show_results(model_bc, NNresults_over)
cm3 = metrics.confusion_matrix(y_test, np.round(model_bc.predict(X_test)))
model_bc_score = model_bc.evaluate(X_test, y_test)
plot_confusion_matrix(cm3)
NNresults_down = train_predict_NN(model_bc, X_down_train, y_down_train, X_test, y_test, 60, 100)
show_results(model_bc, NNresults_down)
cm3 = metrics.confusion_matrix(y_test, np.round(model_bc.predict(X_test)))
model_bc_score = model_bc.evaluate(X_test, y_test)
plot_confusion_matrix(cm3)