Predicting Occupational Disability

This notebook was created as part of the Data Science Challenge 2020 of the German Actuarial Association.

0. Introduction

Our goal is to investigate drivers that lead to limitations keeping people from working. These results are not only interesting for individuals that may get an incentive to adjust their own behavior in order to reduce the likelihood of not being able to work but they are also interesting from the perspective of an insurance company selling occupational disability insurance (OD) for two reasons. First, our results affect the insurance company’s risk assessment before contract conclusion. Typically, insurance companies request a lot of information from the potential policyholder before selling products like OD insurance. Our results support insurance companies to focus on the most relevant drivers in this regard. This enables insurance companies to shorten their risk assessment during the application process increasing (potential) customer experience. Second, our results provide insights regarding promising preventive measures that are worth to be supported by an insurance company.

In the last years mental health problems are becoming increasingly important as cause to keep people from working. For this purpose, we particularly focus on mental health problems and take into account various potential drivers such as educational level, BMI, smoking behavior or sleeping behavior, which are typically used in insurance companies’ application processes or might be added to them.

Our analysis is based on publicly available data from the National Center for Health Statistics. In one of their programes they assess the health and nutritional status of adults and children in the United States. Their survey examines a nationally representative sample of about 5,000 persons every second year, where we use their results from 2005 to 2015.

In this notebook we show how to prepare such data and based on this how to use this data in order to be able to make predictions. In our case we aim to predict the dependent variable "occupational disability" by certain independent variables such as BMI, smoking behavior or sleeping behavior. We train and evaluate different machine learning models (e.g., random forests or gradient tree boosting) and illustrate resampling techniques in order to improve the results of the models. Moreover, in this context, trustworthy artificial intelligence (AI) is particularly important for insurance companies. One main driver of trustworthy AI is transparency meaning explainability of AI models. Hence, we also pay special attention to the explainability of our models.

Our main results show that in particular trouble sleeping, age and smoking are related to occupational disability. As "occupational disability" is an underrepresented event in the data set, special data preparation in the data processing and modelling steps is necessary. Our results show that resampling techniques (over- or downsampling) are in this case important. Furthermore, our results yield that in our case predictions by a logistic regression are not inferior to predictions by more sophisticated machine learning models such as random forests or gradient tree boosting.

The remainder of our notebook is structured as follows. Section 1 describes our data collection process, while Section 2 focuses on processing and visualizing the data including data cleaning and one-hot encoding techniques. In Section 3 we normalize and split our data in training und test sample before we provide theoretical background on ratios to evaluate models in Section 4. Section 5 contains techniqual aspects and defines the functions used for predicting, evaluating and visualizing the model data in the following sections. In Section 6 we apply various AI models on our data and evaluate them. Using resampling techniques we aim to improve our results in Section 7, while Section 8 focuses on the explainability of the AI models. We give a final overview on the achieved results again in Section 9.

In [1]:
import os

from time import time

import warnings
# to make the notebook more readable we will take Future Warnings out
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
import numpy as np

# visuals
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn import metrics 
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score, fbeta_score
from sklearn.metrics import mean_squared_error

from sklearn.utils import resample

# model explanation
import shap 
import eli5
from eli5.sklearn import PermutationImportance
import lime

# python version
import platform
In [2]:
# this notebook has been created with the following Version of Python
print('This notebook has been created with Python Version: ' + platform.python_version())
This notebook has been created with Python Version: 3.8.3

1. Data Collection

Data source: National Center for Health Statistics (NHANES)
NHANES is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey examines a nationally representative sample of about 5,000 persons every second year.

The data can be downloaded from https://wwwn.cdc.gov/nchs/nhanes/Default.aspx

We use the data from 2005 to 2015 with consistent questions throughout the years. In earlier years some data is missing.
We select the following chapters:
-BPQ: Blood Pressure & Cholesterol
-DEMO: Demographic Variables
-PFQ: Physical Functioning (containing the target variable: occupational disability)
-SLQ: Sleep Disorders
-SMQ: Smoking - Cigarette Use
-WHQ: Weight History

1.1 Original Data

If downloaded the data from 2005 to 2015, the following code can be used to import all .XPT-files and merge as well as filter the relevant data fields. Make sure to provide the correct data path. To include the code, uncomment and turn Markdown to Code.

my_path = os.getcwd()
path = os.path.join(my_path, 'NHANES\\Daten\\')
year = ('2005','2007','2009','2011','2013','2015')
file_extension = ('D','E','F','G','H','I')
bpq = pd.DataFrame()
demo = pd.DataFrame()
pfq = pd.DataFrame()
slq = pd.DataFrame()
smq = pd.DataFrame()
whq = pd.DataFrame()

for i in range(0,len(year)):
        bpq=bpq.append(pd.read_sas(path+year[i]+'\\BPQ_'+file_extension[i]+'.XPT'))
        demo=demo.append(pd.read_sas(path+year[i]+'\\DEMO_'+file_extension[i]+'.XPT'))
        pfq=pfq.append(pd.read_sas(path+year[i]+'\\PFQ_'+file_extension[i]+'.XPT'))
        slq=slq.append(pd.read_sas(path+year[i]+'\\SLQ_'+file_extension[i]+'.XPT'))
        smq=smq.append(pd.read_sas(path+year[i]+'\\SMQ_'+file_extension[i]+'.XPT'))
        whq=whq.append(pd.read_sas(path+year[i]+'\\WHQ_'+file_extension[i]+'.XPT'))

Merge the tables on SEQN (individual number per person).

result = pfq.merge(bpq,how="left",on='SEQN').merge(demo,how="left",on='SEQN').merge(slq,how="left",on='SEQN').merge(smq,how="left",on='SEQN').merge(whq,how="left",on='SEQN')

In the next step we filter for relevant questions in each dataset.

We focus on questions that are plausible to influence the probability for occupational disability. Another point is, that the question relates to the past or is not directly influenced by the fact, weather the person is able to work right now.

For example a question we decided to drop is "INDFMIN2" about the family income. From our perspective, a low family income in combination with occupational disability is rather a result and not an indicator. In any such cases we decided to not use the questions.

data=result[['SEQN','PFQ049','RIDAGEYR','DMDEDUC2','DMDMARTL','DMDHHSIZ','WHD010','WHD020','WHD050','WHD110','WHD130','WHD120','WHD140','SMQ020','SMD030','SMQ040','SMQ050Q','SMQ050U','SMQ621','SMQ900','SLD012','SLD010H','SLQ050','SLQ120','BPQ020','BPQ030','BPD035','BPQ040A','BPQ050A','BPQ080','BPQ060','BPQ070','BPQ090D','BPQ100D']]

1.2 Prepared .csv-file

For convenience we provide the final result of the data collection steps in a .csv-file to be simply read in by pandas read_csv.

In [3]:
# Import data
data = pd.read_csv ('Daten.csv')

1.3 Data Overview

In [4]:
print('Number and features of imported values:', data.shape)
Number and features of imported values: (55075, 34)
In [5]:
print('Number and features of imported values when question PFQ049 has the answer yes = 1 or no = 2:', data[data.PFQ049.isin((1,2))].shape)
Number and features of imported values when question PFQ049 has the answer yes = 1 or no = 2: (34163, 34)

2. Data Processing and Visualisation

2.1 Cleaning the data

Before we can do anything with the data we need to clean it up. To get an idea we first take a look at some sample data.

In [6]:
# Ten random data sets
data[data.PFQ049.isin((1,2))].sample (10)
Out[6]:
SEQN PFQ049 RIDAGEYR DMDEDUC2 DMDMARTL DMDHHSIZ WHD010 WHD020 WHD050 WHD110 ... BPQ020 BPQ030 BPD035 BPQ040A BPQ050A BPQ080 BPQ060 BPQ070 BPQ090D BPQ100D
1705 32931.0 2.0 38.0 1.0 1.0 3.0 NaN 247.0 250.0 220.0 ... 2.0 NaN NaN NaN NaN NaN 2.0 NaN NaN NaN
52096 90429.0 2.0 35.0 5.0 1.0 4.0 69.0 142.0 142.0 NaN ... 2.0 NaN NaN NaN NaN 1.0 NaN 1.0 2.0 NaN
11658 43407.0 2.0 55.0 2.0 3.0 1.0 59.0 125.0 126.0 130.0 ... 1.0 1.0 54.0 1.0 1.0 1.0 1.0 1.0 2.0 NaN
5316 36732.0 2.0 72.0 4.0 1.0 2.0 63.0 118.0 118.0 120.0 ... 1.0 1.0 NaN 1.0 1.0 2.0 1.0 1.0 NaN NaN
16641 48643.0 2.0 79.0 4.0 2.0 1.0 60.0 169.0 176.0 165.0 ... 1.0 1.0 70.0 1.0 1.0 2.0 1.0 1.0 NaN NaN
8112 39676.0 2.0 67.0 5.0 2.0 2.0 62.0 195.0 210.0 224.0 ... 2.0 NaN NaN NaN NaN 2.0 1.0 1.0 NaN NaN
5609 37048.0 1.0 52.0 2.0 5.0 3.0 65.0 170.0 215.0 215.0 ... 1.0 1.0 NaN 1.0 1.0 2.0 1.0 1.0 NaN NaN
28084 61661.0 2.0 63.0 4.0 1.0 3.0 72.0 235.0 235.0 230.0 ... 1.0 1.0 61.0 1.0 1.0 2.0 1.0 1.0 NaN NaN
19610 51762.0 2.0 22.0 3.0 6.0 4.0 62.0 135.0 120.0 NaN ... 2.0 NaN NaN NaN NaN 2.0 1.0 1.0 NaN NaN
24290 57227.0 2.0 27.0 1.0 1.0 5.0 64.0 142.0 138.0 NaN ... 2.0 NaN NaN NaN NaN 2.0 1.0 1.0 NaN NaN

10 rows × 34 columns

In [7]:
# Missing values when question PFQ049 (OD) has the answer 1=yes or 2=no
data[data.PFQ049.isin((1,2))].apply(lambda x: sum(x.isnull().values), axis = 0)
Out[7]:
SEQN            0
PFQ049          0
RIDAGEYR        0
DMDEDUC2        0
DMDMARTL        0
DMDHHSIZ        0
WHD010        244
WHD020        217
WHD050        195
WHD110       9683
WHD130      17665
WHD120       4552
WHD140        124
SMQ020          0
SMD030      18869
SMQ040      18869
SMQ050Q     25993
SMQ050U     26136
SMQ621      34163
SMQ900      28446
SLD012      28478
SLD010H      5733
SLQ050          0
SLQ120      17540
BPQ020          0
BPQ030      22061
BPD035      23677
BPQ040A     22061
BPQ050A     23628
BPQ080       5183
BPQ060       5772
BPQ070       9302
BPQ090D     16077
BPQ100D     26331
dtype: int64

Some questions with a lot of missing values are logically linked to each other. This will be discussed later.

In [8]:
# drop scarcely answered questions 
data = data.drop('SMQ621', axis=1)
data = data.drop('WHD110', axis=1)
data = data.drop('WHD130', axis=1)
data = data.drop('BPD035', axis=1)

Some data is coded in 1=yes, 2=no. We replace 2 with 0 and delete the missing values.

In [9]:
# OD = 1 and not OD = 0
data = data[data.PFQ049.isin((1,2))]
data['PFQ049'] = data['PFQ049'].replace(2,0)

# smoked 100 cigarettes = 1, no = 0 
data = data[data.SMQ020.isin((1,2))]
data['SMQ020'] = data['SMQ020'].replace(2,0)

# Ever told doctor had trouble sleeping?
data = data[data.SLQ050.isin((1,2))]
data['SLQ050'] = data['SLQ050'].replace(2,0)

# high blood pressure / cholesterol
data['BPQ020'] = data['BPQ020'].replace(2,0)
data['BPQ030'] = data['BPQ030'].replace(2,0)
data['BPQ040A'] = data['BPQ040A'].replace(2,0)
data['BPQ050A'] = data['BPQ050A'].replace(2,0)
data['BPQ060'] = data['BPQ060'].replace(2,0)
data['BPQ080'] = data['BPQ080'].replace(2,0)
data['BPQ090D'] = data['BPQ090D'].replace(2,0)
data['BPQ100D'] = data['BPQ100D'].replace(2,0)
In [10]:
# plot: people answering the question about occupational disability grouped by age
f, ax = plt.subplots(figsize=(30, 5))
sns.countplot(x='RIDAGEYR',hue='PFQ049',data=data)
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c1d42d7550>
In [11]:
# reducing the dataset to the target group of OD insurance: ages >= 25 and < 67
data = data[(data.RIDAGEYR < 67) & (data.RIDAGEYR >= 25)]
In [12]:
# drop sequence number 
data = data.drop('SEQN', axis=1)

In the next step we calcualte the BMI for each weight indication.
For the question WHD020 (weight by age 25), with about 15% missing values, we decided to use the ratio of the weight at 25 and the current weight to estimate the missing values.

In [13]:
#calculate weight coefficient
data = data[(data.WHD020 != 7777) & (data.WHD020 != 9999) & (data['WHD020'].notna())]# data without answers "Refused" 7777 or "Don't know" 9999 and NaN
data_WHD = data[(data.WHD120 != 7777) & (data.WHD120 != 9999) & (data['WHD120'].notna())]
weight_coeff = data_WHD["WHD120"]/data_WHD["WHD020"] # Anteil Weight_Age_25/CurrentWeight
weight_coeff.describe()
Out[13]:
count    21427.000000
mean         0.865188
std          0.154699
min          0.288462
25%          0.767442
50%          0.866667
75%          0.954545
max          2.753333
dtype: float64

Since the mean and the median is about the same value, we use the mean to estimate the weight by the age of 25.

In [14]:
data.loc[:,("WHD120")] = data["WHD120"].fillna(weight_coeff.mean()*data["WHD020"])
In [15]:
# drop NAs
data = data.dropna(subset=['WHD010', 'WHD050', 'WHD120', 'WHD140'])

# drop "Refused" 7777 and "Don't know" 9999
data = data[(data.WHD010 != 7777) & (data.WHD010 != 9999)]
data = data[(data.WHD050 != 7777) & (data.WHD050 != 9999)]
data = data[(data.WHD120 != 7777) & (data.WHD120 != 9999)]
data = data[(data.WHD140 != 7777) & (data.WHD140 != 9999)]
In [16]:
# calculate BMI:
data['BMI_0']=(data['WHD020']/2.205)/((data['WHD010']*0.0254)**2)   # current BMI
data['BMI_1']=(data['WHD050']/2.205)/((data['WHD010']*0.0254)**2)   # BMI one year ago
data['BMI_25']=(data['WHD120']/2.205)/((data['WHD010']*0.0254)**2)  # BMI by the age of 25
data['BMI_max']=(data['WHD140']/2.205)/((data['WHD010']*0.0254)**2) # maximum BMI

# drop weight and hight
data = data.drop('WHD010', axis=1)
data = data.drop('WHD020', axis=1)
data = data.drop('WHD050', axis=1)
data = data.drop('WHD120', axis=1)
data = data.drop('WHD140', axis=1)

The next set of questions is about the smoking behavior.
Question SMQ020: Smoked at least 100 cigarettes in life?

In [17]:
# drop "Refused" 7 and "Don't know" 9
data = data[(data.SMQ020 != 7) & (data.SMQ020 != 9)]

Question SMD030: Age started smoking?

In [18]:
# plot: people answering the question about "starting age smoking"
f, ax = plt.subplots(figsize=(30, 5))
sns.countplot(x='SMD030',hue='PFQ049',data=data)
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c1d509eac0>

We face the problem, that this question logically excludes non-smokers.
Our solution is to categorize the starting age by quartiles and create four categories.
A non-smoker has a "0" in every category and a smoker falls exactly in one.

In [19]:
# drop "Refused" 777 and "Don't know" 999
data = data[(data.SMD030 != 777) & (data.SMD030 != 999)]

# NaN to 0: (none smokers)
data['SMD030'][data['SMQ020'] == 0].describe()
data.loc[data['SMQ020']==0,'SMD030'] = 0

# Quartiles
data['SMD030'][(data['SMD030']>0)].describe()
Out[19]:
count    1.005200e+04
mean     1.718156e+01
std      6.130375e+00
min      5.397605e-79
25%      1.500000e+01
50%      1.700000e+01
75%      1.900000e+01
max      5.800000e+01
Name: SMD030, dtype: float64
In [20]:
Quantil25 = data['SMD030'][data['SMD030']>0].quantile(.25)
Quantil50 = data['SMD030'][data['SMD030']>0].quantile(.5)
Quantil75 = data['SMD030'][data['SMD030']>0].quantile(.75)

data['Smoking_Age_25Q'] = data['SMQ020']
data.loc[data['SMD030']>Quantil25, 'Smoking_Age_25Q'] = 0 # [0,25]
data['Smoking_Age_50Q'] = data['SMQ020']
data.loc[data['SMD030']>Quantil50, 'Smoking_Age_50Q'] = 0 # (25,50]
data.loc[data['SMD030']<=Quantil25, 'Smoking_Age_50Q'] = 0
data['Smoking_Age_75Q'] = data['SMQ020']
data.loc[data['SMD030']>Quantil75, 'Smoking_Age_75Q'] = 0 # (50,75]
data.loc[data['SMD030']<=Quantil50, 'Smoking_Age_75Q'] = 0
data['Smoking_Age_100Q'] = data['SMQ020']
data.loc[data['SMD030']<=Quantil75, 'Smoking_Age_100Q'] = 0 # (75,100]

Question SMQ040: Are you currenty smoking?
The goal ist to recode the answers to yes = 1 and no = 0.

In [21]:
# none smokers have missing values -> NaNs to 0
data.loc[data['SMQ020']==0,'SMQ040'] = 0

# drop 7 (Refused) and 9 (Don't know)
data = data[(data.SMQ040 != 7) & (data.SMQ040 != 9)]

# answer 3 ("Not at all") to 0
data['SMQ040'] = data['SMQ040'].replace(3,0)

# answer 2 ("Some days") to 1
data['SMQ040'] = data['SMQ040'].replace(2,1)

Question SMQ050Q: Since when did u stop smoking? (years)
Current-smokers and non-smokers have missing values.
We decide to set the upper limit to 25, because the youngest person is 25. The value for non-smokers will then be set to 25 and for current-smokers to 0.

In [22]:
# none smokers --> 25
data.loc[data['SMQ020'] == 0, 'SMQ050Q'] = 25
# currently-smokers --> 0
data.loc[data['SMQ040'] == 1, 'SMQ050Q'] = 0

# drop 77777 (Refused) and 99999 (Don't know)
data = data[(data.SMQ050Q != 77777) & (data.SMQ050Q != 99999)]

# recode the values to years
data.loc[data['SMQ050U']== 1, 'SMQ050Q'] = round(data['SMQ050Q']/365) # SMQ050U = 1 --> days
data.loc[data['SMQ050U']== 2, 'SMQ050Q'] = round(data['SMQ050Q']/52) # SMQ050U = 2 --> weeks
data.loc[data['SMQ050U']== 3, 'SMQ050Q'] = round(data['SMQ050Q']/12) # SMQ050U = 1 --> month

# maximum 25 years
data.loc[data['SMQ050Q']> 25, 'SMQ050Q'] = 25

# plot SMQ050Q
f, ax = plt.subplots(figsize=(30, 5))
sns.countplot(x='SMQ050Q',data=data[(data['SMQ050Q']<25) & (data['SMQ050Q']>0)])

# drop SMQ050U
data = data.drop('SMQ050U', axis=1)

# drop SMD030
data = data.drop('SMD030', axis=1)

# drop SMQ900 (scarcely answered)
data = data.drop('SMQ900', axis=1)

The set of questions about blood pressure are all linked to each other.
Question BPQ020: Doctor ever told you, you had high blood pressure? is answered by everyone.
Question BPQ030: Told had high blood pressure 2+ times? When BPQ020 = 0 then replace NaN's for this question with 0.
Question BPQ040A: Taking prescription for hypertension? When BPQ020 = 0 then replace NaN's for this question with 0.
Question BPQ050A: Now taking prescribed medicine? When BPQ040A = 0 then replace NaN's for this question with 0.

In [23]:
data.loc[data['BPQ020']==0,'BPQ030'] = 0
data.loc[data['BPQ020']==0,'BPQ040A'] = 0
data.loc[data['BPQ040A']==0,'BPQ050A'] = 0

The set of questions about cholesterol level are all linked to each other.
Question BPQ080: Doctor ever told you, you had high cholesterol level? When BPQ060 = 0 then replace NaN's for this question with 0.
Question BPQ060: Ever had blood cholesterol checked? When BPQ080 = 0 then replace NaN's for this question with 0.
Question BPQ070: When was blood cholesterol last checked? When BPQ080 = 0 then replace NaN's for this question with 4 (5 years or more).
Question BPQ090D: Told to take prescription for cholesterol? When BPQ080 = 0 then replace NaN's for this question with 0.
Question BPQ100D: Now taking prescribed medicine? When BPQ090D = 0 then replace NaN's for this question with 0.

In [24]:
data.loc[data['BPQ080']==1,'BPQ060'] = 1
data.loc[data['BPQ060']==0,'BPQ080'] = 0
data.loc[data['BPQ060']==0,'BPQ070'] = 4
data.loc[data['BPQ080']==0,'BPQ090D'] = 0
data.loc[data['BPQ090D']==0,'BPQ100D'] = 0
In [25]:
# drop "Refused" 7 and "Don't know" 9 and NaN's
data = data[(data.BPQ020 != 7) & (data.BPQ020 != 9) & (data.BPQ020.isnull() == False)]
data = data[(data.BPQ030 != 7) & (data.BPQ030 != 9) & (data.BPQ030.isnull() == False)]
data = data[(data.BPQ040A != 7) & (data.BPQ040A != 9) & (data.BPQ040A.isnull() == False)]
data = data[(data.BPQ050A != 7) & (data.BPQ050A != 9) & (data.BPQ050A.isnull() == False)]
data = data[(data.BPQ080 != 7) & (data.BPQ080 != 9) & (data.BPQ080.isnull() == False)]
data = data[(data.BPQ060 != 7) & (data.BPQ060 != 9) & (data.BPQ060.isnull() == False)]
data = data[(data.BPQ070 != 7) & (data.BPQ070 != 9) & (data.BPQ070.isnull() == False)]
data = data[(data.BPQ090D != 7) & (data.BPQ090D != 9) & (data.BPQ090D.isnull() == False)]
data = data[(data.BPQ100D != 7) & (data.BPQ100D != 9) & (data.BPQ100D.isnull() == False)]

The next set of questions is about sleep disorders.
Question SLQ050: Ever told doctor you had trouble sleeping?

In [26]:
# SLQ050: drop "Refused" 7 and "Don't know" 9
data = data[(data.SLQ050 != 7) & (data.SLQ050 != 9)]

# drop SLQ120: scarcely answered
data = data.drop('SLQ120', axis=1)

Question SLD012 and SLD010H: Sleep hours? This question has different variable names over the years. We replace missing values with the median.

In [27]:
# combine SLD012 and SLD010H in the new variable SLD
data.loc[data['SLD012'].isnull() == True, 'SLD'] = data['SLD012'] 
data.loc[data['SLD012'].isnull() == False, 'SLD'] = data['SLD012'] 

# 99 (Refused) to NaN
data['SLD'] = data['SLD'].replace(99, np.nan)

# replace NaN's with median
print('Median:',data['SLD'].median())
data['SLD']=data['SLD'].fillna(data['SLD'].median())

# drop SLD012 and SLD010H
data = data.drop('SLD012', axis=1)
data = data.drop('SLD010H', axis=1)
Median: 7.5

Finally we drop the missing answers to the questions about Education Level and Marital Status

In [28]:
# drop "Refused" 7 and "Don't know" 9
data = data[(data.DMDEDUC2 != 7) & (data.DMDEDUC2 != 9)]
# drop "Refused" 77 and "Don't know" 99
data = data[(data.DMDMARTL != 77) & (data.DMDMARTL != 99)]

2.2 Renaming the categories

To increase the readability of our data we re-code the answers. You can find the description of the data here: NHanes Search Variables. For easy processing we put the code, a short explanation and a long explanation in a csv.

In [29]:
DataDescription = pd.read_csv('Description.csv')
DataDescription.head(20)
Out[29]:
code ShortExplanation LongExplanation
0 SEQN Number Respondent sequence number
1 PFQ049 OD Limitations keeping you from working
2 RIDAGEYR Age Age at Screening Adjudicated - Recode
3 DMDEDUC2 EducationLevel Education Level - Adults 20+
4 INDFMIN2 FamilyIncome Annual Family Income
5 WHD010 CurrentHeight Current self-reported height (inches)
6 WHD020 CurrentWeight Current self-reported weight (pounds)
7 WHD050 Weight_1 Self-reported weight - 1 yr ago (pounds)
8 WHD110 Weight_10 Self-reported weight-10 yrs ago (pounds)
9 WHD120 Weight_25 Self-reported weight - age 25 (pounds)
10 WHD130 Height_25 Self-reported height - age 25 (inches)
11 WHD140 Weight_max Self-reported greatest weight (pounds)
12 SMQ020 Smoked100 Smoked at least 100 cigarettes in life
13 SMD030 AgeStartSmoking Age started smoking cigarets regularly
14 SMQ040 SmokingNow Do you now smoke cigarettes
15 SMQ050Q YearsQuitSmoking How long since quit smoking cigarettes
16 SMQ621 CigarettesLife Cigarettes smoked in entire life
17 SMQ900 E_Cigarette Ever used an e-cigarette?
18 SLD Sleep hours How much sleep do you get (hours)?
19 SLQ300 SleepTime Usual sleep time on weekdays or workdays
In [30]:
data = data.rename(columns=dict(zip(DataDescription["code"], DataDescription["ShortExplanation"])))

2.3 One-hot encoding variables with multiple categories

Education Level and Marital Status have multiple categories represented by integer values. Without adjustment, machine learning algorithms might interprete a natural ordered relationship between these integers, which makes no sense. Hence, we convert these information in binary variables using one-hot encoding to make sure we do not implicate some kind of relationship between the data that is not true.

Note that with respect to Education Level the value of the integer is related to the education level, i.e. a higher value represents a higher education level. However, the mean between Education Level 1 (Less than 9th grade) and Education Level 5 (College graduate or above) does not make much sense. With Marital Status it is even more obvious. For instance, 'Never married' doesn't equal five times 'Married'.

In [31]:
# One-hot encoding Education Level
Education = pd.get_dummies(data['EducationLevel'], prefix='EduLevel')
data = data.drop('EducationLevel', axis=1)
data = pd.concat([data, Education], axis=1)

EduLevel_1.0: Less than 9th grade
EduLevel_2.0: 9-11th grade (Includes 12th grade with no diploma)
EduLevel_3.0: High school graduate/GED or equivalent
EduLevel_4.0: Some college or AA degree
EduLevel_5.0: College graduate or above

In [32]:
# One-hot encoding Marital Status
MaritalStatus = pd.get_dummies(data['MaritalStatus'], prefix='MaritalStatus')
data = data.drop('MaritalStatus', axis = 1)
data = pd.concat([data, MaritalStatus], axis=1)

MaritalStatus_1.0: Married
MaritalStatus_2.0: Widowed
MaritalStatus_3.0: Divorced
MaritalStatus_4.0: Separated
MaritalStatus_5.0: Never married
MaritalStatus_6.0: Living with partner

In [33]:
# drop infs
data.dropna()
data = data[np.isfinite(data).all(1)]

2.4 Check for correlation in the data

If several variables correlate in the data set, they usually do not add additional information to the model. It might even be a cause-effect relationship that will not improve model quality or disturb the learning process. Therefore we analyse correlations in the data and drop certain variables.

In [34]:
# Pairplot: none binary variables
sns.set(style="ticks")
pairplot=['Age', 'BMI_max', 'PeopleInHousehold', 'YearsQuitSmoking']
pairplot.append('OD')
sns.pairplot(data[pairplot], hue="OD")
Out[34]:
<seaborn.axisgrid.PairGrid at 0x1c1d4c89730>

We notice some correlation. A higher age correlates with Occupational Disability (OD). You can also see that if you do not smoke or quit smoking a long time ago you are less likely to suffer from OD.

In [35]:
DataHeatmap=['OD','Age', 'Smoked100', 'DoctorSleep', 'DoctorHighBloodPressure', 'DoctorHighBloodPressure2', 'YearsQuitSmoking', 'MedicineCholesterol']
sns.heatmap(data[DataHeatmap].corr(), annot=True, cmap="YlGnBu");

The heatmap shows us correlation between different variables. We drop "DoctorHighBloodPressure2 - a doctor has told you that you have high blood pressure twice" because it correlates to the similar question "DoctorHighBloodPressure - a doctor has told you that you have high blood pressure".

In [36]:
data = data.drop('DoctorHighBloodPressure2', axis=1)
In [37]:
DataHeatmapOD=['OD','BMI_0', 'BMI_1', 'BMI_25', 'BMI_max']
sns.heatmap(data[DataHeatmapOD].corr(), annot=True, cmap="YlGnBu");

We drop the (highly correlated) categories for Body Mass Index and decide to only keep the Maximum.

In [38]:
data = data.drop('BMI_0', axis=1)
data = data.drop('BMI_1', axis=1)
data = data.drop('BMI_25',axis=1)
In [39]:
DataHeatmapSmoke = ['OD', 'Smoked100', 'SmokingNow', 'YearsQuitSmoking', 'Smoking_Age_25Q', 'Smoking_Age_50Q', 'Smoking_Age_75Q', 'Smoking_Age_100Q']
sns.heatmap(data[DataHeatmapSmoke].corr(), annot=True, cmap="YlGnBu");

The information if a person currently smokes is correlated with its smoking behaviour in earlier years. Further, if a person smokes in OD state is not an information an insurance company is able to use in the application process. Therefore, we drop the "SmokingNow" variable.

In [40]:
data = data.drop('SmokingNow', axis=1)

3. Normalizing and Splitting the data

3.1 Normalizing Numerical Features

It is often good practice to perform some type of scaling on numerical features. Applying a scaling to the data does not change the shape of each feature's distribution (such as above); however, normalization ensures that each feature is treated equally when applying supervised learners. Note that once scaling is applied, observing the data in its raw form will no longer have the same original meaning, as exampled below.

In [41]:
# Initialize a scaler, then apply it to the features
scaler = MinMaxScaler() # default=(0, 1)

# we single out the data which isn't binary
notbinary = ['Age', 'BMI_max', 'PeopleInHousehold', 'YearsQuitSmoking', ]
data_norm = pd.DataFrame(data = data.copy(deep=True))
data_norm[notbinary] = scaler.fit_transform(data[notbinary])

# Show an example of a record with scaling applied
display(data_norm.head(n = 5))
OD Age PeopleInHousehold Smoked100 YearsQuitSmoking DoctorSleep DoctorHighBloodPressure PrescriptionHighBloodPressure MedicineHighBloodPressure DoctorHighCholesterol ... EduLevel_2.0 EduLevel_3.0 EduLevel_4.0 EduLevel_5.0 MaritalStatus_1.0 MaritalStatus_2.0 MaritalStatus_3.0 MaritalStatus_4.0 MaritalStatus_5.0 MaritalStatus_6.0
3 0.0 0.463415 0.500000 0.0 1.0 0.0 1.0 1.0 1.0 0.0 ... 0 0 1 0 1 0 0 0 0 0
7 0.0 0.390244 0.000000 0.0 1.0 1.0 0.0 0.0 0.0 0.0 ... 0 0 1 0 0 0 0 0 1 0
22 0.0 0.829268 0.166667 1.0 1.0 1.0 1.0 1.0 0.0 0.0 ... 0 1 0 0 1 0 0 0 0 0
23 0.0 0.048780 0.666667 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0 1 0 0 1 0 0 0 0 0
24 1.0 0.463415 0.166667 1.0 0.2 1.0 1.0 1.0 1.0 1.0 ... 0 1 0 0 0 0 0 0 1 0

5 rows × 31 columns

3.2 Train / Test Split

The prediction variable gets separated from the actual input data. We use train_test_split from sklearn to split the data into 70% of training data and 30% of test data.

In [42]:
# Splitting the data
data_test = data_norm.copy(deep=True)

y = data_test['OD']
X = data_test.drop('OD', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
In [43]:
# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))
Training set has 15088 samples.
Testing set has 6467 samples.

4. Modell Evaluation

To evalute the models described in the upcoming Section 6, we will consider certain ratios giving us the estimation of probabilites for certain events. The considered ratios are standard for the evaluation of classification tasks. We will see with a naive predictor that our data set is unbalanced, i.e., being OD is a rare event that has to be predicted. We consider the so-called f-beta score as additional estimator to handle the specialities of an unbalanced data set.

4.1 Ratios

There are different ratios to evaluate the predictive power of the model.

In our case of a binary problem - you either suffer from OD or you do not - we have four possible outcomes for the classified data in our model:

True Positive (TP): You are suffering from Occupational Disability and the model classifies you as 'OD'.

False Positive (FP): You are not suffering from Occupational Disability but the model predicts you do.

True Negative (TN): You are not suffering from Occupational Disability and the model predicts you do not.

False Negative (FN): You are suffering from Occupational Disability but the model classifies you as 'not OD'.

Based on these outcomes we take into account for different ratios as defined below.

Confusion matrix based on (Wikipedia)

Accuracy:

Accuracy describes how much of the data is classified correctly by the model. Especially with imbalanced date like we have in our model it is very easy to get a high accuracy as we will see in the naive predictor below.

$$ Accuracy = \frac{\sum{TP} + \sum{TN}}{\sum{TP} + \sum{TN} + \sum{FP} + \sum{FN}}$$

Precision:

Precision describes how many of the predicted positive (predicted OD) are actual positive (OD). In our case we aim for a high precision. If the precision is low, we will reject to many customers in the application process or invest in preventive measures for customers who don't necessarily need them.

$$ Precision = \frac{\sum{TP}}{\sum{TP} + \sum{FP}}$$

Recall (Sensitivity):

Recall is defined by the ratio of those that we classified OD and the actual number of those being OD. In our case a high recall is very important, since customers being OD are very expensive for insurance companies. A high recall allows insurance companies to reject applicants with a high risk of OD or to invest in preventive measures for them in case they are already customer.

$$ Recall = \frac{\sum{TP}}{\sum{TP} + \sum{FN}}$$

Fall-Out (False Positive Rate):

Fall-out is defined by the ratio of those being mistakenly classified OD and the actual number of those being not OD.

$$ Fall-Out = \frac{\sum{FP}}{\sum{TN} + \sum{FP}}$$

4.2 The Naive Predictor

As a naive predictor we use a very easy model: We predict 'not OD' for all our data points. This gives us a baseline to rate our predictions against.

In [44]:
TP = 0 # There are zero true positives predicted as all predictions are "not OD".
FP = 0 # There are zero false positives predicted as all predictions are "not OD".

TN = data['OD'].count() - np.sum(data['OD']) # The number of true negative predictions is the number of zeros of the prediction variable.
FN = np.sum(data['OD']) # The number of false negative preditions is the number of ones of the prediction variable.

# Calculate accuracy, precision and recall
accuracy = (TP + TN) / data['OD'].count()

# Print the results
print("Accuracy        : {:.4f}".format(accuracy))
Accuracy        : 0.8669

We can get a very high accuracy of over 86.7% with this very naive model. Obviously the accuracy of the naive predictor describes that about 86.7% of the samples are not occupationally disabled. This reflects that the occurance of OD is a rare event of only 13.3% in the data set that we need to predict. The data is unbalanced in this sense.

So it is important to consider other ratios in order to evaluate the benefit of the models.

During the application process we aim to predict the positive cases (OD), in order to reduce the loss ratio. Hence, the recall of the model is particularly important, because it tells us which percentage of OD we got right.

However, the insurance company rejects applicants if our model states that they have a high risk of OD. Rejecting applicants means less new business and a smaller customer base. Hence, insurance companies are not interested in rejecting applicants that actually have not a high risk of OD emphasizing the importance of the Precision - how many of the predicted positive cases are really positive?

However, there is also a ratio that combines the two measures we want to look at - the f-beta score.

F-beta score:

F-beta combines precision and recall:

$$ F_{\beta} = (1 + \beta^2) \cdot \frac{Precision \cdot Recall}{\left( \beta^2 \cdot Precision \right) + Recall} $$

β is chosen such that recall is considered β times as important as precision.

For our data we have to choose a pretty high β, because recall has a much higher impact on costs. In the following model process we decide for a β of 10.

5. Functions for Predicting, Evaluating and Visualising the Model Data

This Section contains necessary functions for the prediction and evaluation process in the upcoming sections. For the reader it is possible to skip this Section and directly continue with the model training in Section 6.

Note that it would be convenient to outsource the full section to a .py-file that could be imported as a module.

In [45]:
def train_predict(model, X_train, y_train, X_test, y_test): 
    '''
    inputs:
       - model: the learning algorithm to be trained and predicted on
       - X_train: features training set
       - y_train: income training set
       - X_test: features testing set
       - y_test: income testing set
       
    output:
        - results: dictionary, containing all introduced model ratios for training and test set
    '''
    results={}
    # Fit the learner to the training data
    start = time() # Get start time
    model = model.fit(X_train, y_train)
    end = time() # Get end time
    
    # Calculate the training time
    results['time_train'] = end-start
        
    # Get the predictions on the test set(X_test)
    start = time() # Get start time
    predictions_test = model.predict(X_test)
    predictions_train = model.predict(X_train)
    end = time() # Get end time
    
    # Calculate the total prediction time
    results['time_test'] = end-start
    # Compute accuracy on the training set
    results['acc_train'] = accuracy_score(y_train, predictions_train)
    # Compute precision on the training set 
    results['pre_train'] = precision_score(y_train, predictions_train)
    # Compute recall on the training set 
    results['rec_train'] = recall_score(y_train, predictions_train)
    # Compute F-score on the training set 
    results['f_train'] = fbeta_score(y_train, predictions_train, beta=10)
    # Compute accuracy on the test set 
    results['acc_test'] = accuracy_score(y_test, predictions_test)
    # Compute precision on the test set 
    results['pre_test'] = precision_score(y_test, predictions_test)
    # Compute recall on the test set 
    results['rec_test'] = recall_score(y_test, predictions_test)
    # Compute F-score on the test set 
    results['f_test'] = fbeta_score(y_test, predictions_test, beta=10)
    
    # Return the results
    return results
In [46]:
def show_results(model, results):
    '''
    inputs:
        - model: the learning algorithm to be trained and predicted on
        - results: dictionary output of train_predict()
    output:
    Prints all estimators contained in results.
    '''
    print("                  {}".format(model.__class__.__name__))
    print("                  Training   Testing")
    print("Prediction time : {:.4f}     {:.4f}".format(results['time_train'],results['time_test']))
    print("Accuracy        : {:.4f}     {:.4f}".format(results['acc_train'],results['acc_test']))
    print("Precision       : {:.4f}     {:.4f}".format(results['pre_train'],results['pre_test']))
    print("Recall          : {:.4f}     {:.4f}".format(results['rec_train'],results['rec_test']))
    print("F-score         : {:.4f}     {:.4f}".format(results['f_train'],results['f_test']))
    print("\n")
In [47]:
# plotting function for the confusion matrix
def plot_confusion_matrix(cmatrix):
    '''
    input:
        - cmatrix: confusion matrix
    output:
    Visualisation of confusion matrix as heatmap.
    '''
    plt.figure(figsize=(6,6))
    sns.set(font_scale=1.6)
    ax = sns.heatmap(cmatrix, annot=True, fmt="0", linewidths=.5, square=True, cmap='Blues_r')
    cmlabels = ['TN', 'FP',
              'FN', 'TP']
    for i,t in enumerate(ax.texts):
        t.set_text(t.get_text() + "\n" + cmlabels[i])
    plt.title('Confusion Matrix', size=15)
    plt.ylabel('Actual label')
    plt.xlabel('Predicted label')
    all_sample_title = 'Confusion Matrix'
    plt.title(all_sample_title, size=20)
In [48]:
def evaluate(models, allresults):
    '''
    inputs:
        - models: list of models considered
        - allresults: dictionary of the result dictionaries from train_predict() for model in models
    output:
    Visualisation of a comparison of the results of the models in models.
    '''
    # Create figure
    count_models = len(allresults)
    fig, ax = plt.subplots(1, 4, figsize = (20,5))

    # Constants
    bar_width = 3/count_models
    colors = "bgrcmykw"
        
    for j, metric in enumerate(['acc_test', 'pre_test','rec_test', 'f_test']):
        i=0
        for model in models:
            ax[j].bar(i*bar_width+bar_width, allresults[model][metric], width = bar_width, color = colors[i])
            i+=1
        ax[j].set_xlim((-0.1, 3.5))
    
    # Add unique y-labels
    ax[0].set_ylabel("Accuracy Score")
    ax[1].set_ylabel("Precision Score")
    ax[2].set_ylabel("Recall Score")
    ax[3].set_ylabel("F-score")
        
    # Add titles
    ax[0].set_title("Accuracy")
    ax[1].set_title("Precision")
    ax[2].set_title("Recall")
    ax[3].set_title("F-score")
    
    # Set y-limits for score panels
    ax[1].set_ylim((0, 1))
    ax[2].set_ylim((0, 1))
    ax[3].set_ylim((0, 1))
    
     # Create patches for the legend
    patches = []
    i=0
    for model in models:
        patches.append(mpatches.Patch(color = colors[i], label = model.__class__.__name__))
        i+=1
    plt.legend(handles = patches, loc = 'lower right', ncol = 1)
    
    # Aesthetics
    plt.suptitle("Performance Metrics", fontsize = 16, y = 1.10)
    plt.tight_layout()
    plt.show()

6. Models and Training

In the following we introduce the models that we focus on in the modelling process. We focus on the most interesting models in terms of explainability, performance, handeling and runtime.

6.1 Random Forest

Random forest is a machine learning technique that combines several uncorrelated decision trees for classification purposes. Applying supervised learning techniques, several individual decision trees are trained in parallel on each of the respective training sets. In order to classify a specific case using a random forest, first, each individual decision tree is evaluated leading to an individual classification. Second, the classification of the random forest is the mode of the resulting classes of the individual decision trees.

In [49]:
clf_RandomForest = RandomForestClassifier(random_state=42)
allresults = {clf_RandomForest: train_predict(clf_RandomForest, X_train, y_train, X_test, y_test)}
In [50]:
show_results(clf_RandomForest, allresults[clf_RandomForest])
y_pred = clf_RandomForest.predict(X_test)
cm = metrics.confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm)
                  RandomForestClassifier
                  Training   Testing
Prediction time : 1.8009     0.4483
Accuracy        : 1.0000     0.8704
Precision       : 1.0000     0.5050
Recall          : 1.0000     0.1819
F-score         : 1.0000     0.1831


We can see that the model learns the training data by heart and it is not possible to transfer the very good training results to the test data.

If we assume that an insurance company would base its application process on this model and reject the customers with predicted OD this would lead to:

  • 6467 applicants
  • 303 of those would be rejected (4.7 %)
  • 6164 would be insured
  • 688 would become OD

So we have a claims ratio of 688 / 6146 = 11.2 % which is only slightly less than the claims ratio if all customers would be accepted.

6.2 Gradient Tree Boosting

Similar to random forests, gradient tree boosting is a machine learning technique for classification purposes that uses a set of decision trees. However, in contrast to random forests, gradient tree boosting does not build individual fully grown decision trees independently in parallel. Gradient tree boosting aims to boost weak learners into a strong learner. Starting with a decision tree (initial strong learner) gradient tree boosting sequentially adds results of different shallow decision trees (weak learner) in order to stepwise improve the strong learner. In each iteration, the added weak learner is trained on the remaining errors of the current strong learner.

In [51]:
clf_GradBoost = GradientBoostingClassifier(random_state=42)
allresults.update({clf_GradBoost: train_predict(clf_GradBoost, X_train, y_train, X_test, y_test)})
In [52]:
show_results(clf_GradBoost, allresults[clf_GradBoost])
y_pred = clf_GradBoost.predict(X_test)
cm = metrics.confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm)
                  GradientBoostingClassifier
                  Training   Testing
Prediction time : 2.3908     0.0560
Accuracy        : 0.8849     0.8758
Precision       : 0.6897     0.5633
Recall          : 0.2608     0.2010
F-score         : 0.2625     0.2022


Here, the model performs slightly better on the training as on the test set but the imbalanced data set still puts an overweight on the true negative predictions as dominating event.

6.3 Stochastic Gradient Descent

Stochastic Gradient Descent is kind of a stochastic approximation of Gradient Descent. It is a machine learning algorithm that iteratively aims to optimize a specific objective function.

In [53]:
clf_SGDC = SGDClassifier(random_state=42)
allresults.update({clf_SGDC: train_predict(clf_SGDC, X_train, y_train, X_test, y_test)})
In [54]:
show_results(clf_SGDC, allresults[clf_SGDC])
y_pred = clf_SGDC.predict(X_test)
cm = metrics.confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm)
                  SGDClassifier
                  Training   Testing
Prediction time : 0.1649     0.0080
Accuracy        : 0.8739     0.8771
Precision       : 0.6731     0.6533
Recall          : 0.1208     0.1165
F-score         : 0.1218     0.1175


We also obtain similar results using Stochastic Gradient Descent.

6.4 Logistic Regression

A logistic regression is also a machine learning model that models the relationship of a binary dependent variable and independent variables (not necessarily binary) using a logistic predictor function. The unknown parameters of the predictor function are estimated from the data.

In [55]:
clf_LogisticRegression = linear_model.LogisticRegression(max_iter = 1000)
allresults.update({clf_LogisticRegression: train_predict(clf_LogisticRegression, X_train, y_train, X_test, y_test)})
In [56]:
show_results(clf_LogisticRegression, allresults[clf_LogisticRegression])
y_pred = clf_LogisticRegression.predict(X_test)
cm = metrics.confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm)
                  LogisticRegression
                  Training   Testing
Prediction time : 0.5151     0.0000
Accuracy        : 0.8763     0.8768
Precision       : 0.6081     0.5724
Recall          : 0.2234     0.2069
F-score         : 0.2248     0.2082


The logistic regression performs as good as the other models presented in Section 6. It is most likely that we have a mostly linear behaviour in the data relationships.

6.5 Comparison of Model Results

In [57]:
models = [clf_LogisticRegression, clf_RandomForest, clf_GradBoost, clf_SGDC]
model_name =['LogisticRegression','RandomForest','GradientBoosting','StochasticGradientDescent']
evaluate(models, allresults)

We can see that no matter what model we choose, while accuracy is high, precision is not good and recall is even worse. The combined estimator f-score with $\beta = 10$ does reflect the same model training behaviour. Because of the unbalanced data set the number of non-OD-cases is dominating the model training.

6.6 Neural Networks

The idea of artificial neural networks is that the observed data is first entered into an input layer. These data is then passed on to a second layer consisting of so-called neurons. In the neurons, the data is processed with the help of so-called activation functions and sent from there to a third layer, also consisting of neurons. This process is then repeated for all so-called hidden layers. Finally, the resulting classification is the output of the last layer.

The advantage of the modelling strategie of a neural network is that such a network is able to approximate highly nonlinear structures. Further, the training of a neural network takes an extensive amount of time as it is a stochastic optimization process adjusting the variables.

As we can already see in the comparison of the four previously described models that the regression model does perform as good as the other models. So we will not expect an additional value of this kind of model what is validated in our experiments in Appendix A.

7 Resampling Techniques

7.1 Oversample minority class

Oversampling can be defined as adding more copies of the minority class. We will use the resampling module from Scikit-Learn to randomly replicate samples from the minority class.

Always split into test and train sets before trying oversampling techniques! Oversampling before splitting the data can allow the exact same observations to be present in both the test and train sets. This can allow our model to simply memorize specific data points and cause overfitting and poor generalization to the test data.

For more information see here.

In [58]:
#data_over = data.copy(deep=True)
data_over = data_norm.copy(deep=True)

# Separate input features and target
y_over = data_over.OD
X_over = data_over.drop('OD', axis=1)

# setting up testing and training sets
X_over_train, X_over_test, y_over_train, y_over_test = train_test_split(X_over, y_over, test_size=0.3, random_state=42)

# concatenate our training data back together
X_over = pd.concat([X_over_train, y_over_train], axis=1)

# separate minority and majority classes
not_OD = X_over[X_over.OD==0]
OD = X_over[X_over.OD==1]
In [59]:
OD_upsampled = resample(OD,
               replace=True, # sample with replacement
               n_samples=len(not_OD), # match number in majority class
               random_state=42) # reproducible results

# combine majority and upsampled minority
upsampled = pd.concat([not_OD, OD_upsampled])

# change the training data to the upsampled version
y_over_train = upsampled.OD
X_over_train = upsampled.drop('OD', axis=1)
In [60]:
fig, ax = plt.subplots(1, 4, figsize = (20,5))
plt.setp(ax, xticks=[0,0.5,1], xticklabels=['0', '50:100', '100:100'])
plt.setp(ax, yticks=[0,0.2,0.4,0.6,0.8,1], yticklabels=['0', '20%', '40%','60%','80%','100%'])
fig.text(0.5, 0, 'Class Proportions (minority:majority)', ha='center')

i=0

for model in models:

    Accuracy = []
    Recall = []
    Fall_Out = []
    minority_majority = []

    
    for n in range(11):
        OD_oversampled = resample(OD,
                            replace=True, # sample with replacement
                            n_samples=len(not_OD)-round((len(not_OD)-len(OD))*(10-n)/10), # match number in majority class
                            random_state=42) # reproducible results

        # combine majority and upsampled minority
        oversampled = pd.concat([not_OD, OD_oversampled])

        # change the training data to the upsampled version
        y_over_train = oversampled.OD
        X_over_train = oversampled.drop('OD', axis=1)
        
        if model == clf_LogisticRegression:
            model.fit(X_over_train, y_over_train)
            cm=metrics.confusion_matrix(y_test, np.round(model.predict(X_test),0))
        else:
            model.fit(X_over_train, y_over_train)
            cm=metrics.confusion_matrix(y_test, model.predict(X_test))
                                    
        Accuracy.append((cm[0][0]+cm[1][1])/cm.sum())
        Recall.append(cm[1][1]/cm.sum(axis=1)[1])
        Fall_Out.append(cm[0][0]/cm.sum(axis=1)[0])
        minority_majority.append(len(OD_oversampled)/len(not_OD))
    
    # Plot: Oversampling the minority class

    ax[i].plot(minority_majority,Accuracy)
    ax[i].plot(minority_majority,Fall_Out)
    ax[i].plot(minority_majority,Recall)
    ax[i].set_xlim((0, 1))
    ax[i].set_ylim((0, 1))
    ax[i].set_title(model_name[i])
    
    i=i+1

plt.suptitle("Oversampling", y=1.05)
plt.legend(('Accuracy', 'Fall-Out', 'Recall'),loc='lower right')
Out[60]:
<matplotlib.legend.Legend at 0x1c1db831df0>
In [61]:
# the data
allresults_over={} # nested dict
for model in models:
    allresults_over[model] = train_predict(model, X_over_train, y_over_train, X_test, y_test)

evaluate(models, allresults_over)

Oversampling improves continously two of the four investigated models of Section 6. Especially the logistic regression model is still performing as good as the Gradient Boosting.

The random forest model does not improve significantly by the oversampling strategy. This behaviour was observed before in several publications (i.e., cf. C4.5,ClassImbalance,and CostSensitivity:Why Under-SamplingbeatsOver-Sampling).

Stochastic Gradient Descent improves in recall and f-score on a fully oversampled training set but on the tradeoff of precision and overall accuracy.

7.2 Downsampling

Downsampling follows the idea of cutting down the majority class of the training set to the size of the minority class to get again a balanced data set to train with. We will use the resampling module from Scikit-Learn to randomly throw off samples from the majority class.

Again first split into training and test set before downsampling as there is no need in cutting down the overall data.

Note that if there are only 14% OD positive values, we have to cut down about 83% of the not-OD data in the training set to get an equaly distributed training set. Therefore, the approach requires a data base that is huge enough so that the models can still be fitted on this much smaller data base.

In [62]:
fig, ax = plt.subplots(1, 4, figsize = (20,5))
plt.setp(ax, xticks=[0,0.5,1], xticklabels=['0', '50:100', '100:100'])
plt.setp(ax, yticks=[0,0.2,0.4,0.6,0.8,1], yticklabels=['0', '20%', '40%','60%','80%','100%'])
fig.text(0.5, 0, 'Class Proportions (minority:majority)', ha='center')

i=0

for model in models:

    Accuracy = []
    Recall = []
    Fall_Out = []
    minority_majority = []

    for n in range(11):
        not_OD_downsampled = resample(not_OD,
                                  replace = False, # sample without replacement
                                  n_samples = len(OD)+round((len(not_OD)-len(OD))*(10-n)/10),
                                  random_state = 27) # reproducible results
        downsampled = pd.concat([not_OD_downsampled, OD])
        y_down_train = downsampled.OD
        X_down_train = downsampled.drop('OD', axis=1)
        
        model.fit(X_down_train, y_down_train)
        cm=metrics.confusion_matrix(y_test, [1 if x >= 0.5 else 0 for x in model.predict(X_test)]) # Binarization of model.predict only effects the linear Regression
                                    
        Accuracy.append((cm[0][0]+cm[1][1])/cm.sum())
        Recall.append(cm[1][1]/cm.sum(axis=1)[1])
        Fall_Out.append(cm[0][0]/cm.sum(axis=1)[0])
        minority_majority.append(len(OD)/len(not_OD_downsampled))
    
    # Plot: Downsampling the Majority Class

    ax[i].plot(minority_majority,Accuracy)
    ax[i].plot(minority_majority,Fall_Out)
    ax[i].plot(minority_majority,Recall)
    ax[i].set_xlim((0, 1))
    ax[i].set_ylim((0, 1))
    ax[i].set_title(model_name[i])
    
    i=i+1

plt.suptitle("Downsampling", y=1.05)
plt.legend(('Accuracy', 'Fall-Out', 'Recall'),loc='lower right')
Out[62]:
<matplotlib.legend.Legend at 0x1c1d653ee20>
In [63]:
# the data
allresults_down={} # nested dict
for model in models:
    allresults_down[model] = train_predict(model, X_down_train, y_down_train, X_test, y_test)
    
evaluate(models, allresults_down)

With downsampling all models improve eventhough in the parameter optimization process of Stochastic Gradient Descent the optimization procedure still struggles in the performance.

In particular on the downsampled training set the random forest model is improving as well.

In [64]:
model = clf_GradBoost
X_down_test = X_over_test
y_down_test = y_over_test 
results = train_predict(model, X_down_train, y_down_train, X_down_test, y_down_test)
In [65]:
show_results(model, results)
y_pred = model.predict(X_test)
cm = metrics.confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm)
                  GradientBoostingClassifier
                  Training   Testing
Prediction time : 0.6895     0.0469
Accuracy        : 0.7860     0.7373
Precision       : 0.7824     0.3003
Recall          : 0.7924     0.7669
F-score         : 0.7923     0.7553


Now we get the following:

  • 6467 applicants
  • 2148 of those would be rejected (33.2 %)
  • 4319 would be insured
  • 196 would become OD

We can see the trade-off we already described: Based on this model we reject a lot of potential customers (33.2%), but we can lower the claims ratio significantly to 196 / 4319 = 4.5 %.

8. Model explanation

The goal of this chapter is to find the features with the most impact on the prediction and to explain single predictions for the Gradient Boosting Classifier.

Linear regression coefficients

In [66]:
regr = linear_model.LinearRegression()
LR=regr.fit(X_down_train, y_down_train)

# get importance
importance = regr.coef_

# plot feature importance
plt.figure(figsize=(15,5))
plt.bar([x for x in range(len(regr.coef_))], regr.coef_)
plt.xticks(np.arange(30), X_down_train.columns, rotation='vertical')
plt.show()

The outstanding features in terms of absolute value of the coefficients are:

  • BMI_max
  • Age
  • DoctorSleep
  • EduLevel_5.0
  • YearsQuitSmoking

It seems like the most important feature is BMI_max, but since the features are not equally distributed on the [0,1] intervall, the interpretation can be misleading.

In [67]:
fig, ax = plt.subplots(1, 2, figsize = (20,5))
sns.distplot(X_down_train['Age'], kde=False, ax=ax[0])
sns.distplot(X_down_train['BMI_max'], kde=False, ax=ax[1])
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c1dc32c790>

The values on the training dataset for BMI_max are on avarage a lot lower compared to age. Hence, in reality the feature BMI_max is not as important as the coefficient implies.

Permutation Importance (eli5)

For a black-box estimator like Gradient Boosting we need a different approch for measuring feature importance.

The "eli5" package provides a way to compute feature importances by measuring how score decreases when a feature is not available.
The idea is to remove a feature from the test dataset and compute score without using this feature. The problem here is, that the model needs the feature to calculate the score. So the workaround is to randomly shuffle values for one feature. This way the feature no longer contains useful information. The method is known as “permutation importance”.

In [68]:
# Linear Regression
perm = PermutationImportance(LR, random_state=0).fit(X_test, y_test)
eli5.show_weights(perm, feature_names=X_test.columns.tolist(), top=30)
Out[68]:
Weight Feature
0.0907 ± 0.0171 DoctorSleep
0.0374 ± 0.0081 Age
0.0269 ± 0.0076 YearsQuitSmoking
0.0139 ± 0.0011 DoctorHighCholesterol
0.0109 ± 0.0018 MedicineHighBloodPressure
0.0099 ± 0.0025 CholestoralChecked
0.0085 ± 0.0006 Smoked100
0.0066 ± 0.0016 Sleep hours
0.0063 ± 0.0029 MaritalStatus_5.0
0.0063 ± 0.0018 EduLevel_4.0
0.0049 ± 0.0045 MaritalStatus_1.0
0.0046 ± 0.0066 EduLevel_5.0
0.0043 ± 0.0006 Smoking_Age_75Q
0.0041 ± 0.0006 Smoking_Age_100Q
0.0037 ± 0.0031 EduLevel_1.0
0.0033 ± 0.0020 MaritalStatus_6.0
0.0001 ± 0.0068 BMI_max
0.0001 ± 0.0002 MaritalStatus_4.0
-0.0004 ± 0.0001 Smoking_Age_50Q
-0.0010 ± 0.0007 MaritalStatus_2.0
-0.0011 ± 0.0012 EduLevel_3.0
-0.0015 ± 0.0007 PeopleInHousehold
-0.0023 ± 0.0002 MedicineCholesterol
-0.0031 ± 0.0014 WhenCholestoralChecked
-0.0041 ± 0.0009 MaritalStatus_3.0
-0.0047 ± 0.0023 Smoking_Age_25Q
-0.0053 ± 0.0027 EduLevel_2.0
-0.0056 ± 0.0040 PrescriptionCholesterol
-0.0131 ± 0.0012 PrescriptionHighBloodPressure
-0.0169 ± 0.0073 DoctorHighBloodPressure
In [69]:
# Gradient Boosting
GB = clf_GradBoost.fit(X_down_train, y_down_train)
perm = PermutationImportance(GB, random_state=0).fit(X_test, y_test)
eli5.show_weights(perm, feature_names=X_test.columns.tolist(), top=30)
Out[69]:
Weight Feature
0.0305 ± 0.0058 DoctorSleep
0.0121 ± 0.0054 Age
0.0057 ± 0.0042 YearsQuitSmoking
0.0055 ± 0.0018 Sleep hours
0.0054 ± 0.0035 MaritalStatus_1.0
0.0040 ± 0.0050 BMI_max
0.0029 ± 0.0029 MaritalStatus_5.0
0.0023 ± 0.0009 MedicineHighBloodPressure
0.0016 ± 0.0016 PeopleInHousehold
0.0012 ± 0.0010 EduLevel_4.0
0.0007 ± 0.0016 Smoking_Age_25Q
0.0006 ± 0.0033 EduLevel_5.0
0.0004 ± 0.0022 DoctorHighCholesterol
0.0002 ± 0.0003 MaritalStatus_2.0
0.0002 ± 0.0007 Smoked100
0.0001 ± 0.0006 Smoking_Age_75Q
0 ± 0.0000 MaritalStatus_4.0
-0.0001 ± 0.0004 MaritalStatus_3.0
-0.0001 ± 0.0005 Smoking_Age_100Q
-0.0001 ± 0.0042 DoctorHighBloodPressure
-0.0003 ± 0.0003 Smoking_Age_50Q
-0.0003 ± 0.0009 EduLevel_3.0
-0.0005 ± 0.0008 CholestoralChecked
-0.0006 ± 0.0025 MaritalStatus_6.0
-0.0006 ± 0.0025 MedicineCholesterol
-0.0007 ± 0.0019 WhenCholestoralChecked
-0.0012 ± 0.0011 EduLevel_2.0
-0.0013 ± 0.0020 EduLevel_1.0
-0.0024 ± 0.0019 PrescriptionHighBloodPressure
-0.0039 ± 0.0009 PrescriptionCholesterol

The top 3 features for both LR and GB are DoctorSleep, Age and YearsQuitSmoking. These result match the high coefficents of the LR. BMI_max is weighted less as expected.
A big problem for Permutation Importance are correlated features. Permutation importance can be low for all of these features, because dropping one of the features may not affect the result, as estimator still has an access to the same information from other features. This affects especially one-hot encoded features like "EduLevel".

Shap Values

The next explanation approach are SHAP values.
The goal of SHAP is to explain the prediction of observation by computing the contribution of each feature to the prediction. This method computes Shapley values from coalitional game theory. SHAP values tell us how to fairly distribute the prediction among the features.

In [70]:
# SHAP values: LR
explainer_LR = shap.LinearExplainer(LR,X_down_train)
shap_values_LR = explainer_LR.shap_values(X_test)
shap.summary_plot(shap_values_LR, X_test, plot_type="bar")
In [71]:
# SHAP values: GB
explainer_GB = shap.TreeExplainer(GB,X_down_train)
shap_values_GB = explainer_GB.shap_values(X_test)
shap.summary_plot(shap_values_GB, X_test, plot_type="bar")
Passing 4056 background samples may lead to slow runtimes. Consider using shap.sample(data, 100) to create a smaller background data set.
100%|===================| 6455/6467 [07:36<00:00]        

The SHAP values can show how much each predictor contributes, either positively or negatively, to the target variable.

In [72]:
# SHAP values: GB
shap.summary_plot(shap_values_GB, X_test)

The top 4 features for both LR and GB are DoctorSleep, Age, EduLevel_5.0 and YearsQuitSmoking, while BMI_max is ranked a little bit lower.

Lime explanations

After a single prediction it is important to know, what features and values did drive the models prediction.
For the LR the answer is the coefficient multiplied by the features value.
In case of GB the answer is not that easy. A solution is the lime-approach (Local Interpretable Model-Agnostic Explanations).
A lime explanation is a local linear approximation of the model's behaviour. While the model may be very complex globally, it is easier to approximate it around the observation with an linear model.

In [73]:
# LIME explanation for a single observation: GB
predict_proba_gb = lambda x: GB.predict_proba(x).astype(float)
explainer = lime.lime_tabular.LimeTabularExplainer(X_over_train.values, feature_names=X_over_train.columns, class_names=['No OD', 'OD'], kernel_width=5)

choosen_instance =  X_over_train.loc[[53865]].values[0]
exp = explainer.explain_instance(choosen_instance, predict_proba_gb,num_features=9)
exp.show_in_notebook(show_all=False)

In this specific example, GB decides for OD based on a OD prediction probability of 0.81 percent that exceeds the probability for no OD (0.19 percent). The decision for OD is particularly due to the values of the variables DoctorSleep, EduLevel_5.0 and YearsQuitSmoking.

9 Conclusion

We investigated drivers for being in the state of occupational disability. We observed that the influence of the main drivers is modeled well by a logistic regression model as well as using a gradient boosting approach. In Section 8 we identified the highest BMI, the age and sleeping problems as important drivers for becoming occupationally disabled. On the other hand, in particular a high educational level and a high number of years having quit smoking or being a full non-smoker have a positive influence on not becoming occupationally disabled.

Note that especially non-smokers are included in the "YearsQuitSmoking" variable with 25 years as a default. Further, having sleeping problems and being therewith under medical treatment could also be an effect of the occupational disability.

Still the investigation gives hints on how to adapt the application process in insurance companies. It is possible to optimize the questionary used in the application process to focus on the most relevant drivers for an improved customer journey in the application process. Second, preventive measurements might be concerned on customers with a higher risk of becoming occupationally disabled.

Appendix A

Neural Networks

For completion we add the modeling with neural networks in this Appendix in addition to the models described in Section 6. As the relationships in the data are mostly logistic the networks put no additional value in the approximation process in comparison to the other investigated models.

In addition, by comparison of the prediciton times, we can see that as expected neural networks take much more time in the model fitting process. Further, note that we do not include a hyperparameter tuning, i.e., optimization of the number of layers, activation functions or batch sizes but we use a fairly standard model structure and parameters in the fitting process.

Please uncomment the following code and switch from markdown to code.

A.1 Imports and Methods for Neural Networks

import keras
def train_predict_NN(model, X_train, y_train, X_test, y_test, BS = 0, EP = 0): 
    '''
    inputs:
       - learner: the learning algorithm to be trained and predicted on
       - sample_size: the size of samples (number) to be drawn from training set
       - X_train: features training set
       - y_train: income training set
       - X_test: features testing set
       - y_test: income testing set
       - BS: batch_size (in case model is a Neural Network)
       - EP: number of epochs (in case model is a Neural Networ)
    '''
    results={}
    # Fit the learner to the training data
    start = time() # Get start time
    if BS == 0:
        model = model.fit(X_train, y_train)
    else:

        model_history = model.fit(X_train, y_train, batch_size=BS, epochs=EP, validation_data=(X_test, y_test), verbose = 0) 
        plt.plot(model_history.history['accuracy'])
        plt.plot(model_history.history['val_accuracy'])
        plt.title('Model Accuracy')
        plt.ylabel('Accuracy')
        plt.xlabel('Epoch')
        plt.legend(['Train', 'Test'], loc='upper left')
        #plt.savefig('Regression_model_loss.png')
        plt.show()

        # Plot training & validation loss values
        plt.plot(model_history.history['loss'])
        plt.plot(model_history.history['val_loss'])
        plt.title('Model Loss')
        plt.ylabel('Loss')
        plt.xlabel('Epoch')
        plt.legend(['Train', 'Test'], loc='upper left')
        #plt.savefig('Regression_model_loss.png')
        plt.show()
    end = time() # Get end time

    # Calculate the training time
    results['time_train'] = end-start

    # Get the predictions on the test set(X_test)
    start = time() # Get start time
    predictions_test = np.round(model.predict(X_test))
    predictions_train = np.round(model.predict(X_train))    
    end = time() # Get end time

    # Calculate the total prediction time
    results['time_test'] = end-start
    #Compute accuracy on the training set
    results['acc_train'] = accuracy_score(y_train, predictions_train)
    # Compute precision on the training set 
    results['pre_train'] = precision_score(y_train, predictions_train)
    # Compute recall on the training set 
    results['rec_train'] = recall_score(y_train, predictions_train)
    # Compute F-score on the training set 
    results['f_train'] = fbeta_score(y_train, predictions_train, beta=10)
    # Compute accuracy on the test set 
    results['acc_test'] = accuracy_score(y_test, predictions_test)
    # Compute precision on the test set 
    results['pre_test'] = precision_score(y_test, predictions_test)
    # Compute recall on the test set 
    results['rec_test'] = recall_score(y_test, predictions_test)
    # Compute F-score on the test set 
    results['f_test'] = fbeta_score(y_test, predictions_test, beta=10)
    return results
def NN(NoOfFeatures, NoOfReturns, Layers, LossFct, MetricFct):
    model_in = keras.layers.Input(shape = (NoOfFeatures,))
    model = keras.layers.Dense(200, input_shape = (NoOfFeatures,), kernel_initializer = 'normal', activation = "softmax")(model_in)
    model = keras.layers.Dropout(0.4, noise_shape=None, seed=None)(model)
    for k in range(0,Layers-1):
        model = keras.layers.Dense(200, input_shape = (NoOfFeatures,), kernel_initializer = 'normal', activation = "softmax")(model)
        model = keras.layers.Dropout(0.4, noise_shape=None, seed=None)(model)

    model = keras.layers.Dense(NoOfReturns, activation = 'sigmoid')(model)
    model = keras.models.Model(model_in, model)
    model.compile(optimizer = 'adam', loss = LossFct, metrics = [MetricFct])
    #Model3.compile(optimizer = 'adam', loss = 'mse', metrics = ['accuracy'])#mean_squared_error
    #Model3.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])#mean_squared_error
    return model

A.2 Neural Networks Training

We are investigating a four layer neural network with binary crossentropy as loss function and standard binary accuracy.

NoOfFeatures = X_train.shape[1]
NoOfReturns = 1
Layers = 4
LossFct = 'binary_crossentropy'
MetricFct = 'accuracy' # 
model_bc = NN(NoOfFeatures, NoOfReturns, Layers, LossFct, MetricFct)

NNresults = train_predict_NN(model_bc, X_train, y_train, X_test, y_test, 60, 100)

show_results(model_bc, NNresults)
cm3 = metrics.confusion_matrix(y_test, np.round(model_bc.predict(X_test)))
model_bc_score = model_bc.evaluate(X_test, y_test)
plot_confusion_matrix(cm3)

NNresults_over = train_predict_NN(model_bc, X_over_train, y_over_train, X_test, y_test, 60, 100)

show_results(model_bc, NNresults_over)
cm3 = metrics.confusion_matrix(y_test, np.round(model_bc.predict(X_test)))
model_bc_score = model_bc.evaluate(X_test, y_test)
plot_confusion_matrix(cm3)

NNresults_down = train_predict_NN(model_bc, X_down_train, y_down_train, X_test, y_test, 60, 100)

show_results(model_bc, NNresults_down)
cm3 = metrics.confusion_matrix(y_test, np.round(model_bc.predict(X_test)))
model_bc_score = model_bc.evaluate(X_test, y_test)
plot_confusion_matrix(cm3)
In [ ]: