IBM 人员流失预测¶

Introduction¶

address: https://github.com/ghvn7777/kaggle/blob/master/ibm_employee/predict_ibm_attrition.ipynb

保持雇员的快乐和对公司满意度是一个古老的挑战，如果你对雇员投资了很多，而他却离开了，这意味着你还要花费更多时间雇用别人，本着 Kaggle 的精神，让我们构建一个预测模型来根据 IBM 的数据集预测 IBM 员工的流失

这个笔记包括以下内容：

Exploratory Data Analysis: 在这个章节，我们探索数据集分布特征，特征间如何对应并可视化
Feature Engineering and Categorical Encoding: 进行一些特征工程并将我们的的特征编码为多个变量
Implementing Machine Learning models: 我们实现一个随机森林和梯度增强模型，然后看在这些模型中特征的重要性

Let's Go.

In [1]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Import statements required for Plotly 
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss
from imblearn.over_sampling import SMOTE
import xgboost

# Import and suppress warnings
import warnings
warnings.filterwarnings('ignore')

/home/kaka/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning:

This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.

1. Exploratory Data Analysis¶

让我们通过 Pandas 加载 datasets，我们快速看一下前几行，重点的关注是 attrition

In [2]:

attrition = pd.read_csv('./inputs/WA_Fn-UseC_-HR-Employee-Attrition.csv')
attrition.head()

Out[2]:

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	...	1	80	0	8	0	1	6	4	0	5
1	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	...	4	80	1	10	3	3	10	7	1	7
2	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	...	2	80	0	7	3	3	0	0	0	0
3	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	5	...	3	80	0	8	3	3	8	7	3	0
4	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	7	...	4	80	1	6	3	3	2	2	2	2

5 rows × 35 columns

从数据集中看，我们的目标列是 Attrition

此外，我们的数据是类型和数字数据混合的，对于这些非数字的类别，我们后面会将其编码成数字，这里我们首先探索数据集，首先检查一下数据集的完整性，简单的检查一下数据集中有没有空的或者无穷的数据

Data quality checks¶

可以使用 isnull() 函数来看有没有空的数据

In [3]:

#Looking for NaN
attrition.isnull().any()

Out[3]:

Age                         False
Attrition                   False
BusinessTravel              False
DailyRate                   False
Department                  False
DistanceFromHome            False
Education                   False
EducationField              False
EmployeeCount               False
EmployeeNumber              False
EnvironmentSatisfaction     False
Gender                      False
HourlyRate                  False
JobInvolvement              False
JobLevel                    False
JobRole                     False
JobSatisfaction             False
MaritalStatus               False
MonthlyIncome               False
MonthlyRate                 False
NumCompaniesWorked          False
Over18                      False
OverTime                    False
PercentSalaryHike           False
PerformanceRating           False
RelationshipSatisfaction    False
StandardHours               False
StockOptionLevel            False
TotalWorkingYears           False
TrainingTimesLastYear       False
WorkLifeBalance             False
YearsAtCompany              False
YearsInCurrentRole          False
YearsSinceLastPromotion     False
YearsWithCurrManager        False
dtype: bool

Distribution of the dataset¶

一般前几步会探索数据集特征如何分布，为了实现这个，我们调用 Seaborn plotting 库中的 kdeplot() 函数并且生成双变量图如下:

In [4]:

# Plotting the KDEplots
f, axes = plt.subplots(3, 3, figsize=(10, 10), sharex=False, sharey=False)

# Defining our colormap scheme
# s本来想调颜色的，后来都手工指定了 0.333....
#s = np.linspace(0, 3, 10) # [0,3] 区间等间隔生成 10 个数
# 创建一系列调色板，light 是调色板的最浅颜色的强度，１表示最强，
# as_cmap 为真表示使用 matplotlib 颜色表
cmap = sns.cubehelix_palette(start=0.0, light=1, as_cmap=True)

# Generate and plot
x = attrition['Age'].values
y = attrition['TotalWorkingYears'].values
# 画出单变量或双变量核密度预测, shade=True 表示数据是双变量时候填充轮廓
# cut=5 表示从每个内核的极端数据点切去几个 bw (带宽, 也是 kdeplot 的参数，作用控制估计与数据的拟合程度)
# cut 越大，整个图像越小数据越密集
# ax 参数指定在哪个轴上绘制，默认使用当前轴
sns.kdeplot(x, y, cmap=cmap, shade=True, cut=5, ax=axes[0,0])
axes[0,0].set( title = 'Age against Total working years')

cmap = sns.cubehelix_palette(start=0.333333333333, light=1, as_cmap=True)
# Generate and plot
x = attrition['Age'].values
y = attrition['DailyRate'].values
sns.kdeplot(x, y, cmap=cmap, shade=True, ax=axes[0,1])
axes[0,1].set( title = 'Age against Daily Rate')

cmap = sns.cubehelix_palette(start=0.666666666667, light=1, as_cmap=True)
# Generate and plot
x = attrition['YearsInCurrentRole'].values
y = attrition['Age'].values
sns.kdeplot(x, y, cmap=cmap, shade=True, ax=axes[0,2])
axes[0,2].set( title = 'Years in role against Age')

cmap = sns.cubehelix_palette(start=1.0, light=1, as_cmap=True)
# Generate and plot
x = attrition['DailyRate'].values
y = attrition['DistanceFromHome'].values
sns.kdeplot(x, y, cmap=cmap, shade=True,  ax=axes[1,0])
axes[1,0].set( title = 'Daily Rate against DistancefromHome')

cmap = sns.cubehelix_palette(start=1.333333333333, light=1, as_cmap=True)
# Generate and plot
x = attrition['DailyRate'].values
y = attrition['JobSatisfaction'].values
sns.kdeplot(x, y, cmap=cmap, shade=True,  ax=axes[1,1])
axes[1,1].set( title = 'Daily Rate against Job satisfaction')

cmap = sns.cubehelix_palette(start=1.666666666667, light=1, as_cmap=True)
# Generate and plot
x = attrition['YearsAtCompany'].values
y = attrition['JobSatisfaction'].values
sns.kdeplot(x, y, cmap=cmap, shade=True,  ax=axes[1,2])
axes[1,2].set( title = 'Daily Rate against distance')

cmap = sns.cubehelix_palette(start=2.0, light=1, as_cmap=True)
# Generate and plot
x = attrition['YearsAtCompany'].values
y = attrition['DailyRate'].values
sns.kdeplot(x, y, cmap=cmap, shade=True,  ax=axes[2,0])
axes[2,0].set( title = 'Years at company against Daily Rate')

cmap = sns.cubehelix_palette(start=2.333333333333, light=1, as_cmap=True)
# Generate and plot
x = attrition['RelationshipSatisfaction'].values
y = attrition['YearsWithCurrManager'].values
sns.kdeplot(x, y, cmap=cmap, shade=True,  ax=axes[2,1])
axes[2,1].set( title = 'Relationship Satisfaction vs years with manager')

cmap = sns.cubehelix_palette(start=2.666666666667, light=1, as_cmap=True)
# Generate and plot
x = attrition['WorkLifeBalance'].values
y = attrition['JobSatisfaction'].values
sns.kdeplot(x, y, cmap=cmap, shade=True,  ax=axes[2,2])
axes[2,2].set( title = 'WorklifeBalance against Satisfaction')

f.tight_layout()

In [5]:

# Define a dictionary for the target mapping
target_map = {'Yes':1, 'No':0}
# Use the pandas apply method to numerically encode our attrition target variable
attrition["Attrition_numerical"] = attrition["Attrition"].apply(lambda x: target_map[x])

In [6]:

attrition

Out[6]:

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	...	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager	Attrition_numerical
0	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	...	80	0	8	0	1	6	4	0	5	1
1	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	...	80	1	10	3	3	10	7	1	7	0
2	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	...	80	0	7	3	3	0	0	0	0	1
3	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	5	...	80	0	8	3	3	8	7	3	0	0
4	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	7	...	80	1	6	3	3	2	2	2	2	0
5	32	No	Travel_Frequently	1005	Research & Development	2	2	Life Sciences	1	8	...	80	0	8	2	2	7	7	3	6	0
6	59	No	Travel_Rarely	1324	Research & Development	3	3	Medical	1	10	...	80	3	12	3	2	1	0	0	0	0
7	30	No	Travel_Rarely	1358	Research & Development	24	1	Life Sciences	1	11	...	80	1	1	2	3	1	0	0	0	0
8	38	No	Travel_Frequently	216	Research & Development	23	3	Life Sciences	1	12	...	80	0	10	2	3	9	7	1	8	0
9	36	No	Travel_Rarely	1299	Research & Development	27	3	Medical	1	13	...	80	2	17	3	2	7	7	7	7	0
10	35	No	Travel_Rarely	809	Research & Development	16	3	Medical	1	14	...	80	1	6	5	3	5	4	0	3	0
11	29	No	Travel_Rarely	153	Research & Development	15	2	Life Sciences	1	15	...	80	0	10	3	3	9	5	0	8	0
12	31	No	Travel_Rarely	670	Research & Development	26	1	Life Sciences	1	16	...	80	1	5	1	2	5	2	4	3	0
13	34	No	Travel_Rarely	1346	Research & Development	19	2	Medical	1	18	...	80	1	3	2	3	2	2	1	2	0
14	28	Yes	Travel_Rarely	103	Research & Development	24	3	Life Sciences	1	19	...	80	0	6	4	3	4	2	0	3	1
15	29	No	Travel_Rarely	1389	Research & Development	21	4	Life Sciences	1	20	...	80	1	10	1	3	10	9	8	8	0
16	32	No	Travel_Rarely	334	Research & Development	5	2	Life Sciences	1	21	...	80	2	7	5	2	6	2	0	5	0
17	22	No	Non-Travel	1123	Research & Development	16	2	Medical	1	22	...	80	2	1	2	2	1	0	0	0	0
18	53	No	Travel_Rarely	1219	Sales	2	4	Life Sciences	1	23	...	80	0	31	3	3	25	8	3	7	0
19	38	No	Travel_Rarely	371	Research & Development	2	3	Life Sciences	1	24	...	80	0	6	3	3	3	2	1	2	0
20	24	No	Non-Travel	673	Research & Development	11	2	Other	1	26	...	80	1	5	5	2	4	2	1	3	0
21	36	Yes	Travel_Rarely	1218	Sales	9	4	Life Sciences	1	27	...	80	0	10	4	3	5	3	0	3	1
22	34	No	Travel_Rarely	419	Research & Development	7	4	Life Sciences	1	28	...	80	0	13	4	3	12	6	2	11	0
23	21	No	Travel_Rarely	391	Research & Development	15	2	Life Sciences	1	30	...	80	0	0	6	3	0	0	0	0	0
24	34	Yes	Travel_Rarely	699	Research & Development	6	1	Medical	1	31	...	80	0	8	2	3	4	2	1	3	1
25	53	No	Travel_Rarely	1282	Research & Development	5	3	Other	1	32	...	80	1	26	3	2	14	13	4	8	0
26	32	Yes	Travel_Frequently	1125	Research & Development	16	1	Life Sciences	1	33	...	80	0	10	5	3	10	2	6	7	1
27	42	No	Travel_Rarely	691	Sales	8	4	Marketing	1	35	...	80	1	10	2	3	9	7	4	2	0
28	44	No	Travel_Rarely	477	Research & Development	7	4	Medical	1	36	...	80	1	24	4	3	22	6	5	17	0
29	46	No	Travel_Rarely	705	Sales	2	4	Marketing	1	38	...	80	0	22	2	2	2	2	2	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1440	36	No	Travel_Frequently	688	Research & Development	4	2	Life Sciences	1	2025	...	80	3	18	3	3	4	2	0	2	0
1441	56	No	Non-Travel	667	Research & Development	1	4	Life Sciences	1	2026	...	80	1	13	2	2	13	12	1	9	0
1442	29	Yes	Travel_Rarely	1092	Research & Development	1	4	Medical	1	2027	...	80	3	4	3	4	2	2	2	2	1
1443	42	No	Travel_Rarely	300	Research & Development	2	3	Life Sciences	1	2031	...	80	0	24	2	2	22	6	4	14	0
1444	56	Yes	Travel_Rarely	310	Research & Development	7	2	Technical Degree	1	2032	...	80	1	14	4	1	10	9	9	8	1
1445	41	No	Travel_Rarely	582	Research & Development	28	4	Life Sciences	1	2034	...	80	1	21	3	3	20	7	0	10	0
1446	34	No	Travel_Rarely	704	Sales	28	3	Marketing	1	2035	...	80	2	8	2	3	8	7	1	7	0
1447	36	No	Non-Travel	301	Sales	15	4	Marketing	1	2036	...	80	1	15	4	2	15	12	11	11	0
1448	41	No	Travel_Rarely	930	Sales	3	3	Life Sciences	1	2037	...	80	1	14	5	3	5	4	0	4	0
1449	32	No	Travel_Rarely	529	Research & Development	2	3	Technical Degree	1	2038	...	80	0	4	4	3	4	2	1	2	0
1450	35	No	Travel_Rarely	1146	Human Resources	26	4	Life Sciences	1	2040	...	80	0	9	2	3	9	0	1	7	0
1451	38	No	Travel_Rarely	345	Sales	10	2	Life Sciences	1	2041	...	80	1	10	1	3	10	7	1	9	0
1452	50	Yes	Travel_Frequently	878	Sales	1	4	Life Sciences	1	2044	...	80	2	12	3	3	6	3	0	1	1
1453	36	No	Travel_Rarely	1120	Sales	11	4	Marketing	1	2045	...	80	1	8	2	2	6	3	0	0	0
1454	45	No	Travel_Rarely	374	Sales	20	3	Life Sciences	1	2046	...	80	0	8	3	3	5	3	0	1	0
1455	40	No	Travel_Rarely	1322	Research & Development	2	4	Life Sciences	1	2048	...	80	0	8	2	3	2	2	2	2	0
1456	35	No	Travel_Frequently	1199	Research & Development	18	4	Life Sciences	1	2049	...	80	2	10	2	4	10	2	0	2	0
1457	40	No	Travel_Rarely	1194	Research & Development	2	4	Medical	1	2051	...	80	3	20	2	3	5	3	0	2	0
1458	35	No	Travel_Rarely	287	Research & Development	1	4	Life Sciences	1	2052	...	80	1	4	5	3	4	3	1	1	0
1459	29	No	Travel_Rarely	1378	Research & Development	13	2	Other	1	2053	...	80	1	10	2	3	4	3	0	3	0
1460	29	No	Travel_Rarely	468	Research & Development	28	4	Medical	1	2054	...	80	0	5	3	1	5	4	0	4	0
1461	50	Yes	Travel_Rarely	410	Sales	28	3	Marketing	1	2055	...	80	1	20	3	3	3	2	2	0	1
1462	39	No	Travel_Rarely	722	Sales	24	1	Marketing	1	2056	...	80	1	21	2	2	20	9	9	6	0
1463	31	No	Non-Travel	325	Research & Development	5	3	Medical	1	2057	...	80	0	10	2	3	9	4	1	7	0
1464	26	No	Travel_Rarely	1167	Sales	5	3	Other	1	2060	...	80	0	5	2	3	4	2	0	0	0
1465	36	No	Travel_Frequently	884	Research & Development	23	2	Medical	1	2061	...	80	1	17	3	3	5	2	0	3	0
1466	39	No	Travel_Rarely	613	Research & Development	6	1	Medical	1	2062	...	80	1	9	5	3	7	7	1	7	0
1467	27	No	Travel_Rarely	155	Research & Development	4	3	Life Sciences	1	2064	...	80	1	6	0	3	6	2	0	3	0
1468	49	No	Travel_Frequently	1023	Sales	2	3	Medical	1	2065	...	80	0	17	3	2	9	6	0	8	0
1469	34	No	Travel_Rarely	628	Research & Development	8	3	Medical	1	2068	...	80	0	6	3	4	4	3	1	2	0

1470 rows × 36 columns

Correlation of Features¶

接下来的探索工具是关于矩阵的，通过绘制相关矩阵，我们可以很好的描述特征之间的关联，在 Pandas dataframe 中，我们可以使用 corr 函数可以为 dataframe 的每对列数据提供皮尔森相关系数（也叫矩阵相关系数，用来反映两个变量线性相关程度的统计量）

在这里，我将使用 Plotly 库中的 Heatmap() 函数绘出皮尔森相关系数矩阵:

In [7]:

# creating a list of only numerical values
numerical = [u'Age', u'DailyRate', u'DistanceFromHome', u'Education', u'EmployeeNumber', u'EnvironmentSatisfaction',
       u'HourlyRate', u'JobInvolvement', u'JobLevel', u'JobSatisfaction',
       u'MonthlyIncome', u'MonthlyRate', u'NumCompaniesWorked',
       u'PercentSalaryHike', u'PerformanceRating', u'RelationshipSatisfaction',
       u'StockOptionLevel', u'TotalWorkingYears',
       u'TrainingTimesLastYear', u'WorkLifeBalance', u'YearsAtCompany',
       u'YearsInCurrentRole', u'YearsSinceLastPromotion',
       u'YearsWithCurrManager']
data = [
    go.Heatmap(
        z= attrition[numerical].astype(float).corr().values, # Generating the Pearson correlation
        x=attrition[numerical].columns.values,
        y=attrition[numerical].columns.values,
        colorscale='Viridis',
        reversescale = False, #反转色域
        text = True,
        opacity = 1.0 #不透明度
        
    )
]


layout = go.Layout(
    title='Pearson Correlation of numerical features',
    xaxis = dict(ticks='', nticks=36),
    yaxis = dict(ticks='' ),
    width = 900, height = 700,
    
)


fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='labelled-heatmap')

Takeaway from the plots¶

从上图中，我们可以看到有相当多的列好像彼此关系很差，一般来说，做一个预测模型，我们的训练数据最好彼此不相关，因为我们不需要冗余数据，在这个例子中，在我们有相当多的相关特征情况下，或许我们应该应用 PCA（Principal Component Analysis --> 主成分分析）来减少特征空间

Pairplot Visualisations¶

现在让我们创建一些 Seaborn pairplot 并且设置 Attrition 列作为目标变量得到各个特征分布对人员流失的影响

In [8]:

# Refining our list of numerical variables
numerical = [u'Age', u'DailyRate',  u'JobSatisfaction',
       u'MonthlyIncome', u'PerformanceRating',
        u'WorkLifeBalance', u'YearsAtCompany', u'Attrition_numerical']

#g = sns.pairplot(attrition[numerical], hue='Attrition_numerical', palette='seismic', diag_kind = 'kde',diag_kws=dict(shade=True))
#g.set(xticklabels=[])

2. Feature Engineering & Categorical Encoding¶

我们对数据集进行了简单的探索，现在我们处理特征工程和对分类进行数字编码，特征工程简单地说就是从已有的特征创建新的特征关系，特征工程非常重要。

在开始之前，我们使用 dtype 方法将数字和分类隔离

In [9]:

attrition

Out[9]:

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	...	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager	Attrition_numerical
0	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	...	80	0	8	0	1	6	4	0	5	1
1	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	...	80	1	10	3	3	10	7	1	7	0
2	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	...	80	0	7	3	3	0	0	0	0	1
3	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	5	...	80	0	8	3	3	8	7	3	0	0
4	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	7	...	80	1	6	3	3	2	2	2	2	0
5	32	No	Travel_Frequently	1005	Research & Development	2	2	Life Sciences	1	8	...	80	0	8	2	2	7	7	3	6	0
6	59	No	Travel_Rarely	1324	Research & Development	3	3	Medical	1	10	...	80	3	12	3	2	1	0	0	0	0
7	30	No	Travel_Rarely	1358	Research & Development	24	1	Life Sciences	1	11	...	80	1	1	2	3	1	0	0	0	0
8	38	No	Travel_Frequently	216	Research & Development	23	3	Life Sciences	1	12	...	80	0	10	2	3	9	7	1	8	0
9	36	No	Travel_Rarely	1299	Research & Development	27	3	Medical	1	13	...	80	2	17	3	2	7	7	7	7	0
10	35	No	Travel_Rarely	809	Research & Development	16	3	Medical	1	14	...	80	1	6	5	3	5	4	0	3	0
11	29	No	Travel_Rarely	153	Research & Development	15	2	Life Sciences	1	15	...	80	0	10	3	3	9	5	0	8	0
12	31	No	Travel_Rarely	670	Research & Development	26	1	Life Sciences	1	16	...	80	1	5	1	2	5	2	4	3	0
13	34	No	Travel_Rarely	1346	Research & Development	19	2	Medical	1	18	...	80	1	3	2	3	2	2	1	2	0
14	28	Yes	Travel_Rarely	103	Research & Development	24	3	Life Sciences	1	19	...	80	0	6	4	3	4	2	0	3	1
15	29	No	Travel_Rarely	1389	Research & Development	21	4	Life Sciences	1	20	...	80	1	10	1	3	10	9	8	8	0
16	32	No	Travel_Rarely	334	Research & Development	5	2	Life Sciences	1	21	...	80	2	7	5	2	6	2	0	5	0
17	22	No	Non-Travel	1123	Research & Development	16	2	Medical	1	22	...	80	2	1	2	2	1	0	0	0	0
18	53	No	Travel_Rarely	1219	Sales	2	4	Life Sciences	1	23	...	80	0	31	3	3	25	8	3	7	0
19	38	No	Travel_Rarely	371	Research & Development	2	3	Life Sciences	1	24	...	80	0	6	3	3	3	2	1	2	0
20	24	No	Non-Travel	673	Research & Development	11	2	Other	1	26	...	80	1	5	5	2	4	2	1	3	0
21	36	Yes	Travel_Rarely	1218	Sales	9	4	Life Sciences	1	27	...	80	0	10	4	3	5	3	0	3	1
22	34	No	Travel_Rarely	419	Research & Development	7	4	Life Sciences	1	28	...	80	0	13	4	3	12	6	2	11	0
23	21	No	Travel_Rarely	391	Research & Development	15	2	Life Sciences	1	30	...	80	0	0	6	3	0	0	0	0	0
24	34	Yes	Travel_Rarely	699	Research & Development	6	1	Medical	1	31	...	80	0	8	2	3	4	2	1	3	1
25	53	No	Travel_Rarely	1282	Research & Development	5	3	Other	1	32	...	80	1	26	3	2	14	13	4	8	0
26	32	Yes	Travel_Frequently	1125	Research & Development	16	1	Life Sciences	1	33	...	80	0	10	5	3	10	2	6	7	1
27	42	No	Travel_Rarely	691	Sales	8	4	Marketing	1	35	...	80	1	10	2	3	9	7	4	2	0
28	44	No	Travel_Rarely	477	Research & Development	7	4	Medical	1	36	...	80	1	24	4	3	22	6	5	17	0
29	46	No	Travel_Rarely	705	Sales	2	4	Marketing	1	38	...	80	0	22	2	2	2	2	2	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1440	36	No	Travel_Frequently	688	Research & Development	4	2	Life Sciences	1	2025	...	80	3	18	3	3	4	2	0	2	0
1441	56	No	Non-Travel	667	Research & Development	1	4	Life Sciences	1	2026	...	80	1	13	2	2	13	12	1	9	0
1442	29	Yes	Travel_Rarely	1092	Research & Development	1	4	Medical	1	2027	...	80	3	4	3	4	2	2	2	2	1
1443	42	No	Travel_Rarely	300	Research & Development	2	3	Life Sciences	1	2031	...	80	0	24	2	2	22	6	4	14	0
1444	56	Yes	Travel_Rarely	310	Research & Development	7	2	Technical Degree	1	2032	...	80	1	14	4	1	10	9	9	8	1
1445	41	No	Travel_Rarely	582	Research & Development	28	4	Life Sciences	1	2034	...	80	1	21	3	3	20	7	0	10	0
1446	34	No	Travel_Rarely	704	Sales	28	3	Marketing	1	2035	...	80	2	8	2	3	8	7	1	7	0
1447	36	No	Non-Travel	301	Sales	15	4	Marketing	1	2036	...	80	1	15	4	2	15	12	11	11	0
1448	41	No	Travel_Rarely	930	Sales	3	3	Life Sciences	1	2037	...	80	1	14	5	3	5	4	0	4	0
1449	32	No	Travel_Rarely	529	Research & Development	2	3	Technical Degree	1	2038	...	80	0	4	4	3	4	2	1	2	0
1450	35	No	Travel_Rarely	1146	Human Resources	26	4	Life Sciences	1	2040	...	80	0	9	2	3	9	0	1	7	0
1451	38	No	Travel_Rarely	345	Sales	10	2	Life Sciences	1	2041	...	80	1	10	1	3	10	7	1	9	0
1452	50	Yes	Travel_Frequently	878	Sales	1	4	Life Sciences	1	2044	...	80	2	12	3	3	6	3	0	1	1
1453	36	No	Travel_Rarely	1120	Sales	11	4	Marketing	1	2045	...	80	1	8	2	2	6	3	0	0	0
1454	45	No	Travel_Rarely	374	Sales	20	3	Life Sciences	1	2046	...	80	0	8	3	3	5	3	0	1	0
1455	40	No	Travel_Rarely	1322	Research & Development	2	4	Life Sciences	1	2048	...	80	0	8	2	3	2	2	2	2	0
1456	35	No	Travel_Frequently	1199	Research & Development	18	4	Life Sciences	1	2049	...	80	2	10	2	4	10	2	0	2	0
1457	40	No	Travel_Rarely	1194	Research & Development	2	4	Medical	1	2051	...	80	3	20	2	3	5	3	0	2	0
1458	35	No	Travel_Rarely	287	Research & Development	1	4	Life Sciences	1	2052	...	80	1	4	5	3	4	3	1	1	0
1459	29	No	Travel_Rarely	1378	Research & Development	13	2	Other	1	2053	...	80	1	10	2	3	4	3	0	3	0
1460	29	No	Travel_Rarely	468	Research & Development	28	4	Medical	1	2054	...	80	0	5	3	1	5	4	0	4	0
1461	50	Yes	Travel_Rarely	410	Sales	28	3	Marketing	1	2055	...	80	1	20	3	3	3	2	2	0	1
1462	39	No	Travel_Rarely	722	Sales	24	1	Marketing	1	2056	...	80	1	21	2	2	20	9	9	6	0
1463	31	No	Non-Travel	325	Research & Development	5	3	Medical	1	2057	...	80	0	10	2	3	9	4	1	7	0
1464	26	No	Travel_Rarely	1167	Sales	5	3	Other	1	2060	...	80	0	5	2	3	4	2	0	0	0
1465	36	No	Travel_Frequently	884	Research & Development	23	2	Medical	1	2061	...	80	1	17	3	3	5	2	0	3	0
1466	39	No	Travel_Rarely	613	Research & Development	6	1	Medical	1	2062	...	80	1	9	5	3	7	7	1	7	0
1467	27	No	Travel_Rarely	155	Research & Development	4	3	Life Sciences	1	2064	...	80	1	6	0	3	6	2	0	3	0
1468	49	No	Travel_Frequently	1023	Sales	2	3	Medical	1	2065	...	80	0	17	3	2	9	6	0	8	0
1469	34	No	Travel_Rarely	628	Research & Development	8	3	Medical	1	2068	...	80	0	6	3	4	4	3	1	2	0

1470 rows × 36 columns

In [10]:

# Drop the Attrition_numerical column from attrition dataset first - Don't want to include that
attrition = attrition.drop(['Attrition_numerical'], axis=1)

# Empty list to store columns with categorical data
categorical = []
for col, value in attrition.iteritems():
    if value.dtype == 'object':
        categorical.append(col)

# Store the numerical columns in a list numerical
print(categorical)
numerical = attrition.columns.difference(categorical)

['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'Over18', 'OverTime']

In [11]:

numerical

Out[11]:

Index(['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome',
       'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object')

确定我们的特征包含分类数据，我们可以将 numerical 编码，可以使用 Pandas 的 get_dummies() 方法

In [12]:

# Store the categorical data in a dataframe called attrition_cat
attrition_cat = attrition[categorical] #提取出不是数字的列
attrition_cat = attrition_cat.drop(['Attrition'], axis=1) # Dropping the target column
print(attrition_cat)

         BusinessTravel              Department    EducationField  Gender  \
0         Travel_Rarely                   Sales     Life Sciences  Female   
1     Travel_Frequently  Research & Development     Life Sciences    Male   
2         Travel_Rarely  Research & Development             Other    Male   
3     Travel_Frequently  Research & Development     Life Sciences  Female   
4         Travel_Rarely  Research & Development           Medical    Male   
5     Travel_Frequently  Research & Development     Life Sciences    Male   
6         Travel_Rarely  Research & Development           Medical  Female   
7         Travel_Rarely  Research & Development     Life Sciences    Male   
8     Travel_Frequently  Research & Development     Life Sciences    Male   
9         Travel_Rarely  Research & Development           Medical    Male   
10        Travel_Rarely  Research & Development           Medical    Male   
11        Travel_Rarely  Research & Development     Life Sciences  Female   
12        Travel_Rarely  Research & Development     Life Sciences    Male   
13        Travel_Rarely  Research & Development           Medical    Male   
14        Travel_Rarely  Research & Development     Life Sciences    Male   
15        Travel_Rarely  Research & Development     Life Sciences  Female   
16        Travel_Rarely  Research & Development     Life Sciences    Male   
17           Non-Travel  Research & Development           Medical    Male   
18        Travel_Rarely                   Sales     Life Sciences  Female   
19        Travel_Rarely  Research & Development     Life Sciences    Male   
20           Non-Travel  Research & Development             Other  Female   
21        Travel_Rarely                   Sales     Life Sciences    Male   
22        Travel_Rarely  Research & Development     Life Sciences  Female   
23        Travel_Rarely  Research & Development     Life Sciences    Male   
24        Travel_Rarely  Research & Development           Medical    Male   
25        Travel_Rarely  Research & Development             Other  Female   
26    Travel_Frequently  Research & Development     Life Sciences  Female   
27        Travel_Rarely                   Sales         Marketing    Male   
28        Travel_Rarely  Research & Development           Medical  Female   
29        Travel_Rarely                   Sales         Marketing  Female   
...                 ...                     ...               ...     ...   
1440  Travel_Frequently  Research & Development     Life Sciences  Female   
1441         Non-Travel  Research & Development     Life Sciences    Male   
1442      Travel_Rarely  Research & Development           Medical    Male   
1443      Travel_Rarely  Research & Development     Life Sciences    Male   
1444      Travel_Rarely  Research & Development  Technical Degree    Male   
1445      Travel_Rarely  Research & Development     Life Sciences  Female   
1446      Travel_Rarely                   Sales         Marketing  Female   
1447         Non-Travel                   Sales         Marketing    Male   
1448      Travel_Rarely                   Sales     Life Sciences    Male   
1449      Travel_Rarely  Research & Development  Technical Degree    Male   
1450      Travel_Rarely         Human Resources     Life Sciences  Female   
1451      Travel_Rarely                   Sales     Life Sciences  Female   
1452  Travel_Frequently                   Sales     Life Sciences    Male   
1453      Travel_Rarely                   Sales         Marketing  Female   
1454      Travel_Rarely                   Sales     Life Sciences  Female   
1455      Travel_Rarely  Research & Development     Life Sciences    Male   
1456  Travel_Frequently  Research & Development     Life Sciences    Male   
1457      Travel_Rarely  Research & Development           Medical  Female   
1458      Travel_Rarely  Research & Development     Life Sciences  Female   
1459      Travel_Rarely  Research & Development             Other    Male   
1460      Travel_Rarely  Research & Development           Medical  Female   
1461      Travel_Rarely                   Sales         Marketing    Male   
1462      Travel_Rarely                   Sales         Marketing  Female   
1463         Non-Travel  Research & Development           Medical    Male   
1464      Travel_Rarely                   Sales             Other  Female   
1465  Travel_Frequently  Research & Development           Medical    Male   
1466      Travel_Rarely  Research & Development           Medical    Male   
1467      Travel_Rarely  Research & Development     Life Sciences    Male   
1468  Travel_Frequently                   Sales           Medical    Male   
1469      Travel_Rarely  Research & Development           Medical    Male   

                        JobRole MaritalStatus Over18 OverTime  
0               Sales Executive        Single      Y      Yes  
1            Research Scientist       Married      Y       No  
2         Laboratory Technician        Single      Y      Yes  
3            Research Scientist       Married      Y      Yes  
4         Laboratory Technician       Married      Y       No  
5         Laboratory Technician        Single      Y       No  
6         Laboratory Technician       Married      Y      Yes  
7         Laboratory Technician      Divorced      Y       No  
8        Manufacturing Director        Single      Y       No  
9     Healthcare Representative       Married      Y       No  
10        Laboratory Technician       Married      Y       No  
11        Laboratory Technician        Single      Y      Yes  
12           Research Scientist      Divorced      Y       No  
13        Laboratory Technician      Divorced      Y       No  
14        Laboratory Technician        Single      Y      Yes  
15       Manufacturing Director      Divorced      Y       No  
16           Research Scientist      Divorced      Y      Yes  
17        Laboratory Technician      Divorced      Y      Yes  
18                      Manager       Married      Y       No  
19           Research Scientist        Single      Y      Yes  
20       Manufacturing Director      Divorced      Y       No  
21         Sales Representative        Single      Y       No  
22            Research Director        Single      Y       No  
23           Research Scientist        Single      Y       No  
24           Research Scientist        Single      Y       No  
25                      Manager      Divorced      Y       No  
26           Research Scientist        Single      Y      Yes  
27              Sales Executive       Married      Y       No  
28    Healthcare Representative       Married      Y       No  
29                      Manager        Single      Y       No  
...                         ...           ...    ...      ...  
1440     Manufacturing Director      Divorced      Y       No  
1441  Healthcare Representative      Divorced      Y       No  
1442         Research Scientist       Married      Y      Yes  
1443                    Manager       Married      Y       No  
1444      Laboratory Technician       Married      Y       No  
1445     Manufacturing Director       Married      Y       No  
1446            Sales Executive       Married      Y       No  
1447            Sales Executive      Divorced      Y       No  
1448            Sales Executive      Divorced      Y       No  
1449         Research Scientist        Single      Y       No  
1450            Human Resources        Single      Y      Yes  
1451            Sales Executive       Married      Y       No  
1452            Sales Executive      Divorced      Y       No  
1453            Sales Executive       Married      Y       No  
1454            Sales Executive        Single      Y       No  
1455         Research Scientist        Single      Y       No  
1456  Healthcare Representative       Married      Y      Yes  
1457         Research Scientist       Married      Y       No  
1458         Research Scientist       Married      Y       No  
1459      Laboratory Technician       Married      Y      Yes  
1460         Research Scientist        Single      Y       No  
1461            Sales Executive      Divorced      Y      Yes  
1462            Sales Executive       Married      Y       No  
1463     Manufacturing Director        Single      Y       No  
1464       Sales Representative        Single      Y       No  
1465      Laboratory Technician       Married      Y       No  
1466  Healthcare Representative       Married      Y       No  
1467     Manufacturing Director       Married      Y      Yes  
1468            Sales Executive       Married      Y       No  
1469      Laboratory Technician       Married      Y       No  

[1470 rows x 8 columns]

应用 get_dummies() 方法自动编码, 我们可以很方便的用以下代码看编码后的结果

In [13]:

attrition_cat = pd.get_dummies(attrition_cat)
attrition_cat.head(3)

Out[13]:

	BusinessTravel_Travel_Frequently	BusinessTravel_Travel_Rarely	Department_Research & Development	Department_Sales	EducationField_Life Sciences	...	JobRole_Research Scientist	JobRole_Sales Executive	MaritalStatus_Married	MaritalStatus_Single	Over18_Y	OverTime_No	OverTime_Yes
0	0	1	0	1	1	...	0	1	0	1	1	0	1
1	1	0	1	0	1	...	1	0	1	0	1	1	0
2	0	1	1	0	0	...	0	0	0	1	1	0	1

3 rows × 29 columns

提取出是数字的列

In [14]:

# Store the numerical features to a dataframe attrition_num
attrition_num = attrition[numerical]

我们编码了非数字的变量，并将数字的提取出来，现在我们要将它们合并成最终的训练数据

In [15]:

# Concat the two dataframes together columnwise
attrition_final = pd.concat([attrition_num, attrition_cat], axis=1)

Target variable¶

最后，我们需要目标变量, 由 attrition 列给出，我们需要将其编码，１代表 Yes, 0 代表 No

In [16]:

# Define a dictionary for the target mapping
target_map = {'Yes':1, 'No':0}
# Use the pandas apply method to numerically encode our attrition target variable
target = attrition["Attrition"].apply(lambda x: target_map[x])
target.head(3)

Out[16]:

0    1
1    0
2    1
Name: Attrition, dtype: int64

然而，如果检查 Yes 和 No 的数量就会发现，数据有非常大的偏差

In [17]:

data = [go.Bar(
            x=attrition["Attrition"].value_counts().index.values,
            y= attrition["Attrition"].value_counts().values
    )]

py.iplot(data, filename='basic-bar')

因此，我们现在的数据是不平衡的，有很多方法可以解决数据不平衡的问题，在这里我们使用 SMOTE 过采样技术来处理不平衡

3. Implementing Machine Learning Models¶

进行了一些探索分析和简单的特征工程，确保我们的所有的数据都被编码，我们现在可以建立自己的模型

在这个笔记一开始时，我们说我们的目标是为了评估和对比一些不同模型的表现

分离和测试数据¶

在我们训练数据之前，需要有一个训练集和测试集，不同于 Kaggle 比赛，一般我们都会有现成的训练集和测试集，这里我们使用 sklearn 来分离数据

In [18]:

# Import the train_test_split method
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import StratifiedShuffleSplit

# Split data into train and test sets as well as for validation and testing
train, test, target_train, target_val = train_test_split(attrition_final, target, train_size= 0.75,random_state=0);
#train, test, target_train, target_val = StratifiedShuffleSplit(attrition_final, target, random_state=0);

SMOTE to oversample due to the skewness in target¶

既然我们已经注意到了目标值的不平衡，让我们通过 imblearn 包实现。

In [19]:

oversampler=SMOTE(random_state=0)
smote_train, smote_target = oversampler.fit_sample(train,target_train)

A. Random Forest Classifier¶

随机森林分类方法是无处不在的决策树，作为独立模型的决策树通常被认为是 "弱学习" 模型，因为它的预测性较差，然而随机森林分类是收集一组决策树，用其组合能力获得较强的预测性能，称为强学习

Initialising Random Forest parameters¶

我们将使用 scikit-learn 的库中的 Random Forest mode, 我们首先定义我们的参数

In [20]:

seed = 0   # We set our random seed to zero for reproducibility
# Random Forest parameters
rf_params = {
    'n_jobs': -1,
    'n_estimators': 800,
    'warm_start': True, 
    'max_features': 0.3,
    'max_depth': 9,
    'min_samples_leaf': 2,
    'max_features' : 'sqrt',
    'random_state' : seed,
    'verbose': 0
}

我们可以使用 scikit-learn 的 RandomForestClassifier() 函数来初始化随机森林并将参数传入

In [21]:

rf = RandomForestClassifier(**rf_params)

我们开始训练:

In [22]:

rf.fit(smote_train, smote_target)
print("Fitting of Random Forest as finished")

Fitting of Random Forest as finished

现在我们可以在测试数据上进行预测:

In [24]:

rf_predictions = rf.predict(test)
print("Predictions finished")

Predictions finished

对预测进行打分:

In [25]:

accuracy_score(target_val, rf_predictions)

Out[25]:

0.87771739130434778

Accuracy of the model¶

我们观察到，使用随机森林分类可以得到 88% 的正确率，乍一看，这像是一个非常好的模型，如果我们考虑我们的数据分布是 84% yes 和　%26　no,就会发现这个面型预测的和蒙的差不多

Feature Ranking via the Random Forest¶

sklearn 随机森林分类包含了一个非常方便和有用的属性是 featureimportances，它可以显示出对于特征森林算法来说最重要的特征，下图显示了对于最重要的几个特征:

In [26]:

# Scatter plot 
trace = go.Scatter(
    y = rf.feature_importances_,
    x = attrition_final.columns.values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 13,
        #size= rf.feature_importances_,
        #color = np.random.randn(500), #set color equal to a variable
        color = rf.feature_importances_,
        colorscale='Portland',
        showscale=True
    ),
    text = attrition_final.columns.values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Random Forest Feature Importance',
    hovermode= 'closest',
     xaxis= dict(
         ticklen= 5,
         showgrid=False,
        zeroline=False,
        showline=False
     ),
    yaxis=dict(
        title= 'Feature Importance',
        showgrid=False,
        zeroline=False,
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

Most RF important features: Overtime, Marital Status¶

通过上图可以看出对于我们最重要的几个特征，算法将加班特征的重要性拍到最高，其次是婚姻状况

我不知道对于你来说哪个重要，但是对于我来说加班确实影响到了我对工作的满意程度，也许这样我们队分类器就不会感到惊讶，因为我们的分类器已经达到了目标并把加班时间重要性排到最高

Visualising Tree Diagram with Graphviz¶

让我们显示我们的特征树，可以使用 DecisionTreeClassifier 对象遍历单个决策树特征并使用 export_graphviz() 函数来显示 png 图像:

In [27]:

from sklearn import tree
from IPython.display import Image as PImage
from subprocess import check_call
from PIL import Image, ImageDraw, ImageFont
import re

decision_tree = tree.DecisionTreeClassifier(max_depth = 4)
decision_tree.fit(train, target_train)

# Predicting results for test dataset
y_pred = decision_tree.predict(test)

# Export our trained model as a .dot file
with open("tree1.dot", 'w') as f:
     f = tree.export_graphviz(decision_tree,
                              out_file=f,
                              max_depth = 4,
                              impurity = False,
                              feature_names = attrition_final.columns.values,
                              class_names = ['No', 'Yes'],
                              rounded = True,
                              filled= True )
        
#Convert .dot to .png to allow display in web notebook
check_call(['dot','-Tpng','tree1.dot','-o','tree1.png'])

# Annotating chart with PIL
img = Image.open("tree1.png")
draw = ImageDraw.Draw(img)
img.save('sample-out.png')
PImage("sample-out.png")

Out[27]:

B. Gradient Boosted Classifier¶

梯度增强法是一种组合技术，非常像随机森林树，是将弱树学习者的组合结合成一棵强树，这个技术涉及到定义一些方法（算法）来最小化损失函数 (loss function)。因此，顾名思义，最小化损失函数的方法就是指梯度下降方法，指向了减少损失函数值的方向。

sklearn 中使用 Gradient Boosted classifier 非常简单，只需要几行代码，我们首先设置分类参数:

Initialising Gradient Boosting Parameters¶

一般来说，在设置梯度增强分类有几个关键参数, 估计数量, 模型的最大深度，每个叶子的最少样本。

In [28]:

# Gradient Boosting Parameters
gb_params ={
    'n_estimators': 500,
    'max_features': 0.9,
    'learning_rate' : 0.2,
    'max_depth': 11,
    'min_samples_leaf': 2,
    'subsample': 1,
    'max_features' : 'sqrt',
    'random_state' : seed,
    'verbose': 0
}

定义了参数后，我们可以训练预测得分了

In [30]:

gb = GradientBoostingClassifier(**gb_params)
# Fit the model to our SMOTEd train and target
gb.fit(smote_train, smote_target)
# Get our predictions
gb_predictions = gb.predict(test)
print("Predictions have finished")
accuracy_score(target_val, gb_predictions)

Predictions have finished

Out[30]:

0.88858695652173914

Feature Ranking via the Gradient Boosting Model¶

我们看一下对于 Gradient Boosting Model 最重要的参数

In [33]:

# Scatter plot 
trace = go.Scatter(
    y = gb.feature_importances_,
    x = attrition_final.columns.values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 13,
        #size= rf.feature_importances_,
        #color = np.random.randn(500), #set color equal to a variable
        color = gb.feature_importances_,
        colorscale='Portland',
        showscale=True
    ),
    text = attrition_final.columns.values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Gradient Boosting Model Feature Importance',
    hovermode= 'closest',
     xaxis= dict(
         ticklen= 5,
         showgrid=False,
        zeroline=False,
        showline=False
     ),
    yaxis=dict(
        title= 'Feature Importance',
        showgrid=False,
        zeroline=False,
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter')

GBM most important features¶

Monthly Income, Overtime, Daily and Monthly Rate

CONCLUSION¶

我们简单的分析了员工的属性，并应用了特征工程，实现了两种算法，得到了 89% 的正确率。

但仍然有改进的空间，可以从数据中应用更多的特征工程，可以通过混合模型来使得模型更准确，例如同时运行多个模型，根据多个模型的结果进行投票

Kaka Blog

IBM 人员流失预测

kaka