Example - Xgboost, state-of-the-art ensemble method for structured data

10. Example - Xgboost, state-of-the-art ensemble method for structured data#

Xgboost is one of the latest implementations of gradient boosting algorithms, with many optimisations compared to other gradient boosting algorithms.
https://xgboost.readthedocs.io/en/latest/

In this example, we are predicting abnormal Tobin’s Q using board characteristics.

10.1. Libraries#

Pandas and Numpy for data processing.
Scipy and Statsmodels for statistical analysis.
Matplotlib for plotting.

import xgboost
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import shap
import scipy.stats as ss
import statsmodels.api as sm

plt.xkcd()

<matplotlib.pyplot._xkcd at 0x128bcd741f0>

A function to check the normality of a dataset. Draws a histogram and a normal curve. It also calculates the Kolmogorov-Smirnov statistic of the dataset.

# Function to check normality
def check_normality(data,bins=50):
    mask = ~np.isnan(data)
    temp1 = data[mask]
    x = np.linspace(np.percentile(temp1,1),np.percentile(temp1,99),100)
    plt.hist(temp1, density=True,bins=bins)
    plt.plot(x,ss.norm.pdf(x,np.mean(temp1),np.std(temp1)))
    print("Kolmogorov-Smirnov: " + str(sm.stats.diagnostic.kstest_normal(temp1)))

Data preparation, which includes:

Loading data from a csv-file
Removing missing y-values
Loading the abnormal Tobin Q values to the y_df dataframe
Loading the board characteristic variables to the x_df dataframe
Winsorizing the data, meaning that we transform the most extreme values to the 1 % and 99 % percentile values.

# Prepare data.
errors_df = pd.read_csv("AbnormalQ.csv",delimiter=";")
errors_df = errors_df.rename(columns = {'BOARD_GENDER_DIVERSITY_P' : 'BOARD_GENDER_DIVERSITY','BOARD_MEMBER_COMPENSATIO' : 'BOARD_MEMBER_COMPENSATION'})
# Remove missing abQ values
errors_df = errors_df.loc[~errors_df['ABN_TOBIN_POIKKILEIK'].isna()]
y_df = errors_df['ABN_TOBIN_POIKKILEIK']
x_df = errors_df[['BOARD_GENDER_DIVERSITY',
       'BOARD_MEETING_ATTENDANCE', 'BOARD_MEMBER_AFFILIATION',
       'BOARD_MEMBER_COMPENSATION', 'BOARD_SIZE',
       'CEO_BOARD_MEMBER', 'CHAIRMAN_IS_EX_CEO',
       'INDEPENDENT_BOARD_MEMBER',
       'NUMBER_OF_BOARD_MEETINGS',
       'AVERAGE_BOARD_TENURE']]
# Winsorize x-data
x_df_wins = x_df.clip(lower=x_df.quantile(0.01), upper=x_df.quantile(0.99), axis = 1)

Calculate the descriptive statistics of the x variables (board characteristics).

x_df.describe().transpose()

	count	mean	std	min	25%	50%	75%	max
BOARD_GENDER_DIVERSITY	3856.0	1.790492e+01	1.065878e+01	0.00	11.11	18.18	25.00	7.500000e+01
BOARD_MEETING_ATTENDANCE	3691.0	7.892493e+01	8.829494e+00	5.00	75.00	75.00	75.00	1.000000e+02
BOARD_MEMBER_AFFILIATION	3856.0	1.001004e+00	6.346233e-01	0.00	0.58	0.92	1.33	8.890000e+00
BOARD_MEMBER_COMPENSATION	3799.0	2.942600e+06	3.141424e+07	388.00	1570763.00	2093063.00	2788399.50	1.929824e+09
BOARD_SIZE	3856.0	9.919865e+00	2.095281e+00	1.00	9.00	10.00	11.00	1.700000e+01
CEO_BOARD_MEMBER	3838.0	9.838458e-01	1.260850e-01	0.00	1.00	1.00	1.00	1.000000e+00
CHAIRMAN_IS_EX_CEO	3857.0	6.238009e-01	4.844938e-01	0.00	0.00	1.00	1.00	1.000000e+00
INDEPENDENT_BOARD_MEMBER	3850.0	8.219792e+01	9.624940e+00	33.33	77.78	84.62	90.00	1.000000e+02
NUMBER_OF_BOARD_MEETINGS	3767.0	7.920361e+00	3.544352e+00	1.00	6.00	7.00	9.00	4.300000e+01
AVERAGE_BOARD_TENURE	3814.0	9.598506e+00	3.673708e+00	0.07	7.23	9.15	11.58	2.886000e+01

An example of the check_normality -function.

check_normality(y_df)

Kolmogorov-Smirnov: (0.04520799160648903, 0.0009999999999998899)

_images/cc7190bcc2afb915d635ae7b1bfff738baa1e365ecde4ae6a4acfbc7a91f980c.png

Convert the data to a xgboost dmatrix, that is more computationally efficient form for data.

dtrain = xgboost.DMatrix(x_df_wins, label=y_df, nthread = -1)

The most difficult part in using xgboost for prediction is the tuning of hyperparameters. There is no good theory to guide parameter optimisation, and it is more dark magic than science. You can use grid-search approaches, but it needs A LOT of computing power. The parameters that we fine-tune are:

num_boost_round: The number of decision trees.
max_depth: The depth of the trees.
eta: The weight of the added tree
subsample: A randomly selected subsample of the data, that is used at each round.
colsample_bytree: A randomly selected subsample of features, that is used for each tree (this can be done also by node and by level).
min_child_weight: A regularisation parameter. In the linear regression case this would mean the minimum number of data points that needs to be in the leaves
gamma: A regularisation parameter, that controls the loss function.

There are many other tunable parameters in the model, but we leave them to default values.

m_depth = 5
eta = 0.02
ssample = 0.8
col_tree = 0.8
m_child_w = 1
gam = 0.1
param = {'max_depth': m_depth, 'eta': eta, 'subsample': ssample, 'colsample_bytree': col_tree, 'min_child_weight' : m_child_w, 'gamma' : gam}

Xgboost.cv -function can be used to search the optimal number of trees in the model. We search the number of trees that achieve the lowest root-mean-square-error. In this case, the optimal number of trees appears to be around 600.

temp = xgboost.cv(param,dtrain,num_boost_round=900,nfold=5,seed=10)

plt.plot(temp['test-rmse-mean'][200:900])
plt.show()

_images/db50f82135feaa8ada4b2780a1b8bc546cea8be858489f7307f809d365ef358a.png

b_rounds = 600

The model is trained using the train-function. The number of trees was set to 600 based on the cross-validation results.

bst = xgboost.train(param,dtrain,num_boost_round=b_rounds)

In the following we analyse the results of the model using the methods of explainable AI.

Histrogram for the split points in the threes for the variable Board gender diversity. Notice that this information cannot be used for deciding what is the optimal percentage. The model just tries to separate companies according to their market performance. This is not necessarily achieved by separating data using the “optimal” feature values.

# Histogram of split points
split_bars = bst.get_split_value_histogram(feature='BOARD_GENDER_DIVERSITY',bins = 40)
plt.bar(split_bars['SplitValue'],split_bars['Count'])
plt.xlabel('Gender diversity')
plt.ylabel('Count')
plt.show()

_images/5195d2c79c2793435ae17e338849805b8e9195fab1b292a4bd36212c76511497.png

Average of the split points.

aver_split_points = []
for feat in bst.feature_names:
    split_bars = bst.get_split_value_histogram(feature=feat,as_pandas=False)
    temp = sum(split_bars[:,0]*split_bars[:,1])/(sum(split_bars[:,1]))
    aver_split_points.append(temp)
aver_df = pd.DataFrame()
aver_df['averages'] = aver_split_points
aver_df.index = bst.feature_names
aver_df.round(2)

	averages
BOARD_GENDER_DIVERSITY	20.40
BOARD_MEETING_ATTENDANCE	102.42
BOARD_MEMBER_AFFILIATION	0.98
BOARD_MEMBER_COMPENSATION	2289342.12
BOARD_SIZE	9.28
CEO_BOARD_MEMBER	1.28
CHAIRMAN_IS_EX_CEO	1.00
INDEPENDENT_BOARD_MEMBER	77.93
NUMBER_OF_BOARD_MEETINGS	10.33
AVERAGE_BOARD_TENURE	9.58

Basic metrics that can be used to measure the feature importance are weight,gain and cover. There are many issues with these metrics. For example the weight metric undervalues binary features. Below is their explanation from the xgboost documents:

‘weight’: the number of times a feature is used to split the data across all trees.
‘gain’: the average gain across all splits the feature is used in.
‘cover’: the average coverage across all splits the feature is used in.

xgboost.plot_importance(bst,importance_type='weight',xlabel='weight',show_values=False,grid=False),
xgboost.plot_importance(bst,importance_type='gain',xlabel='gain',show_values=False,grid=False),
xgboost.plot_importance(bst,importance_type='cover',xlabel='cover',show_values=False,grid=False)
plt.show()

_images/d4cd8c75881f0689c58c38b657e935ccea4d72af1f6cda4577e6c4343eaebec3.png

_images/26f70d814540e8ad7fb4430f1e9981dc792ff7a88b11b90d759ff9450f720f64.png

_images/f0156c31545768b13c7d5ea43a8d84a01c35d0c05228a4f1f2476bce9df4df5b.png

10.1.1. SHAP analysis#

Because the basic importance metrics have many weaknesses, the machine learning community is doing a lot of research at the moment to invent better ways to analyse machine learning models. One recent innovation is SHAP values. slundberg/shap

j=0
shap.initjs()

Below, shap values are calculated for the model.

explainerXGB = shap.TreeExplainer(bst)
shap_values_XGB = explainerXGB.shap_values(x_df_wins,y_df,check_additivity = False)

A summary plot of the most important features. From the board characteristics the most important explainers of market performance are Number of board meetings, Board member compensation, Average board tenure and Board member affiliation. In this example, the meaning of the SHAP values is the absolute average effect on the abnormal tobin Q of the company.

shap.summary_plot(shap_values_XGB,x_df_wins,plot_type='bar')

_images/ed5e4c42ee664a7d71f2456adea064ac4a168d4d87c8f8fbf499f32ac430bbf3.png

Present the average values as a dataframe.

shaps = np.mean(abs(shap_values_XGB), axis = 0)
names = bst.feature_names
apu_df = pd.DataFrame()
apu_df['names'] = names
apu_df['shaps'] = shaps
apu_df

	names	shaps
0	BOARD_GENDER_DIVERSITY	0.014892
1	BOARD_MEETING_ATTENDANCE	0.007698
2	BOARD_MEMBER_AFFILIATION	0.020881
3	BOARD_MEMBER_COMPENSATION	0.021699
4	BOARD_SIZE	0.007537
5	CEO_BOARD_MEMBER	0.001149
6	CHAIRMAN_IS_EX_CEO	0.007380
7	INDEPENDENT_BOARD_MEMBER	0.012343
8	NUMBER_OF_BOARD_MEETINGS	0.021496
9	AVERAGE_BOARD_TENURE	0.019675

The above SHAP plot has no information about the nature of association. Is it positive or negative? Number of board meetings has the largest effect, but is it increasing or decreasing abnormal Tobin’s Q? This is estimated below.

stand_feats = (x_df_wins-x_df_wins.mean(axis = 0))/x_df_wins.std(axis = 0)
std_feat_times_shaps = np.multiply(shap_values_XGB,stand_feats)
dir_metric = np.mean(std_feat_times_shaps, axis = 0)
plt.barh(range(10),dir_metric,tick_label = bst.feature_names)
plt.show()

_images/e03d8155169aa8b15835dae22c9122f95780974e03f1d6ae6bf0849cbe6a201a.png

More detailed analysis of the nature of association can be achieved with the scatter plots of SHAP values.

shap.dependence_plot(0,shap_values_XGB,x_df),
shap.dependence_plot(8,shap_values_XGB,x_df)

_images/f0dfd2a4c24382247e83a07dfe767c5e34c81af9fe9500b3ece0e4df40914c0c.png

_images/343f128de011c19c3aca2b32a8346b26983fd39e79d6a757f832f3b37ff6bc1f.png

Scatter plots for all the variables.

fig, axs = plt.subplots(5,2,figsize=(10,16),squeeze=True)
ind = 0
for ax,feat in zip(axs.flat,x_df_wins.columns):
    ax.scatter(x_df_wins[feat],shap_values_XGB[:,ind],s=1)
    ax.set_ylim([-0.15,0.15])
    ax.set_title(feat)
    ind+=1
plt.subplots_adjust(hspace=0.4)
plt.show()

_images/5274fbe6be835f51aecb59164add5b7838cfed047d087a36810243a19447c0d7.png

Below is an analysis where two subgroups are compared to each other. The subgroups are formed using the extreme values of the features.

Example - Xgboost, state-of-the-art ensemble method for structured data

Contents

10. Example - Xgboost, state-of-the-art ensemble method for structured data#

10.1. Libraries#

10.1.1. SHAP analysis#