15. Deep learning example from Accounting/Finance#

The following example demonstrates a simple example of deep learning that uses accounting/finance data. It also demonstrates, how to implement a deep learning model to traditional structured data. However, it also shows how deep learning is usually not the best option for structured data with relatively small datasets (<100k observations). Deep learning models perform better with large unstructured datasets.

import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

The data has a little under 20k observations. The variables are different financial ratios and board characteristics of S&P1500 companies.

compu_df = pd.read_csv('_data.txt',delimiter='\t')
compu_df
GVKEY datadate fyear cusip conm act at bkvlps capx ceq ... ind_chairman_is_ex_ceo ind_independent_board_members ind_strictly_independent_board_members ind_board_member_affiliations ind_non_executive_board_members ind_board_gender_diversity_percent ind_board_specific_skills_percent ind_executive_members_gender_diversity_percent ind_average_board_tenure ind_board_member_compensation
0 21542 20081231 2008 000360206 AAON INC 80.118 140.743 5.6088 9.610 96.522 ... 0.0 87.500 57.140 1.425 90.91 13.395 50.61 7.735 8.270 1650394.0
1 21542 20091231 2009 000360206 AAON INC 96.240 156.211 6.8544 9.774 117.999 ... 0.0 87.500 50.000 1.180 90.00 12.915 60.00 6.670 8.705 1590889.5
2 21542 20101231 2010 000360206 AAON INC 91.748 160.277 7.0725 17.470 116.739 ... 0.0 84.620 51.925 1.000 90.00 11.110 58.33 9.090 8.780 1801674.0
3 21542 20111231 2011 000360206 AAON INC 84.387 178.981 4.9762 35.914 122.504 ... 0.0 86.670 50.000 1.090 90.00 11.110 57.14 10.000 9.180 1847006.5
4 21542 20121231 2012 000360206 AAON INC 91.546 193.493 5.6341 14.147 138.136 ... 0.0 87.500 50.000 1.190 90.00 14.290 54.55 9.090 9.170 1810953.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19632 28191 20151231 2015 V7780T103 ROYAL CARIBBEAN GROUP 837.022 20921.855 36.9876 1613.340 8063.039 ... 1.0 84.620 50.000 0.905 83.33 16.670 53.85 12.500 8.880 1744895.0
19633 28191 20161231 2016 V7780T103 ROYAL CARIBBEAN GROUP 748.305 22310.324 42.5054 2494.363 9121.412 ... 1.0 83.330 50.000 0.880 83.33 16.670 53.85 14.290 9.340 1737800.0
19634 28191 20171231 2017 V7780T103 ROYAL CARIBBEAN GROUP 843.028 22296.317 50.1659 564.138 10702.303 ... 1.0 83.330 50.000 0.885 83.33 20.000 57.14 13.395 9.105 1793588.5
19635 28191 20181231 2018 V7780T103 ROYAL CARIBBEAN GROUP 1242.044 27698.270 53.1319 3660.028 11105.461 ... 1.0 85.710 50.000 0.745 81.82 22.220 58.33 14.290 9.180 1858984.0
19636 28191 20191231 2019 V7780T103 ROYAL CARIBBEAN GROUP 1162.628 30320.284 58.2557 3024.663 12163.846 ... 1.0 85.165 48.075 0.800 80.00 25.000 60.00 15.190 9.195 1884002.5

19637 rows × 102 columns

We use only variables with few missing values, because the benefits of deep learning are seen in large datasets. The variables are different financial ratios (Compustat). There are also many heavily correlating variables included, but this should not be a serious issue with neural networks.

variables = ['at', 'bkvlps','capx', 'ceq', 'csho', 'cstk', 'dlc', 'dltt', 'dvc', 'ebit',
       'ibc', 'icapt', 'lt', 'ni', 'pstk', 'pstkl','pstkrv', 're', 'sale', 'seq', 'costat', 'prcc_c',
       'prcc_f', 'sic', 'mkvalt','tobin', 'yld', 'age', 'tridx', 'mb',
       'cap_int', 'lvg', 'roa', 'roe', 'roi']
compu_df[variables].isna().sum()
at           3
bkvlps      34
capx        53
ceq          3
csho        11
cstk        13
dlc          9
dltt        81
dvc         77
ebit        15
ibc         17
icapt        4
lt          49
ni           7
pstk         4
pstkl       37
pstkrv      19
re          12
sale         7
seq          3
costat       0
prcc_c       8
prcc_f       6
sic          0
mkvalt      13
tobin       13
yld         85
age          0
tridx      172
mb          36
cap_int     53
lvg         86
roa          7
roe         16
roi          8
dtype: int64
compu_df[variables] = compu_df[variables].clip(lower=compu_df[variables].quantile(0.01),
                                               upper=compu_df[variables].quantile(0.99),axis=1)
compu_df['current_roa'] = compu_df['roa']

Lag everything else

compu_df[variables] = compu_df.groupby(['conm']).shift()[variables]

We have to drop all missing values, because othwerise we can not optimize the network using gradient descent algorithm.

I add industry, SP500 dummy and year separately as I do not want to winsorize or lag these variables.

compu_df[variables + ['fyear','ind','sp500','current_roa']] = compu_df[variables + ['fyear','ind','sp500','current_roa']].dropna()
compu_df[variables + ['fyear','current_roa']].head(30)
at bkvlps capx ceq csho cstk dlc dltt dvc ebit ... age tridx mb cap_int lvg roa roe roi fyear current_roa
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 140.743 5.6088 9.610 96.522 17.209 0.071 2.992 0.000 5.621 43.388 ... 204.0 1.070373 3.722721 0.068280 0.030998 0.203129 0.079563 0.296192 2009.0 0.177459
2 156.211 6.8544 9.774 117.999 17.215 0.071 0.076 0.000 6.201 43.754 ... 216.0 1.017457 2.843429 0.062569 0.000644 0.177459 0.082621 0.234926 2010.0 0.136601
3 160.277 7.0725 17.470 116.739 16.506 0.068 0.000 0.000 6.067 32.715 ... 228.0 1.494462 3.988689 0.108999 0.000000 0.136601 0.047020 0.187547 2011.0 0.078142
4 178.981 4.9762 35.914 122.504 24.618 0.098 4.575 0.000 5.935 23.971 ... 240.0 1.646141 4.117600 0.200658 0.037346 0.078142 0.027727 0.114168 2012.0 0.141860
5 193.493 5.6341 14.147 138.136 24.518 0.098 0.000 0.000 8.840 44.238 ... 252.0 1.706581 3.704230 0.073114 0.000000 0.141860 0.053644 0.198710 2013.0 0.174277
6 215.444 4.4702 9.041 164.106 36.711 0.147 0.000 0.000 7.428 55.803 ... 264.0 3.949485 7.147331 0.041965 0.000000 0.174277 0.032012 0.228797 2014.0 0.189424
7 233.117 3.2208 16.127 174.059 54.042 0.216 0.000 0.000 9.656 71.563 ... 276.0 4.185799 6.951689 0.069180 0.000000 0.189424 0.036494 0.253696 2015.0 0.196381
8 232.854 3.3750 20.967 178.918 53.012 0.212 0.000 0.000 11.857 71.695 ... 288.0 4.381591 6.880000 0.090044 0.000000 0.196381 0.037149 0.255581 2016.0 0.208069
9 256.530 3.9106 26.604 205.898 52.651 0.211 0.000 0.000 12.676 79.574 ... 300.0 6.286182 8.451389 0.103707 0.000000 0.208069 0.030674 0.259235 2017.0 0.183631
10 296.780 4.5252 41.713 237.226 52.423 0.210 0.000 0.000 13.653 74.148 ... 312.0 7.030023 8.110139 0.140552 0.000000 0.183631 0.028326 0.229730 2018.0 0.138132
11 308.197 4.7604 37.268 247.499 51.991 0.208 0.000 0.000 16.717 57.678 ... 324.0 6.776646 7.364927 0.120923 0.000000 0.138132 0.023355 0.172009 2019.0 0.144608
12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13 1377.511 16.8937 27.535 656.895 38.884 44.201 63.600 392.984 0.000 125.529 ... 444.0 0.386537 1.089755 0.019989 0.695064 0.057096 0.109870 0.074914 2009.0 0.029731
14 1501.042 18.9167 28.855 746.906 39.484 44.870 100.833 336.191 0.000 95.415 ... 456.0 0.518011 1.214800 0.019223 0.585112 0.029731 0.049185 0.041225 2010.0 0.040984
15 1703.727 21.0112 124.879 835.845 39.781 44.986 114.075 329.802 2.983 137.016 ... 468.0 0.695924 1.307398 0.073298 0.531052 0.040984 0.063897 0.059932 2011.0 0.030844
16 2195.653 21.4697 91.218 864.649 40.273 44.849 122.865 669.489 12.081 142.360 ... 480.0 0.322468 0.892886 0.041545 0.916388 0.030844 0.087720 0.044105 2012.0 0.025738
17 2136.900 23.3254 37.600 918.600 39.382 44.700 86.400 622.200 11.900 136.600 ... 492.0 0.546782 0.800844 0.017596 0.771391 0.025738 0.074763 0.035675 2013.0 0.033144
18 2199.500 25.2654 26.500 999.500 39.560 44.700 69.700 564.300 11.800 142.600 ... 504.0 0.669914 1.108631 0.012048 0.634317 0.033144 0.065790 0.046581 2014.0 0.006733
19 1515.000 23.8574 46.300 845.100 35.423 44.900 69.000 85.000 11.900 -8.600 ... 516.0 0.823134 1.164419 0.030561 0.182227 0.006733 0.010365 0.010967 2015.0 0.033077
20 1442.100 25.0847 88.400 865.800 34.515 44.900 12.000 136.100 10.400 66.100 ... 528.0 0.688919 1.048049 0.061299 0.171056 0.033077 0.052568 0.047610 2016.0 0.037564
21 1504.100 26.6112 33.600 914.200 34.354 45.200 2.000 155.300 10.200 77.200 ... 540.0 0.995869 1.241958 0.022339 0.172063 0.037564 0.049762 0.052828 2017.0 0.010232
22 1524.700 26.9703 22.000 936.300 34.716 45.300 0.000 177.200 10.300 86.000 ... 552.0 1.283369 1.456788 0.014429 0.189256 0.010232 0.011437 0.014010 2018.0 0.004943
23 1517.200 26.0406 17.400 905.900 34.788 45.300 0.000 141.700 10.500 110.700 ... 564.0 0.870483 1.433915 0.011468 0.156419 0.004943 0.005774 0.007159 2019.0 0.002116
24 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25 1549.913 12.6376 34.063 644.051 50.963 0.581 0.000 230.000 25.271 114.309 ... 468.0 0.820994 1.507406 0.021977 0.357115 0.029314 0.046798 0.051981 2009.0 0.035692
26 1521.153 13.2923 18.582 687.050 51.688 0.517 0.000 172.500 26.727 82.506 ... 480.0 0.973093 1.554283 0.012216 0.251073 0.035692 0.050842 0.063164 2010.0 0.041404
27 1548.670 14.0406 23.942 739.025 52.635 0.526 0.000 140.500 28.152 118.882 ... 492.0 1.198713 1.873139 0.015460 0.190115 0.041404 0.046320 0.072904 2011.0 0.036446
28 1879.598 14.9230 22.124 795.886 53.333 0.533 0.000 300.000 29.744 124.148 ... 504.0 1.101268 1.381760 0.011771 0.376938 0.036446 0.062292 0.062510 2012.0 0.033480
29 1869.251 15.6340 28.052 850.398 54.394 0.544 0.000 215.000 31.309 111.711 ... 516.0 1.064461 1.276065 0.015007 0.252823 0.033480 0.057671 0.058740 2013.0 0.034399

30 rows × 37 columns

I remove the first year (2008), because we do not have any observations there due to the lag procedure.

compu_df = compu_df[compu_df['fyear'] > 2008.]

We try to predict current ROA with the last year’s variable values.

y_df = compu_df['current_roa']
x_df = compu_df[variables + ['fyear','ind','sp500']]

Train/test split

from sklearn.model_selection import train_test_split

Tensoflow does not like Pandas dataframes, so I change them to Numpy array.s

# Split data into training and test sets
x_train, x_test , y_train, y_test = train_test_split(x_df.values, y_df.values, test_size=0.20, random_state=1)
type(x_train)
numpy.ndarray
len(x_train), len(x_test)
(14017, 3505)

Let’s check that there is no missing values any more.

compu_df[variables+['current_roa','fyear']].isna().sum()
at             0
bkvlps         0
capx           0
ceq            0
csho           0
cstk           0
dlc            0
dltt           0
dvc            0
ebit           0
ibc            0
icapt          0
lt             0
ni             0
pstk           0
pstkl          0
pstkrv         0
re             0
sale           0
seq            0
costat         0
prcc_c         0
prcc_f         0
sic            0
mkvalt         0
tobin          0
yld            0
age            0
tridx          0
mb             0
cap_int        0
lvg            0
roa            0
roe            0
roi            0
current_roa    0
fyear          0
dtype: int64

15.1. Densely connected network#

Let’s build a traditional densely connected neural network. We could also use recurrent or LSTM networks, but we would need to reorganize data in that case.

image.png!

One way to define a neural network with Keras is a single Sequential-command that has the layers in a list as a parameter. The densely connected layers have ReLU as an activation function. The last layer has one neuron, because we want to have a single valua as an output (current ROA). There is also no activation function, because we want to have linear output. There is also a dropout layer to counter overfitting. For the first layer, we need to define the shape of our input.

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(128, activation='relu',input_shape = (38,)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.1),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(1)])

You can check your model with the summary() -function. The model has 203 530 parameters.

model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 128)               4992      
_________________________________________________________________
dense_1 (Dense)              (None, 64)                8256      
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_3 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 17        
=================================================================
Total params: 15,873
Trainable params: 15,873
Non-trainable params: 0
_________________________________________________________________

With compile(), we build our neural network to be ready for use. For regerssion problems, MSE is the correct loss function. We measure our performance with Mean Absolute Error, because it is easier to interpret than MSE.

model.compile(loss='mse',metrics=['mae'])

Next, we feed the training data to our model and train it using back-propagation. Everything is automatic, so, we do not need to worry about the details. The training data accuracy rises to 0.986 = 98.6%. However, true performance needs to be evaluated using test data. We can save to history information about the training process. The model is trained with batches of 64 images. So, to go through all the images, we need 938 rounds (the last batch has 32 images). One epoch is one round of going through all the data.

history = model.fit(x_train,y_train,epochs=150,batch_size=1024,validation_split=0.1,verbose=False)

The following code plots the progress of training. Within the code is info for different commands.

plt.style.use('bmh') # We select as a plot-style 'bmh' that is in my opinion usually the prettiest.
burnout = 25
epochs = range(1, len(history.history['val_mae']) + 1) # Correct x-axis values (epochs)
plt.plot(epochs[burnout:], history.history['val_mae'][burnout:], 'r--',label='Validation accuracy') # Plot epochs vs. accuracy
plt.plot(epochs[burnout:], history.history['mae'][burnout:], 'b--',label='Train accuracy') # Plot epochs vs. accuracy
plt.legend()
plt.title('Accuracy') # Add title
plt.figure() # Show the first figure. Without this command, accuracy and loss would be drawn to the same plot.
plt.plot(epochs[burnout:], history.history['val_loss'][burnout:], 'r--',label='Validation loss') # Plot epochs vs. loss
plt.plot(epochs[burnout:], history.history['loss'][burnout:], 'b--',label='Train loss')
plt.title('Loss') # Add title
plt.show() # Show everyhting
_images/d97e86950c616aa87b6ab2c38da819f40ecd883ae406adf1bc65a34c95d79b46.png _images/cd02fcd25a7a47fe4727510accff1b821d5e368e0504c8ef5cb11db4946aa31d.png

Evaluate() can be used to evaluate the model with the test data. Acccuracy with the test data is 0.052.

test_loss,test_acc = model.evaluate(x_test,y_test)
110/110 [==============================] - 0s 248us/step - loss: 0.0066 - mae: 0.0521
test_acc
0.052059270441532135

Let’s compare the performance to a linear model.

import sklearn.linear_model as sk_lm

We define our LinearRegression object.

model = sk_lm.LinearRegression()

fit() can be used to fit the data.

model.fit(x_train,y_train)
LinearRegression()

coef_ -attribute has the coefficients of each variable and intercept_ has the intercept of the linear regression model.

model.coef_
array([ 5.31468126e-08, -1.05923685e-04, -3.53267974e-06, -5.74187084e-07,
        4.89268554e-06,  3.28549174e-08, -1.02011246e-08, -4.32534321e-07,
        3.03267883e-07,  7.33851624e-06,  5.42509020e-07,  5.14662474e-08,
       -1.07960714e-07, -8.41952205e-06, -1.13453398e-06, -1.67169301e-05,
        1.25340601e-05, -1.00876074e-07,  8.42772444e-08,  3.58063483e-07,
        7.40836707e-03,  6.03105032e-04, -4.75728895e-04,  1.11622895e-06,
       -2.38159257e-08,  1.54004100e-02,  9.43484048e-02,  1.94168470e-05,
        2.44164138e-03,  3.21901086e-04, -3.24353490e-02, -1.33602510e-03,
        4.25739547e-01, -3.32303633e-02,  3.77840060e-02, -2.76142065e-03,
       -2.97726456e-04,  2.84737219e-03])
model.intercept_
5.530386213093688

score() can be used to measure the coefficient of determination of the trained model. How much our variables are explaining of the variation of the predicted variable.*

model.score(x_test,y_test)
0.42032935471032107

Mean absolute error.

from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test,model.predict(x_test))
0.03547756137451534

As expected, the linear model performs better for this data.