Deep learning example from Accounting/Finance

15. Deep learning example from Accounting/Finance#

The following example demonstrates a simple example of deep learning that uses accounting/finance data. It also demonstrates, how to implement a deep learning model to traditional structured data. However, it also shows how deep learning is usually not the best option for structured data with relatively small datasets (<100k observations). Deep learning models perform better with large unstructured datasets.

import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

The data has a little under 20k observations. The variables are different financial ratios and board characteristics of S&P1500 companies.

compu_df = pd.read_csv('_data.txt',delimiter='\t')

compu_df

	GVKEY	datadate	fyear	cusip	conm	act	at	bkvlps	capx	ceq	...	ind_chairman_is_ex_ceo	ind_independent_board_members	ind_strictly_independent_board_members	ind_board_member_affiliations	ind_non_executive_board_members	ind_board_gender_diversity_percent	ind_board_specific_skills_percent	ind_executive_members_gender_diversity_percent	ind_average_board_tenure	ind_board_member_compensation
0	21542	20081231	2008	000360206	AAON INC	80.118	140.743	5.6088	9.610	96.522	...	0.0	87.500	57.140	1.425	90.91	13.395	50.61	7.735	8.270	1650394.0
1	21542	20091231	2009	000360206	AAON INC	96.240	156.211	6.8544	9.774	117.999	...	0.0	87.500	50.000	1.180	90.00	12.915	60.00	6.670	8.705	1590889.5
2	21542	20101231	2010	000360206	AAON INC	91.748	160.277	7.0725	17.470	116.739	...	0.0	84.620	51.925	1.000	90.00	11.110	58.33	9.090	8.780	1801674.0
3	21542	20111231	2011	000360206	AAON INC	84.387	178.981	4.9762	35.914	122.504	...	0.0	86.670	50.000	1.090	90.00	11.110	57.14	10.000	9.180	1847006.5
4	21542	20121231	2012	000360206	AAON INC	91.546	193.493	5.6341	14.147	138.136	...	0.0	87.500	50.000	1.190	90.00	14.290	54.55	9.090	9.170	1810953.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
19632	28191	20151231	2015	V7780T103	ROYAL CARIBBEAN GROUP	837.022	20921.855	36.9876	1613.340	8063.039	...	1.0	84.620	50.000	0.905	83.33	16.670	53.85	12.500	8.880	1744895.0
19633	28191	20161231	2016	V7780T103	ROYAL CARIBBEAN GROUP	748.305	22310.324	42.5054	2494.363	9121.412	...	1.0	83.330	50.000	0.880	83.33	16.670	53.85	14.290	9.340	1737800.0
19634	28191	20171231	2017	V7780T103	ROYAL CARIBBEAN GROUP	843.028	22296.317	50.1659	564.138	10702.303	...	1.0	83.330	50.000	0.885	83.33	20.000	57.14	13.395	9.105	1793588.5
19635	28191	20181231	2018	V7780T103	ROYAL CARIBBEAN GROUP	1242.044	27698.270	53.1319	3660.028	11105.461	...	1.0	85.710	50.000	0.745	81.82	22.220	58.33	14.290	9.180	1858984.0
19636	28191	20191231	2019	V7780T103	ROYAL CARIBBEAN GROUP	1162.628	30320.284	58.2557	3024.663	12163.846	...	1.0	85.165	48.075	0.800	80.00	25.000	60.00	15.190	9.195	1884002.5

19637 rows × 102 columns

We use only variables with few missing values, because the benefits of deep learning are seen in large datasets. The variables are different financial ratios (Compustat). There are also many heavily correlating variables included, but this should not be a serious issue with neural networks.

variables = ['at', 'bkvlps','capx', 'ceq', 'csho', 'cstk', 'dlc', 'dltt', 'dvc', 'ebit',
       'ibc', 'icapt', 'lt', 'ni', 'pstk', 'pstkl','pstkrv', 're', 'sale', 'seq', 'costat', 'prcc_c',
       'prcc_f', 'sic', 'mkvalt','tobin', 'yld', 'age', 'tridx', 'mb',
       'cap_int', 'lvg', 'roa', 'roe', 'roi']

compu_df[variables].isna().sum()

at           3
bkvlps      34
capx        53
ceq          3
csho        11
cstk        13
dlc          9
dltt        81
dvc         77
ebit        15
ibc         17
icapt        4
lt          49
ni           7
pstk         4
pstkl       37
pstkrv      19
re          12
sale         7
seq          3
costat       0
prcc_c       8
prcc_f       6
sic          0
mkvalt      13
tobin       13
yld         85
age          0
tridx      172
mb          36
cap_int     53
lvg         86
roa          7
roe         16
roi          8
dtype: int64

compu_df[variables] = compu_df[variables].clip(lower=compu_df[variables].quantile(0.01),
                                               upper=compu_df[variables].quantile(0.99),axis=1)

compu_df['current_roa'] = compu_df['roa']

Lag everything else

compu_df[variables] = compu_df.groupby(['conm']).shift()[variables]

We have to drop all missing values, because othwerise we can not optimize the network using gradient descent algorithm.

I add industry, SP500 dummy and year separately as I do not want to winsorize or lag these variables.

compu_df[variables + ['fyear','ind','sp500','current_roa']] = compu_df[variables + ['fyear','ind','sp500','current_roa']].dropna()

compu_df[variables + ['fyear','current_roa']].head(30)

	at	bkvlps	capx	ceq	csho	cstk	dlc	dltt	dvc	ebit	...	age	tridx	mb	cap_int	lvg	roa	roe	roi	fyear	current_roa
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	140.743	5.6088	9.610	96.522	17.209	0.071	2.992	0.000	5.621	43.388	...	204.0	1.070373	3.722721	0.068280	0.030998	0.203129	0.079563	0.296192	2009.0	0.177459
2	156.211	6.8544	9.774	117.999	17.215	0.071	0.076	0.000	6.201	43.754	...	216.0	1.017457	2.843429	0.062569	0.000644	0.177459	0.082621	0.234926	2010.0	0.136601
3	160.277	7.0725	17.470	116.739	16.506	0.068	0.000	0.000	6.067	32.715	...	228.0	1.494462	3.988689	0.108999	0.000000	0.136601	0.047020	0.187547	2011.0	0.078142
4	178.981	4.9762	35.914	122.504	24.618	0.098	4.575	0.000	5.935	23.971	...	240.0	1.646141	4.117600	0.200658	0.037346	0.078142	0.027727	0.114168	2012.0	0.141860
5	193.493	5.6341	14.147	138.136	24.518	0.098	0.000	0.000	8.840	44.238	...	252.0	1.706581	3.704230	0.073114	0.000000	0.141860	0.053644	0.198710	2013.0	0.174277
6	215.444	4.4702	9.041	164.106	36.711	0.147	0.000	0.000	7.428	55.803	...	264.0	3.949485	7.147331	0.041965	0.000000	0.174277	0.032012	0.228797	2014.0	0.189424
7	233.117	3.2208	16.127	174.059	54.042	0.216	0.000	0.000	9.656	71.563	...	276.0	4.185799	6.951689	0.069180	0.000000	0.189424	0.036494	0.253696	2015.0	0.196381
8	232.854	3.3750	20.967	178.918	53.012	0.212	0.000	0.000	11.857	71.695	...	288.0	4.381591	6.880000	0.090044	0.000000	0.196381	0.037149	0.255581	2016.0	0.208069
9	256.530	3.9106	26.604	205.898	52.651	0.211	0.000	0.000	12.676	79.574	...	300.0	6.286182	8.451389	0.103707	0.000000	0.208069	0.030674	0.259235	2017.0	0.183631
10	296.780	4.5252	41.713	237.226	52.423	0.210	0.000	0.000	13.653	74.148	...	312.0	7.030023	8.110139	0.140552	0.000000	0.183631	0.028326	0.229730	2018.0	0.138132
11	308.197	4.7604	37.268	247.499	51.991	0.208	0.000	0.000	16.717	57.678	...	324.0	6.776646	7.364927	0.120923	0.000000	0.138132	0.023355	0.172009	2019.0	0.144608
12	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
13	1377.511	16.8937	27.535	656.895	38.884	44.201	63.600	392.984	0.000	125.529	...	444.0	0.386537	1.089755	0.019989	0.695064	0.057096	0.109870	0.074914	2009.0	0.029731
14	1501.042	18.9167	28.855	746.906	39.484	44.870	100.833	336.191	0.000	95.415	...	456.0	0.518011	1.214800	0.019223	0.585112	0.029731	0.049185	0.041225	2010.0	0.040984
15	1703.727	21.0112	124.879	835.845	39.781	44.986	114.075	329.802	2.983	137.016	...	468.0	0.695924	1.307398	0.073298	0.531052	0.040984	0.063897	0.059932	2011.0	0.030844
16	2195.653	21.4697	91.218	864.649	40.273	44.849	122.865	669.489	12.081	142.360	...	480.0	0.322468	0.892886	0.041545	0.916388	0.030844	0.087720	0.044105	2012.0	0.025738
17	2136.900	23.3254	37.600	918.600	39.382	44.700	86.400	622.200	11.900	136.600	...	492.0	0.546782	0.800844	0.017596	0.771391	0.025738	0.074763	0.035675	2013.0	0.033144
18	2199.500	25.2654	26.500	999.500	39.560	44.700	69.700	564.300	11.800	142.600	...	504.0	0.669914	1.108631	0.012048	0.634317	0.033144	0.065790	0.046581	2014.0	0.006733
19	1515.000	23.8574	46.300	845.100	35.423	44.900	69.000	85.000	11.900	-8.600	...	516.0	0.823134	1.164419	0.030561	0.182227	0.006733	0.010365	0.010967	2015.0	0.033077
20	1442.100	25.0847	88.400	865.800	34.515	44.900	12.000	136.100	10.400	66.100	...	528.0	0.688919	1.048049	0.061299	0.171056	0.033077	0.052568	0.047610	2016.0	0.037564
21	1504.100	26.6112	33.600	914.200	34.354	45.200	2.000	155.300	10.200	77.200	...	540.0	0.995869	1.241958	0.022339	0.172063	0.037564	0.049762	0.052828	2017.0	0.010232
22	1524.700	26.9703	22.000	936.300	34.716	45.300	0.000	177.200	10.300	86.000	...	552.0	1.283369	1.456788	0.014429	0.189256	0.010232	0.011437	0.014010	2018.0	0.004943
23	1517.200	26.0406	17.400	905.900	34.788	45.300	0.000	141.700	10.500	110.700	...	564.0	0.870483	1.433915	0.011468	0.156419	0.004943	0.005774	0.007159	2019.0	0.002116
24	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25	1549.913	12.6376	34.063	644.051	50.963	0.581	0.000	230.000	25.271	114.309	...	468.0	0.820994	1.507406	0.021977	0.357115	0.029314	0.046798	0.051981	2009.0	0.035692
26	1521.153	13.2923	18.582	687.050	51.688	0.517	0.000	172.500	26.727	82.506	...	480.0	0.973093	1.554283	0.012216	0.251073	0.035692	0.050842	0.063164	2010.0	0.041404
27	1548.670	14.0406	23.942	739.025	52.635	0.526	0.000	140.500	28.152	118.882	...	492.0	1.198713	1.873139	0.015460	0.190115	0.041404	0.046320	0.072904	2011.0	0.036446
28	1879.598	14.9230	22.124	795.886	53.333	0.533	0.000	300.000	29.744	124.148	...	504.0	1.101268	1.381760	0.011771	0.376938	0.036446	0.062292	0.062510	2012.0	0.033480
29	1869.251	15.6340	28.052	850.398	54.394	0.544	0.000	215.000	31.309	111.711	...	516.0	1.064461	1.276065	0.015007	0.252823	0.033480	0.057671	0.058740	2013.0	0.034399

30 rows × 37 columns

I remove the first year (2008), because we do not have any observations there due to the lag procedure.

compu_df = compu_df[compu_df['fyear'] > 2008.]

We try to predict current ROA with the last year’s variable values.

y_df = compu_df['current_roa']

x_df = compu_df[variables + ['fyear','ind','sp500']]

Train/test split

from sklearn.model_selection import train_test_split

Tensoflow does not like Pandas dataframes, so I change them to Numpy array.s

# Split data into training and test sets
x_train, x_test , y_train, y_test = train_test_split(x_df.values, y_df.values, test_size=0.20, random_state=1)

type(x_train)

numpy.ndarray

len(x_train), len(x_test)

(14017, 3505)

Let’s check that there is no missing values any more.

compu_df[variables+['current_roa','fyear']].isna().sum()

at             0
bkvlps         0
capx           0
ceq            0
csho           0
cstk           0
dlc            0
dltt           0
dvc            0
ebit           0
ibc            0
icapt          0
lt             0
ni             0
pstk           0
pstkl          0
pstkrv         0
re             0
sale           0
seq            0
costat         0
prcc_c         0
prcc_f         0
sic            0
mkvalt         0
tobin          0
yld            0
age            0
tridx          0
mb             0
cap_int        0
lvg            0
roa            0
roe            0
roi            0
current_roa    0
fyear          0
dtype: int64

15.1. Densely connected network#

Let’s build a traditional densely connected neural network. We could also use recurrent or LSTM networks, but we would need to reorganize data in that case.

!

One way to define a neural network with Keras is a single Sequential-command that has the layers in a list as a parameter. The densely connected layers have ReLU as an activation function. The last layer has one neuron, because we want to have a single valua as an output (current ROA). There is also no activation function, because we want to have linear output. There is also a dropout layer to counter overfitting. For the first layer, we need to define the shape of our input.

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(128, activation='relu',input_shape = (38,)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.1),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(1)])

You can check your model with the summary() -function. The model has 203 530 parameters.

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 128)               4992      
_________________________________________________________________
dense_1 (Dense)              (None, 64)                8256      
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_3 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 17        
=================================================================
Total params: 15,873
Trainable params: 15,873
Non-trainable params: 0
_________________________________________________________________

With compile(), we build our neural network to be ready for use. For regerssion problems, MSE is the correct loss function. We measure our performance with Mean Absolute Error, because it is easier to interpret than MSE.

model.compile(loss='mse',metrics=['mae'])

Next, we feed the training data to our model and train it using back-propagation. Everything is automatic, so, we do not need to worry about the details. The training data accuracy rises to 0.986 = 98.6%. However, true performance needs to be evaluated using test data. We can save to history information about the training process. The model is trained with batches of 64 images. So, to go through all the images, we need 938 rounds (the last batch has 32 images). One epoch is one round of going through all the data.

history = model.fit(x_train,y_train,epochs=150,batch_size=1024,validation_split=0.1,verbose=False)

The following code plots the progress of training. Within the code is info for different commands.

plt.style.use('bmh') # We select as a plot-style 'bmh' that is in my opinion usually the prettiest.
burnout = 25
epochs = range(1, len(history.history['val_mae']) + 1) # Correct x-axis values (epochs)
plt.plot(epochs[burnout:], history.history['val_mae'][burnout:], 'r--',label='Validation accuracy') # Plot epochs vs. accuracy
plt.plot(epochs[burnout:], history.history['mae'][burnout:], 'b--',label='Train accuracy') # Plot epochs vs. accuracy
plt.legend()
plt.title('Accuracy') # Add title
plt.figure() # Show the first figure. Without this command, accuracy and loss would be drawn to the same plot.
plt.plot(epochs[burnout:], history.history['val_loss'][burnout:], 'r--',label='Validation loss') # Plot epochs vs. loss
plt.plot(epochs[burnout:], history.history['loss'][burnout:], 'b--',label='Train loss')
plt.title('Loss') # Add title
plt.show() # Show everyhting

_images/d97e86950c616aa87b6ab2c38da819f40ecd883ae406adf1bc65a34c95d79b46.png

_images/cd02fcd25a7a47fe4727510accff1b821d5e368e0504c8ef5cb11db4946aa31d.png

Evaluate() can be used to evaluate the model with the test data. Acccuracy with the test data is 0.052.

test_loss,test_acc = model.evaluate(x_test,y_test)

110/110 [==============================] - 0s 248us/step - loss: 0.0066 - mae: 0.0521

test_acc

0.052059270441532135

Let’s compare the performance to a linear model.

import sklearn.linear_model as sk_lm

We define our LinearRegression object.

model = sk_lm.LinearRegression()

fit() can be used to fit the data.

model.fit(x_train,y_train)

LinearRegression()

coef_ -attribute has the coefficients of each variable and intercept_ has the intercept of the linear regression model.

model.coef_

array([ 5.31468126e-08, -1.05923685e-04, -3.53267974e-06, -5.74187084e-07,
        4.89268554e-06,  3.28549174e-08, -1.02011246e-08, -4.32534321e-07,
        3.03267883e-07,  7.33851624e-06,  5.42509020e-07,  5.14662474e-08,
       -1.07960714e-07, -8.41952205e-06, -1.13453398e-06, -1.67169301e-05,
        1.25340601e-05, -1.00876074e-07,  8.42772444e-08,  3.58063483e-07,
        7.40836707e-03,  6.03105032e-04, -4.75728895e-04,  1.11622895e-06,
       -2.38159257e-08,  1.54004100e-02,  9.43484048e-02,  1.94168470e-05,
        2.44164138e-03,  3.21901086e-04, -3.24353490e-02, -1.33602510e-03,
        4.25739547e-01, -3.32303633e-02,  3.77840060e-02, -2.76142065e-03,
       -2.97726456e-04,  2.84737219e-03])

model.intercept_

5.530386213093688

score() can be used to measure the coefficient of determination of the trained model. How much our variables are explaining of the variation of the predicted variable.*

model.score(x_test,y_test)

0.42032935471032107

Mean absolute error.

from sklearn.metrics import mean_absolute_error

mean_absolute_error(y_test,model.predict(x_test))

0.03547756137451534

As expected, the linear model performs better for this data.

Deep learning example from Accounting/Finance

Contents

15. Deep learning example from Accounting/Finance#

15.1. Densely connected network#