5. Connecting with accounting databases#

5.1. Power BI#

The easiest way to exploit the Python ecosystem in Power BI is to enable Python scripting inside Power BI. For that, you need to have Python installed in your system.

In Power BI, check from Options - Global - Python scripting that you have correct folders for your Python and IDE. Power BI should detect the folders automatically, but it is good to check.

Furthermore, you need to have at least Pandas, Matplotlib and Numpy installed.

You can run Python scripts inside Power BI using Get Data - Other - Python script. However, it is a good habit to check in your Python environment that the script is working.

There are a few limitations with the connection between Power BI/Python:

  • If you want to import data, it should be represented in a Pandas data frame.

  • The maximum run time of a script is 30 minutes.

  • You must use full directory paths in your code, not relative paths.

  • Nested tables are not supported.

Otherwise, implementing Python in Power BI is very similar to doing analysis purely inside Python. The good thing is, of course, that you have the tools of both Python and Power BI at your disposal.

5.2. MySQL, SAP and others#

There are good libraries for connecting to MySQL, for example, MySQLdb: pypi.org/project/MySQL-python/. If you want, you can use your MySQL database to go through the examples, instead of SQlite.

SAP HANA is used to connect with Python to a SAP database. Here are the instructions on how to connect Python to SAP: developers.sap.com/tutorials/hana-clients-python.html. The task is quite difficult, and we are not doing that in this course.

5.3. Sqlite#

In the following, we will analyse our example database purely in Python. For that, we use an example company database that is available here: github.com/jpwhite3/northwind-SQLite3

Northwind

This demo-database has originally been published to Microsoft Access 2000. We analyse it using Sqlite3. However, keep in mind that you can repeat the following analysis by connecting to many other databases, like SAP. In the following, we use some basic SQL statements. However, this course is not about SQL, so we do not go deeper in that direction.

Sqlite3 is included in the standard library. So, you do not need to install any additional libraries. Let’s start by importing the library.

import sqlite3

We can create a connection to the example database with connect(). You need to have the Northwind_large.sqlite file in your work folder for the following command to work.

connection = sqlite3.connect('Northwind_large.sqlite')

cursor() returns a cursor for the connection. A cursor -object has many useful methods to execute SQL statements.

cursor = connection.cursor()

The tables of the database can be collected with the following commands. execute() is used for a SQL statement and fetchall() collects all the rows from the result.

cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())
[('Employees',), ('Categories',), ('Customers',), ('Shippers',), ('Suppliers',), ('Orders',), ('Products',), ('OrderDetails',), ('CustomerCustomerDemo',), ('CustomerDemographics',), ('Region',), ('Territories',), ('EmployeeTerritories',)]

We check the fields of the Employees, OrderDetails, Orders and Products tables. The outpus are messy lists of tuples. The name of a field is always the second item in a tuple.

cursor.execute("PRAGMA table_info(Employees)")
print(cursor.fetchall())
[(0, 'Id', 'INTEGER', 0, None, 1), (1, 'LastName', 'VARCHAR(8000)', 0, None, 0), (2, 'FirstName', 'VARCHAR(8000)', 0, None, 0), (3, 'Title', 'VARCHAR(8000)', 0, None, 0), (4, 'TitleOfCourtesy', 'VARCHAR(8000)', 0, None, 0), (5, 'BirthDate', 'VARCHAR(8000)', 0, None, 0), (6, 'HireDate', 'VARCHAR(8000)', 0, None, 0), (7, 'Address', 'VARCHAR(8000)', 0, None, 0), (8, 'City', 'VARCHAR(8000)', 0, None, 0), (9, 'Region', 'VARCHAR(8000)', 0, None, 0), (10, 'PostalCode', 'VARCHAR(8000)', 0, None, 0), (11, 'Country', 'VARCHAR(8000)', 0, None, 0), (12, 'HomePhone', 'VARCHAR(8000)', 0, None, 0), (13, 'Extension', 'VARCHAR(8000)', 0, None, 0), (14, 'Photo', 'BLOB', 0, None, 0), (15, 'Notes', 'VARCHAR(8000)', 0, None, 0), (16, 'ReportsTo', 'INTEGER', 0, None, 0), (17, 'PhotoPath', 'VARCHAR(8000)', 0, None, 0)]
cursor.execute("PRAGMA table_info(OrderDetails)")
print(cursor.fetchall())
[(0, 'Id', 'VARCHAR(8000)', 0, None, 1), (1, 'OrderId', 'INTEGER', 1, None, 0), (2, 'ProductId', 'INTEGER', 1, None, 0), (3, 'UnitPrice', 'DECIMAL', 1, None, 0), (4, 'Quantity', 'INTEGER', 1, None, 0), (5, 'Discount', 'DOUBLE', 1, None, 0)]
cursor.execute("PRAGMA table_info(Orders)")
print(cursor.fetchall())
[(0, 'Id', 'INTEGER', 0, None, 1), (1, 'CustomerId', 'VARCHAR(8000)', 0, None, 0), (2, 'EmployeeId', 'INTEGER', 1, None, 0), (3, 'OrderDate', 'VARCHAR(8000)', 0, None, 0), (4, 'RequiredDate', 'VARCHAR(8000)', 0, None, 0), (5, 'ShippedDate', 'VARCHAR(8000)', 0, None, 0), (6, 'ShipVia', 'INTEGER', 0, None, 0), (7, 'Freight', 'DECIMAL', 1, None, 0), (8, 'ShipName', 'VARCHAR(8000)', 0, None, 0), (9, 'ShipAddress', 'VARCHAR(8000)', 0, None, 0), (10, 'ShipCity', 'VARCHAR(8000)', 0, None, 0), (11, 'ShipRegion', 'VARCHAR(8000)', 0, None, 0), (12, 'ShipPostalCode', 'VARCHAR(8000)', 0, None, 0), (13, 'ShipCountry', 'VARCHAR(8000)', 0, None, 0)]
cursor.execute("PRAGMA table_info(Products)")
print(cursor.fetchall())
[(0, 'Id', 'INTEGER', 0, None, 1), (1, 'ProductName', 'VARCHAR(8000)', 0, None, 0), (2, 'SupplierId', 'INTEGER', 1, None, 0), (3, 'CategoryId', 'INTEGER', 1, None, 0), (4, 'QuantityPerUnit', 'VARCHAR(8000)', 0, None, 0), (5, 'UnitPrice', 'DECIMAL', 1, None, 0), (6, 'UnitsInStock', 'INTEGER', 1, None, 0), (7, 'UnitsOnOrder', 'INTEGER', 1, None, 0), (8, 'ReorderLevel', 'INTEGER', 1, None, 0), (9, 'Discontinued', 'INTEGER', 1, None, 0)]

Pandas has a very convenient function, read_sql_query, to load SQL queries to dataframes. Let’s start by loading Pandas.

import pandas as pd

SQL queries are a whole new world, and we use only the essential. The following code picks up LastName from the Employees table, UnitPrice and Quantity from the OrderDetails, OrderDate and ShipCountry from Orders, CategoryName from Categories, and ProductName from Products. The next part of the code is important. The JOIN commands connect the data in different tables in a correct way. Notice how we qive our sqlite3 database connection -object as a paramter to the function.

query_df = pd.read_sql_query("""SELECT Employees.LastName, OrderDetails.UnitPrice, 
OrderDetails.Quantity, Orders.OrderDate, Orders.ShipCountry, Categories.CategoryName, Products.ProductName
FROM OrderDetails
JOIN Orders ON Orders.Id=OrderDetails.OrderID
JOIN Employees ON Employees.Id=Orders.EmployeeId
JOIN Products ON Products.ID=OrderDetails.ProductId
JOIN Categories ON Categories.ID=Products.CategoryID""", connection)
query_df
LastName UnitPrice Quantity OrderDate ShipCountry CategoryName ProductName
0 Buchanan 14.00 12 2012-07-04 France Dairy Products Queso Cabrales
1 Buchanan 9.80 10 2012-07-04 France Grains/Cereals Singaporean Hokkien Fried Mee
2 Buchanan 34.80 5 2012-07-04 France Dairy Products Mozzarella di Giovanni
3 Suyama 18.60 9 2012-07-05 Germany Produce Tofu
4 Suyama 42.40 40 2012-07-05 Germany Produce Manjimup Dried Apples
... ... ... ... ... ... ... ...
621878 Davolio 21.00 20 2013-08-31 02:59:28 Venezuela Grains/Cereals Gustaf's Knäckebröd
621879 Davolio 13.00 11 2013-08-31 02:59:28 Venezuela Condiments Original Frankfurter grüne Soße
621880 Davolio 39.00 45 2013-08-31 02:59:28 Venezuela Meat/Poultry Alice Mutton
621881 Davolio 25.00 7 2013-08-31 02:59:28 Venezuela Condiments Grandma's Boysenberry Spread
621882 Davolio 21.05 27 2013-08-31 02:59:28 Venezuela Condiments Louisiana Fiery Hot Pepper Sauce

621883 rows × 7 columns

Now that we have everything neatly in a Pandas dataframe, we can do many kinds of analyses. The other chapters focus more on the Pandas functionality, but let’s try something that we can do.

For example, to analyse trends, we can change OrderDate to a datetime object with to_datetime().

query_df['OrderDate'] = pd.to_datetime(query_df['OrderDate'])
query_df
LastName UnitPrice Quantity OrderDate ShipCountry CategoryName ProductName
0 Buchanan 14.00 12 2012-07-04 00:00:00 France Dairy Products Queso Cabrales
1 Buchanan 9.80 10 2012-07-04 00:00:00 France Grains/Cereals Singaporean Hokkien Fried Mee
2 Buchanan 34.80 5 2012-07-04 00:00:00 France Dairy Products Mozzarella di Giovanni
3 Suyama 18.60 9 2012-07-05 00:00:00 Germany Produce Tofu
4 Suyama 42.40 40 2012-07-05 00:00:00 Germany Produce Manjimup Dried Apples
... ... ... ... ... ... ... ...
621878 Davolio 21.00 20 2013-08-31 02:59:28 Venezuela Grains/Cereals Gustaf's Knäckebröd
621879 Davolio 13.00 11 2013-08-31 02:59:28 Venezuela Condiments Original Frankfurter grüne Soße
621880 Davolio 39.00 45 2013-08-31 02:59:28 Venezuela Meat/Poultry Alice Mutton
621881 Davolio 25.00 7 2013-08-31 02:59:28 Venezuela Condiments Grandma's Boysenberry Spread
621882 Davolio 21.05 27 2013-08-31 02:59:28 Venezuela Condiments Louisiana Fiery Hot Pepper Sauce

621883 rows × 7 columns

Next, we can change our datetime object as index.

query_df.index = query_df['OrderDate']
query_df
LastName UnitPrice Quantity OrderDate ShipCountry CategoryName ProductName
OrderDate
2012-07-04 00:00:00 Buchanan 14.00 12 2012-07-04 00:00:00 France Dairy Products Queso Cabrales
2012-07-04 00:00:00 Buchanan 9.80 10 2012-07-04 00:00:00 France Grains/Cereals Singaporean Hokkien Fried Mee
2012-07-04 00:00:00 Buchanan 34.80 5 2012-07-04 00:00:00 France Dairy Products Mozzarella di Giovanni
2012-07-05 00:00:00 Suyama 18.60 9 2012-07-05 00:00:00 Germany Produce Tofu
2012-07-05 00:00:00 Suyama 42.40 40 2012-07-05 00:00:00 Germany Produce Manjimup Dried Apples
... ... ... ... ... ... ... ...
2013-08-31 02:59:28 Davolio 21.00 20 2013-08-31 02:59:28 Venezuela Grains/Cereals Gustaf's Knäckebröd
2013-08-31 02:59:28 Davolio 13.00 11 2013-08-31 02:59:28 Venezuela Condiments Original Frankfurter grüne Soße
2013-08-31 02:59:28 Davolio 39.00 45 2013-08-31 02:59:28 Venezuela Meat/Poultry Alice Mutton
2013-08-31 02:59:28 Davolio 25.00 7 2013-08-31 02:59:28 Venezuela Condiments Grandma's Boysenberry Spread
2013-08-31 02:59:28 Davolio 21.05 27 2013-08-31 02:59:28 Venezuela Condiments Louisiana Fiery Hot Pepper Sauce

621883 rows × 7 columns

We still need to order the index.

query_df.sort_index(inplace=True)

Let’s calculate the total number of orders for different product categories.

query_df['CategoryName'].value_counts().plot.bar()
<AxesSubplot:>
_images/c591a5b5a82a3b0373a44753d53c864ae9e68092e2e7ac0483e2c97cc010152f.png

Let’s check next, to which country the company is selling the most. The following command is quite long! First, it groups values by ShipCountry, then counts values and sorts them in a descending order by Quantity, and finally selects only Quantity -column.

query_df.groupby('ShipCountry').count().sort_values('Quantity',ascending=False)['Quantity']
ShipCountry
USA            89059
Germany        85923
France         67314
Brazil         60663
UK             50162
Mexico         34979
Venezuela      29128
Spain          26270
Argentina      21233
Italy          19952
Canada         19652
Finland        14661
Belgium        14505
Sweden         14147
Switzerland    14049
Denmark        13764
Austria        13669
Portugal       12684
Poland          6756
Ireland         6688
Norway          6625
Name: Quantity, dtype: int64

A nice thing in Python (and Pandas) is that we change the previous to a bar chart just by adding plot.bar() to the end of the command.

Let’s first load Matplotlib to make our plots prettier.

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
query_df.groupby('ShipCountry').count().sort_values('Quantity',ascending=False)['Quantity'].plot.bar(figsize=(10,5))
plt.show()
_images/d5ffea8a36e26aed1cd7970d3269bd539e631ac9a26faa3dc548ebe003fea9ee.png

With pivot tables, we can do a 2-way grouping.

query_df.pivot_table(index='LastName',columns='ShipCountry',values='Quantity',aggfunc=sum)
ShipCountry Argentina Austria Belgium Brazil Canada Denmark Finland France Germany Ireland ... Mexico Norway Poland Portugal Spain Sweden Switzerland UK USA Venezuela
LastName
Buchanan 68252 40687 45468 182595 55898 47414 40935 169200 265031 21996 ... 103888 18487 19077 34787 76052 46805 39866 123721 263998 95810
Callahan 50977 46506 21741 201153 58506 42306 27732 192488 237532 16133 ... 86354 12045 19728 31522 73022 40745 34171 155124 250614 62389
Davolio 66186 44332 47841 150727 50957 42275 47406 186479 229137 16471 ... 105282 9071 22855 31563 61587 33431 38221 149707 258013 73040
Dodsworth 48936 42499 45385 173881 58663 29649 38434 218987 239419 23881 ... 91115 16886 19259 43262 69528 36657 50513 146315 207208 105907
Fuller 61848 25092 46255 155750 55919 29790 30799 184553 228026 20975 ... 119410 22544 21392 30644 69067 31062 31412 130891 250988 86679
King 45559 39419 41950 166678 45888 31635 56539 194100 258696 15336 ... 94927 35009 16933 28930 62777 46751 43769 162153 240532 74124
Leverling 64409 38191 35325 188315 74915 36741 42688 204843 243401 22365 ... 82019 21029 13554 39832 82488 46028 44096 146039 305553 66031
Peacock 62869 28476 52329 171980 56217 46333 50386 172114 244482 12059 ... 89564 18977 12986 44000 92110 40524 44052 132325 252295 90487
Suyama 69697 44687 35595 148480 45064 43631 39998 193725 246849 22518 ... 123194 14638 24995 37539 82311 38743 31109 134042 246389 87347

9 rows × 21 columns

With Quantity and OrderPrice, we can calculate the total price of the orders. Using * for multiplication, Python/Pandas makes the multiplication element-wise.

query_df['TotalPrice'] = query_df['Quantity']*query_df['UnitPrice']

There is too much information to one plot, so let’s resample the data before plotting (‘M’ means monthly).

query_df['TotalPrice'].resample('M').sum().plot(figsize=(10,5),style='r--')
<AxesSubplot:xlabel='OrderDate'>
_images/f98902bd86e21bcad8046b7b1af948158285278dbf8e0ce290ebaffc2c2de391.png

Let’s plot how the sales of different salesperson have progressed. The for loop in the command goes through all the salesperson and draws their performance to the same chart. With set, we can pick from the LastName column unique values.

plt.figure(figsize=(15,10))
for name in set(query_df['LastName']):
    plt.plot(query_df['TotalPrice'][query_df['LastName'] == name].resample('Q').sum())
plt.legend(set(query_df['LastName']))
plt.show()
_images/851aaf1a4b01978e369b29c25e73ce53cc95a1493db8ace98daeb869c6c89b30.png

We can also use bar charts. Here are the sales of different salesperson and product categories. We first do a two-level grouping, sum the values in those groups, and pick TotalPrice. Adding plot.barh() to the end turns the 2-level grouping table into a bar chart.

query_df.groupby(['LastName','CategoryName']).sum()['TotalPrice'].plot.barh(figsize=(5,15))
plt.show()
_images/e1b7121727c5e08a17bf3eb94a29cf9bd1bab44964e319414b39f6a7879fe2ca.png

We can also use percentage values in tables. (I admit, the following command is a mess!). It divides the values of a LastName/CategoryName -pivot table with the row sums of that table. Then, it multiplies these numbers by hundred. style.format is used to decrease the number of decimals to 2 2f, and to add % to the end.

(query_df.pivot_table(values = 'TotalPrice', index = 'LastName',
                      columns = 'CategoryName').divide(query_df.pivot_table(values = 'TotalPrice',
                    index = 'LastName', columns = 'CategoryName').sum(axis=1),axis=0)*100).style.format('{:.2f} %')
CategoryName Beverages Condiments Confections Dairy Products Grains/Cereals Meat/Poultry Produce Seafood
LastName
Buchanan 15.70 % 9.51 % 10.41 % 11.92 % 8.35 % 22.39 % 13.14 % 8.58 %
Callahan 15.71 % 9.52 % 10.41 % 11.84 % 8.18 % 22.45 % 13.44 % 8.45 %
Davolio 15.45 % 9.46 % 10.37 % 11.89 % 8.34 % 22.36 % 13.70 % 8.43 %
Dodsworth 15.65 % 9.64 % 10.32 % 11.91 % 8.38 % 22.36 % 13.06 % 8.67 %
Fuller 15.73 % 9.45 % 10.39 % 11.87 % 8.39 % 22.35 % 13.28 % 8.53 %
King 15.59 % 9.59 % 10.56 % 11.94 % 8.36 % 21.95 % 13.49 % 8.53 %
Leverling 15.69 % 9.58 % 10.42 % 11.98 % 8.37 % 22.31 % 13.08 % 8.58 %
Peacock 15.41 % 9.50 % 10.44 % 11.83 % 8.33 % 22.27 % 13.61 % 8.60 %
Suyama 15.63 % 9.52 % 10.36 % 11.86 % 8.34 % 22.40 % 13.38 % 8.52 %

5.4. Other sources of accounting data#

Google Dataset Search is an excellent source of datasets, including interesting accounting datasets: datasetsearch.research.google.com/

Quandl is another interesting source of data. They have some free datasets, but you need to register to get an api key before you can download any data. Quandl offers a library that you can use to download datasets directly in Python: www.quandl.com/

5.4.1. Pandas Datareader#

Pandas Datareader is a library that can be used to download external datasets to Pandas dataframes. pandas-datareader.readthedocs.io/en/latest/

Currently, the following sources are supported in Pandas Datareader

  • Tiingo

  • IEX

  • Alpha Vantage

  • Enigma

  • EconDB

  • Quandl

  • St.Louis FED (FRED)

  • Kenneth French’s data library

  • World Bank

  • OECD

  • Eurostat

  • Thrift Savings Plan

  • Nasdaq Trader symbol definitions

  • Stooq

  • MOEX

  • Naver Finance

For most of these, free registration is required to get an API key.

You need to install Datareader first. It is included in Anaconda and can also be installed with Pip using a command pip install pandas-datareader.

Let’s import the library

import pandas_datareader as pdr

Let’s use in our example EconDB (www.econdb.com/). In the following code, we load the quarterly values of Finland’s gross domectic product from the year 1999 to the most recent value.

data = pdr.data.DataReader('ticker=RGDPFI','econdb',start=1999)

It returns a Pandas dataframe, to which we can apply all the Pandas functions.

data.plot(figsize=(10,5))
plt.show()
_images/8c8b779eb0abe35dd52bb4984d33d8b9ccf75edb312a1506ba8fae7f09c3c32c.png

Fama/French data library is also very interesting for accounting research. mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html

import pandas_datareader.famafrench as fama

There are 297 datasets available.

len(fama.get_available_datasets())
297

Let’s load an industry portfolio return data.

data2 = pdr.data.DataReader('10_Industry_Portfolios', 'famafrench')

This time, it returns a dictionary. The items of the dictionary are dataframes with different data. DESCR can be used to get information about the data.

type(data2)
dict
data2['DESCR']
'10 Industry Portfolios\n----------------------\n\nThis file was created by CMPT_IND_RETS using the 202009 CRSP database. It contains value- and equal-weighted returns for 10 industry portfolios. The portfolios are constructed at the end of June. The annual returns are from January to December. Missing data are indicated by -99.99 or -999. Copyright 2020 Kenneth R. French\n\n  0 : Average Value Weighted Returns -- Monthly (59 rows x 10 cols)\n  1 : Average Equal Weighted Returns -- Monthly (59 rows x 10 cols)\n  2 : Average Value Weighted Returns -- Annual (5 rows x 10 cols)\n  3 : Average Equal Weighted Returns -- Annual (5 rows x 10 cols)\n  4 : Number of Firms in Portfolios (59 rows x 10 cols)\n  5 : Average Firm Size (59 rows x 10 cols)\n  6 : Sum of BE / Sum of ME (6 rows x 10 cols)\n  7 : Value-Weighted Average of BE/ME (6 rows x 10 cols)'
data2[0].plot(figsize=(10,8))
<AxesSubplot:xlabel='Date'>
_images/b1c13bea2aa29c590ed40c0e9c6cd3732f0252a6da83f38beecb14cba1fcf6f7.png