6. Visualisations with Python#

There are many libraries for visualisations in Python. In this chapter, we will explore Matplotlib and Seaborn. But first we check the somewhat limited visualisation capabilities of Pandas.

6.1. Pandas#

import pandas as pd

Although Pandas is mainly a data handling library, it has some functions for plotting too.

The Series and Dataframe objects have plot() for basic plotting. By default, it makes line plots.

pe_df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/PE.csv',index_col=0)

pe_df['PE_ratio'] = pe_df['price']/pe_df['earnings']

pe_df.index = pd.date_range('1800', '1932', freq='Y')

The evolution of P/E-ratio from 1800 to 1931.

pe_df['PE_ratio'].plot()
<AxesSubplot:>
_images/8c0cc7071fb0b5ffa3f1a91c97bd5145441f8b306aa90a2e806843b2348188c2.png

The plot-object is passed to Matplotlib, so most of the Matplotlib parameters work also with Pandas plots.

Pandas has functions for KDE plots and histograms.

pe_df['PE_ratio'].hist()
<AxesSubplot:>
_images/d2bdaa65900883d4b5a2c490196be1bc533f63818add289b09ad9e9024a87f1d.png
pe_df['PE_ratio'].plot.density()
<AxesSubplot:ylabel='Density'>
_images/43c51b3571c2566892f86714e02d54a06183220f0a52cbd2e5def48e212e5558.png

Pandas has bar plots. Notice how they are invoked using plot.bar().

know_df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/politicalKnowledge.csv',index_col=0)
know_df.set_index('country',inplace=True)

Political knowledge in the US and Europe (high school background).

know_df[['DomesticKnowledge.hs','InternationalKnowledge.hs']].plot.bar()
<AxesSubplot:xlabel='country'>
_images/6aaa4e67f2b931dca1f034b1b7473c65c001f95a946d2ebb8e9af961b987925d.png

And horisontal bars. Political knowledge in the US and Europe (college background).

know_df[['DomesticKnowledge.c','InternationalKnowledge.c']].plot.barh(figsize=(10,6))
<AxesSubplot:ylabel='country'>
_images/513fc2f0fa99008b9d568affba99c620bb3041e3f33e5764447442d1e8aced92.png

With stacked = True, you change grouped bars to stacked bars. Political knowledge in the US and Europe (high school/some college/college background stacked together).

know_df[['DomesticKnowledge.hs', 'DomesticKnowledge.sc', 'DomesticKnowledge.c',
       'InternationalKnowledge.hs', 'InternationalKnowledge.sc',
       'InternationalKnowledge.c']].plot.bar(stacked=True,figsize=(10,10))
<AxesSubplot:xlabel='country'>
_images/0d418fbdac074e4fe09b1bb6422bafdb6eaad76f33601c593740d483d6f95c29.png

6.2. Matplotlib#

In the following, we will go through the basic features of Matplotlib. One important thing to remember is that Jupyter will reset most of the settings after each cell is evaluated. So most of the operations need to be executed in a single cell.

Functions for basic plots are in the matplotlib.pyplot module.

import matplotlib.pyplot as plt

You can use many different styles for your plots. The full list of styles is here:

plt.style.available
['Solarize_Light2',
 '_classic_test_patch',
 'bmh',
 'classic',
 'dark_background',
 'fast',
 'fivethirtyeight',
 'ggplot',
 'grayscale',
 'seaborn',
 'seaborn-bright',
 'seaborn-colorblind',
 'seaborn-dark',
 'seaborn-dark-palette',
 'seaborn-darkgrid',
 'seaborn-deep',
 'seaborn-muted',
 'seaborn-notebook',
 'seaborn-paper',
 'seaborn-pastel',
 'seaborn-poster',
 'seaborn-talk',
 'seaborn-ticks',
 'seaborn-white',
 'seaborn-whitegrid',
 'tableau-colorblind10']

I usually find the bmh style the most appealing. Fivethirtyeight and ggplot are also good.

plt.style.use('bmh')

Plot() is for plotting of sequence-like objects.

import numpy as np
values = np.random.uniform(-1.,1.,size=(100))
csum_values=np.cumsum(values)
plt.plot(csum_values)
plt.show()
_images/52e6437ea0550b2073f60d744278c2767526e7af624dc7260793c19bb4a46e04.png

Although plots will show in jupyter notebooks without plt.show(), by adding it we can remove useless textual information above the plots.

As I mentioned earlier, everything in Python are objects, also Matplotlib figures. The following command will add an empty template without any plots.

template = plt.figure()
<Figure size 432x288 with 0 Axes>

We can add plots to the template using add_subplot(). The plots are addes as axes-objects.

template = plt.figure()
ax1 = template.add_subplot(2,2,1)
ax2 = template.add_subplot(2,2,2)
ax3 = template.add_subplot(2,2,3)
ax4 = template.add_subplot(2,2,4)
plt.plot(csum_values)
plt.show()
_images/0cb2db2b0e33a2959908f41b42dc8f008b8eec63f4d9636cb623357e9f7d4433.png

By default, Matplotlib will add the plot to the last axex-object added. You can add to a specific axes-object by using its name.

template = plt.figure(figsize=(12,8))
ax1 = template.add_subplot(2,2,1)
ax2 = template.add_subplot(2,2,2)
ax3 = template.add_subplot(2,2,3)
ax4 = template.add_subplot(2,2,4)
ax1.plot(csum_values,'r--')
ax2.hist(csum_values,bins=20,color='r')
ax3.plot(np.diff(csum_values),linestyle='--')
ax4.hist(np.diff(csum_values),bins=20)
plt.show()
_images/c7b01f7f8eec11bfdf0a089bda2d5365dc76b7331437c58afbff61bf808af2f4.png

As you can see from the example above, colors, linestyles, etc. are defined using string codes.

plt.plot(csum_values[20:40],'ro--')
plt.show()
_images/0a538550d83df6d1f41f275d1b6d3386b1c6258479a2a2d300fb8fcacea914e0.png

The string codes are convenient, if you know them. But learning them takes a little time. Consult the Matplotlib documentation for the full list of the string codes: matplotlib.org. You can draw the above figure using a more explicit syntax.

plt.plot(csum_values[20:40], color='r',marker='o',linestyle='dashed')
plt.show()
_images/0a538550d83df6d1f41f275d1b6d3386b1c6258479a2a2d300fb8fcacea914e0.png

By default, adjacent points in line plots are connected with linear interpolation. With drawstyle, you can use also constant values.

plt.plot(csum_values[30:40], color='r')
plt.plot(csum_values[30:40], color='darkblue',marker='o',linestyle='dashed',drawstyle='steps-mid')
plt.show()
_images/70f5b2f95ae355f23f4dcadb12da2e2d3f3e8be8790fb65858c625e1d16d1b79.png

Let’s load a more interesting dataset.

import pandas as pd
comp_df = pd.read_csv('stat_data.csv',index_col=0)
# winsorizing numerical variables
numer = comp_df.columns[2:6]
comp_df[numer] = comp_df[numer].clip(lower=comp_df[numer].quantile(0.025),
                                                   upper=comp_df[numer].quantile(0.975),axis=1)
comp_df
NAME DIV. YIELD ROE (%) R&D/SALES (%) CoGS/SALES - 5 Y (%) SG%A/SALES 5Y (%) ACCOUNTING STANDARD Accounting Controversies Basis of EPS data INDUSTRY GROUP IBES COUNTRY CODE
0 APPLE 0.71 55.92 4.95 56.6400 6.53 US standards (GAAP) N NaN 4030.0 US
1 SAUDI ARABIAN OIL 0.21 32.25 NaN NaN NaN IFRS N EPS 5880.0 FW
2 MICROSOFT 1.07 40.14 13.59 26.4600 19.56 US standards (GAAP) N NaN 4080.0 US
3 AMAZON.COM 0.00 21.95 12.29 56.8200 21.28 US standards (GAAP) N NaN 7091.0 US
4 FACEBOOK CLASS A 0.00 19.96 21.00 7.0600 20.42 US standards (GAAP) N NaN 8580.0 US
... ... ... ... ... ... ... ... ... ... ... ...
295 BHP GROUP 4.23 16.61 NaN 45.5000 NaN IFRS N IFRS 5210.0 EX
296 CITIC SECURITIES 'A' 1.67 7.77 NaN 10.9100 27.44 IFRS N EPS 4395.0 FC
297 EDWARDS LIFESCIENCES 0.00 28.73 16.09 23.3300 30.22 US standards (GAAP) N NaN 3440.0 US
298 GREE ELECT.APP. 'A' 2.25 24.52 NaN 66.3500 14.94 Local standards N EPS 3720.0 FC
299 HOUSING DEVELOPMENT FINANCE CORPORATION 1.08 18.17 NaN 4.7435 23.90 Local standards N EPS 4390.0 FI

300 rows × 11 columns

We can add all the axes-objects at once using plt.subplots(). In the following for-loop, we use two iterators. One is going through the axes objects and other is going through specific columns from the dataframe. For that, we need to use zip().

fig, axes = plt.subplots(2,2,figsize=(10,10))
for ax, column in zip(axes.flat,comp_df.columns[2:6]):
    ax.hist(comp_df[column],bins=15)
    ax.set_title(column)
_images/e091ca9d085000d2a16a8cf00df8de12a2a92b0cd5d760145052fbe5e0107a01.png

Matplotlib has a huge collection of different plot types.

The boxplot function, which does not work with missing values.

plt.figure(figsize=(6,8))
plt.boxplot(comp_df['ROE (%)'].dropna(),notch=True)
plt.show()
_images/c325bfe465fdad6ab9cd87a66bea296c47a74e8a7621e874eea4eb405e24ad74.png

The horisontal bar plot function. There is also the function for a vertical bar plot.

plt.barh(comp_df['NAME'][0:10],comp_df['CoGS/SALES - 5 Y (%)'][0:10])
plt.show()
_images/6ef2878681c21ce6aba75251e32d2179a5e9d4db7834ae0df9529ba10ed74702.png

We already saw the normal histogram function, hist(). There is also a function for two-dimensional histograms.

plt.hist2d(np.random.normal(size=2000),np.random.normal(size=2000),bins=20)
plt.show()
_images/82bc7a7478d4bc8906e33727848052150881a36f4476065100134830344f4bb1.png

A lineplot with log-log scales. With legend(), you can add a legend to your plots.

euro_df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/EuStockMarkets.csv',index_col=0)
plt.loglog(euro_df['DAX'],label='DAX')
plt.loglog(euro_df['SMI'],label='SMI')
plt.legend()
plt.show()
_images/034e98d7a3d674298160fafaf2ef99ed9e0f08815d180a16182d6d6b11364322.png

There is also a function for pie charts. Let’s categorise the R&D/SALES -variable for a meaningful data to use in a pie chart.

categs = pd.cut(comp_df['R&D/SALES (%)'],4)

With title(), you can add a title.

plt.pie(categs.value_counts(),labels=categs.value_counts().index)
plt.title('R&D/SALES (%)')
plt.show()
_images/03a0ec5434e8c45018b3f561e168cb9af9cdc2e4cd2d8323be5d6d78d643c177.png

A polar plot.

plt.polar(np.linspace(0,2*np.pi,100),np.sin(2*np.linspace(0,2*np.pi,100)))
[<matplotlib.lines.Line2D at 0x165153836d0>]
_images/1840613647310357d59ebb41af389687502af4912654a3c7f55ad47d29352d02.png
euro_df['DAX'].pct_change()
1            NaN
2      -0.009283
3      -0.004412
4       0.009044
5      -0.001777
          ...   
1856   -0.024631
1857   -0.031985
1858    0.019138
1859   -0.005924
1860    0.022164
Name: DAX, Length: 1860, dtype: float64

Power Spectral Density can be used to recognise cycles in a time series.

plt.psd(euro_df['DAX'].pct_change()[1:])
plt.show()
_images/3212dcfaa2cc266aea203f29a4fe82e7acdac5533c9155918c9417e074a88538.png

Matplotlib allows you to customise almost everything in its plots. We already saw how to add a title and a legend to plots. We can also add axis titles and customise ticks/tickmarks.

fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(comp_df['R&D/SALES (%)'])
plt.show()
_images/96e14b35c204fc77babdf71a6f3617dd1e00007033810d493a564f38a6c069ef.png

Let’s add axis titles to the histogram, and customise its x-ticks. The number of ticks and tick labels must be equal.

fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(comp_df['R&D/SALES (%)'])
ax.set_xticks([0,10,20])
ax.set_xticklabels(['0 %', '10 %', '20 %'])
ax.set_xlabel('R&D / SALES')
ax.set_ylabel('Freq.')
plt.show()
_images/e537399f29781f8be67791ce775a20f7624501ab3803bf42ae0e380e104745e2.png

A very convenient feature in Matplotlib is annotation. With annotate, we can highlight important points in figures.

from datetime import datetime

euro_df.index = pd.date_range('1991-01-01', '1998-02-16', freq='B')
plt.figure(figsize=(10,8))
plt.plot(euro_df)
plt.annotate('The IT boom starts',xy=(datetime(1996,1,1),4000),
             xytext=(datetime(1994,6,1),5000),arrowprops=dict(facecolor='red'))
plt.show()
_images/2e381938eb95a36e6e02bbc2dc122ac04800a0e1b80bd2eebd78acd3b1d8228f.png

When you are happy with your figure, you can save it to a file using savefig(). The function has many convenient parameters. You can read more about them here: matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html?highlight=savefig.

If you want to change the default settings of Matplotlib, you can use rc(). This will change settings for all cells not just the cell you are executing. For example, to change the default settings of the lines, you can write:

plt.rc('lines', linewidth=1, linestyle='dashed')
plt.plot(euro_df.iloc[0:100])
plt.show()
_images/379e387c5e86fb2cb4d7336a375d60e92b1bef263cfec90c5b0407863f40fb84.png

You can restore default settings with rcdefaults().

plt.rcdefaults()
plt.style.use('bmh')

6.3. Seaborn#

Seaborn is a plotting library built on top of Matplotlib. It is especially designed for statistical plotting. Seaborn has very beautiful plots by default, and you get publication-quality plots from it without exhaustive parameter tuning.

import seaborn as sns

The bar plot in Seaborn adds automatically error bars to the plot

euro_ret = euro_df.pct_change()

euro_ret.drop(datetime(1991,1,1),inplace=True)

unstacked = euro_ret.unstack().to_frame()

unstacked.rename({0:'Return'},axis=1,inplace=True)
sns.barplot(x=unstacked['Return'],y=unstacked.index.get_level_values(0))
<AxesSubplot:xlabel='Return'>
_images/259424b3c59f1150f79684422fad54261127174adf4203b56915e0b6de4365c0.png

Seaborn has efficient tools for histograms and KDEs.

sns.displot(euro_ret['DAX'],kind='hist')
<seaborn.axisgrid.FacetGrid at 0x15323bb6730>
_images/d22e187d95d203c2a52dc665877330318aad81be8a6ebfeba7a4436a5aec0b99.png
sns.displot(euro_ret['DAX'],kind='kde')
<seaborn.axisgrid.FacetGrid at 0x1532357deb0>
_images/c155c199814d023fe32cf04b84667ce19f8f303ac391fa25144018e54f9b0d42.png

Regplot is important for linear regression. We need to reset Matplotlib settings, because regplot is not working correctly with the bmh style.

plt.rcdefaults()
sns.set_theme(color_codes=True)
comp_df
NAME DIV. YIELD ROE (%) R&D/SALES (%) CoGS/SALES - 5 Y (%) SG%A/SALES 5Y (%) ACCOUNTING STANDARD Accounting Controversies Basis of EPS data INDUSTRY GROUP IBES COUNTRY CODE
0 APPLE 0.71 55.92 4.95 56.6400 6.53 US standards (GAAP) N NaN 4030.0 US
1 SAUDI ARABIAN OIL 0.21 32.25 NaN NaN NaN IFRS N EPS 5880.0 FW
2 MICROSOFT 1.07 40.14 13.59 26.4600 19.56 US standards (GAAP) N NaN 4080.0 US
3 AMAZON.COM 0.00 21.95 12.29 56.8200 21.28 US standards (GAAP) N NaN 7091.0 US
4 FACEBOOK CLASS A 0.00 19.96 21.00 7.0600 20.42 US standards (GAAP) N NaN 8580.0 US
... ... ... ... ... ... ... ... ... ... ... ...
295 BHP GROUP 4.23 16.61 NaN 45.5000 NaN IFRS N IFRS 5210.0 EX
296 CITIC SECURITIES 'A' 1.67 7.77 NaN 10.9100 27.44 IFRS N EPS 4395.0 FC
297 EDWARDS LIFESCIENCES 0.00 28.73 16.09 23.3300 30.22 US standards (GAAP) N NaN 3440.0 US
298 GREE ELECT.APP. 'A' 2.25 24.52 NaN 66.3500 14.94 Local standards N EPS 3720.0 FC
299 HOUSING DEVELOPMENT FINANCE CORPORATION 1.08 18.17 NaN 4.7435 23.90 Local standards N EPS 4390.0 FI

300 rows × 11 columns

sns.regplot('DIV. YIELD','ROE (%)',data=comp_df)
<AxesSubplot:xlabel='DIV. YIELD', ylabel='ROE (%)'>
_images/f5c4d19cd7ac0d4e0156976adbe5828c626e46dcd65ac0cfb3bf2034981476f0.png

Pairplot is an excellent tool for quick exploratory analysis. It adds scatter graphs and histograms into a single figure.

sns.pairplot(comp_df)
<seaborn.axisgrid.PairGrid at 0x1532197e790>
_images/cc19f9cedde1ddedb033c2fafeed81a6854ee4d578006d76fb7c9191042bc775.png

With catplot, you can catogorise data in multiple ways.

sns.catplot(x='ACCOUNTING STANDARD',y='DIV. YIELD',
               hue='Accounting Controversies',col='Basis of EPS data',kind='bar',data=comp_df)
plt.show()
_images/ae4055ac12c9721c3fcd0936cb7084ef48ec17f0c83444ad914b2b4efa3f26d9.png

There is also the boxplot option in catplot.

sns.catplot(x='ROE (%)',y='ACCOUNTING STANDARD',kind='box',data=comp_df,aspect=2)
<seaborn.axisgrid.FacetGrid at 0x1532a30fdf0>
_images/c592176d00a0fb04f7b1b9aa5c97ede87e6d90bd9681c45e80bd5c083fc4b7ac.png

There are numerous other visualisation libraries in Python. Other important, which we do not have covered here, are at least Plotly, Bokeh and Altair, but there are numerous others and new libraries are introduced all the time.