4. Pandas Data preprocessing#

You might be wondering why we are talking so much about data processing and manipulation in this book. “Where are all the fancy ML methods” you might ask. Unfortunately, most of the time in data science goes to data pre-processing. It is often reported that 80 % of a data scientist’s time goes to data preparation: cleaning, transforming, rearranging and creating suitable features (feature engineering). So, to be a succesfull data scientist, you need know how to play with data. Luckily, we have Pandas at our disposal, wich is one of the most powerful data manipulation tools available.

Let’s look at some of the most common preprocessing situatsions we encounter when doing data science.

4.1. Basic methods#

Missing data is probably the most common issue with data.

import pandas as pd
companies_df = pd.read_csv('emissions.csv',index_col='NAME',delimiter=';')
companies_df
Board member compensation Total CO2 To Revenues Emission Reduction Target %
NAME
APPLE 2832025.00 3.13 NaN
SAUDI ARABIAN OIL 3220550.15 175.59 NaN
MICROSOFT 4336000.00 29.16 75.0
AMAZON.COM 6567669.00 40.14 50.0
ALIBABA GROUP HOLDING ADR 1:8 NaN NaN NaN
FACEBOOK CLASS A 2291182.00 25.69 75.0
TENCENT HOLDINGS 2681758.80 15.81 NaN
ALPHABET A 50294471.00 32.22 50.0
TESLA 17315009.00 NaN NaN
TAIWAN SEMICON.MNFG. 12584165.55 245.17 18.0
NESTLE 'N' 12906002.11 58.52 35.0
NVIDIA NaN 7.10 25.0
SAMSUNG ELECTRONICS NaN 69.06 70.0
JP MORGAN CHASE & CO. 4460719.00 6.69 50.0
KWEICHOW MOUTAI 'A' 40219.20 NaN NaN
NETFLIX 3665870.00 NaN NaN

Dropna can be used to filter out missing data. The how parameter defines do all/any of the values in rows/columns need to be zero for dropping.

companies_df.dropna(how='all')
Board member compensation Total CO2 To Revenues Emission Reduction Target %
NAME
APPLE 2832025.00 3.13 NaN
SAUDI ARABIAN OIL 3220550.15 175.59 NaN
MICROSOFT 4336000.00 29.16 75.0
AMAZON.COM 6567669.00 40.14 50.0
FACEBOOK CLASS A 2291182.00 25.69 75.0
TENCENT HOLDINGS 2681758.80 15.81 NaN
ALPHABET A 50294471.00 32.22 50.0
TESLA 17315009.00 NaN NaN
TAIWAN SEMICON.MNFG. 12584165.55 245.17 18.0
NESTLE 'N' 12906002.11 58.52 35.0
NVIDIA NaN 7.10 25.0
SAMSUNG ELECTRONICS NaN 69.06 70.0
JP MORGAN CHASE & CO. 4460719.00 6.69 50.0
KWEICHOW MOUTAI 'A' 40219.20 NaN NaN
NETFLIX 3665870.00 NaN NaN
companies_df.dropna(how='any')
Board member compensation Total CO2 To Revenues Emission Reduction Target %
NAME
MICROSOFT 4336000.00 29.16 75.0
AMAZON.COM 6567669.00 40.14 50.0
FACEBOOK CLASS A 2291182.00 25.69 75.0
ALPHABET A 50294471.00 32.22 50.0
TAIWAN SEMICON.MNFG. 12584165.55 245.17 18.0
NESTLE 'N' 12906002.11 58.52 35.0
JP MORGAN CHASE & CO. 4460719.00 6.69 50.0

You can again use the axis parameter. However, it is not meaningful to use here because every column has NaN-values (any returns an empty table), and not all values are NaN in any column (all returns the original dataframe).

With thresh you can define how many of the values are allowed to be NaN.

companies_df.dropna(thresh=2)
Board member compensation Total CO2 To Revenues Emission Reduction Target %
NAME
APPLE 2832025.00 3.13 NaN
SAUDI ARABIAN OIL 3220550.15 175.59 NaN
MICROSOFT 4336000.00 29.16 75.0
AMAZON.COM 6567669.00 40.14 50.0
FACEBOOK CLASS A 2291182.00 25.69 75.0
TENCENT HOLDINGS 2681758.80 15.81 NaN
ALPHABET A 50294471.00 32.22 50.0
TAIWAN SEMICON.MNFG. 12584165.55 245.17 18.0
NESTLE 'N' 12906002.11 58.52 35.0
NVIDIA NaN 7.10 25.0
SAMSUNG ELECTRONICS NaN 69.06 70.0
JP MORGAN CHASE & CO. 4460719.00 6.69 50.0

You can use fillna to fill NaN values with other values.

companies_df.fillna(0)
Board member compensation Total CO2 To Revenues Emission Reduction Target %
NAME
APPLE 2832025.00 3.13 0.0
SAUDI ARABIAN OIL 3220550.15 175.59 0.0
MICROSOFT 4336000.00 29.16 75.0
AMAZON.COM 6567669.00 40.14 50.0
ALIBABA GROUP HOLDING ADR 1:8 0.00 0.00 0.0
FACEBOOK CLASS A 2291182.00 25.69 75.0
TENCENT HOLDINGS 2681758.80 15.81 0.0
ALPHABET A 50294471.00 32.22 50.0
TESLA 17315009.00 0.00 0.0
TAIWAN SEMICON.MNFG. 12584165.55 245.17 18.0
NESTLE 'N' 12906002.11 58.52 35.0
NVIDIA 0.00 7.10 25.0
SAMSUNG ELECTRONICS 0.00 69.06 70.0
JP MORGAN CHASE & CO. 4460719.00 6.69 50.0
KWEICHOW MOUTAI 'A' 40219.20 0.00 0.0
NETFLIX 3665870.00 0.00 0.0

Using a dict, you can define different fill values for different columns. A very common choice is to use the mean value of each column.

companies_df.fillna({'Board member compensation':1000000, 'Total CO2 To Revenues':10,
       'Emission Reduction Target %':0})
Board member compensation Total CO2 To Revenues Emission Reduction Target %
NAME
APPLE 2832025.00 3.13 0.0
SAUDI ARABIAN OIL 3220550.15 175.59 0.0
MICROSOFT 4336000.00 29.16 75.0
AMAZON.COM 6567669.00 40.14 50.0
ALIBABA GROUP HOLDING ADR 1:8 1000000.00 10.00 0.0
FACEBOOK CLASS A 2291182.00 25.69 75.0
TENCENT HOLDINGS 2681758.80 15.81 0.0
ALPHABET A 50294471.00 32.22 50.0
TESLA 17315009.00 10.00 0.0
TAIWAN SEMICON.MNFG. 12584165.55 245.17 18.0
NESTLE 'N' 12906002.11 58.52 35.0
NVIDIA 1000000.00 7.10 25.0
SAMSUNG ELECTRONICS 1000000.00 69.06 70.0
JP MORGAN CHASE & CO. 4460719.00 6.69 50.0
KWEICHOW MOUTAI 'A' 40219.20 10.00 0.0
NETFLIX 3665870.00 10.00 0.0

Interpolation methods that were available for reindexing can be used with fillna.

companies_df.fillna(method = 'bfill')
Board member compensation Total CO2 To Revenues Emission Reduction Target %
NAME
APPLE 2832025.00 3.13 75.0
SAUDI ARABIAN OIL 3220550.15 175.59 75.0
MICROSOFT 4336000.00 29.16 75.0
AMAZON.COM 6567669.00 40.14 50.0
ALIBABA GROUP HOLDING ADR 1:8 2291182.00 25.69 75.0
FACEBOOK CLASS A 2291182.00 25.69 75.0
TENCENT HOLDINGS 2681758.80 15.81 50.0
ALPHABET A 50294471.00 32.22 50.0
TESLA 17315009.00 245.17 18.0
TAIWAN SEMICON.MNFG. 12584165.55 245.17 18.0
NESTLE 'N' 12906002.11 58.52 35.0
NVIDIA 4460719.00 7.10 25.0
SAMSUNG ELECTRONICS 4460719.00 69.06 70.0
JP MORGAN CHASE & CO. 4460719.00 6.69 50.0
KWEICHOW MOUTAI 'A' 40219.20 NaN NaN
NETFLIX 3665870.00 NaN NaN

duplicated() can be used to drop duplicated rows. It drops only duplicates that are next to each other

More about append() later. Here we just build a new dataframe with companies_df in it twice. And then we sort the index so that every company is twice in the new dataframe.

new_df = companies_df.append(companies_df)
new_df.sort_index(inplace=True)
new_df
Board member compensation Total CO2 To Revenues Emission Reduction Target %
NAME
ALIBABA GROUP HOLDING ADR 1:8 NaN NaN NaN
ALIBABA GROUP HOLDING ADR 1:8 NaN NaN NaN
ALPHABET A 50294471.00 32.22 50.0
ALPHABET A 50294471.00 32.22 50.0
AMAZON.COM 6567669.00 40.14 50.0
AMAZON.COM 6567669.00 40.14 50.0
APPLE 2832025.00 3.13 NaN
APPLE 2832025.00 3.13 NaN
FACEBOOK CLASS A 2291182.00 25.69 75.0
FACEBOOK CLASS A 2291182.00 25.69 75.0
JP MORGAN CHASE & CO. 4460719.00 6.69 50.0
JP MORGAN CHASE & CO. 4460719.00 6.69 50.0
KWEICHOW MOUTAI 'A' 40219.20 NaN NaN
KWEICHOW MOUTAI 'A' 40219.20 NaN NaN
MICROSOFT 4336000.00 29.16 75.0
MICROSOFT 4336000.00 29.16 75.0
NESTLE 'N' 12906002.11 58.52 35.0
NESTLE 'N' 12906002.11 58.52 35.0
NETFLIX 3665870.00 NaN NaN
NETFLIX 3665870.00 NaN NaN
NVIDIA NaN 7.10 25.0
NVIDIA NaN 7.10 25.0
SAMSUNG ELECTRONICS NaN 69.06 70.0
SAMSUNG ELECTRONICS NaN 69.06 70.0
SAUDI ARABIAN OIL 3220550.15 175.59 NaN
SAUDI ARABIAN OIL 3220550.15 175.59 NaN
TAIWAN SEMICON.MNFG. 12584165.55 245.17 18.0
TAIWAN SEMICON.MNFG. 12584165.55 245.17 18.0
TENCENT HOLDINGS 2681758.80 15.81 NaN
TENCENT HOLDINGS 2681758.80 15.81 NaN
TESLA 17315009.00 NaN NaN
TESLA 17315009.00 NaN NaN
new_df.duplicated()
NAME
ALIBABA GROUP HOLDING ADR 1:8    False
ALIBABA GROUP HOLDING ADR 1:8     True
ALPHABET A                       False
ALPHABET A                        True
AMAZON.COM                       False
AMAZON.COM                        True
APPLE                            False
APPLE                             True
FACEBOOK CLASS A                 False
FACEBOOK CLASS A                  True
JP MORGAN CHASE & CO.            False
JP MORGAN CHASE & CO.             True
KWEICHOW MOUTAI 'A'              False
KWEICHOW MOUTAI 'A'               True
MICROSOFT                        False
MICROSOFT                         True
NESTLE 'N'                       False
NESTLE 'N'                        True
NETFLIX                          False
NETFLIX                           True
NVIDIA                           False
NVIDIA                            True
SAMSUNG ELECTRONICS              False
SAMSUNG ELECTRONICS               True
SAUDI ARABIAN OIL                False
SAUDI ARABIAN OIL                 True
TAIWAN SEMICON.MNFG.             False
TAIWAN SEMICON.MNFG.              True
TENCENT HOLDINGS                 False
TENCENT HOLDINGS                  True
TESLA                            False
TESLA                             True
dtype: bool

We can remove duplicated rows using drop_duplicated().

new_df.drop_duplicates()
Board member compensation Total CO2 To Revenues Emission Reduction Target %
NAME
ALIBABA GROUP HOLDING ADR 1:8 NaN NaN NaN
ALPHABET A 50294471.00 32.22 50.0
AMAZON.COM 6567669.00 40.14 50.0
APPLE 2832025.00 3.13 NaN
FACEBOOK CLASS A 2291182.00 25.69 75.0
JP MORGAN CHASE & CO. 4460719.00 6.69 50.0
KWEICHOW MOUTAI 'A' 40219.20 NaN NaN
MICROSOFT 4336000.00 29.16 75.0
NESTLE 'N' 12906002.11 58.52 35.0
NETFLIX 3665870.00 NaN NaN
NVIDIA NaN 7.10 25.0
SAMSUNG ELECTRONICS NaN 69.06 70.0
SAUDI ARABIAN OIL 3220550.15 175.59 NaN
TAIWAN SEMICON.MNFG. 12584165.55 245.17 18.0
TENCENT HOLDINGS 2681758.80 15.81 NaN
TESLA 17315009.00 NaN NaN

It is easy to apply function transformation to rows, columns or individual cells of a dataframe. map -metohd can be used for that.

companies_df['Compensation ($ millions)'] = companies_df['Board member compensation'].map(
    lambda x: "{:.2f}".format(x/1000000)+" M$")
companies_df
Board member compensation Total CO2 To Revenues Emission Reduction Target % Compensation ($ millions)
NAME
APPLE 2832025.00 3.13 NaN 2.83 M$
SAUDI ARABIAN OIL 3220550.15 175.59 NaN 3.22 M$
MICROSOFT 4336000.00 29.16 75.0 4.34 M$
AMAZON.COM 6567669.00 40.14 50.0 6.57 M$
ALIBABA GROUP HOLDING ADR 1:8 NaN NaN NaN nan M$
FACEBOOK CLASS A 2291182.00 25.69 75.0 2.29 M$
TENCENT HOLDINGS 2681758.80 15.81 NaN 2.68 M$
ALPHABET A 50294471.00 32.22 50.0 50.29 M$
TESLA 17315009.00 NaN NaN 17.32 M$
TAIWAN SEMICON.MNFG. 12584165.55 245.17 18.0 12.58 M$
NESTLE 'N' 12906002.11 58.52 35.0 12.91 M$
NVIDIA NaN 7.10 25.0 nan M$
SAMSUNG ELECTRONICS NaN 69.06 70.0 nan M$
JP MORGAN CHASE & CO. 4460719.00 6.69 50.0 4.46 M$
KWEICHOW MOUTAI 'A' 40219.20 NaN NaN 0.04 M$
NETFLIX 3665870.00 NaN NaN 3.67 M$

Map() can also be used to transform index. The following command turns the company names backwards.

companies_df.index.map(lambda x : x[::-1])
Index(['ELPPA', 'LIO NAIBARA IDUAS', 'TFOSORCIM', 'MOC.NOZAMA',
       '8:1 RDA GNIDLOH PUORG ABABILA', 'A SSALC KOOBECAF', 'SGNIDLOH TNECNET',
       'A TEBAHPLA', 'ALSET', '.GFNM.NOCIMES NAWIAT', ''N' ELTSEN', 'AIDIVN',
       'SCINORTCELE GNUSMAS', '.OC & ESAHC NAGROM PJ', ''A' IATUOM WOHCIEWK',
       'XILFTEN'],
      dtype='object', name='NAME')

Of course, you can also use rename(). Using a dictionary, rename() can also be used to change only some of the labels.

companies_df.rename(index=str.title)
Board member compensation Total CO2 To Revenues Emission Reduction Target % Compensation ($ millions)
NAME
Apple 2832025.00 3.13 NaN 2.83 M$
Saudi Arabian Oil 3220550.15 175.59 NaN 3.22 M$
Microsoft 4336000.00 29.16 75.0 4.34 M$
Amazon.Com 6567669.00 40.14 50.0 6.57 M$
Alibaba Group Holding Adr 1:8 NaN NaN NaN nan M$
Facebook Class A 2291182.00 25.69 75.0 2.29 M$
Tencent Holdings 2681758.80 15.81 NaN 2.68 M$
Alphabet A 50294471.00 32.22 50.0 50.29 M$
Tesla 17315009.00 NaN NaN 17.32 M$
Taiwan Semicon.Mnfg. 12584165.55 245.17 18.0 12.58 M$
Nestle 'N' 12906002.11 58.52 35.0 12.91 M$
Nvidia NaN 7.10 25.0 nan M$
Samsung Electronics NaN 69.06 70.0 nan M$
Jp Morgan Chase & Co. 4460719.00 6.69 50.0 4.46 M$
Kweichow Moutai 'A' 40219.20 NaN NaN 0.04 M$
Netflix 3665870.00 NaN NaN 3.67 M$

Replace() can be used to replace any values, not just NaN values. You can pass also dict/list, if you want to replace multiple values.

import numpy as np
companies_df.replace(np.nan,-999) # Pandas NaN is np.nan
Board member compensation Total CO2 To Revenues Emission Reduction Target % Compensation ($ millions)
NAME
APPLE 2832025.00 3.13 -999.0 2.83 M$
SAUDI ARABIAN OIL 3220550.15 175.59 -999.0 3.22 M$
MICROSOFT 4336000.00 29.16 75.0 4.34 M$
AMAZON.COM 6567669.00 40.14 50.0 6.57 M$
ALIBABA GROUP HOLDING ADR 1:8 -999.00 -999.00 -999.0 nan M$
FACEBOOK CLASS A 2291182.00 25.69 75.0 2.29 M$
TENCENT HOLDINGS 2681758.80 15.81 -999.0 2.68 M$
ALPHABET A 50294471.00 32.22 50.0 50.29 M$
TESLA 17315009.00 -999.00 -999.0 17.32 M$
TAIWAN SEMICON.MNFG. 12584165.55 245.17 18.0 12.58 M$
NESTLE 'N' 12906002.11 58.52 35.0 12.91 M$
NVIDIA -999.00 7.10 25.0 nan M$
SAMSUNG ELECTRONICS -999.00 69.06 70.0 nan M$
JP MORGAN CHASE & CO. 4460719.00 6.69 50.0 4.46 M$
KWEICHOW MOUTAI 'A' 40219.20 -999.00 -999.0 0.04 M$
NETFLIX 3665870.00 -999.00 -999.0 3.67 M$

Dividing data to bins is a very important feature in Pandas. You can use pd.cut() (notice that it is not a dataframe method) to creata a categorical object. pd.value_counts() can be used to calculate the number of observations in each bin. With the labels parameter you can define names for different bins.

compensation_bins = [10000,100000,1000000,10000000,100000000]
comp_categ = pd.cut(companies_df['Board member compensation'],compensation_bins)
comp_categ
NAME
APPLE                              (1000000.0, 10000000.0]
SAUDI ARABIAN OIL                  (1000000.0, 10000000.0]
MICROSOFT                          (1000000.0, 10000000.0]
AMAZON.COM                         (1000000.0, 10000000.0]
ALIBABA GROUP HOLDING ADR 1:8                          NaN
FACEBOOK CLASS A                   (1000000.0, 10000000.0]
TENCENT HOLDINGS                   (1000000.0, 10000000.0]
ALPHABET A                       (10000000.0, 100000000.0]
TESLA                            (10000000.0, 100000000.0]
TAIWAN SEMICON.MNFG.             (10000000.0, 100000000.0]
NESTLE 'N'                       (10000000.0, 100000000.0]
NVIDIA                                                 NaN
SAMSUNG ELECTRONICS                                    NaN
JP MORGAN CHASE & CO.              (1000000.0, 10000000.0]
KWEICHOW MOUTAI 'A'                    (10000.0, 100000.0]
NETFLIX                            (1000000.0, 10000000.0]
Name: Board member compensation, dtype: category
Categories (4, interval[int64]): [(10000, 100000] < (100000, 1000000] < (1000000, 10000000] < (10000000, 100000000]]
comp_categ.values.categories
IntervalIndex([(10000, 100000], (100000, 1000000], (1000000, 10000000], (10000000, 100000000]],
              closed='right',
              dtype='interval[int64]')
comp_categ.values.codes
array([ 2,  2,  2,  2, -1,  2,  2,  3,  3,  3,  3, -1, -1,  2,  0,  2],
      dtype=int8)
pd.value_counts(comp_categ)
(1000000, 10000000]      8
(10000000, 100000000]    4
(10000, 100000]          1
(100000, 1000000]        0
Name: Board member compensation, dtype: int64
pd.cut(companies_df['Board member compensation'],
                    compensation_bins,labels = ['Poor board members','Rich board members',
                                                'Very rich board members','Insanely rich board members'])
NAME
APPLE                                Very rich board members
SAUDI ARABIAN OIL                    Very rich board members
MICROSOFT                            Very rich board members
AMAZON.COM                           Very rich board members
ALIBABA GROUP HOLDING ADR 1:8                            NaN
FACEBOOK CLASS A                     Very rich board members
TENCENT HOLDINGS                     Very rich board members
ALPHABET A                       Insanely rich board members
TESLA                            Insanely rich board members
TAIWAN SEMICON.MNFG.             Insanely rich board members
NESTLE 'N'                       Insanely rich board members
NVIDIA                                                   NaN
SAMSUNG ELECTRONICS                                      NaN
JP MORGAN CHASE & CO.                Very rich board members
KWEICHOW MOUTAI 'A'                       Poor board members
NETFLIX                              Very rich board members
Name: Board member compensation, dtype: category
Categories (4, object): ['Poor board members' < 'Rich board members' < 'Very rich board members' < 'Insanely rich board members']

If you pass a number to the bins parameter, it will return that many equal-length bins.

pd.cut(companies_df['Total CO2 To Revenues'],4,precision=1)
NAME
APPLE                               (2.9, 63.6]
SAUDI ARABIAN OIL                (124.2, 184.7]
MICROSOFT                           (2.9, 63.6]
AMAZON.COM                          (2.9, 63.6]
ALIBABA GROUP HOLDING ADR 1:8               NaN
FACEBOOK CLASS A                    (2.9, 63.6]
TENCENT HOLDINGS                    (2.9, 63.6]
ALPHABET A                          (2.9, 63.6]
TESLA                                       NaN
TAIWAN SEMICON.MNFG.             (184.7, 245.2]
NESTLE 'N'                          (2.9, 63.6]
NVIDIA                              (2.9, 63.6]
SAMSUNG ELECTRONICS               (63.6, 124.2]
JP MORGAN CHASE & CO.               (2.9, 63.6]
KWEICHOW MOUTAI 'A'                         NaN
NETFLIX                                     NaN
Name: Total CO2 To Revenues, dtype: category
Categories (4, interval[float64]): [(2.9, 63.6] < (63.6, 124.2] < (124.2, 184.7] < (184.7, 245.2]]

For many purposes, qcut() is more useful as it bins the data based on sample quantiles. Therefore, every bins has approximately the same number of observations. You can also pass specific quantiles as a list to the function.

pd.qcut(companies_df['Total CO2 To Revenues'],4)
NAME
APPLE                             (3.129, 13.632]
SAUDI ARABIAN OIL                (61.155, 245.17]
MICROSOFT                         (13.632, 30.69]
AMAZON.COM                        (30.69, 61.155]
ALIBABA GROUP HOLDING ADR 1:8                 NaN
FACEBOOK CLASS A                  (13.632, 30.69]
TENCENT HOLDINGS                  (13.632, 30.69]
ALPHABET A                        (30.69, 61.155]
TESLA                                         NaN
TAIWAN SEMICON.MNFG.             (61.155, 245.17]
NESTLE 'N'                        (30.69, 61.155]
NVIDIA                            (3.129, 13.632]
SAMSUNG ELECTRONICS              (61.155, 245.17]
JP MORGAN CHASE & CO.             (3.129, 13.632]
KWEICHOW MOUTAI 'A'                           NaN
NETFLIX                                       NaN
Name: Total CO2 To Revenues, dtype: category
Categories (4, interval[float64]): [(3.129, 13.632] < (13.632, 30.69] < (30.69, 61.155] < (61.155, 245.17]]

Detecting and filtering outliers is easy with boolean dataframes

companies_df[companies_df['Board member compensation'] < 10000000]
Board member compensation Total CO2 To Revenues Emission Reduction Target % Compensation ($ millions)
NAME
APPLE 2832025.00 3.13 NaN 2.83 M$
SAUDI ARABIAN OIL 3220550.15 175.59 NaN 3.22 M$
MICROSOFT 4336000.00 29.16 75.0 4.34 M$
AMAZON.COM 6567669.00 40.14 50.0 6.57 M$
FACEBOOK CLASS A 2291182.00 25.69 75.0 2.29 M$
TENCENT HOLDINGS 2681758.80 15.81 NaN 2.68 M$
JP MORGAN CHASE & CO. 4460719.00 6.69 50.0 4.46 M$
KWEICHOW MOUTAI 'A' 40219.20 NaN NaN 0.04 M$
NETFLIX 3665870.00 NaN NaN 3.67 M$

Filtering is also possible with the summary statistics of variables. The following command picks up cases where the values of all variables deviate from the mean less than two standard deviations.

companies_df.drop('Compensation ($ millions)',axis=1,inplace=True)
companies_df[(np.abs(companies_df-companies_df.mean()) < 2*companies_df.std(axis=0)).all(1)]
Board member compensation Total CO2 To Revenues Emission Reduction Target %
NAME
MICROSOFT 4336000.00 29.16 75.0
AMAZON.COM 6567669.00 40.14 50.0
FACEBOOK CLASS A 2291182.00 25.69 75.0
NESTLE 'N' 12906002.11 58.52 35.0
JP MORGAN CHASE & CO. 4460719.00 6.69 50.0

Winsorising is another common procedure in practical econometrics. In that method, the most extreme values are moved to specific quantiles, usually 1% and 99% quantiles. It is easiest to implement with the clip() method. Notice how the following command winsorise all the variables.

companies_df.clip(lower = companies_df.quantile(0.1), upper = companies_df.quantile(0.9),axis=1)
Board member compensation Total CO2 To Revenues Emission Reduction Target %
NAME
APPLE 2.832025e+06 6.731 NaN
SAUDI ARABIAN OIL 3.220550e+06 164.937 NaN
MICROSOFT 4.336000e+06 29.160 75.0
AMAZON.COM 6.567669e+06 40.140 50.0
ALIBABA GROUP HOLDING ADR 1:8 NaN NaN NaN
FACEBOOK CLASS A 2.369297e+06 25.690 75.0
TENCENT HOLDINGS 2.681759e+06 15.810 NaN
ALPHABET A 1.643321e+07 32.220 50.0
TESLA 1.643321e+07 NaN NaN
TAIWAN SEMICON.MNFG. 1.258417e+07 164.937 23.6
NESTLE 'N' 1.290600e+07 58.520 35.0
NVIDIA NaN 7.100 25.0
SAMSUNG ELECTRONICS NaN 69.060 70.0
JP MORGAN CHASE & CO. 4.460719e+06 6.731 50.0
KWEICHOW MOUTAI 'A' 2.369297e+06 NaN NaN
NETFLIX 3.665870e+06 NaN NaN

Random sampling from data is easy. You can use the sample() method for that. It is also possible to sample with replacement that is needed in many numerical statistical methods.

companies_df.sample(n=10,replace=True)
Board member compensation Total CO2 To Revenues Emission Reduction Target %
NAME
SAUDI ARABIAN OIL 3220550.15 175.59 NaN
TENCENT HOLDINGS 2681758.80 15.81 NaN
FACEBOOK CLASS A 2291182.00 25.69 75.0
TENCENT HOLDINGS 2681758.80 15.81 NaN
NVIDIA NaN 7.10 25.0
NETFLIX 3665870.00 NaN NaN
NETFLIX 3665870.00 NaN NaN
MICROSOFT 4336000.00 29.16 75.0
ALPHABET A 50294471.00 32.22 50.0
TENCENT HOLDINGS 2681758.80 15.81 NaN

If you want to randomise the order of values in a dataframe, you can use Numpy’s random.permutation()

permut_index = np.random.permutation(len(companies_df))
companies_df.iloc[permut_index]
Board member compensation Total CO2 To Revenues Emission Reduction Target %
NAME
ALIBABA GROUP HOLDING ADR 1:8 NaN NaN NaN
SAMSUNG ELECTRONICS NaN 69.06 70.0
TENCENT HOLDINGS 2681758.80 15.81 NaN
ALPHABET A 50294471.00 32.22 50.0
FACEBOOK CLASS A 2291182.00 25.69 75.0
JP MORGAN CHASE & CO. 4460719.00 6.69 50.0
SAUDI ARABIAN OIL 3220550.15 175.59 NaN
MICROSOFT 4336000.00 29.16 75.0
TAIWAN SEMICON.MNFG. 12584165.55 245.17 18.0
APPLE 2832025.00 3.13 NaN
NETFLIX 3665870.00 NaN NaN
NESTLE 'N' 12906002.11 58.52 35.0
KWEICHOW MOUTAI 'A' 40219.20 NaN NaN
NVIDIA NaN 7.10 25.0
AMAZON.COM 6567669.00 40.14 50.0
TESLA 17315009.00 NaN NaN

Very often in econometrics, you need to transform your categorical variables to a collection of dummy variables. It easily done in Pandas using the get_dummies() function.

pd.get_dummies(companies_df['Emission Reduction Target %'],prefix='Emiss_')
Emiss__18.0 Emiss__25.0 Emiss__35.0 Emiss__50.0 Emiss__70.0 Emiss__75.0
NAME
APPLE 0 0 0 0 0 0
SAUDI ARABIAN OIL 0 0 0 0 0 0
MICROSOFT 0 0 0 0 0 1
AMAZON.COM 0 0 0 1 0 0
ALIBABA GROUP HOLDING ADR 1:8 0 0 0 0 0 0
FACEBOOK CLASS A 0 0 0 0 0 1
TENCENT HOLDINGS 0 0 0 0 0 0
ALPHABET A 0 0 0 1 0 0
TESLA 0 0 0 0 0 0
TAIWAN SEMICON.MNFG. 1 0 0 0 0 0
NESTLE 'N' 0 0 1 0 0 0
NVIDIA 0 1 0 0 0 0
SAMSUNG ELECTRONICS 0 0 0 0 1 0
JP MORGAN CHASE & CO. 0 0 0 1 0 0
KWEICHOW MOUTAI 'A' 0 0 0 0 0 0
NETFLIX 0 0 0 0 0 0

Pandas has efficient methods to manipulate strings in a dataframe. Due to this, Pandas is very popular among researchers that need to use string data in their analysis. This makes it also a very important tool for accounting data analysis. All the Python string object’s built-in methods can be used to manipulate strings in a dataframe. We already discussed string-methods in Chapter 1.

Regular expressions can also be used to manipulate string-dataframes. Regular expressions is a very broad topic, and it takes time to master it. Let’s look at some very simple examples. If we want to split a sentence to a words-list, repeated whitespaces make the process difficult with the standard string-methods. However, it is very easy with regular expressions.

import re

The regex for multiple whitespaces is \s+.

splitter = re.compile('\s+')
splitter.split("This    is   a    sample        text.")
['This', 'is', 'a', 'sample', 'text.']

Regular experssions is a powerful tool, but very complex. You can search email-addresses from a text with the following regex-command.

reg_pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
email_search = re.compile(reg_pattern, flags = re.IGNORECASE)
example = """The general format of an email address is local-part@domain,
and a specific example is jsmith@example.com. An address consists of two parts. 
The part before the @ symbol (local part) identifies the name of a mailbox. 
This is often the username of the recipient, e.g., jsmith. 
The part after the @ symbol (domain) is a domain name that represents 
the administrative realm for the mail box, e.g., a company's domain name, example.com."""
email_search.findall(example)
['jsmith@example.com']

We do not go into regular expression more at this point. But we will see some applications of them in the following chapters. If you want to learn more about regular expressions, here is a good website: www.regular-expressions.info/index.html

If your string data contains missing values, usually the standard string methods will not work. Then you need to use the dataframe string methods.

You can slice strings normally.

companies_df.index.str[0:5]
Index(['APPLE', 'SAUDI', 'MICRO', 'AMAZO', 'ALIBA', 'FACEB', 'TENCE', 'ALPHA',
       'TESLA', 'TAIWA', 'NESTL', 'NVIDI', 'SAMSU', 'JP MO', 'KWEIC', 'NETFL'],
      dtype='object', name='NAME')

And you can use regular experssions, for example, to search strings.

companies_df.index.str.findall('am', flags = re.IGNORECASE)
Index([[], [], [], ['AM'], [], [], [], [], [], [], [], [], ['AM'], [], [], []], dtype='object', name='NAME')

Here is a good introduction to Pandas string methods. pandas.pydata.org/pandas-docs/stable/user_guide/text.html. If you scroll down, there is a full list methods.

4.2. Advanced methods#

Let’s look some more advanced methods of Pandas next. We need a new dataset for that.

electricity_df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/AER/USAirlines.csv',index_col=0)
electricity_df['firm'].replace({1:'A',2:'B',3:'C',4:'D',5:'E',6:'F'},inplace=True)

The cost function of electricity producers.

electricity_df
firm year output cost price load
1 A 1970 0.952757 1140640 106650 0.534487
2 A 1971 0.986757 1215690 110307 0.532328
3 A 1972 1.091980 1309570 110574 0.547736
4 A 1973 1.175780 1511530 121974 0.540846
5 A 1974 1.160170 1676730 196606 0.591167
... ... ... ... ... ... ...
86 F 1980 0.112640 381478 874818 0.517766
87 F 1981 0.154154 506969 1013170 0.580049
88 F 1982 0.186461 633388 930477 0.556024
89 F 1983 0.246847 804388 851676 0.537791
90 F 1984 0.304013 1009500 819476 0.525775

90 rows × 6 columns

There are many tools to deal with data that has more than two dimensions. The basic idea is to use hierarchical indexing in a two dimensional dataframe. For example, we can index the data by firm-years using set_index().

electricity_df.set_index(['firm','year'],inplace=True)
electricity_df
output cost price load
firm year
A 1970 0.952757 1140640 106650 0.534487
1971 0.986757 1215690 110307 0.532328
1972 1.091980 1309570 110574 0.547736
1973 1.175780 1511530 121974 0.540846
1974 1.160170 1676730 196606 0.591167
... ... ... ... ... ...
F 1980 0.112640 381478 874818 0.517766
1981 0.154154 506969 1013170 0.580049
1982 0.186461 633388 930477 0.556024
1983 0.246847 804388 851676 0.537791
1984 0.304013 1009500 819476 0.525775

90 rows × 4 columns

Now the values are sorted so that all the years of a certain company are in adjacent rows. With sort_index you can order the values according to years.

electricity_df.sort_index(level=1)
output cost price load
firm year
A 1970 0.952757 1140640 106650 0.534487
B 1970 0.520635 569292 103795 0.490851
C 1970 0.262424 286298 118788 0.524334
D 1970 0.086393 145167 114987 0.432066
E 1970 0.051028 91361 118222 0.442875
... ... ... ... ... ...
B 1984 1.389740 4209390 821361 0.528775
C 1984 0.493317 1170470 844079 0.577078
D 1984 0.421411 1436970 831374 0.585525
E 1984 0.213279 610257 844622 0.635545
F 1984 0.304013 1009500 819476 0.525775

90 rows × 4 columns

Two-level indexing enables an easy way to pick subgroups from the data. For a series, you can just use the standard indexing style of Python.

electricity_df['output']['A']
year
1970    0.952757
1971    0.986757
1972    1.091980
1973    1.175780
1974    1.160170
1975    1.173760
1976    1.290510
1977    1.390670
1978    1.612730
1979    1.825440
1980    1.546040
1981    1.527900
1982    1.660200
1983    1.822310
1984    1.936460
Name: output, dtype: float64
electricity_df['output'][:,1975]
firm
A    1.173760
B    0.852892
C    0.367517
D    0.164272
E    0.073961
F    0.052462
Name: output, dtype: float64

With multi-index dataframes, you have to be a little bit more careful, because the subgroups are dataframes themselves. For example, you need to use loc to pick up the subgroups.

electricity_df.loc['A']
output cost price load
year
1970 0.952757 1140640 106650 0.534487
1971 0.986757 1215690 110307 0.532328
1972 1.091980 1309570 110574 0.547736
1973 1.175780 1511530 121974 0.540846
1974 1.160170 1676730 196606 0.591167
1975 1.173760 1823740 265609 0.575417
1976 1.290510 2022890 263451 0.594495
1977 1.390670 2314760 316411 0.597409
1978 1.612730 2639160 384110 0.638522
1979 1.825440 3247620 569251 0.676287
1980 1.546040 3787750 871636 0.605735
1981 1.527900 3867750 997239 0.614360
1982 1.660200 3996020 938002 0.633366
1983 1.822310 4282880 859572 0.650117
1984 1.936460 4748320 823411 0.625603
electricity_df.loc['B'].loc[1970]
output         0.520635
cost      569292.000000
price     103795.000000
load           0.490851
Name: 1970, dtype: float64

If you want to pick values of a certain year, you change the order of indices using swaplevel.

electricity_df.swaplevel('firm','year').loc[1975]
output cost price load
firm
A 1.173760 1823740 265609 0.575417
B 0.852892 1358100 281704 0.558133
C 0.367517 510412 278721 0.607270
D 0.164272 373941 263148 0.532723
E 0.073961 156228 277930 0.556181
F 0.052462 133161 307923 0.495361

You can easily calculate descriptive statistics at multiple levels. Most of the stat functions in Pandas include a level parameter for that.

electricity_df.sum()
output    4.904952e+01
cost      1.010271e+08
price     4.245147e+07
load      5.044141e+01
dtype: float64
electricity_df.sum(level=0)
output cost price load
firm
A 21.153464 39585050 6934803 8.957875
B 15.250419 31918255 6994301 8.206419
C 6.185664 10847838 7145233 8.768038
D 3.188515 9571326 7100522 8.215159
E 1.707548 4399139 7171182 8.497289
F 1.563908 4705537 7105430 7.796634
electricity_df.sum(level=1)
output cost price load
year
1970 1.910919 2301736 679554 2.873152
1971 1.976987 2506118 701726 2.920993
1972 2.295394 2858803 705968 3.141480
1973 2.579854 3408925 747634 3.146692
1974 2.647980 3941906 1280945 3.381160
1975 2.684864 4355582 1675035 3.325085
1976 2.936119 4848733 1832425 3.364455
1977 3.173497 5548301 2106566 3.402352
1978 3.731864 6454748 2310525 3.707459
1979 4.190756 8020356 3347542 3.740366
1980 3.853406 9661270 5148943 3.481546
1981 3.800075 10634533 6046129 3.513746
1982 4.016458 11167822 5675129 3.481910
1983 4.493125 12133405 5209027 3.482717
1984 4.758220 13184907 4984323 3.478301

If we want to remove multi-index, we can use reset_index(). With the level parameter, you can decide how many levels from the multi-index are removed.

electricity_df.reset_index(level=1)
year output cost price load
firm
A 1970 0.952757 1140640 106650 0.534487
A 1971 0.986757 1215690 110307 0.532328
A 1972 1.091980 1309570 110574 0.547736
A 1973 1.175780 1511530 121974 0.540846
A 1974 1.160170 1676730 196606 0.591167
... ... ... ... ... ...
F 1980 0.112640 381478 874818 0.517766
F 1981 0.154154 506969 1013170 0.580049
F 1982 0.186461 633388 930477 0.556024
F 1983 0.246847 804388 851676 0.537791
F 1984 0.304013 1009500 819476 0.525775

90 rows × 5 columns

electricity_df.reset_index(inplace=True)

Summary statistics can also be calculated using the groupby method.

electricity_df.groupby('firm').mean()
output cost price load
firm
A 1.410231 2.639003e+06 462320.200000 0.597192
B 1.016695 2.127884e+06 466286.733333 0.547095
C 0.412378 7.231892e+05 476348.866667 0.584536
D 0.212568 6.380884e+05 473368.133333 0.547677
E 0.113837 2.932759e+05 478078.800000 0.566486
F 0.104261 3.137025e+05 473695.333333 0.519776
electricity_df.groupby(['firm','year']).mean() # The mean of single values
output cost price load
firm year
A 1970 0.952757 1140640 106650 0.534487
1971 0.986757 1215690 110307 0.532328
1972 1.091980 1309570 110574 0.547736
1973 1.175780 1511530 121974 0.540846
1974 1.160170 1676730 196606 0.591167
... ... ... ... ... ...
F 1980 0.112640 381478 874818 0.517766
1981 0.154154 506969 1013170 0.580049
1982 0.186461 633388 930477 0.556024
1983 0.246847 804388 851676 0.537791
1984 0.304013 1009500 819476 0.525775

90 rows × 4 columns

stack and unstack can be used to reshape hierarchical index dataframes.

electricity_df.reset_index(inplace=True)
electricity_df
firm year output cost price load
0 A 1970 0.952757 1140640 106650 0.534487
1 A 1971 0.986757 1215690 110307 0.532328
2 A 1972 1.091980 1309570 110574 0.547736
3 A 1973 1.175780 1511530 121974 0.540846
4 A 1974 1.160170 1676730 196606 0.591167
... ... ... ... ... ... ...
85 F 1980 0.112640 381478 874818 0.517766
86 F 1981 0.154154 506969 1013170 0.580049
87 F 1982 0.186461 633388 930477 0.556024
88 F 1983 0.246847 804388 851676 0.537791
89 F 1984 0.304013 1009500 819476 0.525775

90 rows × 6 columns

Stack turns a dataframe into a series.

data_series= electricity_df.stack()
data_series
0   firm             A
    year          1970
    output    0.952757
    cost       1140640
    price       106650
                ...   
89  year          1984
    output    0.304013
    cost       1009500
    price       819476
    load      0.525775
Length: 540, dtype: object

unstack() can be used to rearragne data back to a dataframe.

data_series.unstack()
firm year output cost price load
0 A 1970 0.952757 1140640 106650 0.534487
1 A 1971 0.986757 1215690 110307 0.532328
2 A 1972 1.09198 1309570 110574 0.547736
3 A 1973 1.17578 1511530 121974 0.540846
4 A 1974 1.16017 1676730 196606 0.591167
... ... ... ... ... ... ...
85 F 1980 0.11264 381478 874818 0.517766
86 F 1981 0.154154 506969 1013170 0.580049
87 F 1982 0.186461 633388 930477 0.556024
88 F 1983 0.246847 804388 851676 0.537791
89 F 1984 0.304013 1009500 819476 0.525775

90 rows × 6 columns

Yet another tool to reorganise data is pivot_table. More of it later.

4.3. Merging datasets#

Merging dataframes is often difficult for Pandas beginners. It can be a hassle. The usual cause of difficulties is to forget the importance of index with Pandas datatypes. Merging dataframes is not about gluing tables together. The merging is done according to indices.

gasoline_df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Gasoline.csv',index_col=0)
gasoline_df.set_index('country',inplace=True)
gasoline_df
year lgaspcar lincomep lrpmg lcarpcap
country
AUSTRIA 1960 4.173244 -6.474277 -0.334548 -9.766840
AUSTRIA 1961 4.100989 -6.426006 -0.351328 -9.608622
AUSTRIA 1962 4.073177 -6.407308 -0.379518 -9.457257
AUSTRIA 1963 4.059509 -6.370679 -0.414251 -9.343155
AUSTRIA 1964 4.037689 -6.322247 -0.445335 -9.237739
... ... ... ... ... ...
U.S.A. 1974 4.798626 -5.328694 -1.231467 -7.617558
U.S.A. 1975 4.804932 -5.346190 -1.200377 -7.607010
U.S.A. 1976 4.814891 -5.297946 -1.154682 -7.574748
U.S.A. 1977 4.811032 -5.256606 -1.175910 -7.553458
U.S.A. 1978 4.818454 -5.221232 -1.212062 -7.536176

342 rows × 5 columns

Let’s split the data.

gaso2_df = gasoline_df[['lrpmg','lcarpcap']]
gasoline_df.drop(['lrpmg','lcarpcap'],axis=1,inplace=True)
gasoline_df
year lgaspcar lincomep
country
AUSTRIA 1960 4.173244 -6.474277
AUSTRIA 1961 4.100989 -6.426006
AUSTRIA 1962 4.073177 -6.407308
AUSTRIA 1963 4.059509 -6.370679
AUSTRIA 1964 4.037689 -6.322247
... ... ... ...
U.S.A. 1974 4.798626 -5.328694
U.S.A. 1975 4.804932 -5.346190
U.S.A. 1976 4.814891 -5.297946
U.S.A. 1977 4.811032 -5.256606
U.S.A. 1978 4.818454 -5.221232

342 rows × 3 columns

gaso2_df
lrpmg lcarpcap
country
AUSTRIA -0.334548 -9.766840
AUSTRIA -0.351328 -9.608622
AUSTRIA -0.379518 -9.457257
AUSTRIA -0.414251 -9.343155
AUSTRIA -0.445335 -9.237739
... ... ...
U.S.A. -1.231467 -7.617558
U.S.A. -1.200377 -7.607010
U.S.A. -1.154682 -7.574748
U.S.A. -1.175910 -7.553458
U.S.A. -1.212062 -7.536176

342 rows × 2 columns

Merge combindes according to the index values

pd.merge(gasoline_df,gaso2_df,on='country')
year lgaspcar lincomep lrpmg lcarpcap
country
AUSTRIA 1960 4.173244 -6.474277 -0.334548 -9.766840
AUSTRIA 1960 4.173244 -6.474277 -0.351328 -9.608622
AUSTRIA 1960 4.173244 -6.474277 -0.379518 -9.457257
AUSTRIA 1960 4.173244 -6.474277 -0.414251 -9.343155
AUSTRIA 1960 4.173244 -6.474277 -0.445335 -9.237739
... ... ... ... ... ...
U.S.A. 1978 4.818454 -5.221232 -1.231467 -7.617558
U.S.A. 1978 4.818454 -5.221232 -1.200377 -7.607010
U.S.A. 1978 4.818454 -5.221232 -1.154682 -7.574748
U.S.A. 1978 4.818454 -5.221232 -1.175910 -7.553458
U.S.A. 1978 4.818454 -5.221232 -1.212062 -7.536176

6498 rows × 5 columns

You could also do this by activating both indices for merging.

pd.merge(gasoline_df,gaso2_df,left_index=True,right_index=True)
year lgaspcar lincomep lrpmg lcarpcap
country
AUSTRIA 1960 4.173244 -6.474277 -0.334548 -9.766840
AUSTRIA 1960 4.173244 -6.474277 -0.351328 -9.608622
AUSTRIA 1960 4.173244 -6.474277 -0.379518 -9.457257
AUSTRIA 1960 4.173244 -6.474277 -0.414251 -9.343155
AUSTRIA 1960 4.173244 -6.474277 -0.445335 -9.237739
... ... ... ... ... ...
U.S.A. 1978 4.818454 -5.221232 -1.231467 -7.617558
U.S.A. 1978 4.818454 -5.221232 -1.200377 -7.607010
U.S.A. 1978 4.818454 -5.221232 -1.154682 -7.574748
U.S.A. 1978 4.818454 -5.221232 -1.175910 -7.553458
U.S.A. 1978 4.818454 -5.221232 -1.212062 -7.536176

6498 rows × 5 columns

You can also use join. It uses the index values for merging by default.

gasoline_df.join(gaso2_df)
year lgaspcar lincomep lrpmg lcarpcap
country
AUSTRIA 1960 4.173244 -6.474277 -0.334548 -9.766840
AUSTRIA 1960 4.173244 -6.474277 -0.351328 -9.608622
AUSTRIA 1960 4.173244 -6.474277 -0.379518 -9.457257
AUSTRIA 1960 4.173244 -6.474277 -0.414251 -9.343155
AUSTRIA 1960 4.173244 -6.474277 -0.445335 -9.237739
... ... ... ... ... ...
U.S.A. 1978 4.818454 -5.221232 -1.231467 -7.617558
U.S.A. 1978 4.818454 -5.221232 -1.200377 -7.607010
U.S.A. 1978 4.818454 -5.221232 -1.154682 -7.574748
U.S.A. 1978 4.818454 -5.221232 -1.175910 -7.553458
U.S.A. 1978 4.818454 -5.221232 -1.212062 -7.536176

6498 rows × 5 columns

Concat is a more general tool for merging data. It is easiest to understand how it works, if we use very simple datasets. Let’s create three simple Pandas Series.

x1 = pd.Series([20,30,40], index=['a','b','c'])
x2 = pd.Series([50,60], index=['e','f'])
x3 = pd.Series([70,80,90], index=['h','i','j'])

If we just feed them to concat, they are joined together as a longer Series.

pd.concat([x1,x2,x3])
a    20
b    30
c    40
e    50
f    60
h    70
i    80
j    90
dtype: int64

If you change the axis, the result will be a dataframe, because there are no overlapping index values.

pd.concat([x1,x2,x3],axis=1)
0 1 2
a 20.0 NaN NaN
b 30.0 NaN NaN
c 40.0 NaN NaN
e NaN 50.0 NaN
f NaN 60.0 NaN
h NaN NaN 70.0
i NaN NaN 80.0
j NaN NaN 90.0
x4 = pd.concat([x1,x2])
pd.concat([x4,x2],axis=1)
0 1
a 20 NaN
b 30 NaN
c 40 NaN
e 50 50.0
f 60 60.0

If you just want to keep the intersction of the series, you can use join=’inner’

pd.concat([x4,x2],axis=1,join='inner')
0 1
e 50 50
f 60 60

4.4. Data aggreagation#

Data aggregation is one the most important steps in data preprocessing. Pandas has many tools for that.

Groupby is an important tool in aggragation. Let’s load a new dataset for this.

import pandas as pd
comp_returns_df = pd.read_csv('comp_returns.csv',index_col=0)
comp_returns_df
NAME ISO COUNTRY CODE RETURN ON EQUITY - TOTAL (%) TOTAL DEBT % COMMON EQUITY RESEARCH & DEVELOPMENT/SALES - ACCOUNTING STANDARDS FOLLOWED
0 APPLE US 55.92 119.40 4.95 US standards (GAAP)
1 SAUDI ARABIAN OIL SA 32.25 16.96 NaN IFRS
2 MICROSOFT US 40.14 61.56 13.59 US standards (GAAP)
3 AMAZON.COM US 21.95 83.31 12.29 US standards (GAAP)
4 FACEBOOK CLASS A US 19.96 0.47 21.00 US standards (GAAP)
5 TENCENT HOLDINGS HK 24.68 53.74 7.72 IFRS
6 ALPHABET A US 18.12 2.26 15.71 US standards (GAAP)
7 TESLA US -14.94 202.77 10.73 US standards (GAAP)
8 WALMART US 20.22 72.95 NaN US standards (GAAP)
9 JOHNSON & JOHNSON US 25.36 46.57 13.28 US standards (GAAP)
10 TAIWAN SEMICON.MNFG. TW 20.94 11.89 8.08 IFRS
11 NESTLE 'N' CH 23.05 71.42 1.88 IFRS
12 PROCTER & GAMBLE US 27.78 76.10 2.79 US standards (GAAP)
13 VISA 'A' US 40.34 57.25 NaN US standards (GAAP)
14 MASTERCARD US 143.83 144.70 NaN US standards (GAAP)
15 NVIDIA US 25.95 16.31 22.49 US standards (GAAP)
16 3I GROUP GB 2.73 7.67 NaN IFRS
17 JP MORGAN CHASE & CO. US 14.91 220.24 NaN US standards (GAAP)
18 HOME DEPOT US NaN -1010.37 NaN US standards (GAAP)
19 KWEICHOW MOUTAI 'A' CN 33.12 9.14 0.85 Local standards
20 SAMSUNG ELECTRONICS KR 8.69 7.22 7.39 IFRS
21 UNITEDHEALTH GROUP US 25.40 71.33 NaN US standards (GAAP)
22 ROCHE HOLDING CH 44.71 47.58 19.38 IFRS
23 VERIZON COMMUNICATIONS US 33.64 181.59 NaN US standards (GAAP)
24 LVMH FR 20.82 68.97 0.28 IFRS
25 NETFLIX US 29.12 204.38 8.73 US standards (GAAP)
26 ADOBE (NAS) US 29.67 39.30 17.15 US standards (GAAP)
27 TOYOTA MOTOR JP 10.45 102.45 3.66 US standards (GAAP)
28 PAYPAL HOLDINGS US 15.24 29.48 8.49 US standards (GAAP)
29 SALESFORCE.COM US 0.51 9.04 14.76 US standards (GAAP)

Now, we can group data by the accounting standards variable and calculate the mean. Notice how Pandas automatically drops the country variable, because you cannot calculate a mean from strings:

comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').mean()
RETURN ON EQUITY - TOTAL (%) TOTAL DEBT % COMMON EQUITY RESEARCH & DEVELOPMENT/SALES -
ACCOUNTING STANDARDS FOLLOWED
IFRS 22.23375 35.68125 7.455000
Local standards 33.12000 9.14000 0.850000
US standards (GAAP) 29.17850 34.81381 12.115714

You can make a two-level grouping by adding a list of columns to groupby.

comp_returns_df.groupby(['ISO COUNTRY CODE','ACCOUNTING STANDARDS FOLLOWED']).mean()
RETURN ON EQUITY - TOTAL (%) TOTAL DEBT % COMMON EQUITY RESEARCH & DEVELOPMENT/SALES -
ISO COUNTRY CODE ACCOUNTING STANDARDS FOLLOWED
CH IFRS 33.880000 59.500 10.630000
CN Local standards 33.120000 9.140 0.850000
FR IFRS 20.820000 68.970 0.280000
GB IFRS 2.730000 7.670 NaN
HK IFRS 24.680000 53.740 7.720000
JP US standards (GAAP) 10.450000 102.450 3.660000
KR IFRS 8.690000 7.220 7.390000
SA IFRS 32.250000 16.960 NaN
TW IFRS 20.940000 11.890 8.080000
US US standards (GAAP) 30.164211 31.432 12.766154

There are many other groupby-methods, like size(). The full list of groupby -methods are here: pandas.pydata.org/pandas-docs/stable/reference/groupby.html

comp_returns_df.groupby('ISO COUNTRY CODE').size()
ISO COUNTRY CODE
CH     2
CN     1
FR     1
GB     1
HK     1
JP     1
KR     1
SA     1
TW     1
US    20
dtype: int64

If we just use the groupby method without a following method, it will return a Pandas groupby object.

comp_returns_df.groupby('ISO COUNTRY CODE')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000160F39C3B80>

This object can be used for iteration.

for name,group in comp_returns_df.groupby('ISO COUNTRY CODE'):
    print(name)
    print(len(group))
CH
2
CN
1
FR
1
GB
1
HK
1
JP
1
KR
1
SA
1
TW
1
US
20

You could also group data according to columns using axis=1. It makes not much sense with this data, so we skip that example.

You can easily pick up just one column from a groupby object.

comp_returns_df.groupby('ISO COUNTRY CODE')['RETURN ON EQUITY - TOTAL (%)'].mean()
ISO COUNTRY CODE
CH    33.880000
CN    33.120000
FR    20.820000
GB     2.730000
HK    24.680000
JP    10.450000
KR     8.690000
SA    32.250000
TW    20.940000
US    30.164211
Name: RETURN ON EQUITY - TOTAL (%), dtype: float64

You could group data by defining a dict, a function etc. We do not go to these advanced methods here.

4.4.1. Data aggregation#

Now that we know the basics of groupby, we can analyse more how it can be used to aggregate data. The methods of groupby are count, sum, mean, median, std, var, min, max, prod, first and last. These are opitmised for groupby objects, but many other methods work too. Actually, you can define your own functions with groupby.agg().

def mean_median(arr):
    return arr.mean()-arr.median()
comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').agg(mean_median)
RETURN ON EQUITY - TOTAL (%) TOTAL DEBT % COMMON EQUITY RESEARCH & DEVELOPMENT/SALES -
ACCOUNTING STANDARDS FOLLOWED
IFRS 0.23875 3.41125 -0.100000
Local standards 0.00000 0.00000 0.000000
US standards (GAAP) 3.79850 -36.51619 -0.669286

Basically, those Pandas methods will work with groupby that are some kind of aggregations. For example, quantile can be used

comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').quantile(0.9)
RETURN ON EQUITY - TOTAL (%) TOTAL DEBT % COMMON EQUITY RESEARCH & DEVELOPMENT/SALES -
ACCOUNTING STANDARDS FOLLOWED
IFRS 35.988 69.705 13.730
Local standards 33.120 9.140 0.850
US standards (GAAP) 41.898 202.770 19.845

Even though describe is not an aggregating function, it will also work.

comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').describe()
RETURN ON EQUITY - TOTAL (%) TOTAL DEBT % COMMON EQUITY RESEARCH & DEVELOPMENT/SALES -
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
ACCOUNTING STANDARDS FOLLOWED
IFRS 8.0 22.23375 12.981632 2.73 17.7875 21.995 26.5725 44.71 8.0 35.68125 ... 57.5475 71.42 6.0 7.455000 6.711661 0.28 3.2575 7.555 7.9900 19.38
Local standards 1.0 33.12000 NaN 33.12 33.1200 33.120 33.1200 33.12 1.0 9.14000 ... 9.1400 9.14 1.0 0.850000 NaN 0.85 0.8500 0.850 0.8500 0.85
US standards (GAAP) 20.0 29.17850 30.798945 -14.94 17.4000 25.380 30.6625 143.83 21.0 34.81381 ... 119.4000 220.24 14.0 12.115714 6.023770 2.79 8.5500 12.785 15.4725 22.49

3 rows × 24 columns

You can make the previous table more readable by unstacking it.

comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').describe().unstack()
                                       ACCOUNTING STANDARDS FOLLOWED
RETURN ON EQUITY - TOTAL (%)    count  IFRS                              8.00000
                                       Local standards                   1.00000
                                       US standards (GAAP)              20.00000
                                mean   IFRS                             22.23375
                                       Local standards                  33.12000
                                                                          ...   
RESEARCH & DEVELOPMENT/SALES -  75%    Local standards                   0.85000
                                       US standards (GAAP)              15.47250
                                max    IFRS                             19.38000
                                       Local standards                   0.85000
                                       US standards (GAAP)              22.49000
Length: 72, dtype: float64

Apply() is the most versatile method to use with groupby objects.

def medalists(df,var = 'RETURN ON EQUITY - TOTAL (%)'):
    return df.sort_values(by=var)[-3:]
comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').apply(medalists)
NAME ISO COUNTRY CODE RETURN ON EQUITY - TOTAL (%) TOTAL DEBT % COMMON EQUITY RESEARCH & DEVELOPMENT/SALES - ACCOUNTING STANDARDS FOLLOWED
ACCOUNTING STANDARDS FOLLOWED
IFRS 5 TENCENT HOLDINGS HK 24.68 53.74 7.72 IFRS
1 SAUDI ARABIAN OIL SA 32.25 16.96 NaN IFRS
22 ROCHE HOLDING CH 44.71 47.58 19.38 IFRS
Local standards 19 KWEICHOW MOUTAI 'A' CN 33.12 9.14 0.85 Local standards
US standards (GAAP) 0 APPLE US 55.92 119.40 4.95 US standards (GAAP)
14 MASTERCARD US 143.83 144.70 NaN US standards (GAAP)
18 HOME DEPOT US NaN -1010.37 NaN US standards (GAAP)

It is also possible to use the bins created with cut or qcut as a grouping criteria in groupby.

ROE_quartiles = pd.qcut(comp_returns_df['RETURN ON EQUITY - TOTAL (%)'],4)
comp_returns_df.groupby(ROE_quartiles).size()
RETURN ON EQUITY - TOTAL (%)
(-14.940999999999999, 18.12]    8
(18.12, 24.68]                  7
(24.68, 32.25]                  7
(32.25, 143.83]                 7
dtype: int64
comp_returns_df.groupby(ROE_quartiles).mean()
RETURN ON EQUITY - TOTAL (%) TOTAL DEBT % COMMON EQUITY RESEARCH & DEVELOPMENT/SALES -
RETURN ON EQUITY - TOTAL (%)
(-14.940999999999999, 18.12] 6.963750 72.641250 10.123333
(18.12, 24.68] 21.660000 51.821429 8.541667
(24.68, 32.25] 27.932857 67.278571 12.888000
(32.25, 143.83] 55.957143 88.745714 9.692500

Pivot tables are another option to organise data with Pandas. They are very popular in spreadsheet softwares, like Excel.It is very similar to cross-tabs that we use in statistics. Pandas has a specific function for pivot tables.

Let’s load a new data.

large_df = pd.read_csv('large_table.csv',delimiter=';')
large_df.drop('Type',inplace=True,axis=1)
large_df
NAME ISO COUNTRY CODE RETURN ON EQUITY - TOTAL (%) TOTAL DEBT % COMMON EQUITY RESEARCH & DEVELOPMENT/SALES - INDUSTRY GROUP ACCOUNTING STANDARDS FOLLOWED CSR Sustainability External Audit Audit Committee CSR Sustainability Committee
0 APPLE US 55.92 119.40 4.95 34 US standards (GAAP) Y Y Y
1 SAUDI ARABIAN OIL SA 32.25 16.96 NaN 97 IFRS NaN Y Y
2 MICROSOFT US 40.14 61.56 13.59 58 US standards (GAAP) NaN Y Y
3 AMAZON.COM US 21.95 83.31 12.29 87 US standards (GAAP) Y Y Y
4 FACEBOOK CLASS A US 19.96 0.47 21.00 151 US standards (GAAP) Y Y Y
... ... ... ... ... ... ... ... ... ... ...
100 CHINA MERCHANTS BANK 'A' CN 16.84 301.52 NaN 102 IFRS Y Y Y
101 ENEL IT 7.00 202.46 NaN 169 IFRS Y Y Y
102 JD COM ADR 1:2 CN 17.28 12.28 2.18 87 US standards (GAAP) NaN Y N
103 KEYENCE JP 11.84 0.00 2.76 214 Local standards NaN N N
104 HSBC HOLDINGS GB 3.71 328.78 NaN 102 IFRS Y Y Y

105 rows × 10 columns

The default pivot_table aggregation type is mean.

large_df.columns
Index(['NAME', 'ISO COUNTRY CODE', 'RETURN ON EQUITY - TOTAL (%)',
       'TOTAL DEBT % COMMON EQUITY', 'RESEARCH & DEVELOPMENT/SALES -',
       'INDUSTRY GROUP', 'ACCOUNTING STANDARDS FOLLOWED',
       'CSR Sustainability External Audit', 'Audit Committee',
       'CSR Sustainability Committee'],
      dtype='object')
large_df.pivot_table(values='RETURN ON EQUITY - TOTAL (%)',
                     index='CSR Sustainability External Audit',columns='ACCOUNTING STANDARDS FOLLOWED')
ACCOUNTING STANDARDS FOLLOWED IFRS Local standards US standards (GAAP)
CSR Sustainability External Audit
N NaN NaN 22.380000
Y 18.968333 23.315 46.280556

You can use hierarchical index and add marginal sums to the table.

large_df.pivot_table(values='RETURN ON EQUITY - TOTAL (%)',
                     index=['CSR Sustainability External Audit','Audit Committee'],
                     columns='ACCOUNTING STANDARDS FOLLOWED',margins=True)
ACCOUNTING STANDARDS FOLLOWED IFRS Local standards US standards (GAAP) All
CSR Sustainability External Audit Audit Committee
N Y NaN NaN 22.380000 22.380000
Y N 24.660000 NaN 10.450000 17.555000
Y 18.720870 23.315 47.304286 35.547667
All 18.968333 23.315 44.442051 34.386308

You can also use hierarchical columns

large_df.pivot_table(values='RESEARCH & DEVELOPMENT/SALES -',
                     columns=['CSR Sustainability Committee','Audit Committee'],
                     index='ISO COUNTRY CODE',margins=True)
CSR Sustainability Committee N Y All
Audit Committee N Y N Y
ISO COUNTRY CODE
CA NaN 19.890000 NaN NaN 19.890000
CH NaN NaN NaN 12.720000 12.720000
CN 0.850 2.180000 0.280000 NaN 1.103333
DE NaN NaN NaN 10.315000 10.315000
DK NaN NaN NaN 11.850000 11.850000
FR NaN NaN NaN 4.902500 4.902500
GB NaN NaN NaN 18.015000 18.015000
HK NaN 15.420000 NaN 7.720000 11.570000
IN NaN NaN NaN 0.220000 0.220000
JP 2.760 NaN 2.505000 1.890000 2.415000
KR NaN NaN NaN 7.390000 7.390000
NL NaN NaN NaN 14.710000 14.710000
TW NaN NaN NaN 8.080000 8.080000
US NaN 13.755000 NaN 10.564242 11.055128
All 1.805 13.335556 1.763333 9.906346 9.758333

Cross-tabs are a special case of pivot tables, where the values are frequencies.

large_df.pivot_table(columns='CSR Sustainability Committee',index='ISO COUNTRY CODE',
                     aggfunc='count',margins=True)
ACCOUNTING STANDARDS FOLLOWED Audit Committee CSR Sustainability External Audit INDUSTRY GROUP ... NAME RESEARCH & DEVELOPMENT/SALES - RETURN ON EQUITY - TOTAL (%) TOTAL DEBT % COMMON EQUITY
CSR Sustainability Committee N Y All N Y All N Y All N ... All N Y All N Y All N Y All
ISO COUNTRY CODE
CA 1.0 NaN NaN 1.0 NaN NaN 0.0 NaN NaN 1.0 ... NaN 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN NaN
CH NaN 3.0 3.0 NaN 3.0 3.0 NaN 3.0 3.0 NaN ... 3.0 NaN 3.0 3.0 NaN 3.0 3.0 NaN 3.0 3.0
CN 3.0 6.0 NaN 3.0 6.0 NaN 1.0 4.0 NaN 3.0 ... NaN 2.0 1.0 NaN 3.0 6.0 NaN 3.0 6.0 NaN
DE NaN 2.0 2.0 NaN 2.0 2.0 NaN 2.0 2.0 NaN ... 2.0 NaN 2.0 2.0 NaN 2.0 2.0 NaN 2.0 2.0
DK NaN 1.0 1.0 NaN 1.0 1.0 NaN 1.0 1.0 NaN ... 1.0 NaN 1.0 1.0 NaN 1.0 1.0 NaN 1.0 1.0
FR NaN 4.0 4.0 NaN 4.0 4.0 NaN 4.0 4.0 NaN ... 4.0 NaN 4.0 4.0 NaN 4.0 4.0 NaN 4.0 4.0
GB NaN 3.0 2.0 NaN 3.0 2.0 NaN 3.0 2.0 NaN ... 2.0 NaN 2.0 2.0 NaN 3.0 2.0 NaN 3.0 2.0
HK 1.0 3.0 NaN 1.0 3.0 NaN 0.0 2.0 NaN 1.0 ... NaN 1.0 1.0 NaN 1.0 3.0 NaN 1.0 3.0 NaN
IN NaN 2.0 2.0 NaN 2.0 2.0 NaN 2.0 2.0 NaN ... 2.0 NaN 2.0 2.0 NaN 2.0 2.0 NaN 2.0 2.0
IT NaN 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN NaN ... NaN NaN 0.0 NaN NaN 1.0 NaN NaN 1.0 NaN
JP 1.0 3.0 2.0 1.0 3.0 2.0 0.0 2.0 2.0 1.0 ... 2.0 1.0 3.0 2.0 1.0 3.0 2.0 1.0 3.0 2.0
KR NaN 1.0 1.0 NaN 1.0 1.0 NaN 1.0 1.0 NaN ... 1.0 NaN 1.0 1.0 NaN 1.0 1.0 NaN 1.0 1.0
NL 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 ... 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
SA NaN 1.0 NaN NaN 1.0 NaN NaN 0.0 NaN NaN ... NaN NaN 0.0 NaN NaN 1.0 NaN NaN 1.0 NaN
TW NaN 1.0 1.0 NaN 1.0 1.0 NaN 1.0 1.0 NaN ... 1.0 NaN 1.0 1.0 NaN 1.0 1.0 NaN 1.0 1.0
US 11.0 55.0 23.0 11.0 55.0 23.0 4.0 34.0 23.0 11.0 ... 23.0 6.0 33.0 23.0 11.0 51.0 23.0 11.0 55.0 23.0
All 2.0 40.0 42.0 2.0 40.0 42.0 2.0 40.0 42.0 2.0 ... 42.0 2.0 40.0 42.0 2.0 40.0 42.0 2.0 40.0 42.0

17 rows × 24 columns

As you can see, it will return values for all variables. Therefore it is more convenient to use crosstab.

pd.crosstab(large_df['CSR Sustainability External Audit'],large_df['ISO COUNTRY CODE'],margins=True)
ISO COUNTRY CODE CH CN DE DK FR GB HK IN IT JP KR NL TW US All
CSR Sustainability External Audit
N 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3
Y 3 5 2 1 4 3 2 2 1 2 1 1 1 35 63
All 3 5 2 1 4 3 2 2 1 2 1 1 1 38 66