Pandas Data preprocessing

4. Pandas Data preprocessing#

You might be wondering why we are talking so much about data processing and manipulation in this book. “Where are all the fancy ML methods” you might ask. Unfortunately, most of the time in data science goes to data pre-processing. It is often reported that 80 % of a data scientist’s time goes to data preparation: cleaning, transforming, rearranging and creating suitable features (feature engineering). So, to be a succesfull data scientist, you need know how to play with data. Luckily, we have Pandas at our disposal, wich is one of the most powerful data manipulation tools available.

Let’s look at some of the most common preprocessing situatsions we encounter when doing data science.

4.1. Basic methods#

Missing data is probably the most common issue with data.

import pandas as pd

companies_df = pd.read_csv('emissions.csv',index_col='NAME',delimiter=';')

companies_df

	Board member compensation	Total CO2 To Revenues	Emission Reduction Target %
NAME
APPLE	2832025.00	3.13	NaN
SAUDI ARABIAN OIL	3220550.15	175.59	NaN
MICROSOFT	4336000.00	29.16	75.0
AMAZON.COM	6567669.00	40.14	50.0
ALIBABA GROUP HOLDING ADR 1:8	NaN	NaN	NaN
FACEBOOK CLASS A	2291182.00	25.69	75.0
TENCENT HOLDINGS	2681758.80	15.81	NaN
ALPHABET A	50294471.00	32.22	50.0
TESLA	17315009.00	NaN	NaN
TAIWAN SEMICON.MNFG.	12584165.55	245.17	18.0
NESTLE 'N'	12906002.11	58.52	35.0
NVIDIA	NaN	7.10	25.0
SAMSUNG ELECTRONICS	NaN	69.06	70.0
JP MORGAN CHASE & CO.	4460719.00	6.69	50.0
KWEICHOW MOUTAI 'A'	40219.20	NaN	NaN
NETFLIX	3665870.00	NaN	NaN

Dropna can be used to filter out missing data. The how parameter defines do all/any of the values in rows/columns need to be zero for dropping.

companies_df.dropna(how='all')

	Board member compensation	Total CO2 To Revenues	Emission Reduction Target %
NAME
APPLE	2832025.00	3.13	NaN
SAUDI ARABIAN OIL	3220550.15	175.59	NaN
MICROSOFT	4336000.00	29.16	75.0
AMAZON.COM	6567669.00	40.14	50.0
FACEBOOK CLASS A	2291182.00	25.69	75.0
TENCENT HOLDINGS	2681758.80	15.81	NaN
ALPHABET A	50294471.00	32.22	50.0
TESLA	17315009.00	NaN	NaN
TAIWAN SEMICON.MNFG.	12584165.55	245.17	18.0
NESTLE 'N'	12906002.11	58.52	35.0
NVIDIA	NaN	7.10	25.0
SAMSUNG ELECTRONICS	NaN	69.06	70.0
JP MORGAN CHASE & CO.	4460719.00	6.69	50.0
KWEICHOW MOUTAI 'A'	40219.20	NaN	NaN
NETFLIX	3665870.00	NaN	NaN

companies_df.dropna(how='any')

	Board member compensation	Total CO2 To Revenues	Emission Reduction Target %
NAME
MICROSOFT	4336000.00	29.16	75.0
AMAZON.COM	6567669.00	40.14	50.0
FACEBOOK CLASS A	2291182.00	25.69	75.0
ALPHABET A	50294471.00	32.22	50.0
TAIWAN SEMICON.MNFG.	12584165.55	245.17	18.0
NESTLE 'N'	12906002.11	58.52	35.0
JP MORGAN CHASE & CO.	4460719.00	6.69	50.0

You can again use the axis parameter. However, it is not meaningful to use here because every column has NaN-values (any returns an empty table), and not all values are NaN in any column (all returns the original dataframe).

With thresh you can define how many of the values are allowed to be NaN.

companies_df.dropna(thresh=2)

	Board member compensation	Total CO2 To Revenues	Emission Reduction Target %
NAME
APPLE	2832025.00	3.13	NaN
SAUDI ARABIAN OIL	3220550.15	175.59	NaN
MICROSOFT	4336000.00	29.16	75.0
AMAZON.COM	6567669.00	40.14	50.0
FACEBOOK CLASS A	2291182.00	25.69	75.0
TENCENT HOLDINGS	2681758.80	15.81	NaN
ALPHABET A	50294471.00	32.22	50.0
TAIWAN SEMICON.MNFG.	12584165.55	245.17	18.0
NESTLE 'N'	12906002.11	58.52	35.0
NVIDIA	NaN	7.10	25.0
SAMSUNG ELECTRONICS	NaN	69.06	70.0
JP MORGAN CHASE & CO.	4460719.00	6.69	50.0

You can use fillna to fill NaN values with other values.

companies_df.fillna(0)

	Board member compensation	Total CO2 To Revenues	Emission Reduction Target %
NAME
APPLE	2832025.00	3.13	0.0
SAUDI ARABIAN OIL	3220550.15	175.59	0.0
MICROSOFT	4336000.00	29.16	75.0
AMAZON.COM	6567669.00	40.14	50.0
ALIBABA GROUP HOLDING ADR 1:8	0.00	0.00	0.0
FACEBOOK CLASS A	2291182.00	25.69	75.0
TENCENT HOLDINGS	2681758.80	15.81	0.0
ALPHABET A	50294471.00	32.22	50.0
TESLA	17315009.00	0.00	0.0
TAIWAN SEMICON.MNFG.	12584165.55	245.17	18.0
NESTLE 'N'	12906002.11	58.52	35.0
NVIDIA	0.00	7.10	25.0
SAMSUNG ELECTRONICS	0.00	69.06	70.0
JP MORGAN CHASE & CO.	4460719.00	6.69	50.0
KWEICHOW MOUTAI 'A'	40219.20	0.00	0.0
NETFLIX	3665870.00	0.00	0.0

Using a dict, you can define different fill values for different columns. A very common choice is to use the mean value of each column.

companies_df.fillna({'Board member compensation':1000000, 'Total CO2 To Revenues':10,
       'Emission Reduction Target %':0})

	Board member compensation	Total CO2 To Revenues	Emission Reduction Target %
NAME
APPLE	2832025.00	3.13	0.0
SAUDI ARABIAN OIL	3220550.15	175.59	0.0
MICROSOFT	4336000.00	29.16	75.0
AMAZON.COM	6567669.00	40.14	50.0
ALIBABA GROUP HOLDING ADR 1:8	1000000.00	10.00	0.0
FACEBOOK CLASS A	2291182.00	25.69	75.0
TENCENT HOLDINGS	2681758.80	15.81	0.0
ALPHABET A	50294471.00	32.22	50.0
TESLA	17315009.00	10.00	0.0
TAIWAN SEMICON.MNFG.	12584165.55	245.17	18.0
NESTLE 'N'	12906002.11	58.52	35.0
NVIDIA	1000000.00	7.10	25.0
SAMSUNG ELECTRONICS	1000000.00	69.06	70.0
JP MORGAN CHASE & CO.	4460719.00	6.69	50.0
KWEICHOW MOUTAI 'A'	40219.20	10.00	0.0
NETFLIX	3665870.00	10.00	0.0

Interpolation methods that were available for reindexing can be used with fillna.

companies_df.fillna(method = 'bfill')

	Board member compensation	Total CO2 To Revenues	Emission Reduction Target %
NAME
APPLE	2832025.00	3.13	75.0
SAUDI ARABIAN OIL	3220550.15	175.59	75.0
MICROSOFT	4336000.00	29.16	75.0
AMAZON.COM	6567669.00	40.14	50.0
ALIBABA GROUP HOLDING ADR 1:8	2291182.00	25.69	75.0
FACEBOOK CLASS A	2291182.00	25.69	75.0
TENCENT HOLDINGS	2681758.80	15.81	50.0
ALPHABET A	50294471.00	32.22	50.0
TESLA	17315009.00	245.17	18.0
TAIWAN SEMICON.MNFG.	12584165.55	245.17	18.0
NESTLE 'N'	12906002.11	58.52	35.0
NVIDIA	4460719.00	7.10	25.0
SAMSUNG ELECTRONICS	4460719.00	69.06	70.0
JP MORGAN CHASE & CO.	4460719.00	6.69	50.0
KWEICHOW MOUTAI 'A'	40219.20	NaN	NaN
NETFLIX	3665870.00	NaN	NaN

duplicated() can be used to drop duplicated rows. It drops only duplicates that are next to each other

More about append() later. Here we just build a new dataframe with companies_df in it twice. And then we sort the index so that every company is twice in the new dataframe.

new_df = companies_df.append(companies_df)

new_df.sort_index(inplace=True)

new_df

	Board member compensation	Total CO2 To Revenues	Emission Reduction Target %
NAME
ALIBABA GROUP HOLDING ADR 1:8	NaN	NaN	NaN
ALIBABA GROUP HOLDING ADR 1:8	NaN	NaN	NaN
ALPHABET A	50294471.00	32.22	50.0
ALPHABET A	50294471.00	32.22	50.0
AMAZON.COM	6567669.00	40.14	50.0
AMAZON.COM	6567669.00	40.14	50.0
APPLE	2832025.00	3.13	NaN
APPLE	2832025.00	3.13	NaN
FACEBOOK CLASS A	2291182.00	25.69	75.0
FACEBOOK CLASS A	2291182.00	25.69	75.0
JP MORGAN CHASE & CO.	4460719.00	6.69	50.0
JP MORGAN CHASE & CO.	4460719.00	6.69	50.0
KWEICHOW MOUTAI 'A'	40219.20	NaN	NaN
KWEICHOW MOUTAI 'A'	40219.20	NaN	NaN
MICROSOFT	4336000.00	29.16	75.0
MICROSOFT	4336000.00	29.16	75.0
NESTLE 'N'	12906002.11	58.52	35.0
NESTLE 'N'	12906002.11	58.52	35.0
NETFLIX	3665870.00	NaN	NaN
NETFLIX	3665870.00	NaN	NaN
NVIDIA	NaN	7.10	25.0
NVIDIA	NaN	7.10	25.0
SAMSUNG ELECTRONICS	NaN	69.06	70.0
SAMSUNG ELECTRONICS	NaN	69.06	70.0
SAUDI ARABIAN OIL	3220550.15	175.59	NaN
SAUDI ARABIAN OIL	3220550.15	175.59	NaN
TAIWAN SEMICON.MNFG.	12584165.55	245.17	18.0
TAIWAN SEMICON.MNFG.	12584165.55	245.17	18.0
TENCENT HOLDINGS	2681758.80	15.81	NaN
TENCENT HOLDINGS	2681758.80	15.81	NaN
TESLA	17315009.00	NaN	NaN
TESLA	17315009.00	NaN	NaN

new_df.duplicated()

NAME
ALIBABA GROUP HOLDING ADR 1:8    False
ALIBABA GROUP HOLDING ADR 1:8     True
ALPHABET A                       False
ALPHABET A                        True
AMAZON.COM                       False
AMAZON.COM                        True
APPLE                            False
APPLE                             True
FACEBOOK CLASS A                 False
FACEBOOK CLASS A                  True
JP MORGAN CHASE & CO.            False
JP MORGAN CHASE & CO.             True
KWEICHOW MOUTAI 'A'              False
KWEICHOW MOUTAI 'A'               True
MICROSOFT                        False
MICROSOFT                         True
NESTLE 'N'                       False
NESTLE 'N'                        True
NETFLIX                          False
NETFLIX                           True
NVIDIA                           False
NVIDIA                            True
SAMSUNG ELECTRONICS              False
SAMSUNG ELECTRONICS               True
SAUDI ARABIAN OIL                False
SAUDI ARABIAN OIL                 True
TAIWAN SEMICON.MNFG.             False
TAIWAN SEMICON.MNFG.              True
TENCENT HOLDINGS                 False
TENCENT HOLDINGS                  True
TESLA                            False
TESLA                             True
dtype: bool

We can remove duplicated rows using drop_duplicated().

new_df.drop_duplicates()

	Board member compensation	Total CO2 To Revenues	Emission Reduction Target %
NAME
ALIBABA GROUP HOLDING ADR 1:8	NaN	NaN	NaN
ALPHABET A	50294471.00	32.22	50.0
AMAZON.COM	6567669.00	40.14	50.0
APPLE	2832025.00	3.13	NaN
FACEBOOK CLASS A	2291182.00	25.69	75.0
JP MORGAN CHASE & CO.	4460719.00	6.69	50.0
KWEICHOW MOUTAI 'A'	40219.20	NaN	NaN
MICROSOFT	4336000.00	29.16	75.0
NESTLE 'N'	12906002.11	58.52	35.0
NETFLIX	3665870.00	NaN	NaN
NVIDIA	NaN	7.10	25.0
SAMSUNG ELECTRONICS	NaN	69.06	70.0
SAUDI ARABIAN OIL	3220550.15	175.59	NaN
TAIWAN SEMICON.MNFG.	12584165.55	245.17	18.0
TENCENT HOLDINGS	2681758.80	15.81	NaN
TESLA	17315009.00	NaN	NaN

It is easy to apply function transformation to rows, columns or individual cells of a dataframe. map -metohd can be used for that.

companies_df['Compensation ($ millions)'] = companies_df['Board member compensation'].map(
    lambda x: "{:.2f}".format(x/1000000)+" M$")

companies_df

	Board member compensation	Total CO2 To Revenues	Emission Reduction Target %	Compensation ($ millions)
NAME
APPLE	2832025.00	3.13	NaN	2.83 M$
SAUDI ARABIAN OIL	3220550.15	175.59	NaN	3.22 M$
MICROSOFT	4336000.00	29.16	75.0	4.34 M$
AMAZON.COM	6567669.00	40.14	50.0	6.57 M$
ALIBABA GROUP HOLDING ADR 1:8	NaN	NaN	NaN	nan M$
FACEBOOK CLASS A	2291182.00	25.69	75.0	2.29 M$
TENCENT HOLDINGS	2681758.80	15.81	NaN	2.68 M$
ALPHABET A	50294471.00	32.22	50.0	50.29 M$
TESLA	17315009.00	NaN	NaN	17.32 M$
TAIWAN SEMICON.MNFG.	12584165.55	245.17	18.0	12.58 M$
NESTLE 'N'	12906002.11	58.52	35.0	12.91 M$
NVIDIA	NaN	7.10	25.0	nan M$
SAMSUNG ELECTRONICS	NaN	69.06	70.0	nan M$
JP MORGAN CHASE & CO.	4460719.00	6.69	50.0	4.46 M$
KWEICHOW MOUTAI 'A'	40219.20	NaN	NaN	0.04 M$
NETFLIX	3665870.00	NaN	NaN	3.67 M$

Map() can also be used to transform index. The following command turns the company names backwards.

companies_df.index.map(lambda x : x[::-1])

Index(['ELPPA', 'LIO NAIBARA IDUAS', 'TFOSORCIM', 'MOC.NOZAMA',
       '8:1 RDA GNIDLOH PUORG ABABILA', 'A SSALC KOOBECAF', 'SGNIDLOH TNECNET',
       'A TEBAHPLA', 'ALSET', '.GFNM.NOCIMES NAWIAT', ''N' ELTSEN', 'AIDIVN',
       'SCINORTCELE GNUSMAS', '.OC & ESAHC NAGROM PJ', ''A' IATUOM WOHCIEWK',
       'XILFTEN'],
      dtype='object', name='NAME')

Of course, you can also use rename(). Using a dictionary, rename() can also be used to change only some of the labels.

companies_df.rename(index=str.title)

	Board member compensation	Total CO2 To Revenues	Emission Reduction Target %	Compensation ($ millions)
NAME
Apple	2832025.00	3.13	NaN	2.83 M$
Saudi Arabian Oil	3220550.15	175.59	NaN	3.22 M$
Microsoft	4336000.00	29.16	75.0	4.34 M$
Amazon.Com	6567669.00	40.14	50.0	6.57 M$
Alibaba Group Holding Adr 1:8	NaN	NaN	NaN	nan M$
Facebook Class A	2291182.00	25.69	75.0	2.29 M$
Tencent Holdings	2681758.80	15.81	NaN	2.68 M$
Alphabet A	50294471.00	32.22	50.0	50.29 M$
Tesla	17315009.00	NaN	NaN	17.32 M$
Taiwan Semicon.Mnfg.	12584165.55	245.17	18.0	12.58 M$
Nestle 'N'	12906002.11	58.52	35.0	12.91 M$
Nvidia	NaN	7.10	25.0	nan M$
Samsung Electronics	NaN	69.06	70.0	nan M$
Jp Morgan Chase & Co.	4460719.00	6.69	50.0	4.46 M$
Kweichow Moutai 'A'	40219.20	NaN	NaN	0.04 M$
Netflix	3665870.00	NaN	NaN	3.67 M$

Replace() can be used to replace any values, not just NaN values. You can pass also dict/list, if you want to replace multiple values.

import numpy as np

companies_df.replace(np.nan,-999) # Pandas NaN is np.nan

	Board member compensation	Total CO2 To Revenues	Emission Reduction Target %	Compensation ($ millions)
NAME
APPLE	2832025.00	3.13	-999.0	2.83 M$
SAUDI ARABIAN OIL	3220550.15	175.59	-999.0	3.22 M$
MICROSOFT	4336000.00	29.16	75.0	4.34 M$
AMAZON.COM	6567669.00	40.14	50.0	6.57 M$
ALIBABA GROUP HOLDING ADR 1:8	-999.00	-999.00	-999.0	nan M$
FACEBOOK CLASS A	2291182.00	25.69	75.0	2.29 M$
TENCENT HOLDINGS	2681758.80	15.81	-999.0	2.68 M$
ALPHABET A	50294471.00	32.22	50.0	50.29 M$
TESLA	17315009.00	-999.00	-999.0	17.32 M$
TAIWAN SEMICON.MNFG.	12584165.55	245.17	18.0	12.58 M$
NESTLE 'N'	12906002.11	58.52	35.0	12.91 M$
NVIDIA	-999.00	7.10	25.0	nan M$
SAMSUNG ELECTRONICS	-999.00	69.06	70.0	nan M$
JP MORGAN CHASE & CO.	4460719.00	6.69	50.0	4.46 M$
KWEICHOW MOUTAI 'A'	40219.20	-999.00	-999.0	0.04 M$
NETFLIX	3665870.00	-999.00	-999.0	3.67 M$

Dividing data to bins is a very important feature in Pandas. You can use pd.cut() (notice that it is not a dataframe method) to creata a categorical object. pd.value_counts() can be used to calculate the number of observations in each bin. With the labels parameter you can define names for different bins.

compensation_bins = [10000,100000,1000000,10000000,100000000]

comp_categ = pd.cut(companies_df['Board member compensation'],compensation_bins)

comp_categ

NAME
APPLE                              (1000000.0, 10000000.0]
SAUDI ARABIAN OIL                  (1000000.0, 10000000.0]
MICROSOFT                          (1000000.0, 10000000.0]
AMAZON.COM                         (1000000.0, 10000000.0]
ALIBABA GROUP HOLDING ADR 1:8                          NaN
FACEBOOK CLASS A                   (1000000.0, 10000000.0]
TENCENT HOLDINGS                   (1000000.0, 10000000.0]
ALPHABET A                       (10000000.0, 100000000.0]
TESLA                            (10000000.0, 100000000.0]
TAIWAN SEMICON.MNFG.             (10000000.0, 100000000.0]
NESTLE 'N'                       (10000000.0, 100000000.0]
NVIDIA                                                 NaN
SAMSUNG ELECTRONICS                                    NaN
JP MORGAN CHASE & CO.              (1000000.0, 10000000.0]
KWEICHOW MOUTAI 'A'                    (10000.0, 100000.0]
NETFLIX                            (1000000.0, 10000000.0]
Name: Board member compensation, dtype: category
Categories (4, interval[int64]): [(10000, 100000] < (100000, 1000000] < (1000000, 10000000] < (10000000, 100000000]]

comp_categ.values.categories

IntervalIndex([(10000, 100000], (100000, 1000000], (1000000, 10000000], (10000000, 100000000]],
              closed='right',
              dtype='interval[int64]')

comp_categ.values.codes

array([ 2,  2,  2,  2, -1,  2,  2,  3,  3,  3,  3, -1, -1,  2,  0,  2],
      dtype=int8)

pd.value_counts(comp_categ)

(1000000, 10000000]      8
(10000000, 100000000]    4
(10000, 100000]          1
(100000, 1000000]        0
Name: Board member compensation, dtype: int64

pd.cut(companies_df['Board member compensation'],
                    compensation_bins,labels = ['Poor board members','Rich board members',
                                                'Very rich board members','Insanely rich board members'])

NAME
APPLE                                Very rich board members
SAUDI ARABIAN OIL                    Very rich board members
MICROSOFT                            Very rich board members
AMAZON.COM                           Very rich board members
ALIBABA GROUP HOLDING ADR 1:8                            NaN
FACEBOOK CLASS A                     Very rich board members
TENCENT HOLDINGS                     Very rich board members
ALPHABET A                       Insanely rich board members
TESLA                            Insanely rich board members
TAIWAN SEMICON.MNFG.             Insanely rich board members
NESTLE 'N'                       Insanely rich board members
NVIDIA                                                   NaN
SAMSUNG ELECTRONICS                                      NaN
JP MORGAN CHASE & CO.                Very rich board members
KWEICHOW MOUTAI 'A'                       Poor board members
NETFLIX                              Very rich board members
Name: Board member compensation, dtype: category
Categories (4, object): ['Poor board members' < 'Rich board members' < 'Very rich board members' < 'Insanely rich board members']

If you pass a number to the bins parameter, it will return that many equal-length bins.

pd.cut(companies_df['Total CO2 To Revenues'],4,precision=1)

NAME
APPLE                               (2.9, 63.6]
SAUDI ARABIAN OIL                (124.2, 184.7]
MICROSOFT                           (2.9, 63.6]
AMAZON.COM                          (2.9, 63.6]
ALIBABA GROUP HOLDING ADR 1:8               NaN
FACEBOOK CLASS A                    (2.9, 63.6]
TENCENT HOLDINGS                    (2.9, 63.6]
ALPHABET A                          (2.9, 63.6]
TESLA                                       NaN
TAIWAN SEMICON.MNFG.             (184.7, 245.2]
NESTLE 'N'                          (2.9, 63.6]
NVIDIA                              (2.9, 63.6]
SAMSUNG ELECTRONICS               (63.6, 124.2]
JP MORGAN CHASE & CO.               (2.9, 63.6]
KWEICHOW MOUTAI 'A'                         NaN
NETFLIX                                     NaN
Name: Total CO2 To Revenues, dtype: category
Categories (4, interval[float64]): [(2.9, 63.6] < (63.6, 124.2] < (124.2, 184.7] < (184.7, 245.2]]

For many purposes, qcut() is more useful as it bins the data based on sample quantiles. Therefore, every bins has approximately the same number of observations. You can also pass specific quantiles as a list to the function.

pd.qcut(companies_df['Total CO2 To Revenues'],4)

NAME
APPLE                             (3.129, 13.632]
SAUDI ARABIAN OIL                (61.155, 245.17]
MICROSOFT                         (13.632, 30.69]
AMAZON.COM                        (30.69, 61.155]
ALIBABA GROUP HOLDING ADR 1:8                 NaN
FACEBOOK CLASS A                  (13.632, 30.69]
TENCENT HOLDINGS                  (13.632, 30.69]
ALPHABET A                        (30.69, 61.155]
TESLA                                         NaN
TAIWAN SEMICON.MNFG.             (61.155, 245.17]
NESTLE 'N'                        (30.69, 61.155]
NVIDIA                            (3.129, 13.632]
SAMSUNG ELECTRONICS              (61.155, 245.17]
JP MORGAN CHASE & CO.             (3.129, 13.632]
KWEICHOW MOUTAI 'A'                           NaN
NETFLIX                                       NaN
Name: Total CO2 To Revenues, dtype: category
Categories (4, interval[float64]): [(3.129, 13.632] < (13.632, 30.69] < (30.69, 61.155] < (61.155, 245.17]]

Detecting and filtering outliers is easy with boolean dataframes

companies_df[companies_df['Board member compensation'] < 10000000]

	Board member compensation	Total CO2 To Revenues	Emission Reduction Target %	Compensation ($ millions)
NAME
APPLE	2832025.00	3.13	NaN	2.83 M$
SAUDI ARABIAN OIL	3220550.15	175.59	NaN	3.22 M$
MICROSOFT	4336000.00	29.16	75.0	4.34 M$
AMAZON.COM	6567669.00	40.14	50.0	6.57 M$
FACEBOOK CLASS A	2291182.00	25.69	75.0	2.29 M$
TENCENT HOLDINGS	2681758.80	15.81	NaN	2.68 M$
JP MORGAN CHASE & CO.	4460719.00	6.69	50.0	4.46 M$
KWEICHOW MOUTAI 'A'	40219.20	NaN	NaN	0.04 M$
NETFLIX	3665870.00	NaN	NaN	3.67 M$

Filtering is also possible with the summary statistics of variables. The following command picks up cases where the values of all variables deviate from the mean less than two standard deviations.

companies_df.drop('Compensation ($ millions)',axis=1,inplace=True)

companies_df[(np.abs(companies_df-companies_df.mean()) < 2*companies_df.std(axis=0)).all(1)]

	Board member compensation	Total CO2 To Revenues	Emission Reduction Target %
NAME
MICROSOFT	4336000.00	29.16	75.0
AMAZON.COM	6567669.00	40.14	50.0
FACEBOOK CLASS A	2291182.00	25.69	75.0
NESTLE 'N'	12906002.11	58.52	35.0
JP MORGAN CHASE & CO.	4460719.00	6.69	50.0

Winsorising is another common procedure in practical econometrics. In that method, the most extreme values are moved to specific quantiles, usually 1% and 99% quantiles. It is easiest to implement with the clip() method. Notice how the following command winsorise all the variables.

companies_df.clip(lower = companies_df.quantile(0.1), upper = companies_df.quantile(0.9),axis=1)

	Board member compensation	Total CO2 To Revenues	Emission Reduction Target %
NAME
APPLE	2.832025e+06	6.731	NaN
SAUDI ARABIAN OIL	3.220550e+06	164.937	NaN
MICROSOFT	4.336000e+06	29.160	75.0
AMAZON.COM	6.567669e+06	40.140	50.0
ALIBABA GROUP HOLDING ADR 1:8	NaN	NaN	NaN
FACEBOOK CLASS A	2.369297e+06	25.690	75.0
TENCENT HOLDINGS	2.681759e+06	15.810	NaN
ALPHABET A	1.643321e+07	32.220	50.0
TESLA	1.643321e+07	NaN	NaN
TAIWAN SEMICON.MNFG.	1.258417e+07	164.937	23.6
NESTLE 'N'	1.290600e+07	58.520	35.0
NVIDIA	NaN	7.100	25.0
SAMSUNG ELECTRONICS	NaN	69.060	70.0
JP MORGAN CHASE & CO.	4.460719e+06	6.731	50.0
KWEICHOW MOUTAI 'A'	2.369297e+06	NaN	NaN
NETFLIX	3.665870e+06	NaN	NaN

Random sampling from data is easy. You can use the sample() method for that. It is also possible to sample with replacement that is needed in many numerical statistical methods.

companies_df.sample(n=10,replace=True)

	Board member compensation	Total CO2 To Revenues	Emission Reduction Target %
NAME
SAUDI ARABIAN OIL	3220550.15	175.59	NaN
TENCENT HOLDINGS	2681758.80	15.81	NaN
FACEBOOK CLASS A	2291182.00	25.69	75.0
TENCENT HOLDINGS	2681758.80	15.81	NaN
NVIDIA	NaN	7.10	25.0
NETFLIX	3665870.00	NaN	NaN
NETFLIX	3665870.00	NaN	NaN
MICROSOFT	4336000.00	29.16	75.0
ALPHABET A	50294471.00	32.22	50.0
TENCENT HOLDINGS	2681758.80	15.81	NaN

If you want to randomise the order of values in a dataframe, you can use Numpy’s random.permutation()

permut_index = np.random.permutation(len(companies_df))

companies_df.iloc[permut_index]

	Board member compensation	Total CO2 To Revenues	Emission Reduction Target %
NAME
ALIBABA GROUP HOLDING ADR 1:8	NaN	NaN	NaN
SAMSUNG ELECTRONICS	NaN	69.06	70.0
TENCENT HOLDINGS	2681758.80	15.81	NaN
ALPHABET A	50294471.00	32.22	50.0
FACEBOOK CLASS A	2291182.00	25.69	75.0
JP MORGAN CHASE & CO.	4460719.00	6.69	50.0
SAUDI ARABIAN OIL	3220550.15	175.59	NaN
MICROSOFT	4336000.00	29.16	75.0
TAIWAN SEMICON.MNFG.	12584165.55	245.17	18.0
APPLE	2832025.00	3.13	NaN
NETFLIX	3665870.00	NaN	NaN
NESTLE 'N'	12906002.11	58.52	35.0
KWEICHOW MOUTAI 'A'	40219.20	NaN	NaN
NVIDIA	NaN	7.10	25.0
AMAZON.COM	6567669.00	40.14	50.0
TESLA	17315009.00	NaN	NaN

Very often in econometrics, you need to transform your categorical variables to a collection of dummy variables. It easily done in Pandas using the get_dummies() function.

pd.get_dummies(companies_df['Emission Reduction Target %'],prefix='Emiss_')

	Emiss__18.0	Emiss__25.0	Emiss__35.0	Emiss__50.0	Emiss__70.0	Emiss__75.0
NAME
APPLE	0	0	0	0	0	0
SAUDI ARABIAN OIL	0	0	0	0	0	0
MICROSOFT	0	0	0	0	0	1
AMAZON.COM	0	0	0	1	0	0
ALIBABA GROUP HOLDING ADR 1:8	0	0	0	0	0	0
FACEBOOK CLASS A	0	0	0	0	0	1
TENCENT HOLDINGS	0	0	0	0	0	0
ALPHABET A	0	0	0	1	0	0
TESLA	0	0	0	0	0	0
TAIWAN SEMICON.MNFG.	1	0	0	0	0	0
NESTLE 'N'	0	0	1	0	0	0
NVIDIA	0	1	0	0	0	0
SAMSUNG ELECTRONICS	0	0	0	0	1	0
JP MORGAN CHASE & CO.	0	0	0	1	0	0
KWEICHOW MOUTAI 'A'	0	0	0	0	0	0
NETFLIX	0	0	0	0	0	0

Pandas has efficient methods to manipulate strings in a dataframe. Due to this, Pandas is very popular among researchers that need to use string data in their analysis. This makes it also a very important tool for accounting data analysis. All the Python string object’s built-in methods can be used to manipulate strings in a dataframe. We already discussed string-methods in Chapter 1.

Regular expressions can also be used to manipulate string-dataframes. Regular expressions is a very broad topic, and it takes time to master it. Let’s look at some very simple examples. If we want to split a sentence to a words-list, repeated whitespaces make the process difficult with the standard string-methods. However, it is very easy with regular expressions.

import re

The regex for multiple whitespaces is \s+.

splitter = re.compile('\s+')
splitter.split("This    is   a    sample        text.")

['This', 'is', 'a', 'sample', 'text.']

Regular experssions is a powerful tool, but very complex. You can search email-addresses from a text with the following regex-command.

reg_pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

email_search = re.compile(reg_pattern, flags = re.IGNORECASE)

example = """The general format of an email address is local-part@domain,
and a specific example is jsmith@example.com. An address consists of two parts. 
The part before the @ symbol (local part) identifies the name of a mailbox. 
This is often the username of the recipient, e.g., jsmith. 
The part after the @ symbol (domain) is a domain name that represents 
the administrative realm for the mail box, e.g., a company's domain name, example.com."""

email_search.findall(example)

['jsmith@example.com']

We do not go into regular expression more at this point. But we will see some applications of them in the following chapters. If you want to learn more about regular expressions, here is a good website: www.regular-expressions.info/index.html

If your string data contains missing values, usually the standard string methods will not work. Then you need to use the dataframe string methods.

You can slice strings normally.

companies_df.index.str[0:5]

Index(['APPLE', 'SAUDI', 'MICRO', 'AMAZO', 'ALIBA', 'FACEB', 'TENCE', 'ALPHA',
       'TESLA', 'TAIWA', 'NESTL', 'NVIDI', 'SAMSU', 'JP MO', 'KWEIC', 'NETFL'],
      dtype='object', name='NAME')

And you can use regular experssions, for example, to search strings.

companies_df.index.str.findall('am', flags = re.IGNORECASE)

Index([[], [], [], ['AM'], [], [], [], [], [], [], [], [], ['AM'], [], [], []], dtype='object', name='NAME')

Here is a good introduction to Pandas string methods. pandas.pydata.org/pandas-docs/stable/user_guide/text.html. If you scroll down, there is a full list methods.

4.2. Advanced methods#

Let’s look some more advanced methods of Pandas next. We need a new dataset for that.

electricity_df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/AER/USAirlines.csv',index_col=0)

electricity_df['firm'].replace({1:'A',2:'B',3:'C',4:'D',5:'E',6:'F'},inplace=True)

The cost function of electricity producers.

electricity_df

	firm	year	output	cost	price	load
1	A	1970	0.952757	1140640	106650	0.534487
2	A	1971	0.986757	1215690	110307	0.532328
3	A	1972	1.091980	1309570	110574	0.547736
4	A	1973	1.175780	1511530	121974	0.540846
5	A	1974	1.160170	1676730	196606	0.591167
...	...	...	...	...	...	...
86	F	1980	0.112640	381478	874818	0.517766
87	F	1981	0.154154	506969	1013170	0.580049
88	F	1982	0.186461	633388	930477	0.556024
89	F	1983	0.246847	804388	851676	0.537791
90	F	1984	0.304013	1009500	819476	0.525775

90 rows × 6 columns

There are many tools to deal with data that has more than two dimensions. The basic idea is to use hierarchical indexing in a two dimensional dataframe. For example, we can index the data by firm-years using set_index().

electricity_df.set_index(['firm','year'],inplace=True)

electricity_df

		output	cost	price	load
firm	year
A	1970	0.952757	1140640	106650	0.534487
	1971	0.986757	1215690	110307	0.532328
	1972	1.091980	1309570	110574	0.547736
	1973	1.175780	1511530	121974	0.540846
	1974	1.160170	1676730	196606	0.591167
...	...	...	...	...	...
F	1980	0.112640	381478	874818	0.517766
	1981	0.154154	506969	1013170	0.580049
	1982	0.186461	633388	930477	0.556024
	1983	0.246847	804388	851676	0.537791
	1984	0.304013	1009500	819476	0.525775

90 rows × 4 columns

Now the values are sorted so that all the years of a certain company are in adjacent rows. With sort_index you can order the values according to years.

electricity_df.sort_index(level=1)

		output	cost	price	load
firm	year
A	1970	0.952757	1140640	106650	0.534487
B	1970	0.520635	569292	103795	0.490851
C	1970	0.262424	286298	118788	0.524334
D	1970	0.086393	145167	114987	0.432066
E	1970	0.051028	91361	118222	0.442875
...	...	...	...	...	...
B	1984	1.389740	4209390	821361	0.528775
C	1984	0.493317	1170470	844079	0.577078
D	1984	0.421411	1436970	831374	0.585525
E	1984	0.213279	610257	844622	0.635545
F	1984	0.304013	1009500	819476	0.525775

90 rows × 4 columns

Two-level indexing enables an easy way to pick subgroups from the data. For a series, you can just use the standard indexing style of Python.

electricity_df['output']['A']

year
  0.952757
  0.986757
  1.091980
  1.175780
  1.160170
  1.173760
  1.290510
  1.390670
  1.612730
  1.825440
  1.546040
  1.527900
  1.660200
  1.822310
  1.936460
Name: output, dtype: float64

electricity_df['output'][:,1975]

firm
A    1.173760
B    0.852892
C    0.367517
D    0.164272
E    0.073961
F    0.052462
Name: output, dtype: float64

With multi-index dataframes, you have to be a little bit more careful, because the subgroups are dataframes themselves. For example, you need to use loc to pick up the subgroups.

electricity_df.loc['A']

	output	cost	price	load
year
1970	0.952757	1140640	106650	0.534487
1971	0.986757	1215690	110307	0.532328
1972	1.091980	1309570	110574	0.547736
1973	1.175780	1511530	121974	0.540846
1974	1.160170	1676730	196606	0.591167
1975	1.173760	1823740	265609	0.575417
1976	1.290510	2022890	263451	0.594495
1977	1.390670	2314760	316411	0.597409
1978	1.612730	2639160	384110	0.638522
1979	1.825440	3247620	569251	0.676287
1980	1.546040	3787750	871636	0.605735
1981	1.527900	3867750	997239	0.614360
1982	1.660200	3996020	938002	0.633366
1983	1.822310	4282880	859572	0.650117
1984	1.936460	4748320	823411	0.625603

electricity_df.loc['B'].loc[1970]

output         0.520635
cost      569292.000000
price     103795.000000
load           0.490851
Name: 1970, dtype: float64

If you want to pick values of a certain year, you change the order of indices using swaplevel.

electricity_df.swaplevel('firm','year').loc[1975]

	output	cost	price	load
firm
A	1.173760	1823740	265609	0.575417
B	0.852892	1358100	281704	0.558133
C	0.367517	510412	278721	0.607270
D	0.164272	373941	263148	0.532723
E	0.073961	156228	277930	0.556181
F	0.052462	133161	307923	0.495361

You can easily calculate descriptive statistics at multiple levels. Most of the stat functions in Pandas include a level parameter for that.

electricity_df.sum()

output    4.904952e+01
cost      1.010271e+08
price     4.245147e+07
load      5.044141e+01
dtype: float64

electricity_df.sum(level=0)

	output	cost	price	load
firm
A	21.153464	39585050	6934803	8.957875
B	15.250419	31918255	6994301	8.206419
C	6.185664	10847838	7145233	8.768038
D	3.188515	9571326	7100522	8.215159
E	1.707548	4399139	7171182	8.497289
F	1.563908	4705537	7105430	7.796634

electricity_df.sum(level=1)

	output	cost	price	load
year
1970	1.910919	2301736	679554	2.873152
1971	1.976987	2506118	701726	2.920993
1972	2.295394	2858803	705968	3.141480
1973	2.579854	3408925	747634	3.146692
1974	2.647980	3941906	1280945	3.381160
1975	2.684864	4355582	1675035	3.325085
1976	2.936119	4848733	1832425	3.364455
1977	3.173497	5548301	2106566	3.402352
1978	3.731864	6454748	2310525	3.707459
1979	4.190756	8020356	3347542	3.740366
1980	3.853406	9661270	5148943	3.481546
1981	3.800075	10634533	6046129	3.513746
1982	4.016458	11167822	5675129	3.481910
1983	4.493125	12133405	5209027	3.482717
1984	4.758220	13184907	4984323	3.478301

If we want to remove multi-index, we can use reset_index(). With the level parameter, you can decide how many levels from the multi-index are removed.

electricity_df.reset_index(level=1)

	year	output	cost	price	load
firm
A	1970	0.952757	1140640	106650	0.534487
A	1971	0.986757	1215690	110307	0.532328
A	1972	1.091980	1309570	110574	0.547736
A	1973	1.175780	1511530	121974	0.540846
A	1974	1.160170	1676730	196606	0.591167
...	...	...	...	...	...
F	1980	0.112640	381478	874818	0.517766
F	1981	0.154154	506969	1013170	0.580049
F	1982	0.186461	633388	930477	0.556024
F	1983	0.246847	804388	851676	0.537791
F	1984	0.304013	1009500	819476	0.525775

90 rows × 5 columns

electricity_df.reset_index(inplace=True)

Summary statistics can also be calculated using the groupby method.

electricity_df.groupby('firm').mean()

	output	cost	price	load
firm
A	1.410231	2.639003e+06	462320.200000	0.597192
B	1.016695	2.127884e+06	466286.733333	0.547095
C	0.412378	7.231892e+05	476348.866667	0.584536
D	0.212568	6.380884e+05	473368.133333	0.547677
E	0.113837	2.932759e+05	478078.800000	0.566486
F	0.104261	3.137025e+05	473695.333333	0.519776

electricity_df.groupby(['firm','year']).mean() # The mean of single values

		output	cost	price	load
firm	year
A	1970	0.952757	1140640	106650	0.534487
	1971	0.986757	1215690	110307	0.532328
	1972	1.091980	1309570	110574	0.547736
	1973	1.175780	1511530	121974	0.540846
	1974	1.160170	1676730	196606	0.591167
...	...	...	...	...	...
F	1980	0.112640	381478	874818	0.517766
	1981	0.154154	506969	1013170	0.580049
	1982	0.186461	633388	930477	0.556024
	1983	0.246847	804388	851676	0.537791
	1984	0.304013	1009500	819476	0.525775

90 rows × 4 columns

stack and unstack can be used to reshape hierarchical index dataframes.

electricity_df.reset_index(inplace=True)

electricity_df

	firm	year	output	cost	price	load
0	A	1970	0.952757	1140640	106650	0.534487
1	A	1971	0.986757	1215690	110307	0.532328
2	A	1972	1.091980	1309570	110574	0.547736
3	A	1973	1.175780	1511530	121974	0.540846
4	A	1974	1.160170	1676730	196606	0.591167
...	...	...	...	...	...	...
85	F	1980	0.112640	381478	874818	0.517766
86	F	1981	0.154154	506969	1013170	0.580049
87	F	1982	0.186461	633388	930477	0.556024
88	F	1983	0.246847	804388	851676	0.537791
89	F	1984	0.304013	1009500	819476	0.525775

90 rows × 6 columns

Stack turns a dataframe into a series.

data_series= electricity_df.stack()

data_series

0   firm             A
    year          1970
    output    0.952757
    cost       1140640
    price       106650
                ...   
89  year          1984
    output    0.304013
    cost       1009500
    price       819476
    load      0.525775
Length: 540, dtype: object

unstack() can be used to rearragne data back to a dataframe.

data_series.unstack()

	firm	year	output	cost	price	load
0	A	1970	0.952757	1140640	106650	0.534487
1	A	1971	0.986757	1215690	110307	0.532328
2	A	1972	1.09198	1309570	110574	0.547736
3	A	1973	1.17578	1511530	121974	0.540846
4	A	1974	1.16017	1676730	196606	0.591167
...	...	...	...	...	...	...
85	F	1980	0.11264	381478	874818	0.517766
86	F	1981	0.154154	506969	1013170	0.580049
87	F	1982	0.186461	633388	930477	0.556024
88	F	1983	0.246847	804388	851676	0.537791
89	F	1984	0.304013	1009500	819476	0.525775

90 rows × 6 columns

Yet another tool to reorganise data is pivot_table. More of it later.

4.3. Merging datasets#

Merging dataframes is often difficult for Pandas beginners. It can be a hassle. The usual cause of difficulties is to forget the importance of index with Pandas datatypes. Merging dataframes is not about gluing tables together. The merging is done according to indices.

gasoline_df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Gasoline.csv',index_col=0)

gasoline_df.set_index('country',inplace=True)

gasoline_df

	year	lgaspcar	lincomep	lrpmg	lcarpcap
country
AUSTRIA	1960	4.173244	-6.474277	-0.334548	-9.766840
AUSTRIA	1961	4.100989	-6.426006	-0.351328	-9.608622
AUSTRIA	1962	4.073177	-6.407308	-0.379518	-9.457257
AUSTRIA	1963	4.059509	-6.370679	-0.414251	-9.343155
AUSTRIA	1964	4.037689	-6.322247	-0.445335	-9.237739
...	...	...	...	...	...
U.S.A.	1974	4.798626	-5.328694	-1.231467	-7.617558
U.S.A.	1975	4.804932	-5.346190	-1.200377	-7.607010
U.S.A.	1976	4.814891	-5.297946	-1.154682	-7.574748
U.S.A.	1977	4.811032	-5.256606	-1.175910	-7.553458
U.S.A.	1978	4.818454	-5.221232	-1.212062	-7.536176

342 rows × 5 columns

Let’s split the data.

gaso2_df = gasoline_df[['lrpmg','lcarpcap']]

gasoline_df.drop(['lrpmg','lcarpcap'],axis=1,inplace=True)

gasoline_df

	year	lgaspcar	lincomep
country
AUSTRIA	1960	4.173244	-6.474277
AUSTRIA	1961	4.100989	-6.426006
AUSTRIA	1962	4.073177	-6.407308
AUSTRIA	1963	4.059509	-6.370679
AUSTRIA	1964	4.037689	-6.322247
...	...	...	...
U.S.A.	1974	4.798626	-5.328694
U.S.A.	1975	4.804932	-5.346190
U.S.A.	1976	4.814891	-5.297946
U.S.A.	1977	4.811032	-5.256606
U.S.A.	1978	4.818454	-5.221232

342 rows × 3 columns

gaso2_df

	lrpmg	lcarpcap
country
AUSTRIA	-0.334548	-9.766840
AUSTRIA	-0.351328	-9.608622
AUSTRIA	-0.379518	-9.457257
AUSTRIA	-0.414251	-9.343155
AUSTRIA	-0.445335	-9.237739
...	...	...
U.S.A.	-1.231467	-7.617558
U.S.A.	-1.200377	-7.607010
U.S.A.	-1.154682	-7.574748
U.S.A.	-1.175910	-7.553458
U.S.A.	-1.212062	-7.536176

342 rows × 2 columns

Merge combindes according to the index values

pd.merge(gasoline_df,gaso2_df,on='country')

	year	lgaspcar	lincomep	lrpmg	lcarpcap
country
AUSTRIA	1960	4.173244	-6.474277	-0.334548	-9.766840
AUSTRIA	1960	4.173244	-6.474277	-0.351328	-9.608622
AUSTRIA	1960	4.173244	-6.474277	-0.379518	-9.457257
AUSTRIA	1960	4.173244	-6.474277	-0.414251	-9.343155
AUSTRIA	1960	4.173244	-6.474277	-0.445335	-9.237739
...	...	...	...	...	...
U.S.A.	1978	4.818454	-5.221232	-1.231467	-7.617558
U.S.A.	1978	4.818454	-5.221232	-1.200377	-7.607010
U.S.A.	1978	4.818454	-5.221232	-1.154682	-7.574748
U.S.A.	1978	4.818454	-5.221232	-1.175910	-7.553458
U.S.A.	1978	4.818454	-5.221232	-1.212062	-7.536176

6498 rows × 5 columns

You could also do this by activating both indices for merging.

pd.merge(gasoline_df,gaso2_df,left_index=True,right_index=True)

	year	lgaspcar	lincomep	lrpmg	lcarpcap
country
AUSTRIA	1960	4.173244	-6.474277	-0.334548	-9.766840
AUSTRIA	1960	4.173244	-6.474277	-0.351328	-9.608622
AUSTRIA	1960	4.173244	-6.474277	-0.379518	-9.457257
AUSTRIA	1960	4.173244	-6.474277	-0.414251	-9.343155
AUSTRIA	1960	4.173244	-6.474277	-0.445335	-9.237739
...	...	...	...	...	...
U.S.A.	1978	4.818454	-5.221232	-1.231467	-7.617558
U.S.A.	1978	4.818454	-5.221232	-1.200377	-7.607010
U.S.A.	1978	4.818454	-5.221232	-1.154682	-7.574748
U.S.A.	1978	4.818454	-5.221232	-1.175910	-7.553458
U.S.A.	1978	4.818454	-5.221232	-1.212062	-7.536176

6498 rows × 5 columns

You can also use join. It uses the index values for merging by default.

gasoline_df.join(gaso2_df)

	year	lgaspcar	lincomep	lrpmg	lcarpcap
country
AUSTRIA	1960	4.173244	-6.474277	-0.334548	-9.766840
AUSTRIA	1960	4.173244	-6.474277	-0.351328	-9.608622
AUSTRIA	1960	4.173244	-6.474277	-0.379518	-9.457257
AUSTRIA	1960	4.173244	-6.474277	-0.414251	-9.343155
AUSTRIA	1960	4.173244	-6.474277	-0.445335	-9.237739
...	...	...	...	...	...
U.S.A.	1978	4.818454	-5.221232	-1.231467	-7.617558
U.S.A.	1978	4.818454	-5.221232	-1.200377	-7.607010
U.S.A.	1978	4.818454	-5.221232	-1.154682	-7.574748
U.S.A.	1978	4.818454	-5.221232	-1.175910	-7.553458
U.S.A.	1978	4.818454	-5.221232	-1.212062	-7.536176

6498 rows × 5 columns

Concat is a more general tool for merging data. It is easiest to understand how it works, if we use very simple datasets. Let’s create three simple Pandas Series.

x1 = pd.Series([20,30,40], index=['a','b','c'])
x2 = pd.Series([50,60], index=['e','f'])
x3 = pd.Series([70,80,90], index=['h','i','j'])

If we just feed them to concat, they are joined together as a longer Series.

pd.concat([x1,x2,x3])

a    20
b    30
c    40
e    50
f    60
h    70
i    80
j    90
dtype: int64

If you change the axis, the result will be a dataframe, because there are no overlapping index values.

pd.concat([x1,x2,x3],axis=1)

	0	1	2
a	20.0	NaN	NaN
b	30.0	NaN	NaN
c	40.0	NaN	NaN
e	NaN	50.0	NaN
f	NaN	60.0	NaN
h	NaN	NaN	70.0
i	NaN	NaN	80.0
j	NaN	NaN	90.0

x4 = pd.concat([x1,x2])

pd.concat([x4,x2],axis=1)

	0	1
a	20	NaN
b	30	NaN
c	40	NaN
e	50	50.0
f	60	60.0

If you just want to keep the intersction of the series, you can use join=’inner’

pd.concat([x4,x2],axis=1,join='inner')

	0	1
e	50	50
f	60	60

4.4. Data aggreagation#

Data aggregation is one the most important steps in data preprocessing. Pandas has many tools for that.

Groupby is an important tool in aggragation. Let’s load a new dataset for this.

import pandas as pd

comp_returns_df = pd.read_csv('comp_returns.csv',index_col=0)

comp_returns_df

	NAME	ISO COUNTRY CODE	RETURN ON EQUITY - TOTAL (%)	TOTAL DEBT % COMMON EQUITY	RESEARCH & DEVELOPMENT/SALES -	ACCOUNTING STANDARDS FOLLOWED
0	APPLE	US	55.92	119.40	4.95	US standards (GAAP)
1	SAUDI ARABIAN OIL	SA	32.25	16.96	NaN	IFRS
2	MICROSOFT	US	40.14	61.56	13.59	US standards (GAAP)
3	AMAZON.COM	US	21.95	83.31	12.29	US standards (GAAP)
4	FACEBOOK CLASS A	US	19.96	0.47	21.00	US standards (GAAP)
5	TENCENT HOLDINGS	HK	24.68	53.74	7.72	IFRS
6	ALPHABET A	US	18.12	2.26	15.71	US standards (GAAP)
7	TESLA	US	-14.94	202.77	10.73	US standards (GAAP)
8	WALMART	US	20.22	72.95	NaN	US standards (GAAP)
9	JOHNSON & JOHNSON	US	25.36	46.57	13.28	US standards (GAAP)
10	TAIWAN SEMICON.MNFG.	TW	20.94	11.89	8.08	IFRS
11	NESTLE 'N'	CH	23.05	71.42	1.88	IFRS
12	PROCTER & GAMBLE	US	27.78	76.10	2.79	US standards (GAAP)
13	VISA 'A'	US	40.34	57.25	NaN	US standards (GAAP)
14	MASTERCARD	US	143.83	144.70	NaN	US standards (GAAP)
15	NVIDIA	US	25.95	16.31	22.49	US standards (GAAP)
16	3I GROUP	GB	2.73	7.67	NaN	IFRS
17	JP MORGAN CHASE & CO.	US	14.91	220.24	NaN	US standards (GAAP)
18	HOME DEPOT	US	NaN	-1010.37	NaN	US standards (GAAP)
19	KWEICHOW MOUTAI 'A'	CN	33.12	9.14	0.85	Local standards
20	SAMSUNG ELECTRONICS	KR	8.69	7.22	7.39	IFRS
21	UNITEDHEALTH GROUP	US	25.40	71.33	NaN	US standards (GAAP)
22	ROCHE HOLDING	CH	44.71	47.58	19.38	IFRS
23	VERIZON COMMUNICATIONS	US	33.64	181.59	NaN	US standards (GAAP)
24	LVMH	FR	20.82	68.97	0.28	IFRS
25	NETFLIX	US	29.12	204.38	8.73	US standards (GAAP)
26	ADOBE (NAS)	US	29.67	39.30	17.15	US standards (GAAP)
27	TOYOTA MOTOR	JP	10.45	102.45	3.66	US standards (GAAP)
28	PAYPAL HOLDINGS	US	15.24	29.48	8.49	US standards (GAAP)
29	SALESFORCE.COM	US	0.51	9.04	14.76	US standards (GAAP)

Now, we can group data by the accounting standards variable and calculate the mean. Notice how Pandas automatically drops the country variable, because you cannot calculate a mean from strings:

comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').mean()

	RETURN ON EQUITY - TOTAL (%)	TOTAL DEBT % COMMON EQUITY	RESEARCH & DEVELOPMENT/SALES -
ACCOUNTING STANDARDS FOLLOWED
IFRS	22.23375	35.68125	7.455000
Local standards	33.12000	9.14000	0.850000
US standards (GAAP)	29.17850	34.81381	12.115714

You can make a two-level grouping by adding a list of columns to groupby.

comp_returns_df.groupby(['ISO COUNTRY CODE','ACCOUNTING STANDARDS FOLLOWED']).mean()

		RETURN ON EQUITY - TOTAL (%)	TOTAL DEBT % COMMON EQUITY	RESEARCH & DEVELOPMENT/SALES -
ISO COUNTRY CODE	ACCOUNTING STANDARDS FOLLOWED
CH	IFRS	33.880000	59.500	10.630000
CN	Local standards	33.120000	9.140	0.850000
FR	IFRS	20.820000	68.970	0.280000
GB	IFRS	2.730000	7.670	NaN
HK	IFRS	24.680000	53.740	7.720000
JP	US standards (GAAP)	10.450000	102.450	3.660000
KR	IFRS	8.690000	7.220	7.390000
SA	IFRS	32.250000	16.960	NaN
TW	IFRS	20.940000	11.890	8.080000
US	US standards (GAAP)	30.164211	31.432	12.766154

There are many other groupby-methods, like size(). The full list of groupby -methods are here: pandas.pydata.org/pandas-docs/stable/reference/groupby.html

comp_returns_df.groupby('ISO COUNTRY CODE').size()

ISO COUNTRY CODE
CH     2
CN     1
FR     1
GB     1
HK     1
JP     1
KR     1
SA     1
TW     1
US    20
dtype: int64

If we just use the groupby method without a following method, it will return a Pandas groupby object.

comp_returns_df.groupby('ISO COUNTRY CODE')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000160F39C3B80>

This object can be used for iteration.

for name,group in comp_returns_df.groupby('ISO COUNTRY CODE'):
    print(name)
    print(len(group))

CH
2
CN
1
FR
1
GB
1
HK
1
JP
1
KR
1
SA
1
TW
1
US
20

You could also group data according to columns using axis=1. It makes not much sense with this data, so we skip that example.

You can easily pick up just one column from a groupby object.

comp_returns_df.groupby('ISO COUNTRY CODE')['RETURN ON EQUITY - TOTAL (%)'].mean()

ISO COUNTRY CODE
CH    33.880000
CN    33.120000
FR    20.820000
GB     2.730000
HK    24.680000
JP    10.450000
KR     8.690000
SA    32.250000
TW    20.940000
US    30.164211
Name: RETURN ON EQUITY - TOTAL (%), dtype: float64

You could group data by defining a dict, a function etc. We do not go to these advanced methods here.

4.4.1. Data aggregation#

Now that we know the basics of groupby, we can analyse more how it can be used to aggregate data. The methods of groupby are count, sum, mean, median, std, var, min, max, prod, first and last. These are opitmised for groupby objects, but many other methods work too. Actually, you can define your own functions with groupby.agg().

def mean_median(arr):
    return arr.mean()-arr.median()

comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').agg(mean_median)

	RETURN ON EQUITY - TOTAL (%)	TOTAL DEBT % COMMON EQUITY	RESEARCH & DEVELOPMENT/SALES -
ACCOUNTING STANDARDS FOLLOWED
IFRS	0.23875	3.41125	-0.100000
Local standards	0.00000	0.00000	0.000000
US standards (GAAP)	3.79850	-36.51619	-0.669286

Basically, those Pandas methods will work with groupby that are some kind of aggregations. For example, quantile can be used

comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').quantile(0.9)

	RETURN ON EQUITY - TOTAL (%)	TOTAL DEBT % COMMON EQUITY	RESEARCH & DEVELOPMENT/SALES -
ACCOUNTING STANDARDS FOLLOWED
IFRS	35.988	69.705	13.730
Local standards	33.120	9.140	0.850
US standards (GAAP)	41.898	202.770	19.845

Even though describe is not an aggregating function, it will also work.

comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').describe()

	RETURN ON EQUITY - TOTAL (%)								TOTAL DEBT % COMMON EQUITY					RESEARCH & DEVELOPMENT/SALES -
	count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
ACCOUNTING STANDARDS FOLLOWED
IFRS	8.0	22.23375	12.981632	2.73	17.7875	21.995	26.5725	44.71	8.0	35.68125	...	57.5475	71.42	6.0	7.455000	6.711661	0.28	3.2575	7.555	7.9900	19.38
Local standards	1.0	33.12000	NaN	33.12	33.1200	33.120	33.1200	33.12	1.0	9.14000	...	9.1400	9.14	1.0	0.850000	NaN	0.85	0.8500	0.850	0.8500	0.85
US standards (GAAP)	20.0	29.17850	30.798945	-14.94	17.4000	25.380	30.6625	143.83	21.0	34.81381	...	119.4000	220.24	14.0	12.115714	6.023770	2.79	8.5500	12.785	15.4725	22.49

3 rows × 24 columns

You can make the previous table more readable by unstacking it.

comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').describe().unstack()

                                       ACCOUNTING STANDARDS FOLLOWED
RETURN ON EQUITY - TOTAL (%)    count  IFRS                              8.00000
                                       Local standards                   1.00000
                                       US standards (GAAP)              20.00000
                                mean   IFRS                             22.23375
                                       Local standards                  33.12000
                                                                          ...   
RESEARCH & DEVELOPMENT/SALES -  75%    Local standards                   0.85000
                                       US standards (GAAP)              15.47250
                                max    IFRS                             19.38000
                                       Local standards                   0.85000
                                       US standards (GAAP)              22.49000
Length: 72, dtype: float64

Apply() is the most versatile method to use with groupby objects.

def medalists(df,var = 'RETURN ON EQUITY - TOTAL (%)'):
    return df.sort_values(by=var)[-3:]

comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').apply(medalists)

		NAME	ISO COUNTRY CODE	RETURN ON EQUITY - TOTAL (%)	TOTAL DEBT % COMMON EQUITY	RESEARCH & DEVELOPMENT/SALES -	ACCOUNTING STANDARDS FOLLOWED
ACCOUNTING STANDARDS FOLLOWED
IFRS	5	TENCENT HOLDINGS	HK	24.68	53.74	7.72	IFRS
	1	SAUDI ARABIAN OIL	SA	32.25	16.96	NaN	IFRS
	22	ROCHE HOLDING	CH	44.71	47.58	19.38	IFRS
Local standards	19	KWEICHOW MOUTAI 'A'	CN	33.12	9.14	0.85	Local standards
US standards (GAAP)	0	APPLE	US	55.92	119.40	4.95	US standards (GAAP)
	14	MASTERCARD	US	143.83	144.70	NaN	US standards (GAAP)
	18	HOME DEPOT	US	NaN	-1010.37	NaN	US standards (GAAP)

It is also possible to use the bins created with cut or qcut as a grouping criteria in groupby.

ROE_quartiles = pd.qcut(comp_returns_df['RETURN ON EQUITY - TOTAL (%)'],4)

comp_returns_df.groupby(ROE_quartiles).size()

RETURN ON EQUITY - TOTAL (%)
(-14.940999999999999, 18.12]    8
(18.12, 24.68]                  7
(24.68, 32.25]                  7
(32.25, 143.83]                 7
dtype: int64

comp_returns_df.groupby(ROE_quartiles).mean()

	RETURN ON EQUITY - TOTAL (%)	TOTAL DEBT % COMMON EQUITY	RESEARCH & DEVELOPMENT/SALES -
RETURN ON EQUITY - TOTAL (%)
(-14.940999999999999, 18.12]	6.963750	72.641250	10.123333
(18.12, 24.68]	21.660000	51.821429	8.541667
(24.68, 32.25]	27.932857	67.278571	12.888000
(32.25, 143.83]	55.957143	88.745714	9.692500

Pivot tables are another option to organise data with Pandas. They are very popular in spreadsheet softwares, like Excel.It is very similar to cross-tabs that we use in statistics. Pandas has a specific function for pivot tables.

Let’s load a new data.

large_df = pd.read_csv('large_table.csv',delimiter=';')

large_df.drop('Type',inplace=True,axis=1)

large_df

	NAME	ISO COUNTRY CODE	RETURN ON EQUITY - TOTAL (%)	TOTAL DEBT % COMMON EQUITY	RESEARCH & DEVELOPMENT/SALES -	INDUSTRY GROUP	ACCOUNTING STANDARDS FOLLOWED	CSR Sustainability External Audit	Audit Committee	CSR Sustainability Committee
0	APPLE	US	55.92	119.40	4.95	34	US standards (GAAP)	Y	Y	Y
1	SAUDI ARABIAN OIL	SA	32.25	16.96	NaN	97	IFRS	NaN	Y	Y
2	MICROSOFT	US	40.14	61.56	13.59	58	US standards (GAAP)	NaN	Y	Y
3	AMAZON.COM	US	21.95	83.31	12.29	87	US standards (GAAP)	Y	Y	Y
4	FACEBOOK CLASS A	US	19.96	0.47	21.00	151	US standards (GAAP)	Y	Y	Y
...	...	...	...	...	...	...	...	...	...	...
100	CHINA MERCHANTS BANK 'A'	CN	16.84	301.52	NaN	102	IFRS	Y	Y	Y
101	ENEL	IT	7.00	202.46	NaN	169	IFRS	Y	Y	Y
102	JD COM ADR 1:2	CN	17.28	12.28	2.18	87	US standards (GAAP)	NaN	Y	N
103	KEYENCE	JP	11.84	0.00	2.76	214	Local standards	NaN	N	N
104	HSBC HOLDINGS	GB	3.71	328.78	NaN	102	IFRS	Y	Y	Y

105 rows × 10 columns

The default pivot_table aggregation type is mean.

large_df.columns

Index(['NAME', 'ISO COUNTRY CODE', 'RETURN ON EQUITY - TOTAL (%)',
       'TOTAL DEBT % COMMON EQUITY', 'RESEARCH & DEVELOPMENT/SALES -',
       'INDUSTRY GROUP', 'ACCOUNTING STANDARDS FOLLOWED',
       'CSR Sustainability External Audit', 'Audit Committee',
       'CSR Sustainability Committee'],
      dtype='object')

large_df.pivot_table(values='RETURN ON EQUITY - TOTAL (%)',
                     index='CSR Sustainability External Audit',columns='ACCOUNTING STANDARDS FOLLOWED')

ACCOUNTING STANDARDS FOLLOWED	IFRS	Local standards	US standards (GAAP)
CSR Sustainability External Audit
N	NaN	NaN	22.380000
Y	18.968333	23.315	46.280556

You can use hierarchical index and add marginal sums to the table.

large_df.pivot_table(values='RETURN ON EQUITY - TOTAL (%)',
                     index=['CSR Sustainability External Audit','Audit Committee'],
                     columns='ACCOUNTING STANDARDS FOLLOWED',margins=True)

	ACCOUNTING STANDARDS FOLLOWED	IFRS	Local standards	US standards (GAAP)	All
CSR Sustainability External Audit	Audit Committee
N	Y	NaN	NaN	22.380000	22.380000
Y	N	24.660000	NaN	10.450000	17.555000
Y	Y	18.720870	23.315	47.304286	35.547667
All		18.968333	23.315	44.442051	34.386308

You can also use hierarchical columns

large_df.pivot_table(values='RESEARCH & DEVELOPMENT/SALES -',
                     columns=['CSR Sustainability Committee','Audit Committee'],
                     index='ISO COUNTRY CODE',margins=True)

CSR Sustainability Committee	N		Y		All
Audit Committee	N	Y	N	Y
ISO COUNTRY CODE
CA	NaN	19.890000	NaN	NaN	19.890000
CH	NaN	NaN	NaN	12.720000	12.720000
CN	0.850	2.180000	0.280000	NaN	1.103333
DE	NaN	NaN	NaN	10.315000	10.315000
DK	NaN	NaN	NaN	11.850000	11.850000
FR	NaN	NaN	NaN	4.902500	4.902500
GB	NaN	NaN	NaN	18.015000	18.015000
HK	NaN	15.420000	NaN	7.720000	11.570000
IN	NaN	NaN	NaN	0.220000	0.220000
JP	2.760	NaN	2.505000	1.890000	2.415000
KR	NaN	NaN	NaN	7.390000	7.390000
NL	NaN	NaN	NaN	14.710000	14.710000
TW	NaN	NaN	NaN	8.080000	8.080000
US	NaN	13.755000	NaN	10.564242	11.055128
All	1.805	13.335556	1.763333	9.906346	9.758333

Cross-tabs are a special case of pivot tables, where the values are frequencies.

large_df.pivot_table(columns='CSR Sustainability Committee',index='ISO COUNTRY CODE',
                     aggfunc='count',margins=True)

	ACCOUNTING STANDARDS FOLLOWED			Audit Committee			CSR Sustainability External Audit			INDUSTRY GROUP	...	NAME	RESEARCH & DEVELOPMENT/SALES -			RETURN ON EQUITY - TOTAL (%)			TOTAL DEBT % COMMON EQUITY
CSR Sustainability Committee	N	Y	All	N	Y	All	N	Y	All	N	...	All	N	Y	All	N	Y	All	N	Y	All
ISO COUNTRY CODE
CA	1.0	NaN	NaN	1.0	NaN	NaN	0.0	NaN	NaN	1.0	...	NaN	1.0	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN
CH	NaN	3.0	3.0	NaN	3.0	3.0	NaN	3.0	3.0	NaN	...	3.0	NaN	3.0	3.0	NaN	3.0	3.0	NaN	3.0	3.0
CN	3.0	6.0	NaN	3.0	6.0	NaN	1.0	4.0	NaN	3.0	...	NaN	2.0	1.0	NaN	3.0	6.0	NaN	3.0	6.0	NaN
DE	NaN	2.0	2.0	NaN	2.0	2.0	NaN	2.0	2.0	NaN	...	2.0	NaN	2.0	2.0	NaN	2.0	2.0	NaN	2.0	2.0
DK	NaN	1.0	1.0	NaN	1.0	1.0	NaN	1.0	1.0	NaN	...	1.0	NaN	1.0	1.0	NaN	1.0	1.0	NaN	1.0	1.0
FR	NaN	4.0	4.0	NaN	4.0	4.0	NaN	4.0	4.0	NaN	...	4.0	NaN	4.0	4.0	NaN	4.0	4.0	NaN	4.0	4.0
GB	NaN	3.0	2.0	NaN	3.0	2.0	NaN	3.0	2.0	NaN	...	2.0	NaN	2.0	2.0	NaN	3.0	2.0	NaN	3.0	2.0
HK	1.0	3.0	NaN	1.0	3.0	NaN	0.0	2.0	NaN	1.0	...	NaN	1.0	1.0	NaN	1.0	3.0	NaN	1.0	3.0	NaN
IN	NaN	2.0	2.0	NaN	2.0	2.0	NaN	2.0	2.0	NaN	...	2.0	NaN	2.0	2.0	NaN	2.0	2.0	NaN	2.0	2.0
IT	NaN	1.0	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN	...	NaN	NaN	0.0	NaN	NaN	1.0	NaN	NaN	1.0	NaN
JP	1.0	3.0	2.0	1.0	3.0	2.0	0.0	2.0	2.0	1.0	...	2.0	1.0	3.0	2.0	1.0	3.0	2.0	1.0	3.0	2.0
KR	NaN	1.0	1.0	NaN	1.0	1.0	NaN	1.0	1.0	NaN	...	1.0	NaN	1.0	1.0	NaN	1.0	1.0	NaN	1.0	1.0
NL	1.0	1.0	1.0	1.0	1.0	1.0	0.0	1.0	1.0	1.0	...	1.0	0.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
SA	NaN	1.0	NaN	NaN	1.0	NaN	NaN	0.0	NaN	NaN	...	NaN	NaN	0.0	NaN	NaN	1.0	NaN	NaN	1.0	NaN
TW	NaN	1.0	1.0	NaN	1.0	1.0	NaN	1.0	1.0	NaN	...	1.0	NaN	1.0	1.0	NaN	1.0	1.0	NaN	1.0	1.0
US	11.0	55.0	23.0	11.0	55.0	23.0	4.0	34.0	23.0	11.0	...	23.0	6.0	33.0	23.0	11.0	51.0	23.0	11.0	55.0	23.0
All	2.0	40.0	42.0	2.0	40.0	42.0	2.0	40.0	42.0	2.0	...	42.0	2.0	40.0	42.0	2.0	40.0	42.0	2.0	40.0	42.0

17 rows × 24 columns

As you can see, it will return values for all variables. Therefore it is more convenient to use crosstab.

pd.crosstab(large_df['CSR Sustainability External Audit'],large_df['ISO COUNTRY CODE'],margins=True)

ISO COUNTRY CODE	CH	CN	DE	DK	FR	GB	HK	IN	IT	JP	KR	NL	TW	US	All
CSR Sustainability External Audit
N	0	0	0	0	0	0	0	0	0	0	0	0	0	3	3
Y	3	5	2	1	4	3	2	2	1	2	1	1	1	35	63
All	3	5	2	1	4	3	2	2	1	2	1	1	1	38	66

Pandas Data preprocessing

Contents

4. Pandas Data preprocessing#

4.1. Basic methods#

4.2. Advanced methods#

4.3. Merging datasets#

4.4. Data aggreagation#

4.4.1. Data aggregation#