4. Pandas Data preprocessing#
You might be wondering why we are talking so much about data processing and manipulation in this book. “Where are all the fancy ML methods” you might ask. Unfortunately, most of the time in data science goes to data pre-processing. It is often reported that 80 % of a data scientist’s time goes to data preparation: cleaning, transforming, rearranging and creating suitable features (feature engineering). So, to be a succesfull data scientist, you need know how to play with data. Luckily, we have Pandas at our disposal, wich is one of the most powerful data manipulation tools available.
Let’s look at some of the most common preprocessing situatsions we encounter when doing data science.
4.1. Basic methods#
Missing data is probably the most common issue with data.
import pandas as pd
companies_df = pd.read_csv('emissions.csv',index_col='NAME',delimiter=';')
companies_df
Board member compensation | Total CO2 To Revenues | Emission Reduction Target % | |
---|---|---|---|
NAME | |||
APPLE | 2832025.00 | 3.13 | NaN |
SAUDI ARABIAN OIL | 3220550.15 | 175.59 | NaN |
MICROSOFT | 4336000.00 | 29.16 | 75.0 |
AMAZON.COM | 6567669.00 | 40.14 | 50.0 |
ALIBABA GROUP HOLDING ADR 1:8 | NaN | NaN | NaN |
FACEBOOK CLASS A | 2291182.00 | 25.69 | 75.0 |
TENCENT HOLDINGS | 2681758.80 | 15.81 | NaN |
ALPHABET A | 50294471.00 | 32.22 | 50.0 |
TESLA | 17315009.00 | NaN | NaN |
TAIWAN SEMICON.MNFG. | 12584165.55 | 245.17 | 18.0 |
NESTLE 'N' | 12906002.11 | 58.52 | 35.0 |
NVIDIA | NaN | 7.10 | 25.0 |
SAMSUNG ELECTRONICS | NaN | 69.06 | 70.0 |
JP MORGAN CHASE & CO. | 4460719.00 | 6.69 | 50.0 |
KWEICHOW MOUTAI 'A' | 40219.20 | NaN | NaN |
NETFLIX | 3665870.00 | NaN | NaN |
Dropna can be used to filter out missing data. The how parameter defines do all/any of the values in rows/columns need to be zero for dropping.
companies_df.dropna(how='all')
Board member compensation | Total CO2 To Revenues | Emission Reduction Target % | |
---|---|---|---|
NAME | |||
APPLE | 2832025.00 | 3.13 | NaN |
SAUDI ARABIAN OIL | 3220550.15 | 175.59 | NaN |
MICROSOFT | 4336000.00 | 29.16 | 75.0 |
AMAZON.COM | 6567669.00 | 40.14 | 50.0 |
FACEBOOK CLASS A | 2291182.00 | 25.69 | 75.0 |
TENCENT HOLDINGS | 2681758.80 | 15.81 | NaN |
ALPHABET A | 50294471.00 | 32.22 | 50.0 |
TESLA | 17315009.00 | NaN | NaN |
TAIWAN SEMICON.MNFG. | 12584165.55 | 245.17 | 18.0 |
NESTLE 'N' | 12906002.11 | 58.52 | 35.0 |
NVIDIA | NaN | 7.10 | 25.0 |
SAMSUNG ELECTRONICS | NaN | 69.06 | 70.0 |
JP MORGAN CHASE & CO. | 4460719.00 | 6.69 | 50.0 |
KWEICHOW MOUTAI 'A' | 40219.20 | NaN | NaN |
NETFLIX | 3665870.00 | NaN | NaN |
companies_df.dropna(how='any')
Board member compensation | Total CO2 To Revenues | Emission Reduction Target % | |
---|---|---|---|
NAME | |||
MICROSOFT | 4336000.00 | 29.16 | 75.0 |
AMAZON.COM | 6567669.00 | 40.14 | 50.0 |
FACEBOOK CLASS A | 2291182.00 | 25.69 | 75.0 |
ALPHABET A | 50294471.00 | 32.22 | 50.0 |
TAIWAN SEMICON.MNFG. | 12584165.55 | 245.17 | 18.0 |
NESTLE 'N' | 12906002.11 | 58.52 | 35.0 |
JP MORGAN CHASE & CO. | 4460719.00 | 6.69 | 50.0 |
You can again use the axis parameter. However, it is not meaningful to use here because every column has NaN-values (any returns an empty table), and not all values are NaN in any column (all returns the original dataframe).
With thresh you can define how many of the values are allowed to be NaN.
companies_df.dropna(thresh=2)
Board member compensation | Total CO2 To Revenues | Emission Reduction Target % | |
---|---|---|---|
NAME | |||
APPLE | 2832025.00 | 3.13 | NaN |
SAUDI ARABIAN OIL | 3220550.15 | 175.59 | NaN |
MICROSOFT | 4336000.00 | 29.16 | 75.0 |
AMAZON.COM | 6567669.00 | 40.14 | 50.0 |
FACEBOOK CLASS A | 2291182.00 | 25.69 | 75.0 |
TENCENT HOLDINGS | 2681758.80 | 15.81 | NaN |
ALPHABET A | 50294471.00 | 32.22 | 50.0 |
TAIWAN SEMICON.MNFG. | 12584165.55 | 245.17 | 18.0 |
NESTLE 'N' | 12906002.11 | 58.52 | 35.0 |
NVIDIA | NaN | 7.10 | 25.0 |
SAMSUNG ELECTRONICS | NaN | 69.06 | 70.0 |
JP MORGAN CHASE & CO. | 4460719.00 | 6.69 | 50.0 |
You can use fillna to fill NaN values with other values.
companies_df.fillna(0)
Board member compensation | Total CO2 To Revenues | Emission Reduction Target % | |
---|---|---|---|
NAME | |||
APPLE | 2832025.00 | 3.13 | 0.0 |
SAUDI ARABIAN OIL | 3220550.15 | 175.59 | 0.0 |
MICROSOFT | 4336000.00 | 29.16 | 75.0 |
AMAZON.COM | 6567669.00 | 40.14 | 50.0 |
ALIBABA GROUP HOLDING ADR 1:8 | 0.00 | 0.00 | 0.0 |
FACEBOOK CLASS A | 2291182.00 | 25.69 | 75.0 |
TENCENT HOLDINGS | 2681758.80 | 15.81 | 0.0 |
ALPHABET A | 50294471.00 | 32.22 | 50.0 |
TESLA | 17315009.00 | 0.00 | 0.0 |
TAIWAN SEMICON.MNFG. | 12584165.55 | 245.17 | 18.0 |
NESTLE 'N' | 12906002.11 | 58.52 | 35.0 |
NVIDIA | 0.00 | 7.10 | 25.0 |
SAMSUNG ELECTRONICS | 0.00 | 69.06 | 70.0 |
JP MORGAN CHASE & CO. | 4460719.00 | 6.69 | 50.0 |
KWEICHOW MOUTAI 'A' | 40219.20 | 0.00 | 0.0 |
NETFLIX | 3665870.00 | 0.00 | 0.0 |
Using a dict, you can define different fill values for different columns. A very common choice is to use the mean value of each column.
companies_df.fillna({'Board member compensation':1000000, 'Total CO2 To Revenues':10,
'Emission Reduction Target %':0})
Board member compensation | Total CO2 To Revenues | Emission Reduction Target % | |
---|---|---|---|
NAME | |||
APPLE | 2832025.00 | 3.13 | 0.0 |
SAUDI ARABIAN OIL | 3220550.15 | 175.59 | 0.0 |
MICROSOFT | 4336000.00 | 29.16 | 75.0 |
AMAZON.COM | 6567669.00 | 40.14 | 50.0 |
ALIBABA GROUP HOLDING ADR 1:8 | 1000000.00 | 10.00 | 0.0 |
FACEBOOK CLASS A | 2291182.00 | 25.69 | 75.0 |
TENCENT HOLDINGS | 2681758.80 | 15.81 | 0.0 |
ALPHABET A | 50294471.00 | 32.22 | 50.0 |
TESLA | 17315009.00 | 10.00 | 0.0 |
TAIWAN SEMICON.MNFG. | 12584165.55 | 245.17 | 18.0 |
NESTLE 'N' | 12906002.11 | 58.52 | 35.0 |
NVIDIA | 1000000.00 | 7.10 | 25.0 |
SAMSUNG ELECTRONICS | 1000000.00 | 69.06 | 70.0 |
JP MORGAN CHASE & CO. | 4460719.00 | 6.69 | 50.0 |
KWEICHOW MOUTAI 'A' | 40219.20 | 10.00 | 0.0 |
NETFLIX | 3665870.00 | 10.00 | 0.0 |
Interpolation methods that were available for reindexing can be used with fillna.
companies_df.fillna(method = 'bfill')
Board member compensation | Total CO2 To Revenues | Emission Reduction Target % | |
---|---|---|---|
NAME | |||
APPLE | 2832025.00 | 3.13 | 75.0 |
SAUDI ARABIAN OIL | 3220550.15 | 175.59 | 75.0 |
MICROSOFT | 4336000.00 | 29.16 | 75.0 |
AMAZON.COM | 6567669.00 | 40.14 | 50.0 |
ALIBABA GROUP HOLDING ADR 1:8 | 2291182.00 | 25.69 | 75.0 |
FACEBOOK CLASS A | 2291182.00 | 25.69 | 75.0 |
TENCENT HOLDINGS | 2681758.80 | 15.81 | 50.0 |
ALPHABET A | 50294471.00 | 32.22 | 50.0 |
TESLA | 17315009.00 | 245.17 | 18.0 |
TAIWAN SEMICON.MNFG. | 12584165.55 | 245.17 | 18.0 |
NESTLE 'N' | 12906002.11 | 58.52 | 35.0 |
NVIDIA | 4460719.00 | 7.10 | 25.0 |
SAMSUNG ELECTRONICS | 4460719.00 | 69.06 | 70.0 |
JP MORGAN CHASE & CO. | 4460719.00 | 6.69 | 50.0 |
KWEICHOW MOUTAI 'A' | 40219.20 | NaN | NaN |
NETFLIX | 3665870.00 | NaN | NaN |
duplicated() can be used to drop duplicated rows. It drops only duplicates that are next to each other
More about append() later. Here we just build a new dataframe with companies_df in it twice. And then we sort the index so that every company is twice in the new dataframe.
new_df = companies_df.append(companies_df)
new_df.sort_index(inplace=True)
new_df
Board member compensation | Total CO2 To Revenues | Emission Reduction Target % | |
---|---|---|---|
NAME | |||
ALIBABA GROUP HOLDING ADR 1:8 | NaN | NaN | NaN |
ALIBABA GROUP HOLDING ADR 1:8 | NaN | NaN | NaN |
ALPHABET A | 50294471.00 | 32.22 | 50.0 |
ALPHABET A | 50294471.00 | 32.22 | 50.0 |
AMAZON.COM | 6567669.00 | 40.14 | 50.0 |
AMAZON.COM | 6567669.00 | 40.14 | 50.0 |
APPLE | 2832025.00 | 3.13 | NaN |
APPLE | 2832025.00 | 3.13 | NaN |
FACEBOOK CLASS A | 2291182.00 | 25.69 | 75.0 |
FACEBOOK CLASS A | 2291182.00 | 25.69 | 75.0 |
JP MORGAN CHASE & CO. | 4460719.00 | 6.69 | 50.0 |
JP MORGAN CHASE & CO. | 4460719.00 | 6.69 | 50.0 |
KWEICHOW MOUTAI 'A' | 40219.20 | NaN | NaN |
KWEICHOW MOUTAI 'A' | 40219.20 | NaN | NaN |
MICROSOFT | 4336000.00 | 29.16 | 75.0 |
MICROSOFT | 4336000.00 | 29.16 | 75.0 |
NESTLE 'N' | 12906002.11 | 58.52 | 35.0 |
NESTLE 'N' | 12906002.11 | 58.52 | 35.0 |
NETFLIX | 3665870.00 | NaN | NaN |
NETFLIX | 3665870.00 | NaN | NaN |
NVIDIA | NaN | 7.10 | 25.0 |
NVIDIA | NaN | 7.10 | 25.0 |
SAMSUNG ELECTRONICS | NaN | 69.06 | 70.0 |
SAMSUNG ELECTRONICS | NaN | 69.06 | 70.0 |
SAUDI ARABIAN OIL | 3220550.15 | 175.59 | NaN |
SAUDI ARABIAN OIL | 3220550.15 | 175.59 | NaN |
TAIWAN SEMICON.MNFG. | 12584165.55 | 245.17 | 18.0 |
TAIWAN SEMICON.MNFG. | 12584165.55 | 245.17 | 18.0 |
TENCENT HOLDINGS | 2681758.80 | 15.81 | NaN |
TENCENT HOLDINGS | 2681758.80 | 15.81 | NaN |
TESLA | 17315009.00 | NaN | NaN |
TESLA | 17315009.00 | NaN | NaN |
new_df.duplicated()
NAME
ALIBABA GROUP HOLDING ADR 1:8 False
ALIBABA GROUP HOLDING ADR 1:8 True
ALPHABET A False
ALPHABET A True
AMAZON.COM False
AMAZON.COM True
APPLE False
APPLE True
FACEBOOK CLASS A False
FACEBOOK CLASS A True
JP MORGAN CHASE & CO. False
JP MORGAN CHASE & CO. True
KWEICHOW MOUTAI 'A' False
KWEICHOW MOUTAI 'A' True
MICROSOFT False
MICROSOFT True
NESTLE 'N' False
NESTLE 'N' True
NETFLIX False
NETFLIX True
NVIDIA False
NVIDIA True
SAMSUNG ELECTRONICS False
SAMSUNG ELECTRONICS True
SAUDI ARABIAN OIL False
SAUDI ARABIAN OIL True
TAIWAN SEMICON.MNFG. False
TAIWAN SEMICON.MNFG. True
TENCENT HOLDINGS False
TENCENT HOLDINGS True
TESLA False
TESLA True
dtype: bool
We can remove duplicated rows using drop_duplicated().
new_df.drop_duplicates()
Board member compensation | Total CO2 To Revenues | Emission Reduction Target % | |
---|---|---|---|
NAME | |||
ALIBABA GROUP HOLDING ADR 1:8 | NaN | NaN | NaN |
ALPHABET A | 50294471.00 | 32.22 | 50.0 |
AMAZON.COM | 6567669.00 | 40.14 | 50.0 |
APPLE | 2832025.00 | 3.13 | NaN |
FACEBOOK CLASS A | 2291182.00 | 25.69 | 75.0 |
JP MORGAN CHASE & CO. | 4460719.00 | 6.69 | 50.0 |
KWEICHOW MOUTAI 'A' | 40219.20 | NaN | NaN |
MICROSOFT | 4336000.00 | 29.16 | 75.0 |
NESTLE 'N' | 12906002.11 | 58.52 | 35.0 |
NETFLIX | 3665870.00 | NaN | NaN |
NVIDIA | NaN | 7.10 | 25.0 |
SAMSUNG ELECTRONICS | NaN | 69.06 | 70.0 |
SAUDI ARABIAN OIL | 3220550.15 | 175.59 | NaN |
TAIWAN SEMICON.MNFG. | 12584165.55 | 245.17 | 18.0 |
TENCENT HOLDINGS | 2681758.80 | 15.81 | NaN |
TESLA | 17315009.00 | NaN | NaN |
It is easy to apply function transformation to rows, columns or individual cells of a dataframe. map -metohd can be used for that.
companies_df['Compensation ($ millions)'] = companies_df['Board member compensation'].map(
lambda x: "{:.2f}".format(x/1000000)+" M$")
companies_df
Board member compensation | Total CO2 To Revenues | Emission Reduction Target % | Compensation ($ millions) | |
---|---|---|---|---|
NAME | ||||
APPLE | 2832025.00 | 3.13 | NaN | 2.83 M$ |
SAUDI ARABIAN OIL | 3220550.15 | 175.59 | NaN | 3.22 M$ |
MICROSOFT | 4336000.00 | 29.16 | 75.0 | 4.34 M$ |
AMAZON.COM | 6567669.00 | 40.14 | 50.0 | 6.57 M$ |
ALIBABA GROUP HOLDING ADR 1:8 | NaN | NaN | NaN | nan M$ |
FACEBOOK CLASS A | 2291182.00 | 25.69 | 75.0 | 2.29 M$ |
TENCENT HOLDINGS | 2681758.80 | 15.81 | NaN | 2.68 M$ |
ALPHABET A | 50294471.00 | 32.22 | 50.0 | 50.29 M$ |
TESLA | 17315009.00 | NaN | NaN | 17.32 M$ |
TAIWAN SEMICON.MNFG. | 12584165.55 | 245.17 | 18.0 | 12.58 M$ |
NESTLE 'N' | 12906002.11 | 58.52 | 35.0 | 12.91 M$ |
NVIDIA | NaN | 7.10 | 25.0 | nan M$ |
SAMSUNG ELECTRONICS | NaN | 69.06 | 70.0 | nan M$ |
JP MORGAN CHASE & CO. | 4460719.00 | 6.69 | 50.0 | 4.46 M$ |
KWEICHOW MOUTAI 'A' | 40219.20 | NaN | NaN | 0.04 M$ |
NETFLIX | 3665870.00 | NaN | NaN | 3.67 M$ |
Map() can also be used to transform index. The following command turns the company names backwards.
companies_df.index.map(lambda x : x[::-1])
Index(['ELPPA', 'LIO NAIBARA IDUAS', 'TFOSORCIM', 'MOC.NOZAMA',
'8:1 RDA GNIDLOH PUORG ABABILA', 'A SSALC KOOBECAF', 'SGNIDLOH TNECNET',
'A TEBAHPLA', 'ALSET', '.GFNM.NOCIMES NAWIAT', ''N' ELTSEN', 'AIDIVN',
'SCINORTCELE GNUSMAS', '.OC & ESAHC NAGROM PJ', ''A' IATUOM WOHCIEWK',
'XILFTEN'],
dtype='object', name='NAME')
Of course, you can also use rename(). Using a dictionary, rename() can also be used to change only some of the labels.
companies_df.rename(index=str.title)
Board member compensation | Total CO2 To Revenues | Emission Reduction Target % | Compensation ($ millions) | |
---|---|---|---|---|
NAME | ||||
Apple | 2832025.00 | 3.13 | NaN | 2.83 M$ |
Saudi Arabian Oil | 3220550.15 | 175.59 | NaN | 3.22 M$ |
Microsoft | 4336000.00 | 29.16 | 75.0 | 4.34 M$ |
Amazon.Com | 6567669.00 | 40.14 | 50.0 | 6.57 M$ |
Alibaba Group Holding Adr 1:8 | NaN | NaN | NaN | nan M$ |
Facebook Class A | 2291182.00 | 25.69 | 75.0 | 2.29 M$ |
Tencent Holdings | 2681758.80 | 15.81 | NaN | 2.68 M$ |
Alphabet A | 50294471.00 | 32.22 | 50.0 | 50.29 M$ |
Tesla | 17315009.00 | NaN | NaN | 17.32 M$ |
Taiwan Semicon.Mnfg. | 12584165.55 | 245.17 | 18.0 | 12.58 M$ |
Nestle 'N' | 12906002.11 | 58.52 | 35.0 | 12.91 M$ |
Nvidia | NaN | 7.10 | 25.0 | nan M$ |
Samsung Electronics | NaN | 69.06 | 70.0 | nan M$ |
Jp Morgan Chase & Co. | 4460719.00 | 6.69 | 50.0 | 4.46 M$ |
Kweichow Moutai 'A' | 40219.20 | NaN | NaN | 0.04 M$ |
Netflix | 3665870.00 | NaN | NaN | 3.67 M$ |
Replace() can be used to replace any values, not just NaN values. You can pass also dict/list, if you want to replace multiple values.
import numpy as np
companies_df.replace(np.nan,-999) # Pandas NaN is np.nan
Board member compensation | Total CO2 To Revenues | Emission Reduction Target % | Compensation ($ millions) | |
---|---|---|---|---|
NAME | ||||
APPLE | 2832025.00 | 3.13 | -999.0 | 2.83 M$ |
SAUDI ARABIAN OIL | 3220550.15 | 175.59 | -999.0 | 3.22 M$ |
MICROSOFT | 4336000.00 | 29.16 | 75.0 | 4.34 M$ |
AMAZON.COM | 6567669.00 | 40.14 | 50.0 | 6.57 M$ |
ALIBABA GROUP HOLDING ADR 1:8 | -999.00 | -999.00 | -999.0 | nan M$ |
FACEBOOK CLASS A | 2291182.00 | 25.69 | 75.0 | 2.29 M$ |
TENCENT HOLDINGS | 2681758.80 | 15.81 | -999.0 | 2.68 M$ |
ALPHABET A | 50294471.00 | 32.22 | 50.0 | 50.29 M$ |
TESLA | 17315009.00 | -999.00 | -999.0 | 17.32 M$ |
TAIWAN SEMICON.MNFG. | 12584165.55 | 245.17 | 18.0 | 12.58 M$ |
NESTLE 'N' | 12906002.11 | 58.52 | 35.0 | 12.91 M$ |
NVIDIA | -999.00 | 7.10 | 25.0 | nan M$ |
SAMSUNG ELECTRONICS | -999.00 | 69.06 | 70.0 | nan M$ |
JP MORGAN CHASE & CO. | 4460719.00 | 6.69 | 50.0 | 4.46 M$ |
KWEICHOW MOUTAI 'A' | 40219.20 | -999.00 | -999.0 | 0.04 M$ |
NETFLIX | 3665870.00 | -999.00 | -999.0 | 3.67 M$ |
Dividing data to bins is a very important feature in Pandas. You can use pd.cut() (notice that it is not a dataframe method) to creata a categorical object. pd.value_counts() can be used to calculate the number of observations in each bin. With the labels parameter you can define names for different bins.
compensation_bins = [10000,100000,1000000,10000000,100000000]
comp_categ = pd.cut(companies_df['Board member compensation'],compensation_bins)
comp_categ
NAME
APPLE (1000000.0, 10000000.0]
SAUDI ARABIAN OIL (1000000.0, 10000000.0]
MICROSOFT (1000000.0, 10000000.0]
AMAZON.COM (1000000.0, 10000000.0]
ALIBABA GROUP HOLDING ADR 1:8 NaN
FACEBOOK CLASS A (1000000.0, 10000000.0]
TENCENT HOLDINGS (1000000.0, 10000000.0]
ALPHABET A (10000000.0, 100000000.0]
TESLA (10000000.0, 100000000.0]
TAIWAN SEMICON.MNFG. (10000000.0, 100000000.0]
NESTLE 'N' (10000000.0, 100000000.0]
NVIDIA NaN
SAMSUNG ELECTRONICS NaN
JP MORGAN CHASE & CO. (1000000.0, 10000000.0]
KWEICHOW MOUTAI 'A' (10000.0, 100000.0]
NETFLIX (1000000.0, 10000000.0]
Name: Board member compensation, dtype: category
Categories (4, interval[int64]): [(10000, 100000] < (100000, 1000000] < (1000000, 10000000] < (10000000, 100000000]]
comp_categ.values.categories
IntervalIndex([(10000, 100000], (100000, 1000000], (1000000, 10000000], (10000000, 100000000]],
closed='right',
dtype='interval[int64]')
comp_categ.values.codes
array([ 2, 2, 2, 2, -1, 2, 2, 3, 3, 3, 3, -1, -1, 2, 0, 2],
dtype=int8)
pd.value_counts(comp_categ)
(1000000, 10000000] 8
(10000000, 100000000] 4
(10000, 100000] 1
(100000, 1000000] 0
Name: Board member compensation, dtype: int64
pd.cut(companies_df['Board member compensation'],
compensation_bins,labels = ['Poor board members','Rich board members',
'Very rich board members','Insanely rich board members'])
NAME
APPLE Very rich board members
SAUDI ARABIAN OIL Very rich board members
MICROSOFT Very rich board members
AMAZON.COM Very rich board members
ALIBABA GROUP HOLDING ADR 1:8 NaN
FACEBOOK CLASS A Very rich board members
TENCENT HOLDINGS Very rich board members
ALPHABET A Insanely rich board members
TESLA Insanely rich board members
TAIWAN SEMICON.MNFG. Insanely rich board members
NESTLE 'N' Insanely rich board members
NVIDIA NaN
SAMSUNG ELECTRONICS NaN
JP MORGAN CHASE & CO. Very rich board members
KWEICHOW MOUTAI 'A' Poor board members
NETFLIX Very rich board members
Name: Board member compensation, dtype: category
Categories (4, object): ['Poor board members' < 'Rich board members' < 'Very rich board members' < 'Insanely rich board members']
If you pass a number to the bins parameter, it will return that many equal-length bins.
pd.cut(companies_df['Total CO2 To Revenues'],4,precision=1)
NAME
APPLE (2.9, 63.6]
SAUDI ARABIAN OIL (124.2, 184.7]
MICROSOFT (2.9, 63.6]
AMAZON.COM (2.9, 63.6]
ALIBABA GROUP HOLDING ADR 1:8 NaN
FACEBOOK CLASS A (2.9, 63.6]
TENCENT HOLDINGS (2.9, 63.6]
ALPHABET A (2.9, 63.6]
TESLA NaN
TAIWAN SEMICON.MNFG. (184.7, 245.2]
NESTLE 'N' (2.9, 63.6]
NVIDIA (2.9, 63.6]
SAMSUNG ELECTRONICS (63.6, 124.2]
JP MORGAN CHASE & CO. (2.9, 63.6]
KWEICHOW MOUTAI 'A' NaN
NETFLIX NaN
Name: Total CO2 To Revenues, dtype: category
Categories (4, interval[float64]): [(2.9, 63.6] < (63.6, 124.2] < (124.2, 184.7] < (184.7, 245.2]]
For many purposes, qcut() is more useful as it bins the data based on sample quantiles. Therefore, every bins has approximately the same number of observations. You can also pass specific quantiles as a list to the function.
pd.qcut(companies_df['Total CO2 To Revenues'],4)
NAME
APPLE (3.129, 13.632]
SAUDI ARABIAN OIL (61.155, 245.17]
MICROSOFT (13.632, 30.69]
AMAZON.COM (30.69, 61.155]
ALIBABA GROUP HOLDING ADR 1:8 NaN
FACEBOOK CLASS A (13.632, 30.69]
TENCENT HOLDINGS (13.632, 30.69]
ALPHABET A (30.69, 61.155]
TESLA NaN
TAIWAN SEMICON.MNFG. (61.155, 245.17]
NESTLE 'N' (30.69, 61.155]
NVIDIA (3.129, 13.632]
SAMSUNG ELECTRONICS (61.155, 245.17]
JP MORGAN CHASE & CO. (3.129, 13.632]
KWEICHOW MOUTAI 'A' NaN
NETFLIX NaN
Name: Total CO2 To Revenues, dtype: category
Categories (4, interval[float64]): [(3.129, 13.632] < (13.632, 30.69] < (30.69, 61.155] < (61.155, 245.17]]
Detecting and filtering outliers is easy with boolean dataframes
companies_df[companies_df['Board member compensation'] < 10000000]
Board member compensation | Total CO2 To Revenues | Emission Reduction Target % | Compensation ($ millions) | |
---|---|---|---|---|
NAME | ||||
APPLE | 2832025.00 | 3.13 | NaN | 2.83 M$ |
SAUDI ARABIAN OIL | 3220550.15 | 175.59 | NaN | 3.22 M$ |
MICROSOFT | 4336000.00 | 29.16 | 75.0 | 4.34 M$ |
AMAZON.COM | 6567669.00 | 40.14 | 50.0 | 6.57 M$ |
FACEBOOK CLASS A | 2291182.00 | 25.69 | 75.0 | 2.29 M$ |
TENCENT HOLDINGS | 2681758.80 | 15.81 | NaN | 2.68 M$ |
JP MORGAN CHASE & CO. | 4460719.00 | 6.69 | 50.0 | 4.46 M$ |
KWEICHOW MOUTAI 'A' | 40219.20 | NaN | NaN | 0.04 M$ |
NETFLIX | 3665870.00 | NaN | NaN | 3.67 M$ |
Filtering is also possible with the summary statistics of variables. The following command picks up cases where the values of all variables deviate from the mean less than two standard deviations.
companies_df.drop('Compensation ($ millions)',axis=1,inplace=True)
companies_df[(np.abs(companies_df-companies_df.mean()) < 2*companies_df.std(axis=0)).all(1)]
Board member compensation | Total CO2 To Revenues | Emission Reduction Target % | |
---|---|---|---|
NAME | |||
MICROSOFT | 4336000.00 | 29.16 | 75.0 |
AMAZON.COM | 6567669.00 | 40.14 | 50.0 |
FACEBOOK CLASS A | 2291182.00 | 25.69 | 75.0 |
NESTLE 'N' | 12906002.11 | 58.52 | 35.0 |
JP MORGAN CHASE & CO. | 4460719.00 | 6.69 | 50.0 |
Winsorising is another common procedure in practical econometrics. In that method, the most extreme values are moved to specific quantiles, usually 1% and 99% quantiles. It is easiest to implement with the clip() method. Notice how the following command winsorise all the variables.
companies_df.clip(lower = companies_df.quantile(0.1), upper = companies_df.quantile(0.9),axis=1)
Board member compensation | Total CO2 To Revenues | Emission Reduction Target % | |
---|---|---|---|
NAME | |||
APPLE | 2.832025e+06 | 6.731 | NaN |
SAUDI ARABIAN OIL | 3.220550e+06 | 164.937 | NaN |
MICROSOFT | 4.336000e+06 | 29.160 | 75.0 |
AMAZON.COM | 6.567669e+06 | 40.140 | 50.0 |
ALIBABA GROUP HOLDING ADR 1:8 | NaN | NaN | NaN |
FACEBOOK CLASS A | 2.369297e+06 | 25.690 | 75.0 |
TENCENT HOLDINGS | 2.681759e+06 | 15.810 | NaN |
ALPHABET A | 1.643321e+07 | 32.220 | 50.0 |
TESLA | 1.643321e+07 | NaN | NaN |
TAIWAN SEMICON.MNFG. | 1.258417e+07 | 164.937 | 23.6 |
NESTLE 'N' | 1.290600e+07 | 58.520 | 35.0 |
NVIDIA | NaN | 7.100 | 25.0 |
SAMSUNG ELECTRONICS | NaN | 69.060 | 70.0 |
JP MORGAN CHASE & CO. | 4.460719e+06 | 6.731 | 50.0 |
KWEICHOW MOUTAI 'A' | 2.369297e+06 | NaN | NaN |
NETFLIX | 3.665870e+06 | NaN | NaN |
Random sampling from data is easy. You can use the sample() method for that. It is also possible to sample with replacement that is needed in many numerical statistical methods.
companies_df.sample(n=10,replace=True)
Board member compensation | Total CO2 To Revenues | Emission Reduction Target % | |
---|---|---|---|
NAME | |||
SAUDI ARABIAN OIL | 3220550.15 | 175.59 | NaN |
TENCENT HOLDINGS | 2681758.80 | 15.81 | NaN |
FACEBOOK CLASS A | 2291182.00 | 25.69 | 75.0 |
TENCENT HOLDINGS | 2681758.80 | 15.81 | NaN |
NVIDIA | NaN | 7.10 | 25.0 |
NETFLIX | 3665870.00 | NaN | NaN |
NETFLIX | 3665870.00 | NaN | NaN |
MICROSOFT | 4336000.00 | 29.16 | 75.0 |
ALPHABET A | 50294471.00 | 32.22 | 50.0 |
TENCENT HOLDINGS | 2681758.80 | 15.81 | NaN |
If you want to randomise the order of values in a dataframe, you can use Numpy’s random.permutation()
permut_index = np.random.permutation(len(companies_df))
companies_df.iloc[permut_index]
Board member compensation | Total CO2 To Revenues | Emission Reduction Target % | |
---|---|---|---|
NAME | |||
ALIBABA GROUP HOLDING ADR 1:8 | NaN | NaN | NaN |
SAMSUNG ELECTRONICS | NaN | 69.06 | 70.0 |
TENCENT HOLDINGS | 2681758.80 | 15.81 | NaN |
ALPHABET A | 50294471.00 | 32.22 | 50.0 |
FACEBOOK CLASS A | 2291182.00 | 25.69 | 75.0 |
JP MORGAN CHASE & CO. | 4460719.00 | 6.69 | 50.0 |
SAUDI ARABIAN OIL | 3220550.15 | 175.59 | NaN |
MICROSOFT | 4336000.00 | 29.16 | 75.0 |
TAIWAN SEMICON.MNFG. | 12584165.55 | 245.17 | 18.0 |
APPLE | 2832025.00 | 3.13 | NaN |
NETFLIX | 3665870.00 | NaN | NaN |
NESTLE 'N' | 12906002.11 | 58.52 | 35.0 |
KWEICHOW MOUTAI 'A' | 40219.20 | NaN | NaN |
NVIDIA | NaN | 7.10 | 25.0 |
AMAZON.COM | 6567669.00 | 40.14 | 50.0 |
TESLA | 17315009.00 | NaN | NaN |
Very often in econometrics, you need to transform your categorical variables to a collection of dummy variables. It easily done in Pandas using the get_dummies() function.
pd.get_dummies(companies_df['Emission Reduction Target %'],prefix='Emiss_')
Emiss__18.0 | Emiss__25.0 | Emiss__35.0 | Emiss__50.0 | Emiss__70.0 | Emiss__75.0 | |
---|---|---|---|---|---|---|
NAME | ||||||
APPLE | 0 | 0 | 0 | 0 | 0 | 0 |
SAUDI ARABIAN OIL | 0 | 0 | 0 | 0 | 0 | 0 |
MICROSOFT | 0 | 0 | 0 | 0 | 0 | 1 |
AMAZON.COM | 0 | 0 | 0 | 1 | 0 | 0 |
ALIBABA GROUP HOLDING ADR 1:8 | 0 | 0 | 0 | 0 | 0 | 0 |
FACEBOOK CLASS A | 0 | 0 | 0 | 0 | 0 | 1 |
TENCENT HOLDINGS | 0 | 0 | 0 | 0 | 0 | 0 |
ALPHABET A | 0 | 0 | 0 | 1 | 0 | 0 |
TESLA | 0 | 0 | 0 | 0 | 0 | 0 |
TAIWAN SEMICON.MNFG. | 1 | 0 | 0 | 0 | 0 | 0 |
NESTLE 'N' | 0 | 0 | 1 | 0 | 0 | 0 |
NVIDIA | 0 | 1 | 0 | 0 | 0 | 0 |
SAMSUNG ELECTRONICS | 0 | 0 | 0 | 0 | 1 | 0 |
JP MORGAN CHASE & CO. | 0 | 0 | 0 | 1 | 0 | 0 |
KWEICHOW MOUTAI 'A' | 0 | 0 | 0 | 0 | 0 | 0 |
NETFLIX | 0 | 0 | 0 | 0 | 0 | 0 |
Pandas has efficient methods to manipulate strings in a dataframe. Due to this, Pandas is very popular among researchers that need to use string data in their analysis. This makes it also a very important tool for accounting data analysis. All the Python string object’s built-in methods can be used to manipulate strings in a dataframe. We already discussed string-methods in Chapter 1.
Regular expressions can also be used to manipulate string-dataframes. Regular expressions is a very broad topic, and it takes time to master it. Let’s look at some very simple examples. If we want to split a sentence to a words-list, repeated whitespaces make the process difficult with the standard string-methods. However, it is very easy with regular expressions.
import re
The regex for multiple whitespaces is \s+.
splitter = re.compile('\s+')
splitter.split("This is a sample text.")
['This', 'is', 'a', 'sample', 'text.']
Regular experssions is a powerful tool, but very complex. You can search email-addresses from a text with the following regex-command.
reg_pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
email_search = re.compile(reg_pattern, flags = re.IGNORECASE)
example = """The general format of an email address is local-part@domain,
and a specific example is jsmith@example.com. An address consists of two parts.
The part before the @ symbol (local part) identifies the name of a mailbox.
This is often the username of the recipient, e.g., jsmith.
The part after the @ symbol (domain) is a domain name that represents
the administrative realm for the mail box, e.g., a company's domain name, example.com."""
email_search.findall(example)
['jsmith@example.com']
We do not go into regular expression more at this point. But we will see some applications of them in the following chapters. If you want to learn more about regular expressions, here is a good website: www.regular-expressions.info/index.html
If your string data contains missing values, usually the standard string methods will not work. Then you need to use the dataframe string methods.
You can slice strings normally.
companies_df.index.str[0:5]
Index(['APPLE', 'SAUDI', 'MICRO', 'AMAZO', 'ALIBA', 'FACEB', 'TENCE', 'ALPHA',
'TESLA', 'TAIWA', 'NESTL', 'NVIDI', 'SAMSU', 'JP MO', 'KWEIC', 'NETFL'],
dtype='object', name='NAME')
And you can use regular experssions, for example, to search strings.
companies_df.index.str.findall('am', flags = re.IGNORECASE)
Index([[], [], [], ['AM'], [], [], [], [], [], [], [], [], ['AM'], [], [], []], dtype='object', name='NAME')
Here is a good introduction to Pandas string methods. pandas.pydata.org/pandas-docs/stable/user_guide/text.html. If you scroll down, there is a full list methods.
4.2. Advanced methods#
Let’s look some more advanced methods of Pandas next. We need a new dataset for that.
electricity_df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/AER/USAirlines.csv',index_col=0)
electricity_df['firm'].replace({1:'A',2:'B',3:'C',4:'D',5:'E',6:'F'},inplace=True)
The cost function of electricity producers.
electricity_df
firm | year | output | cost | price | load | |
---|---|---|---|---|---|---|
1 | A | 1970 | 0.952757 | 1140640 | 106650 | 0.534487 |
2 | A | 1971 | 0.986757 | 1215690 | 110307 | 0.532328 |
3 | A | 1972 | 1.091980 | 1309570 | 110574 | 0.547736 |
4 | A | 1973 | 1.175780 | 1511530 | 121974 | 0.540846 |
5 | A | 1974 | 1.160170 | 1676730 | 196606 | 0.591167 |
... | ... | ... | ... | ... | ... | ... |
86 | F | 1980 | 0.112640 | 381478 | 874818 | 0.517766 |
87 | F | 1981 | 0.154154 | 506969 | 1013170 | 0.580049 |
88 | F | 1982 | 0.186461 | 633388 | 930477 | 0.556024 |
89 | F | 1983 | 0.246847 | 804388 | 851676 | 0.537791 |
90 | F | 1984 | 0.304013 | 1009500 | 819476 | 0.525775 |
90 rows × 6 columns
There are many tools to deal with data that has more than two dimensions. The basic idea is to use hierarchical indexing in a two dimensional dataframe. For example, we can index the data by firm-years using set_index().
electricity_df.set_index(['firm','year'],inplace=True)
electricity_df
output | cost | price | load | ||
---|---|---|---|---|---|
firm | year | ||||
A | 1970 | 0.952757 | 1140640 | 106650 | 0.534487 |
1971 | 0.986757 | 1215690 | 110307 | 0.532328 | |
1972 | 1.091980 | 1309570 | 110574 | 0.547736 | |
1973 | 1.175780 | 1511530 | 121974 | 0.540846 | |
1974 | 1.160170 | 1676730 | 196606 | 0.591167 | |
... | ... | ... | ... | ... | ... |
F | 1980 | 0.112640 | 381478 | 874818 | 0.517766 |
1981 | 0.154154 | 506969 | 1013170 | 0.580049 | |
1982 | 0.186461 | 633388 | 930477 | 0.556024 | |
1983 | 0.246847 | 804388 | 851676 | 0.537791 | |
1984 | 0.304013 | 1009500 | 819476 | 0.525775 |
90 rows × 4 columns
Now the values are sorted so that all the years of a certain company are in adjacent rows. With sort_index you can order the values according to years.
electricity_df.sort_index(level=1)
output | cost | price | load | ||
---|---|---|---|---|---|
firm | year | ||||
A | 1970 | 0.952757 | 1140640 | 106650 | 0.534487 |
B | 1970 | 0.520635 | 569292 | 103795 | 0.490851 |
C | 1970 | 0.262424 | 286298 | 118788 | 0.524334 |
D | 1970 | 0.086393 | 145167 | 114987 | 0.432066 |
E | 1970 | 0.051028 | 91361 | 118222 | 0.442875 |
... | ... | ... | ... | ... | ... |
B | 1984 | 1.389740 | 4209390 | 821361 | 0.528775 |
C | 1984 | 0.493317 | 1170470 | 844079 | 0.577078 |
D | 1984 | 0.421411 | 1436970 | 831374 | 0.585525 |
E | 1984 | 0.213279 | 610257 | 844622 | 0.635545 |
F | 1984 | 0.304013 | 1009500 | 819476 | 0.525775 |
90 rows × 4 columns
Two-level indexing enables an easy way to pick subgroups from the data. For a series, you can just use the standard indexing style of Python.
electricity_df['output']['A']
year
1970 0.952757
1971 0.986757
1972 1.091980
1973 1.175780
1974 1.160170
1975 1.173760
1976 1.290510
1977 1.390670
1978 1.612730
1979 1.825440
1980 1.546040
1981 1.527900
1982 1.660200
1983 1.822310
1984 1.936460
Name: output, dtype: float64
electricity_df['output'][:,1975]
firm
A 1.173760
B 0.852892
C 0.367517
D 0.164272
E 0.073961
F 0.052462
Name: output, dtype: float64
With multi-index dataframes, you have to be a little bit more careful, because the subgroups are dataframes themselves. For example, you need to use loc to pick up the subgroups.
electricity_df.loc['A']
output | cost | price | load | |
---|---|---|---|---|
year | ||||
1970 | 0.952757 | 1140640 | 106650 | 0.534487 |
1971 | 0.986757 | 1215690 | 110307 | 0.532328 |
1972 | 1.091980 | 1309570 | 110574 | 0.547736 |
1973 | 1.175780 | 1511530 | 121974 | 0.540846 |
1974 | 1.160170 | 1676730 | 196606 | 0.591167 |
1975 | 1.173760 | 1823740 | 265609 | 0.575417 |
1976 | 1.290510 | 2022890 | 263451 | 0.594495 |
1977 | 1.390670 | 2314760 | 316411 | 0.597409 |
1978 | 1.612730 | 2639160 | 384110 | 0.638522 |
1979 | 1.825440 | 3247620 | 569251 | 0.676287 |
1980 | 1.546040 | 3787750 | 871636 | 0.605735 |
1981 | 1.527900 | 3867750 | 997239 | 0.614360 |
1982 | 1.660200 | 3996020 | 938002 | 0.633366 |
1983 | 1.822310 | 4282880 | 859572 | 0.650117 |
1984 | 1.936460 | 4748320 | 823411 | 0.625603 |
electricity_df.loc['B'].loc[1970]
output 0.520635
cost 569292.000000
price 103795.000000
load 0.490851
Name: 1970, dtype: float64
If you want to pick values of a certain year, you change the order of indices using swaplevel.
electricity_df.swaplevel('firm','year').loc[1975]
output | cost | price | load | |
---|---|---|---|---|
firm | ||||
A | 1.173760 | 1823740 | 265609 | 0.575417 |
B | 0.852892 | 1358100 | 281704 | 0.558133 |
C | 0.367517 | 510412 | 278721 | 0.607270 |
D | 0.164272 | 373941 | 263148 | 0.532723 |
E | 0.073961 | 156228 | 277930 | 0.556181 |
F | 0.052462 | 133161 | 307923 | 0.495361 |
You can easily calculate descriptive statistics at multiple levels. Most of the stat functions in Pandas include a level parameter for that.
electricity_df.sum()
output 4.904952e+01
cost 1.010271e+08
price 4.245147e+07
load 5.044141e+01
dtype: float64
electricity_df.sum(level=0)
output | cost | price | load | |
---|---|---|---|---|
firm | ||||
A | 21.153464 | 39585050 | 6934803 | 8.957875 |
B | 15.250419 | 31918255 | 6994301 | 8.206419 |
C | 6.185664 | 10847838 | 7145233 | 8.768038 |
D | 3.188515 | 9571326 | 7100522 | 8.215159 |
E | 1.707548 | 4399139 | 7171182 | 8.497289 |
F | 1.563908 | 4705537 | 7105430 | 7.796634 |
electricity_df.sum(level=1)
output | cost | price | load | |
---|---|---|---|---|
year | ||||
1970 | 1.910919 | 2301736 | 679554 | 2.873152 |
1971 | 1.976987 | 2506118 | 701726 | 2.920993 |
1972 | 2.295394 | 2858803 | 705968 | 3.141480 |
1973 | 2.579854 | 3408925 | 747634 | 3.146692 |
1974 | 2.647980 | 3941906 | 1280945 | 3.381160 |
1975 | 2.684864 | 4355582 | 1675035 | 3.325085 |
1976 | 2.936119 | 4848733 | 1832425 | 3.364455 |
1977 | 3.173497 | 5548301 | 2106566 | 3.402352 |
1978 | 3.731864 | 6454748 | 2310525 | 3.707459 |
1979 | 4.190756 | 8020356 | 3347542 | 3.740366 |
1980 | 3.853406 | 9661270 | 5148943 | 3.481546 |
1981 | 3.800075 | 10634533 | 6046129 | 3.513746 |
1982 | 4.016458 | 11167822 | 5675129 | 3.481910 |
1983 | 4.493125 | 12133405 | 5209027 | 3.482717 |
1984 | 4.758220 | 13184907 | 4984323 | 3.478301 |
If we want to remove multi-index, we can use reset_index(). With the level parameter, you can decide how many levels from the multi-index are removed.
electricity_df.reset_index(level=1)
year | output | cost | price | load | |
---|---|---|---|---|---|
firm | |||||
A | 1970 | 0.952757 | 1140640 | 106650 | 0.534487 |
A | 1971 | 0.986757 | 1215690 | 110307 | 0.532328 |
A | 1972 | 1.091980 | 1309570 | 110574 | 0.547736 |
A | 1973 | 1.175780 | 1511530 | 121974 | 0.540846 |
A | 1974 | 1.160170 | 1676730 | 196606 | 0.591167 |
... | ... | ... | ... | ... | ... |
F | 1980 | 0.112640 | 381478 | 874818 | 0.517766 |
F | 1981 | 0.154154 | 506969 | 1013170 | 0.580049 |
F | 1982 | 0.186461 | 633388 | 930477 | 0.556024 |
F | 1983 | 0.246847 | 804388 | 851676 | 0.537791 |
F | 1984 | 0.304013 | 1009500 | 819476 | 0.525775 |
90 rows × 5 columns
electricity_df.reset_index(inplace=True)
Summary statistics can also be calculated using the groupby method.
electricity_df.groupby('firm').mean()
output | cost | price | load | |
---|---|---|---|---|
firm | ||||
A | 1.410231 | 2.639003e+06 | 462320.200000 | 0.597192 |
B | 1.016695 | 2.127884e+06 | 466286.733333 | 0.547095 |
C | 0.412378 | 7.231892e+05 | 476348.866667 | 0.584536 |
D | 0.212568 | 6.380884e+05 | 473368.133333 | 0.547677 |
E | 0.113837 | 2.932759e+05 | 478078.800000 | 0.566486 |
F | 0.104261 | 3.137025e+05 | 473695.333333 | 0.519776 |
electricity_df.groupby(['firm','year']).mean() # The mean of single values
output | cost | price | load | ||
---|---|---|---|---|---|
firm | year | ||||
A | 1970 | 0.952757 | 1140640 | 106650 | 0.534487 |
1971 | 0.986757 | 1215690 | 110307 | 0.532328 | |
1972 | 1.091980 | 1309570 | 110574 | 0.547736 | |
1973 | 1.175780 | 1511530 | 121974 | 0.540846 | |
1974 | 1.160170 | 1676730 | 196606 | 0.591167 | |
... | ... | ... | ... | ... | ... |
F | 1980 | 0.112640 | 381478 | 874818 | 0.517766 |
1981 | 0.154154 | 506969 | 1013170 | 0.580049 | |
1982 | 0.186461 | 633388 | 930477 | 0.556024 | |
1983 | 0.246847 | 804388 | 851676 | 0.537791 | |
1984 | 0.304013 | 1009500 | 819476 | 0.525775 |
90 rows × 4 columns
stack and unstack can be used to reshape hierarchical index dataframes.
electricity_df.reset_index(inplace=True)
electricity_df
firm | year | output | cost | price | load | |
---|---|---|---|---|---|---|
0 | A | 1970 | 0.952757 | 1140640 | 106650 | 0.534487 |
1 | A | 1971 | 0.986757 | 1215690 | 110307 | 0.532328 |
2 | A | 1972 | 1.091980 | 1309570 | 110574 | 0.547736 |
3 | A | 1973 | 1.175780 | 1511530 | 121974 | 0.540846 |
4 | A | 1974 | 1.160170 | 1676730 | 196606 | 0.591167 |
... | ... | ... | ... | ... | ... | ... |
85 | F | 1980 | 0.112640 | 381478 | 874818 | 0.517766 |
86 | F | 1981 | 0.154154 | 506969 | 1013170 | 0.580049 |
87 | F | 1982 | 0.186461 | 633388 | 930477 | 0.556024 |
88 | F | 1983 | 0.246847 | 804388 | 851676 | 0.537791 |
89 | F | 1984 | 0.304013 | 1009500 | 819476 | 0.525775 |
90 rows × 6 columns
Stack turns a dataframe into a series.
data_series= electricity_df.stack()
data_series
0 firm A
year 1970
output 0.952757
cost 1140640
price 106650
...
89 year 1984
output 0.304013
cost 1009500
price 819476
load 0.525775
Length: 540, dtype: object
unstack() can be used to rearragne data back to a dataframe.
data_series.unstack()
firm | year | output | cost | price | load | |
---|---|---|---|---|---|---|
0 | A | 1970 | 0.952757 | 1140640 | 106650 | 0.534487 |
1 | A | 1971 | 0.986757 | 1215690 | 110307 | 0.532328 |
2 | A | 1972 | 1.09198 | 1309570 | 110574 | 0.547736 |
3 | A | 1973 | 1.17578 | 1511530 | 121974 | 0.540846 |
4 | A | 1974 | 1.16017 | 1676730 | 196606 | 0.591167 |
... | ... | ... | ... | ... | ... | ... |
85 | F | 1980 | 0.11264 | 381478 | 874818 | 0.517766 |
86 | F | 1981 | 0.154154 | 506969 | 1013170 | 0.580049 |
87 | F | 1982 | 0.186461 | 633388 | 930477 | 0.556024 |
88 | F | 1983 | 0.246847 | 804388 | 851676 | 0.537791 |
89 | F | 1984 | 0.304013 | 1009500 | 819476 | 0.525775 |
90 rows × 6 columns
Yet another tool to reorganise data is pivot_table. More of it later.
4.3. Merging datasets#
Merging dataframes is often difficult for Pandas beginners. It can be a hassle. The usual cause of difficulties is to forget the importance of index with Pandas datatypes. Merging dataframes is not about gluing tables together. The merging is done according to indices.
gasoline_df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Gasoline.csv',index_col=0)
gasoline_df.set_index('country',inplace=True)
gasoline_df
year | lgaspcar | lincomep | lrpmg | lcarpcap | |
---|---|---|---|---|---|
country | |||||
AUSTRIA | 1960 | 4.173244 | -6.474277 | -0.334548 | -9.766840 |
AUSTRIA | 1961 | 4.100989 | -6.426006 | -0.351328 | -9.608622 |
AUSTRIA | 1962 | 4.073177 | -6.407308 | -0.379518 | -9.457257 |
AUSTRIA | 1963 | 4.059509 | -6.370679 | -0.414251 | -9.343155 |
AUSTRIA | 1964 | 4.037689 | -6.322247 | -0.445335 | -9.237739 |
... | ... | ... | ... | ... | ... |
U.S.A. | 1974 | 4.798626 | -5.328694 | -1.231467 | -7.617558 |
U.S.A. | 1975 | 4.804932 | -5.346190 | -1.200377 | -7.607010 |
U.S.A. | 1976 | 4.814891 | -5.297946 | -1.154682 | -7.574748 |
U.S.A. | 1977 | 4.811032 | -5.256606 | -1.175910 | -7.553458 |
U.S.A. | 1978 | 4.818454 | -5.221232 | -1.212062 | -7.536176 |
342 rows × 5 columns
Let’s split the data.
gaso2_df = gasoline_df[['lrpmg','lcarpcap']]
gasoline_df.drop(['lrpmg','lcarpcap'],axis=1,inplace=True)
gasoline_df
year | lgaspcar | lincomep | |
---|---|---|---|
country | |||
AUSTRIA | 1960 | 4.173244 | -6.474277 |
AUSTRIA | 1961 | 4.100989 | -6.426006 |
AUSTRIA | 1962 | 4.073177 | -6.407308 |
AUSTRIA | 1963 | 4.059509 | -6.370679 |
AUSTRIA | 1964 | 4.037689 | -6.322247 |
... | ... | ... | ... |
U.S.A. | 1974 | 4.798626 | -5.328694 |
U.S.A. | 1975 | 4.804932 | -5.346190 |
U.S.A. | 1976 | 4.814891 | -5.297946 |
U.S.A. | 1977 | 4.811032 | -5.256606 |
U.S.A. | 1978 | 4.818454 | -5.221232 |
342 rows × 3 columns
gaso2_df
lrpmg | lcarpcap | |
---|---|---|
country | ||
AUSTRIA | -0.334548 | -9.766840 |
AUSTRIA | -0.351328 | -9.608622 |
AUSTRIA | -0.379518 | -9.457257 |
AUSTRIA | -0.414251 | -9.343155 |
AUSTRIA | -0.445335 | -9.237739 |
... | ... | ... |
U.S.A. | -1.231467 | -7.617558 |
U.S.A. | -1.200377 | -7.607010 |
U.S.A. | -1.154682 | -7.574748 |
U.S.A. | -1.175910 | -7.553458 |
U.S.A. | -1.212062 | -7.536176 |
342 rows × 2 columns
Merge combindes according to the index values
pd.merge(gasoline_df,gaso2_df,on='country')
year | lgaspcar | lincomep | lrpmg | lcarpcap | |
---|---|---|---|---|---|
country | |||||
AUSTRIA | 1960 | 4.173244 | -6.474277 | -0.334548 | -9.766840 |
AUSTRIA | 1960 | 4.173244 | -6.474277 | -0.351328 | -9.608622 |
AUSTRIA | 1960 | 4.173244 | -6.474277 | -0.379518 | -9.457257 |
AUSTRIA | 1960 | 4.173244 | -6.474277 | -0.414251 | -9.343155 |
AUSTRIA | 1960 | 4.173244 | -6.474277 | -0.445335 | -9.237739 |
... | ... | ... | ... | ... | ... |
U.S.A. | 1978 | 4.818454 | -5.221232 | -1.231467 | -7.617558 |
U.S.A. | 1978 | 4.818454 | -5.221232 | -1.200377 | -7.607010 |
U.S.A. | 1978 | 4.818454 | -5.221232 | -1.154682 | -7.574748 |
U.S.A. | 1978 | 4.818454 | -5.221232 | -1.175910 | -7.553458 |
U.S.A. | 1978 | 4.818454 | -5.221232 | -1.212062 | -7.536176 |
6498 rows × 5 columns
You could also do this by activating both indices for merging.
pd.merge(gasoline_df,gaso2_df,left_index=True,right_index=True)
year | lgaspcar | lincomep | lrpmg | lcarpcap | |
---|---|---|---|---|---|
country | |||||
AUSTRIA | 1960 | 4.173244 | -6.474277 | -0.334548 | -9.766840 |
AUSTRIA | 1960 | 4.173244 | -6.474277 | -0.351328 | -9.608622 |
AUSTRIA | 1960 | 4.173244 | -6.474277 | -0.379518 | -9.457257 |
AUSTRIA | 1960 | 4.173244 | -6.474277 | -0.414251 | -9.343155 |
AUSTRIA | 1960 | 4.173244 | -6.474277 | -0.445335 | -9.237739 |
... | ... | ... | ... | ... | ... |
U.S.A. | 1978 | 4.818454 | -5.221232 | -1.231467 | -7.617558 |
U.S.A. | 1978 | 4.818454 | -5.221232 | -1.200377 | -7.607010 |
U.S.A. | 1978 | 4.818454 | -5.221232 | -1.154682 | -7.574748 |
U.S.A. | 1978 | 4.818454 | -5.221232 | -1.175910 | -7.553458 |
U.S.A. | 1978 | 4.818454 | -5.221232 | -1.212062 | -7.536176 |
6498 rows × 5 columns
You can also use join. It uses the index values for merging by default.
gasoline_df.join(gaso2_df)
year | lgaspcar | lincomep | lrpmg | lcarpcap | |
---|---|---|---|---|---|
country | |||||
AUSTRIA | 1960 | 4.173244 | -6.474277 | -0.334548 | -9.766840 |
AUSTRIA | 1960 | 4.173244 | -6.474277 | -0.351328 | -9.608622 |
AUSTRIA | 1960 | 4.173244 | -6.474277 | -0.379518 | -9.457257 |
AUSTRIA | 1960 | 4.173244 | -6.474277 | -0.414251 | -9.343155 |
AUSTRIA | 1960 | 4.173244 | -6.474277 | -0.445335 | -9.237739 |
... | ... | ... | ... | ... | ... |
U.S.A. | 1978 | 4.818454 | -5.221232 | -1.231467 | -7.617558 |
U.S.A. | 1978 | 4.818454 | -5.221232 | -1.200377 | -7.607010 |
U.S.A. | 1978 | 4.818454 | -5.221232 | -1.154682 | -7.574748 |
U.S.A. | 1978 | 4.818454 | -5.221232 | -1.175910 | -7.553458 |
U.S.A. | 1978 | 4.818454 | -5.221232 | -1.212062 | -7.536176 |
6498 rows × 5 columns
Concat is a more general tool for merging data. It is easiest to understand how it works, if we use very simple datasets. Let’s create three simple Pandas Series.
x1 = pd.Series([20,30,40], index=['a','b','c'])
x2 = pd.Series([50,60], index=['e','f'])
x3 = pd.Series([70,80,90], index=['h','i','j'])
If we just feed them to concat, they are joined together as a longer Series.
pd.concat([x1,x2,x3])
a 20
b 30
c 40
e 50
f 60
h 70
i 80
j 90
dtype: int64
If you change the axis, the result will be a dataframe, because there are no overlapping index values.
pd.concat([x1,x2,x3],axis=1)
0 | 1 | 2 | |
---|---|---|---|
a | 20.0 | NaN | NaN |
b | 30.0 | NaN | NaN |
c | 40.0 | NaN | NaN |
e | NaN | 50.0 | NaN |
f | NaN | 60.0 | NaN |
h | NaN | NaN | 70.0 |
i | NaN | NaN | 80.0 |
j | NaN | NaN | 90.0 |
x4 = pd.concat([x1,x2])
pd.concat([x4,x2],axis=1)
0 | 1 | |
---|---|---|
a | 20 | NaN |
b | 30 | NaN |
c | 40 | NaN |
e | 50 | 50.0 |
f | 60 | 60.0 |
If you just want to keep the intersction of the series, you can use join=’inner’
pd.concat([x4,x2],axis=1,join='inner')
0 | 1 | |
---|---|---|
e | 50 | 50 |
f | 60 | 60 |
4.4. Data aggreagation#
Data aggregation is one the most important steps in data preprocessing. Pandas has many tools for that.
Groupby is an important tool in aggragation. Let’s load a new dataset for this.
import pandas as pd
comp_returns_df = pd.read_csv('comp_returns.csv',index_col=0)
comp_returns_df
NAME | ISO COUNTRY CODE | RETURN ON EQUITY - TOTAL (%) | TOTAL DEBT % COMMON EQUITY | RESEARCH & DEVELOPMENT/SALES - | ACCOUNTING STANDARDS FOLLOWED | |
---|---|---|---|---|---|---|
0 | APPLE | US | 55.92 | 119.40 | 4.95 | US standards (GAAP) |
1 | SAUDI ARABIAN OIL | SA | 32.25 | 16.96 | NaN | IFRS |
2 | MICROSOFT | US | 40.14 | 61.56 | 13.59 | US standards (GAAP) |
3 | AMAZON.COM | US | 21.95 | 83.31 | 12.29 | US standards (GAAP) |
4 | FACEBOOK CLASS A | US | 19.96 | 0.47 | 21.00 | US standards (GAAP) |
5 | TENCENT HOLDINGS | HK | 24.68 | 53.74 | 7.72 | IFRS |
6 | ALPHABET A | US | 18.12 | 2.26 | 15.71 | US standards (GAAP) |
7 | TESLA | US | -14.94 | 202.77 | 10.73 | US standards (GAAP) |
8 | WALMART | US | 20.22 | 72.95 | NaN | US standards (GAAP) |
9 | JOHNSON & JOHNSON | US | 25.36 | 46.57 | 13.28 | US standards (GAAP) |
10 | TAIWAN SEMICON.MNFG. | TW | 20.94 | 11.89 | 8.08 | IFRS |
11 | NESTLE 'N' | CH | 23.05 | 71.42 | 1.88 | IFRS |
12 | PROCTER & GAMBLE | US | 27.78 | 76.10 | 2.79 | US standards (GAAP) |
13 | VISA 'A' | US | 40.34 | 57.25 | NaN | US standards (GAAP) |
14 | MASTERCARD | US | 143.83 | 144.70 | NaN | US standards (GAAP) |
15 | NVIDIA | US | 25.95 | 16.31 | 22.49 | US standards (GAAP) |
16 | 3I GROUP | GB | 2.73 | 7.67 | NaN | IFRS |
17 | JP MORGAN CHASE & CO. | US | 14.91 | 220.24 | NaN | US standards (GAAP) |
18 | HOME DEPOT | US | NaN | -1010.37 | NaN | US standards (GAAP) |
19 | KWEICHOW MOUTAI 'A' | CN | 33.12 | 9.14 | 0.85 | Local standards |
20 | SAMSUNG ELECTRONICS | KR | 8.69 | 7.22 | 7.39 | IFRS |
21 | UNITEDHEALTH GROUP | US | 25.40 | 71.33 | NaN | US standards (GAAP) |
22 | ROCHE HOLDING | CH | 44.71 | 47.58 | 19.38 | IFRS |
23 | VERIZON COMMUNICATIONS | US | 33.64 | 181.59 | NaN | US standards (GAAP) |
24 | LVMH | FR | 20.82 | 68.97 | 0.28 | IFRS |
25 | NETFLIX | US | 29.12 | 204.38 | 8.73 | US standards (GAAP) |
26 | ADOBE (NAS) | US | 29.67 | 39.30 | 17.15 | US standards (GAAP) |
27 | TOYOTA MOTOR | JP | 10.45 | 102.45 | 3.66 | US standards (GAAP) |
28 | PAYPAL HOLDINGS | US | 15.24 | 29.48 | 8.49 | US standards (GAAP) |
29 | SALESFORCE.COM | US | 0.51 | 9.04 | 14.76 | US standards (GAAP) |
Now, we can group data by the accounting standards variable and calculate the mean. Notice how Pandas automatically drops the country variable, because you cannot calculate a mean from strings:
comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').mean()
RETURN ON EQUITY - TOTAL (%) | TOTAL DEBT % COMMON EQUITY | RESEARCH & DEVELOPMENT/SALES - | |
---|---|---|---|
ACCOUNTING STANDARDS FOLLOWED | |||
IFRS | 22.23375 | 35.68125 | 7.455000 |
Local standards | 33.12000 | 9.14000 | 0.850000 |
US standards (GAAP) | 29.17850 | 34.81381 | 12.115714 |
You can make a two-level grouping by adding a list of columns to groupby.
comp_returns_df.groupby(['ISO COUNTRY CODE','ACCOUNTING STANDARDS FOLLOWED']).mean()
RETURN ON EQUITY - TOTAL (%) | TOTAL DEBT % COMMON EQUITY | RESEARCH & DEVELOPMENT/SALES - | ||
---|---|---|---|---|
ISO COUNTRY CODE | ACCOUNTING STANDARDS FOLLOWED | |||
CH | IFRS | 33.880000 | 59.500 | 10.630000 |
CN | Local standards | 33.120000 | 9.140 | 0.850000 |
FR | IFRS | 20.820000 | 68.970 | 0.280000 |
GB | IFRS | 2.730000 | 7.670 | NaN |
HK | IFRS | 24.680000 | 53.740 | 7.720000 |
JP | US standards (GAAP) | 10.450000 | 102.450 | 3.660000 |
KR | IFRS | 8.690000 | 7.220 | 7.390000 |
SA | IFRS | 32.250000 | 16.960 | NaN |
TW | IFRS | 20.940000 | 11.890 | 8.080000 |
US | US standards (GAAP) | 30.164211 | 31.432 | 12.766154 |
There are many other groupby-methods, like size(). The full list of groupby -methods are here: pandas.pydata.org/pandas-docs/stable/reference/groupby.html
comp_returns_df.groupby('ISO COUNTRY CODE').size()
ISO COUNTRY CODE
CH 2
CN 1
FR 1
GB 1
HK 1
JP 1
KR 1
SA 1
TW 1
US 20
dtype: int64
If we just use the groupby method without a following method, it will return a Pandas groupby object.
comp_returns_df.groupby('ISO COUNTRY CODE')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000160F39C3B80>
This object can be used for iteration.
for name,group in comp_returns_df.groupby('ISO COUNTRY CODE'):
print(name)
print(len(group))
CH
2
CN
1
FR
1
GB
1
HK
1
JP
1
KR
1
SA
1
TW
1
US
20
You could also group data according to columns using axis=1. It makes not much sense with this data, so we skip that example.
You can easily pick up just one column from a groupby object.
comp_returns_df.groupby('ISO COUNTRY CODE')['RETURN ON EQUITY - TOTAL (%)'].mean()
ISO COUNTRY CODE
CH 33.880000
CN 33.120000
FR 20.820000
GB 2.730000
HK 24.680000
JP 10.450000
KR 8.690000
SA 32.250000
TW 20.940000
US 30.164211
Name: RETURN ON EQUITY - TOTAL (%), dtype: float64
You could group data by defining a dict, a function etc. We do not go to these advanced methods here.
4.4.1. Data aggregation#
Now that we know the basics of groupby, we can analyse more how it can be used to aggregate data. The methods of groupby are count, sum, mean, median, std, var, min, max, prod, first and last. These are opitmised for groupby objects, but many other methods work too. Actually, you can define your own functions with groupby.agg().
def mean_median(arr):
return arr.mean()-arr.median()
comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').agg(mean_median)
RETURN ON EQUITY - TOTAL (%) | TOTAL DEBT % COMMON EQUITY | RESEARCH & DEVELOPMENT/SALES - | |
---|---|---|---|
ACCOUNTING STANDARDS FOLLOWED | |||
IFRS | 0.23875 | 3.41125 | -0.100000 |
Local standards | 0.00000 | 0.00000 | 0.000000 |
US standards (GAAP) | 3.79850 | -36.51619 | -0.669286 |
Basically, those Pandas methods will work with groupby that are some kind of aggregations. For example, quantile can be used
comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').quantile(0.9)
RETURN ON EQUITY - TOTAL (%) | TOTAL DEBT % COMMON EQUITY | RESEARCH & DEVELOPMENT/SALES - | |
---|---|---|---|
ACCOUNTING STANDARDS FOLLOWED | |||
IFRS | 35.988 | 69.705 | 13.730 |
Local standards | 33.120 | 9.140 | 0.850 |
US standards (GAAP) | 41.898 | 202.770 | 19.845 |
Even though describe is not an aggregating function, it will also work.
comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').describe()
RETURN ON EQUITY - TOTAL (%) | TOTAL DEBT % COMMON EQUITY | RESEARCH & DEVELOPMENT/SALES - | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | ... | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
ACCOUNTING STANDARDS FOLLOWED | |||||||||||||||||||||
IFRS | 8.0 | 22.23375 | 12.981632 | 2.73 | 17.7875 | 21.995 | 26.5725 | 44.71 | 8.0 | 35.68125 | ... | 57.5475 | 71.42 | 6.0 | 7.455000 | 6.711661 | 0.28 | 3.2575 | 7.555 | 7.9900 | 19.38 |
Local standards | 1.0 | 33.12000 | NaN | 33.12 | 33.1200 | 33.120 | 33.1200 | 33.12 | 1.0 | 9.14000 | ... | 9.1400 | 9.14 | 1.0 | 0.850000 | NaN | 0.85 | 0.8500 | 0.850 | 0.8500 | 0.85 |
US standards (GAAP) | 20.0 | 29.17850 | 30.798945 | -14.94 | 17.4000 | 25.380 | 30.6625 | 143.83 | 21.0 | 34.81381 | ... | 119.4000 | 220.24 | 14.0 | 12.115714 | 6.023770 | 2.79 | 8.5500 | 12.785 | 15.4725 | 22.49 |
3 rows × 24 columns
You can make the previous table more readable by unstacking it.
comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').describe().unstack()
ACCOUNTING STANDARDS FOLLOWED
RETURN ON EQUITY - TOTAL (%) count IFRS 8.00000
Local standards 1.00000
US standards (GAAP) 20.00000
mean IFRS 22.23375
Local standards 33.12000
...
RESEARCH & DEVELOPMENT/SALES - 75% Local standards 0.85000
US standards (GAAP) 15.47250
max IFRS 19.38000
Local standards 0.85000
US standards (GAAP) 22.49000
Length: 72, dtype: float64
Apply() is the most versatile method to use with groupby objects.
def medalists(df,var = 'RETURN ON EQUITY - TOTAL (%)'):
return df.sort_values(by=var)[-3:]
comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').apply(medalists)
NAME | ISO COUNTRY CODE | RETURN ON EQUITY - TOTAL (%) | TOTAL DEBT % COMMON EQUITY | RESEARCH & DEVELOPMENT/SALES - | ACCOUNTING STANDARDS FOLLOWED | ||
---|---|---|---|---|---|---|---|
ACCOUNTING STANDARDS FOLLOWED | |||||||
IFRS | 5 | TENCENT HOLDINGS | HK | 24.68 | 53.74 | 7.72 | IFRS |
1 | SAUDI ARABIAN OIL | SA | 32.25 | 16.96 | NaN | IFRS | |
22 | ROCHE HOLDING | CH | 44.71 | 47.58 | 19.38 | IFRS | |
Local standards | 19 | KWEICHOW MOUTAI 'A' | CN | 33.12 | 9.14 | 0.85 | Local standards |
US standards (GAAP) | 0 | APPLE | US | 55.92 | 119.40 | 4.95 | US standards (GAAP) |
14 | MASTERCARD | US | 143.83 | 144.70 | NaN | US standards (GAAP) | |
18 | HOME DEPOT | US | NaN | -1010.37 | NaN | US standards (GAAP) |
It is also possible to use the bins created with cut or qcut as a grouping criteria in groupby.
ROE_quartiles = pd.qcut(comp_returns_df['RETURN ON EQUITY - TOTAL (%)'],4)
comp_returns_df.groupby(ROE_quartiles).size()
RETURN ON EQUITY - TOTAL (%)
(-14.940999999999999, 18.12] 8
(18.12, 24.68] 7
(24.68, 32.25] 7
(32.25, 143.83] 7
dtype: int64
comp_returns_df.groupby(ROE_quartiles).mean()
RETURN ON EQUITY - TOTAL (%) | TOTAL DEBT % COMMON EQUITY | RESEARCH & DEVELOPMENT/SALES - | |
---|---|---|---|
RETURN ON EQUITY - TOTAL (%) | |||
(-14.940999999999999, 18.12] | 6.963750 | 72.641250 | 10.123333 |
(18.12, 24.68] | 21.660000 | 51.821429 | 8.541667 |
(24.68, 32.25] | 27.932857 | 67.278571 | 12.888000 |
(32.25, 143.83] | 55.957143 | 88.745714 | 9.692500 |
Pivot tables are another option to organise data with Pandas. They are very popular in spreadsheet softwares, like Excel.It is very similar to cross-tabs that we use in statistics. Pandas has a specific function for pivot tables.
Let’s load a new data.
large_df = pd.read_csv('large_table.csv',delimiter=';')
large_df.drop('Type',inplace=True,axis=1)
large_df
NAME | ISO COUNTRY CODE | RETURN ON EQUITY - TOTAL (%) | TOTAL DEBT % COMMON EQUITY | RESEARCH & DEVELOPMENT/SALES - | INDUSTRY GROUP | ACCOUNTING STANDARDS FOLLOWED | CSR Sustainability External Audit | Audit Committee | CSR Sustainability Committee | |
---|---|---|---|---|---|---|---|---|---|---|
0 | APPLE | US | 55.92 | 119.40 | 4.95 | 34 | US standards (GAAP) | Y | Y | Y |
1 | SAUDI ARABIAN OIL | SA | 32.25 | 16.96 | NaN | 97 | IFRS | NaN | Y | Y |
2 | MICROSOFT | US | 40.14 | 61.56 | 13.59 | 58 | US standards (GAAP) | NaN | Y | Y |
3 | AMAZON.COM | US | 21.95 | 83.31 | 12.29 | 87 | US standards (GAAP) | Y | Y | Y |
4 | FACEBOOK CLASS A | US | 19.96 | 0.47 | 21.00 | 151 | US standards (GAAP) | Y | Y | Y |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
100 | CHINA MERCHANTS BANK 'A' | CN | 16.84 | 301.52 | NaN | 102 | IFRS | Y | Y | Y |
101 | ENEL | IT | 7.00 | 202.46 | NaN | 169 | IFRS | Y | Y | Y |
102 | JD COM ADR 1:2 | CN | 17.28 | 12.28 | 2.18 | 87 | US standards (GAAP) | NaN | Y | N |
103 | KEYENCE | JP | 11.84 | 0.00 | 2.76 | 214 | Local standards | NaN | N | N |
104 | HSBC HOLDINGS | GB | 3.71 | 328.78 | NaN | 102 | IFRS | Y | Y | Y |
105 rows × 10 columns
The default pivot_table aggregation type is mean.
large_df.columns
Index(['NAME', 'ISO COUNTRY CODE', 'RETURN ON EQUITY - TOTAL (%)',
'TOTAL DEBT % COMMON EQUITY', 'RESEARCH & DEVELOPMENT/SALES -',
'INDUSTRY GROUP', 'ACCOUNTING STANDARDS FOLLOWED',
'CSR Sustainability External Audit', 'Audit Committee',
'CSR Sustainability Committee'],
dtype='object')
large_df.pivot_table(values='RETURN ON EQUITY - TOTAL (%)',
index='CSR Sustainability External Audit',columns='ACCOUNTING STANDARDS FOLLOWED')
ACCOUNTING STANDARDS FOLLOWED | IFRS | Local standards | US standards (GAAP) |
---|---|---|---|
CSR Sustainability External Audit | |||
N | NaN | NaN | 22.380000 |
Y | 18.968333 | 23.315 | 46.280556 |
You can use hierarchical index and add marginal sums to the table.
large_df.pivot_table(values='RETURN ON EQUITY - TOTAL (%)',
index=['CSR Sustainability External Audit','Audit Committee'],
columns='ACCOUNTING STANDARDS FOLLOWED',margins=True)
ACCOUNTING STANDARDS FOLLOWED | IFRS | Local standards | US standards (GAAP) | All | |
---|---|---|---|---|---|
CSR Sustainability External Audit | Audit Committee | ||||
N | Y | NaN | NaN | 22.380000 | 22.380000 |
Y | N | 24.660000 | NaN | 10.450000 | 17.555000 |
Y | 18.720870 | 23.315 | 47.304286 | 35.547667 | |
All | 18.968333 | 23.315 | 44.442051 | 34.386308 |
You can also use hierarchical columns
large_df.pivot_table(values='RESEARCH & DEVELOPMENT/SALES -',
columns=['CSR Sustainability Committee','Audit Committee'],
index='ISO COUNTRY CODE',margins=True)
CSR Sustainability Committee | N | Y | All | ||
---|---|---|---|---|---|
Audit Committee | N | Y | N | Y | |
ISO COUNTRY CODE | |||||
CA | NaN | 19.890000 | NaN | NaN | 19.890000 |
CH | NaN | NaN | NaN | 12.720000 | 12.720000 |
CN | 0.850 | 2.180000 | 0.280000 | NaN | 1.103333 |
DE | NaN | NaN | NaN | 10.315000 | 10.315000 |
DK | NaN | NaN | NaN | 11.850000 | 11.850000 |
FR | NaN | NaN | NaN | 4.902500 | 4.902500 |
GB | NaN | NaN | NaN | 18.015000 | 18.015000 |
HK | NaN | 15.420000 | NaN | 7.720000 | 11.570000 |
IN | NaN | NaN | NaN | 0.220000 | 0.220000 |
JP | 2.760 | NaN | 2.505000 | 1.890000 | 2.415000 |
KR | NaN | NaN | NaN | 7.390000 | 7.390000 |
NL | NaN | NaN | NaN | 14.710000 | 14.710000 |
TW | NaN | NaN | NaN | 8.080000 | 8.080000 |
US | NaN | 13.755000 | NaN | 10.564242 | 11.055128 |
All | 1.805 | 13.335556 | 1.763333 | 9.906346 | 9.758333 |
Cross-tabs are a special case of pivot tables, where the values are frequencies.
large_df.pivot_table(columns='CSR Sustainability Committee',index='ISO COUNTRY CODE',
aggfunc='count',margins=True)
ACCOUNTING STANDARDS FOLLOWED | Audit Committee | CSR Sustainability External Audit | INDUSTRY GROUP | ... | NAME | RESEARCH & DEVELOPMENT/SALES - | RETURN ON EQUITY - TOTAL (%) | TOTAL DEBT % COMMON EQUITY | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CSR Sustainability Committee | N | Y | All | N | Y | All | N | Y | All | N | ... | All | N | Y | All | N | Y | All | N | Y | All |
ISO COUNTRY CODE | |||||||||||||||||||||
CA | 1.0 | NaN | NaN | 1.0 | NaN | NaN | 0.0 | NaN | NaN | 1.0 | ... | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN |
CH | NaN | 3.0 | 3.0 | NaN | 3.0 | 3.0 | NaN | 3.0 | 3.0 | NaN | ... | 3.0 | NaN | 3.0 | 3.0 | NaN | 3.0 | 3.0 | NaN | 3.0 | 3.0 |
CN | 3.0 | 6.0 | NaN | 3.0 | 6.0 | NaN | 1.0 | 4.0 | NaN | 3.0 | ... | NaN | 2.0 | 1.0 | NaN | 3.0 | 6.0 | NaN | 3.0 | 6.0 | NaN |
DE | NaN | 2.0 | 2.0 | NaN | 2.0 | 2.0 | NaN | 2.0 | 2.0 | NaN | ... | 2.0 | NaN | 2.0 | 2.0 | NaN | 2.0 | 2.0 | NaN | 2.0 | 2.0 |
DK | NaN | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | ... | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | 1.0 |
FR | NaN | 4.0 | 4.0 | NaN | 4.0 | 4.0 | NaN | 4.0 | 4.0 | NaN | ... | 4.0 | NaN | 4.0 | 4.0 | NaN | 4.0 | 4.0 | NaN | 4.0 | 4.0 |
GB | NaN | 3.0 | 2.0 | NaN | 3.0 | 2.0 | NaN | 3.0 | 2.0 | NaN | ... | 2.0 | NaN | 2.0 | 2.0 | NaN | 3.0 | 2.0 | NaN | 3.0 | 2.0 |
HK | 1.0 | 3.0 | NaN | 1.0 | 3.0 | NaN | 0.0 | 2.0 | NaN | 1.0 | ... | NaN | 1.0 | 1.0 | NaN | 1.0 | 3.0 | NaN | 1.0 | 3.0 | NaN |
IN | NaN | 2.0 | 2.0 | NaN | 2.0 | 2.0 | NaN | 2.0 | 2.0 | NaN | ... | 2.0 | NaN | 2.0 | 2.0 | NaN | 2.0 | 2.0 | NaN | 2.0 | 2.0 |
IT | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | ... | NaN | NaN | 0.0 | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN |
JP | 1.0 | 3.0 | 2.0 | 1.0 | 3.0 | 2.0 | 0.0 | 2.0 | 2.0 | 1.0 | ... | 2.0 | 1.0 | 3.0 | 2.0 | 1.0 | 3.0 | 2.0 | 1.0 | 3.0 | 2.0 |
KR | NaN | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | ... | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | 1.0 |
NL | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
SA | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | 0.0 | NaN | NaN | ... | NaN | NaN | 0.0 | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN |
TW | NaN | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | ... | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | 1.0 |
US | 11.0 | 55.0 | 23.0 | 11.0 | 55.0 | 23.0 | 4.0 | 34.0 | 23.0 | 11.0 | ... | 23.0 | 6.0 | 33.0 | 23.0 | 11.0 | 51.0 | 23.0 | 11.0 | 55.0 | 23.0 |
All | 2.0 | 40.0 | 42.0 | 2.0 | 40.0 | 42.0 | 2.0 | 40.0 | 42.0 | 2.0 | ... | 42.0 | 2.0 | 40.0 | 42.0 | 2.0 | 40.0 | 42.0 | 2.0 | 40.0 | 42.0 |
17 rows × 24 columns
As you can see, it will return values for all variables. Therefore it is more convenient to use crosstab.
pd.crosstab(large_df['CSR Sustainability External Audit'],large_df['ISO COUNTRY CODE'],margins=True)
ISO COUNTRY CODE | CH | CN | DE | DK | FR | GB | HK | IN | IT | JP | KR | NL | TW | US | All |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CSR Sustainability External Audit | |||||||||||||||
N | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 3 |
Y | 3 | 5 | 2 | 1 | 4 | 3 | 2 | 2 | 1 | 2 | 1 | 1 | 1 | 35 | 63 |
All | 3 | 5 | 2 | 1 | 4 | 3 | 2 | 2 | 1 | 2 | 1 | 1 | 1 | 38 | 66 |