{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas Data preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might be wondering why we are talking so much about data processing and manipulation in this book. \"Where are all the fancy ML methods\" you might ask. Unfortunately, most of the time in data science goes to data pre-processing. It is often reported that 80 % of a data scientist's time goes to data preparation: cleaning, transforming, rearranging and creating suitable features (feature engineering). So, to be a succesfull data scientist, you need know how to play with data. Luckily, we have Pandas at our disposal, wich is one of the most powerful data manipulation tools available.\n", "\n", "Let's look at some of the most common preprocessing situatsions we encounter when doing data science." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Missing data is probably the most common issue with data." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "companies_df = pd.read_csv('emissions.csv',index_col='NAME',delimiter=';')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Board member compensationTotal CO2 To RevenuesEmission Reduction Target %
NAME
APPLE2832025.003.13NaN
SAUDI ARABIAN OIL3220550.15175.59NaN
MICROSOFT4336000.0029.1675.0
AMAZON.COM6567669.0040.1450.0
ALIBABA GROUP HOLDING ADR 1:8NaNNaNNaN
FACEBOOK CLASS A2291182.0025.6975.0
TENCENT HOLDINGS2681758.8015.81NaN
ALPHABET A50294471.0032.2250.0
TESLA17315009.00NaNNaN
TAIWAN SEMICON.MNFG.12584165.55245.1718.0
NESTLE 'N'12906002.1158.5235.0
NVIDIANaN7.1025.0
SAMSUNG ELECTRONICSNaN69.0670.0
JP MORGAN CHASE & CO.4460719.006.6950.0
KWEICHOW MOUTAI 'A'40219.20NaNNaN
NETFLIX3665870.00NaNNaN
\n", "
" ], "text/plain": [ " Board member compensation \\\n", "NAME \n", "APPLE 2832025.00 \n", "SAUDI ARABIAN OIL 3220550.15 \n", "MICROSOFT 4336000.00 \n", "AMAZON.COM 6567669.00 \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "FACEBOOK CLASS A 2291182.00 \n", "TENCENT HOLDINGS 2681758.80 \n", "ALPHABET A 50294471.00 \n", "TESLA 17315009.00 \n", "TAIWAN SEMICON.MNFG. 12584165.55 \n", "NESTLE 'N' 12906002.11 \n", "NVIDIA NaN \n", "SAMSUNG ELECTRONICS NaN \n", "JP MORGAN CHASE & CO. 4460719.00 \n", "KWEICHOW MOUTAI 'A' 40219.20 \n", "NETFLIX 3665870.00 \n", "\n", " Total CO2 To Revenues \\\n", "NAME \n", "APPLE 3.13 \n", "SAUDI ARABIAN OIL 175.59 \n", "MICROSOFT 29.16 \n", "AMAZON.COM 40.14 \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "FACEBOOK CLASS A 25.69 \n", "TENCENT HOLDINGS 15.81 \n", "ALPHABET A 32.22 \n", "TESLA NaN \n", "TAIWAN SEMICON.MNFG. 245.17 \n", "NESTLE 'N' 58.52 \n", "NVIDIA 7.10 \n", "SAMSUNG ELECTRONICS 69.06 \n", "JP MORGAN CHASE & CO. 6.69 \n", "KWEICHOW MOUTAI 'A' NaN \n", "NETFLIX NaN \n", "\n", " Emission Reduction Target % \n", "NAME \n", "APPLE NaN \n", "SAUDI ARABIAN OIL NaN \n", "MICROSOFT 75.0 \n", "AMAZON.COM 50.0 \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "FACEBOOK CLASS A 75.0 \n", "TENCENT HOLDINGS NaN \n", "ALPHABET A 50.0 \n", "TESLA NaN \n", "TAIWAN SEMICON.MNFG. 18.0 \n", "NESTLE 'N' 35.0 \n", "NVIDIA 25.0 \n", "SAMSUNG ELECTRONICS 70.0 \n", "JP MORGAN CHASE & CO. 50.0 \n", "KWEICHOW MOUTAI 'A' NaN \n", "NETFLIX NaN " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Dropna** can be used to filter out missing data. The **how** parameter defines do all/any of the values in rows/columns need to be zero for dropping." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Board member compensationTotal CO2 To RevenuesEmission Reduction Target %
NAME
APPLE2832025.003.13NaN
SAUDI ARABIAN OIL3220550.15175.59NaN
MICROSOFT4336000.0029.1675.0
AMAZON.COM6567669.0040.1450.0
FACEBOOK CLASS A2291182.0025.6975.0
TENCENT HOLDINGS2681758.8015.81NaN
ALPHABET A50294471.0032.2250.0
TESLA17315009.00NaNNaN
TAIWAN SEMICON.MNFG.12584165.55245.1718.0
NESTLE 'N'12906002.1158.5235.0
NVIDIANaN7.1025.0
SAMSUNG ELECTRONICSNaN69.0670.0
JP MORGAN CHASE & CO.4460719.006.6950.0
KWEICHOW MOUTAI 'A'40219.20NaNNaN
NETFLIX3665870.00NaNNaN
\n", "
" ], "text/plain": [ " Board member compensation Total CO2 To Revenues \\\n", "NAME \n", "APPLE 2832025.00 3.13 \n", "SAUDI ARABIAN OIL 3220550.15 175.59 \n", "MICROSOFT 4336000.00 29.16 \n", "AMAZON.COM 6567669.00 40.14 \n", "FACEBOOK CLASS A 2291182.00 25.69 \n", "TENCENT HOLDINGS 2681758.80 15.81 \n", "ALPHABET A 50294471.00 32.22 \n", "TESLA 17315009.00 NaN \n", "TAIWAN SEMICON.MNFG. 12584165.55 245.17 \n", "NESTLE 'N' 12906002.11 58.52 \n", "NVIDIA NaN 7.10 \n", "SAMSUNG ELECTRONICS NaN 69.06 \n", "JP MORGAN CHASE & CO. 4460719.00 6.69 \n", "KWEICHOW MOUTAI 'A' 40219.20 NaN \n", "NETFLIX 3665870.00 NaN \n", "\n", " Emission Reduction Target % \n", "NAME \n", "APPLE NaN \n", "SAUDI ARABIAN OIL NaN \n", "MICROSOFT 75.0 \n", "AMAZON.COM 50.0 \n", "FACEBOOK CLASS A 75.0 \n", "TENCENT HOLDINGS NaN \n", "ALPHABET A 50.0 \n", "TESLA NaN \n", "TAIWAN SEMICON.MNFG. 18.0 \n", "NESTLE 'N' 35.0 \n", "NVIDIA 25.0 \n", "SAMSUNG ELECTRONICS 70.0 \n", "JP MORGAN CHASE & CO. 50.0 \n", "KWEICHOW MOUTAI 'A' NaN \n", "NETFLIX NaN " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_df.dropna(how='all')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Board member compensationTotal CO2 To RevenuesEmission Reduction Target %
NAME
MICROSOFT4336000.0029.1675.0
AMAZON.COM6567669.0040.1450.0
FACEBOOK CLASS A2291182.0025.6975.0
ALPHABET A50294471.0032.2250.0
TAIWAN SEMICON.MNFG.12584165.55245.1718.0
NESTLE 'N'12906002.1158.5235.0
JP MORGAN CHASE & CO.4460719.006.6950.0
\n", "
" ], "text/plain": [ " Board member compensation Total CO2 To Revenues \\\n", "NAME \n", "MICROSOFT 4336000.00 29.16 \n", "AMAZON.COM 6567669.00 40.14 \n", "FACEBOOK CLASS A 2291182.00 25.69 \n", "ALPHABET A 50294471.00 32.22 \n", "TAIWAN SEMICON.MNFG. 12584165.55 245.17 \n", "NESTLE 'N' 12906002.11 58.52 \n", "JP MORGAN CHASE & CO. 4460719.00 6.69 \n", "\n", " Emission Reduction Target % \n", "NAME \n", "MICROSOFT 75.0 \n", "AMAZON.COM 50.0 \n", "FACEBOOK CLASS A 75.0 \n", "ALPHABET A 50.0 \n", "TAIWAN SEMICON.MNFG. 18.0 \n", "NESTLE 'N' 35.0 \n", "JP MORGAN CHASE & CO. 50.0 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_df.dropna(how='any')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can again use the **axis** parameter. However, it is not meaningful to use here because every column has NaN-values (**any** returns an empty table), and not all values are NaN in any column (**all** returns the original dataframe)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With **thresh** you can define how many of the values are allowed to be NaN." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Board member compensationTotal CO2 To RevenuesEmission Reduction Target %
NAME
APPLE2832025.003.13NaN
SAUDI ARABIAN OIL3220550.15175.59NaN
MICROSOFT4336000.0029.1675.0
AMAZON.COM6567669.0040.1450.0
FACEBOOK CLASS A2291182.0025.6975.0
TENCENT HOLDINGS2681758.8015.81NaN
ALPHABET A50294471.0032.2250.0
TAIWAN SEMICON.MNFG.12584165.55245.1718.0
NESTLE 'N'12906002.1158.5235.0
NVIDIANaN7.1025.0
SAMSUNG ELECTRONICSNaN69.0670.0
JP MORGAN CHASE & CO.4460719.006.6950.0
\n", "
" ], "text/plain": [ " Board member compensation Total CO2 To Revenues \\\n", "NAME \n", "APPLE 2832025.00 3.13 \n", "SAUDI ARABIAN OIL 3220550.15 175.59 \n", "MICROSOFT 4336000.00 29.16 \n", "AMAZON.COM 6567669.00 40.14 \n", "FACEBOOK CLASS A 2291182.00 25.69 \n", "TENCENT HOLDINGS 2681758.80 15.81 \n", "ALPHABET A 50294471.00 32.22 \n", "TAIWAN SEMICON.MNFG. 12584165.55 245.17 \n", "NESTLE 'N' 12906002.11 58.52 \n", "NVIDIA NaN 7.10 \n", "SAMSUNG ELECTRONICS NaN 69.06 \n", "JP MORGAN CHASE & CO. 4460719.00 6.69 \n", "\n", " Emission Reduction Target % \n", "NAME \n", "APPLE NaN \n", "SAUDI ARABIAN OIL NaN \n", "MICROSOFT 75.0 \n", "AMAZON.COM 50.0 \n", "FACEBOOK CLASS A 75.0 \n", "TENCENT HOLDINGS NaN \n", "ALPHABET A 50.0 \n", "TAIWAN SEMICON.MNFG. 18.0 \n", "NESTLE 'N' 35.0 \n", "NVIDIA 25.0 \n", "SAMSUNG ELECTRONICS 70.0 \n", "JP MORGAN CHASE & CO. 50.0 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_df.dropna(thresh=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use **fillna** to fill NaN values with other values." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Board member compensationTotal CO2 To RevenuesEmission Reduction Target %
NAME
APPLE2832025.003.130.0
SAUDI ARABIAN OIL3220550.15175.590.0
MICROSOFT4336000.0029.1675.0
AMAZON.COM6567669.0040.1450.0
ALIBABA GROUP HOLDING ADR 1:80.000.000.0
FACEBOOK CLASS A2291182.0025.6975.0
TENCENT HOLDINGS2681758.8015.810.0
ALPHABET A50294471.0032.2250.0
TESLA17315009.000.000.0
TAIWAN SEMICON.MNFG.12584165.55245.1718.0
NESTLE 'N'12906002.1158.5235.0
NVIDIA0.007.1025.0
SAMSUNG ELECTRONICS0.0069.0670.0
JP MORGAN CHASE & CO.4460719.006.6950.0
KWEICHOW MOUTAI 'A'40219.200.000.0
NETFLIX3665870.000.000.0
\n", "
" ], "text/plain": [ " Board member compensation \\\n", "NAME \n", "APPLE 2832025.00 \n", "SAUDI ARABIAN OIL 3220550.15 \n", "MICROSOFT 4336000.00 \n", "AMAZON.COM 6567669.00 \n", "ALIBABA GROUP HOLDING ADR 1:8 0.00 \n", "FACEBOOK CLASS A 2291182.00 \n", "TENCENT HOLDINGS 2681758.80 \n", "ALPHABET A 50294471.00 \n", "TESLA 17315009.00 \n", "TAIWAN SEMICON.MNFG. 12584165.55 \n", "NESTLE 'N' 12906002.11 \n", "NVIDIA 0.00 \n", "SAMSUNG ELECTRONICS 0.00 \n", "JP MORGAN CHASE & CO. 4460719.00 \n", "KWEICHOW MOUTAI 'A' 40219.20 \n", "NETFLIX 3665870.00 \n", "\n", " Total CO2 To Revenues \\\n", "NAME \n", "APPLE 3.13 \n", "SAUDI ARABIAN OIL 175.59 \n", "MICROSOFT 29.16 \n", "AMAZON.COM 40.14 \n", "ALIBABA GROUP HOLDING ADR 1:8 0.00 \n", "FACEBOOK CLASS A 25.69 \n", "TENCENT HOLDINGS 15.81 \n", "ALPHABET A 32.22 \n", "TESLA 0.00 \n", "TAIWAN SEMICON.MNFG. 245.17 \n", "NESTLE 'N' 58.52 \n", "NVIDIA 7.10 \n", "SAMSUNG ELECTRONICS 69.06 \n", "JP MORGAN CHASE & CO. 6.69 \n", "KWEICHOW MOUTAI 'A' 0.00 \n", "NETFLIX 0.00 \n", "\n", " Emission Reduction Target % \n", "NAME \n", "APPLE 0.0 \n", "SAUDI ARABIAN OIL 0.0 \n", "MICROSOFT 75.0 \n", "AMAZON.COM 50.0 \n", "ALIBABA GROUP HOLDING ADR 1:8 0.0 \n", "FACEBOOK CLASS A 75.0 \n", "TENCENT HOLDINGS 0.0 \n", "ALPHABET A 50.0 \n", "TESLA 0.0 \n", "TAIWAN SEMICON.MNFG. 18.0 \n", "NESTLE 'N' 35.0 \n", "NVIDIA 25.0 \n", "SAMSUNG ELECTRONICS 70.0 \n", "JP MORGAN CHASE & CO. 50.0 \n", "KWEICHOW MOUTAI 'A' 0.0 \n", "NETFLIX 0.0 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_df.fillna(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using a dict, you can define different fill values for different columns. A very common choice is to use the mean value of each column." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Board member compensationTotal CO2 To RevenuesEmission Reduction Target %
NAME
APPLE2832025.003.130.0
SAUDI ARABIAN OIL3220550.15175.590.0
MICROSOFT4336000.0029.1675.0
AMAZON.COM6567669.0040.1450.0
ALIBABA GROUP HOLDING ADR 1:81000000.0010.000.0
FACEBOOK CLASS A2291182.0025.6975.0
TENCENT HOLDINGS2681758.8015.810.0
ALPHABET A50294471.0032.2250.0
TESLA17315009.0010.000.0
TAIWAN SEMICON.MNFG.12584165.55245.1718.0
NESTLE 'N'12906002.1158.5235.0
NVIDIA1000000.007.1025.0
SAMSUNG ELECTRONICS1000000.0069.0670.0
JP MORGAN CHASE & CO.4460719.006.6950.0
KWEICHOW MOUTAI 'A'40219.2010.000.0
NETFLIX3665870.0010.000.0
\n", "
" ], "text/plain": [ " Board member compensation \\\n", "NAME \n", "APPLE 2832025.00 \n", "SAUDI ARABIAN OIL 3220550.15 \n", "MICROSOFT 4336000.00 \n", "AMAZON.COM 6567669.00 \n", "ALIBABA GROUP HOLDING ADR 1:8 1000000.00 \n", "FACEBOOK CLASS A 2291182.00 \n", "TENCENT HOLDINGS 2681758.80 \n", "ALPHABET A 50294471.00 \n", "TESLA 17315009.00 \n", "TAIWAN SEMICON.MNFG. 12584165.55 \n", "NESTLE 'N' 12906002.11 \n", "NVIDIA 1000000.00 \n", "SAMSUNG ELECTRONICS 1000000.00 \n", "JP MORGAN CHASE & CO. 4460719.00 \n", "KWEICHOW MOUTAI 'A' 40219.20 \n", "NETFLIX 3665870.00 \n", "\n", " Total CO2 To Revenues \\\n", "NAME \n", "APPLE 3.13 \n", "SAUDI ARABIAN OIL 175.59 \n", "MICROSOFT 29.16 \n", "AMAZON.COM 40.14 \n", "ALIBABA GROUP HOLDING ADR 1:8 10.00 \n", "FACEBOOK CLASS A 25.69 \n", "TENCENT HOLDINGS 15.81 \n", "ALPHABET A 32.22 \n", "TESLA 10.00 \n", "TAIWAN SEMICON.MNFG. 245.17 \n", "NESTLE 'N' 58.52 \n", "NVIDIA 7.10 \n", "SAMSUNG ELECTRONICS 69.06 \n", "JP MORGAN CHASE & CO. 6.69 \n", "KWEICHOW MOUTAI 'A' 10.00 \n", "NETFLIX 10.00 \n", "\n", " Emission Reduction Target % \n", "NAME \n", "APPLE 0.0 \n", "SAUDI ARABIAN OIL 0.0 \n", "MICROSOFT 75.0 \n", "AMAZON.COM 50.0 \n", "ALIBABA GROUP HOLDING ADR 1:8 0.0 \n", "FACEBOOK CLASS A 75.0 \n", "TENCENT HOLDINGS 0.0 \n", "ALPHABET A 50.0 \n", "TESLA 0.0 \n", "TAIWAN SEMICON.MNFG. 18.0 \n", "NESTLE 'N' 35.0 \n", "NVIDIA 25.0 \n", "SAMSUNG ELECTRONICS 70.0 \n", "JP MORGAN CHASE & CO. 50.0 \n", "KWEICHOW MOUTAI 'A' 0.0 \n", "NETFLIX 0.0 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_df.fillna({'Board member compensation':1000000, 'Total CO2 To Revenues':10,\n", " 'Emission Reduction Target %':0})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interpolation methods that were available for reindexing can be used with fillna." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Board member compensationTotal CO2 To RevenuesEmission Reduction Target %
NAME
APPLE2832025.003.1375.0
SAUDI ARABIAN OIL3220550.15175.5975.0
MICROSOFT4336000.0029.1675.0
AMAZON.COM6567669.0040.1450.0
ALIBABA GROUP HOLDING ADR 1:82291182.0025.6975.0
FACEBOOK CLASS A2291182.0025.6975.0
TENCENT HOLDINGS2681758.8015.8150.0
ALPHABET A50294471.0032.2250.0
TESLA17315009.00245.1718.0
TAIWAN SEMICON.MNFG.12584165.55245.1718.0
NESTLE 'N'12906002.1158.5235.0
NVIDIA4460719.007.1025.0
SAMSUNG ELECTRONICS4460719.0069.0670.0
JP MORGAN CHASE & CO.4460719.006.6950.0
KWEICHOW MOUTAI 'A'40219.20NaNNaN
NETFLIX3665870.00NaNNaN
\n", "
" ], "text/plain": [ " Board member compensation \\\n", "NAME \n", "APPLE 2832025.00 \n", "SAUDI ARABIAN OIL 3220550.15 \n", "MICROSOFT 4336000.00 \n", "AMAZON.COM 6567669.00 \n", "ALIBABA GROUP HOLDING ADR 1:8 2291182.00 \n", "FACEBOOK CLASS A 2291182.00 \n", "TENCENT HOLDINGS 2681758.80 \n", "ALPHABET A 50294471.00 \n", "TESLA 17315009.00 \n", "TAIWAN SEMICON.MNFG. 12584165.55 \n", "NESTLE 'N' 12906002.11 \n", "NVIDIA 4460719.00 \n", "SAMSUNG ELECTRONICS 4460719.00 \n", "JP MORGAN CHASE & CO. 4460719.00 \n", "KWEICHOW MOUTAI 'A' 40219.20 \n", "NETFLIX 3665870.00 \n", "\n", " Total CO2 To Revenues \\\n", "NAME \n", "APPLE 3.13 \n", "SAUDI ARABIAN OIL 175.59 \n", "MICROSOFT 29.16 \n", "AMAZON.COM 40.14 \n", "ALIBABA GROUP HOLDING ADR 1:8 25.69 \n", "FACEBOOK CLASS A 25.69 \n", "TENCENT HOLDINGS 15.81 \n", "ALPHABET A 32.22 \n", "TESLA 245.17 \n", "TAIWAN SEMICON.MNFG. 245.17 \n", "NESTLE 'N' 58.52 \n", "NVIDIA 7.10 \n", "SAMSUNG ELECTRONICS 69.06 \n", "JP MORGAN CHASE & CO. 6.69 \n", "KWEICHOW MOUTAI 'A' NaN \n", "NETFLIX NaN \n", "\n", " Emission Reduction Target % \n", "NAME \n", "APPLE 75.0 \n", "SAUDI ARABIAN OIL 75.0 \n", "MICROSOFT 75.0 \n", "AMAZON.COM 50.0 \n", "ALIBABA GROUP HOLDING ADR 1:8 75.0 \n", "FACEBOOK CLASS A 75.0 \n", "TENCENT HOLDINGS 50.0 \n", "ALPHABET A 50.0 \n", "TESLA 18.0 \n", "TAIWAN SEMICON.MNFG. 18.0 \n", "NESTLE 'N' 35.0 \n", "NVIDIA 25.0 \n", "SAMSUNG ELECTRONICS 70.0 \n", "JP MORGAN CHASE & CO. 50.0 \n", "KWEICHOW MOUTAI 'A' NaN \n", "NETFLIX NaN " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_df.fillna(method = 'bfill')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**duplicated()** can be used to drop duplicated rows. It drops only duplicates that are next to each other" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More about **append()** later. Here we just build a new dataframe with companies_df in it twice. And then we sort the index so that every company is twice in the new dataframe." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "new_df = companies_df.append(companies_df)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "new_df.sort_index(inplace=True)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Board member compensationTotal CO2 To RevenuesEmission Reduction Target %
NAME
ALIBABA GROUP HOLDING ADR 1:8NaNNaNNaN
ALIBABA GROUP HOLDING ADR 1:8NaNNaNNaN
ALPHABET A50294471.0032.2250.0
ALPHABET A50294471.0032.2250.0
AMAZON.COM6567669.0040.1450.0
AMAZON.COM6567669.0040.1450.0
APPLE2832025.003.13NaN
APPLE2832025.003.13NaN
FACEBOOK CLASS A2291182.0025.6975.0
FACEBOOK CLASS A2291182.0025.6975.0
JP MORGAN CHASE & CO.4460719.006.6950.0
JP MORGAN CHASE & CO.4460719.006.6950.0
KWEICHOW MOUTAI 'A'40219.20NaNNaN
KWEICHOW MOUTAI 'A'40219.20NaNNaN
MICROSOFT4336000.0029.1675.0
MICROSOFT4336000.0029.1675.0
NESTLE 'N'12906002.1158.5235.0
NESTLE 'N'12906002.1158.5235.0
NETFLIX3665870.00NaNNaN
NETFLIX3665870.00NaNNaN
NVIDIANaN7.1025.0
NVIDIANaN7.1025.0
SAMSUNG ELECTRONICSNaN69.0670.0
SAMSUNG ELECTRONICSNaN69.0670.0
SAUDI ARABIAN OIL3220550.15175.59NaN
SAUDI ARABIAN OIL3220550.15175.59NaN
TAIWAN SEMICON.MNFG.12584165.55245.1718.0
TAIWAN SEMICON.MNFG.12584165.55245.1718.0
TENCENT HOLDINGS2681758.8015.81NaN
TENCENT HOLDINGS2681758.8015.81NaN
TESLA17315009.00NaNNaN
TESLA17315009.00NaNNaN
\n", "
" ], "text/plain": [ " Board member compensation \\\n", "NAME \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "ALPHABET A 50294471.00 \n", "ALPHABET A 50294471.00 \n", "AMAZON.COM 6567669.00 \n", "AMAZON.COM 6567669.00 \n", "APPLE 2832025.00 \n", "APPLE 2832025.00 \n", "FACEBOOK CLASS A 2291182.00 \n", "FACEBOOK CLASS A 2291182.00 \n", "JP MORGAN CHASE & CO. 4460719.00 \n", "JP MORGAN CHASE & CO. 4460719.00 \n", "KWEICHOW MOUTAI 'A' 40219.20 \n", "KWEICHOW MOUTAI 'A' 40219.20 \n", "MICROSOFT 4336000.00 \n", "MICROSOFT 4336000.00 \n", "NESTLE 'N' 12906002.11 \n", "NESTLE 'N' 12906002.11 \n", "NETFLIX 3665870.00 \n", "NETFLIX 3665870.00 \n", "NVIDIA NaN \n", "NVIDIA NaN \n", "SAMSUNG ELECTRONICS NaN \n", "SAMSUNG ELECTRONICS NaN \n", "SAUDI ARABIAN OIL 3220550.15 \n", "SAUDI ARABIAN OIL 3220550.15 \n", "TAIWAN SEMICON.MNFG. 12584165.55 \n", "TAIWAN SEMICON.MNFG. 12584165.55 \n", "TENCENT HOLDINGS 2681758.80 \n", "TENCENT HOLDINGS 2681758.80 \n", "TESLA 17315009.00 \n", "TESLA 17315009.00 \n", "\n", " Total CO2 To Revenues \\\n", "NAME \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "ALPHABET A 32.22 \n", "ALPHABET A 32.22 \n", "AMAZON.COM 40.14 \n", "AMAZON.COM 40.14 \n", "APPLE 3.13 \n", "APPLE 3.13 \n", "FACEBOOK CLASS A 25.69 \n", "FACEBOOK CLASS A 25.69 \n", "JP MORGAN CHASE & CO. 6.69 \n", "JP MORGAN CHASE & CO. 6.69 \n", "KWEICHOW MOUTAI 'A' NaN \n", "KWEICHOW MOUTAI 'A' NaN \n", "MICROSOFT 29.16 \n", "MICROSOFT 29.16 \n", "NESTLE 'N' 58.52 \n", "NESTLE 'N' 58.52 \n", "NETFLIX NaN \n", "NETFLIX NaN \n", "NVIDIA 7.10 \n", "NVIDIA 7.10 \n", "SAMSUNG ELECTRONICS 69.06 \n", "SAMSUNG ELECTRONICS 69.06 \n", "SAUDI ARABIAN OIL 175.59 \n", "SAUDI ARABIAN OIL 175.59 \n", "TAIWAN SEMICON.MNFG. 245.17 \n", "TAIWAN SEMICON.MNFG. 245.17 \n", "TENCENT HOLDINGS 15.81 \n", "TENCENT HOLDINGS 15.81 \n", "TESLA NaN \n", "TESLA NaN \n", "\n", " Emission Reduction Target % \n", "NAME \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "ALPHABET A 50.0 \n", "ALPHABET A 50.0 \n", "AMAZON.COM 50.0 \n", "AMAZON.COM 50.0 \n", "APPLE NaN \n", "APPLE NaN \n", "FACEBOOK CLASS A 75.0 \n", "FACEBOOK CLASS A 75.0 \n", "JP MORGAN CHASE & CO. 50.0 \n", "JP MORGAN CHASE & CO. 50.0 \n", "KWEICHOW MOUTAI 'A' NaN \n", "KWEICHOW MOUTAI 'A' NaN \n", "MICROSOFT 75.0 \n", "MICROSOFT 75.0 \n", "NESTLE 'N' 35.0 \n", "NESTLE 'N' 35.0 \n", "NETFLIX NaN \n", "NETFLIX NaN \n", "NVIDIA 25.0 \n", "NVIDIA 25.0 \n", "SAMSUNG ELECTRONICS 70.0 \n", "SAMSUNG ELECTRONICS 70.0 \n", "SAUDI ARABIAN OIL NaN \n", "SAUDI ARABIAN OIL NaN \n", "TAIWAN SEMICON.MNFG. 18.0 \n", "TAIWAN SEMICON.MNFG. 18.0 \n", "TENCENT HOLDINGS NaN \n", "TENCENT HOLDINGS NaN \n", "TESLA NaN \n", "TESLA NaN " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "NAME\n", "ALIBABA GROUP HOLDING ADR 1:8 False\n", "ALIBABA GROUP HOLDING ADR 1:8 True\n", "ALPHABET A False\n", "ALPHABET A True\n", "AMAZON.COM False\n", "AMAZON.COM True\n", "APPLE False\n", "APPLE True\n", "FACEBOOK CLASS A False\n", "FACEBOOK CLASS A True\n", "JP MORGAN CHASE & CO. False\n", "JP MORGAN CHASE & CO. True\n", "KWEICHOW MOUTAI 'A' False\n", "KWEICHOW MOUTAI 'A' True\n", "MICROSOFT False\n", "MICROSOFT True\n", "NESTLE 'N' False\n", "NESTLE 'N' True\n", "NETFLIX False\n", "NETFLIX True\n", "NVIDIA False\n", "NVIDIA True\n", "SAMSUNG ELECTRONICS False\n", "SAMSUNG ELECTRONICS True\n", "SAUDI ARABIAN OIL False\n", "SAUDI ARABIAN OIL True\n", "TAIWAN SEMICON.MNFG. False\n", "TAIWAN SEMICON.MNFG. True\n", "TENCENT HOLDINGS False\n", "TENCENT HOLDINGS True\n", "TESLA False\n", "TESLA True\n", "dtype: bool" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df.duplicated()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can remove duplicated rows using **drop_duplicated()**." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Board member compensationTotal CO2 To RevenuesEmission Reduction Target %
NAME
ALIBABA GROUP HOLDING ADR 1:8NaNNaNNaN
ALPHABET A50294471.0032.2250.0
AMAZON.COM6567669.0040.1450.0
APPLE2832025.003.13NaN
FACEBOOK CLASS A2291182.0025.6975.0
JP MORGAN CHASE & CO.4460719.006.6950.0
KWEICHOW MOUTAI 'A'40219.20NaNNaN
MICROSOFT4336000.0029.1675.0
NESTLE 'N'12906002.1158.5235.0
NETFLIX3665870.00NaNNaN
NVIDIANaN7.1025.0
SAMSUNG ELECTRONICSNaN69.0670.0
SAUDI ARABIAN OIL3220550.15175.59NaN
TAIWAN SEMICON.MNFG.12584165.55245.1718.0
TENCENT HOLDINGS2681758.8015.81NaN
TESLA17315009.00NaNNaN
\n", "
" ], "text/plain": [ " Board member compensation \\\n", "NAME \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "ALPHABET A 50294471.00 \n", "AMAZON.COM 6567669.00 \n", "APPLE 2832025.00 \n", "FACEBOOK CLASS A 2291182.00 \n", "JP MORGAN CHASE & CO. 4460719.00 \n", "KWEICHOW MOUTAI 'A' 40219.20 \n", "MICROSOFT 4336000.00 \n", "NESTLE 'N' 12906002.11 \n", "NETFLIX 3665870.00 \n", "NVIDIA NaN \n", "SAMSUNG ELECTRONICS NaN \n", "SAUDI ARABIAN OIL 3220550.15 \n", "TAIWAN SEMICON.MNFG. 12584165.55 \n", "TENCENT HOLDINGS 2681758.80 \n", "TESLA 17315009.00 \n", "\n", " Total CO2 To Revenues \\\n", "NAME \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "ALPHABET A 32.22 \n", "AMAZON.COM 40.14 \n", "APPLE 3.13 \n", "FACEBOOK CLASS A 25.69 \n", "JP MORGAN CHASE & CO. 6.69 \n", "KWEICHOW MOUTAI 'A' NaN \n", "MICROSOFT 29.16 \n", "NESTLE 'N' 58.52 \n", "NETFLIX NaN \n", "NVIDIA 7.10 \n", "SAMSUNG ELECTRONICS 69.06 \n", "SAUDI ARABIAN OIL 175.59 \n", "TAIWAN SEMICON.MNFG. 245.17 \n", "TENCENT HOLDINGS 15.81 \n", "TESLA NaN \n", "\n", " Emission Reduction Target % \n", "NAME \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "ALPHABET A 50.0 \n", "AMAZON.COM 50.0 \n", "APPLE NaN \n", "FACEBOOK CLASS A 75.0 \n", "JP MORGAN CHASE & CO. 50.0 \n", "KWEICHOW MOUTAI 'A' NaN \n", "MICROSOFT 75.0 \n", "NESTLE 'N' 35.0 \n", "NETFLIX NaN \n", "NVIDIA 25.0 \n", "SAMSUNG ELECTRONICS 70.0 \n", "SAUDI ARABIAN OIL NaN \n", "TAIWAN SEMICON.MNFG. 18.0 \n", "TENCENT HOLDINGS NaN \n", "TESLA NaN " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df.drop_duplicates()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is easy to apply function transformation to rows, columns or individual cells of a dataframe. **map** -metohd can be used for that." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "companies_df['Compensation ($ millions)'] = companies_df['Board member compensation'].map(\n", " lambda x: \"{:.2f}\".format(x/1000000)+\" M$\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Board member compensationTotal CO2 To RevenuesEmission Reduction Target %Compensation ($ millions)
NAME
APPLE2832025.003.13NaN2.83 M$
SAUDI ARABIAN OIL3220550.15175.59NaN3.22 M$
MICROSOFT4336000.0029.1675.04.34 M$
AMAZON.COM6567669.0040.1450.06.57 M$
ALIBABA GROUP HOLDING ADR 1:8NaNNaNNaNnan M$
FACEBOOK CLASS A2291182.0025.6975.02.29 M$
TENCENT HOLDINGS2681758.8015.81NaN2.68 M$
ALPHABET A50294471.0032.2250.050.29 M$
TESLA17315009.00NaNNaN17.32 M$
TAIWAN SEMICON.MNFG.12584165.55245.1718.012.58 M$
NESTLE 'N'12906002.1158.5235.012.91 M$
NVIDIANaN7.1025.0nan M$
SAMSUNG ELECTRONICSNaN69.0670.0nan M$
JP MORGAN CHASE & CO.4460719.006.6950.04.46 M$
KWEICHOW MOUTAI 'A'40219.20NaNNaN0.04 M$
NETFLIX3665870.00NaNNaN3.67 M$
\n", "
" ], "text/plain": [ " Board member compensation \\\n", "NAME \n", "APPLE 2832025.00 \n", "SAUDI ARABIAN OIL 3220550.15 \n", "MICROSOFT 4336000.00 \n", "AMAZON.COM 6567669.00 \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "FACEBOOK CLASS A 2291182.00 \n", "TENCENT HOLDINGS 2681758.80 \n", "ALPHABET A 50294471.00 \n", "TESLA 17315009.00 \n", "TAIWAN SEMICON.MNFG. 12584165.55 \n", "NESTLE 'N' 12906002.11 \n", "NVIDIA NaN \n", "SAMSUNG ELECTRONICS NaN \n", "JP MORGAN CHASE & CO. 4460719.00 \n", "KWEICHOW MOUTAI 'A' 40219.20 \n", "NETFLIX 3665870.00 \n", "\n", " Total CO2 To Revenues \\\n", "NAME \n", "APPLE 3.13 \n", "SAUDI ARABIAN OIL 175.59 \n", "MICROSOFT 29.16 \n", "AMAZON.COM 40.14 \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "FACEBOOK CLASS A 25.69 \n", "TENCENT HOLDINGS 15.81 \n", "ALPHABET A 32.22 \n", "TESLA NaN \n", "TAIWAN SEMICON.MNFG. 245.17 \n", "NESTLE 'N' 58.52 \n", "NVIDIA 7.10 \n", "SAMSUNG ELECTRONICS 69.06 \n", "JP MORGAN CHASE & CO. 6.69 \n", "KWEICHOW MOUTAI 'A' NaN \n", "NETFLIX NaN \n", "\n", " Emission Reduction Target % \\\n", "NAME \n", "APPLE NaN \n", "SAUDI ARABIAN OIL NaN \n", "MICROSOFT 75.0 \n", "AMAZON.COM 50.0 \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "FACEBOOK CLASS A 75.0 \n", "TENCENT HOLDINGS NaN \n", "ALPHABET A 50.0 \n", "TESLA NaN \n", "TAIWAN SEMICON.MNFG. 18.0 \n", "NESTLE 'N' 35.0 \n", "NVIDIA 25.0 \n", "SAMSUNG ELECTRONICS 70.0 \n", "JP MORGAN CHASE & CO. 50.0 \n", "KWEICHOW MOUTAI 'A' NaN \n", "NETFLIX NaN \n", "\n", " Compensation ($ millions) \n", "NAME \n", "APPLE 2.83 M$ \n", "SAUDI ARABIAN OIL 3.22 M$ \n", "MICROSOFT 4.34 M$ \n", "AMAZON.COM 6.57 M$ \n", "ALIBABA GROUP HOLDING ADR 1:8 nan M$ \n", "FACEBOOK CLASS A 2.29 M$ \n", "TENCENT HOLDINGS 2.68 M$ \n", "ALPHABET A 50.29 M$ \n", "TESLA 17.32 M$ \n", "TAIWAN SEMICON.MNFG. 12.58 M$ \n", "NESTLE 'N' 12.91 M$ \n", "NVIDIA nan M$ \n", "SAMSUNG ELECTRONICS nan M$ \n", "JP MORGAN CHASE & CO. 4.46 M$ \n", "KWEICHOW MOUTAI 'A' 0.04 M$ \n", "NETFLIX 3.67 M$ " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Map()** can also be used to transform index. The following command turns the company names backwards." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['ELPPA', 'LIO NAIBARA IDUAS', 'TFOSORCIM', 'MOC.NOZAMA',\n", " '8:1 RDA GNIDLOH PUORG ABABILA', 'A SSALC KOOBECAF', 'SGNIDLOH TNECNET',\n", " 'A TEBAHPLA', 'ALSET', '.GFNM.NOCIMES NAWIAT', ''N' ELTSEN', 'AIDIVN',\n", " 'SCINORTCELE GNUSMAS', '.OC & ESAHC NAGROM PJ', ''A' IATUOM WOHCIEWK',\n", " 'XILFTEN'],\n", " dtype='object', name='NAME')" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_df.index.map(lambda x : x[::-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, you can also use **rename()**. Using a dictionary, **rename()** can also be used to change only some of the labels." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Board member compensationTotal CO2 To RevenuesEmission Reduction Target %Compensation ($ millions)
NAME
Apple2832025.003.13NaN2.83 M$
Saudi Arabian Oil3220550.15175.59NaN3.22 M$
Microsoft4336000.0029.1675.04.34 M$
Amazon.Com6567669.0040.1450.06.57 M$
Alibaba Group Holding Adr 1:8NaNNaNNaNnan M$
Facebook Class A2291182.0025.6975.02.29 M$
Tencent Holdings2681758.8015.81NaN2.68 M$
Alphabet A50294471.0032.2250.050.29 M$
Tesla17315009.00NaNNaN17.32 M$
Taiwan Semicon.Mnfg.12584165.55245.1718.012.58 M$
Nestle 'N'12906002.1158.5235.012.91 M$
NvidiaNaN7.1025.0nan M$
Samsung ElectronicsNaN69.0670.0nan M$
Jp Morgan Chase & Co.4460719.006.6950.04.46 M$
Kweichow Moutai 'A'40219.20NaNNaN0.04 M$
Netflix3665870.00NaNNaN3.67 M$
\n", "
" ], "text/plain": [ " Board member compensation \\\n", "NAME \n", "Apple 2832025.00 \n", "Saudi Arabian Oil 3220550.15 \n", "Microsoft 4336000.00 \n", "Amazon.Com 6567669.00 \n", "Alibaba Group Holding Adr 1:8 NaN \n", "Facebook Class A 2291182.00 \n", "Tencent Holdings 2681758.80 \n", "Alphabet A 50294471.00 \n", "Tesla 17315009.00 \n", "Taiwan Semicon.Mnfg. 12584165.55 \n", "Nestle 'N' 12906002.11 \n", "Nvidia NaN \n", "Samsung Electronics NaN \n", "Jp Morgan Chase & Co. 4460719.00 \n", "Kweichow Moutai 'A' 40219.20 \n", "Netflix 3665870.00 \n", "\n", " Total CO2 To Revenues \\\n", "NAME \n", "Apple 3.13 \n", "Saudi Arabian Oil 175.59 \n", "Microsoft 29.16 \n", "Amazon.Com 40.14 \n", "Alibaba Group Holding Adr 1:8 NaN \n", "Facebook Class A 25.69 \n", "Tencent Holdings 15.81 \n", "Alphabet A 32.22 \n", "Tesla NaN \n", "Taiwan Semicon.Mnfg. 245.17 \n", "Nestle 'N' 58.52 \n", "Nvidia 7.10 \n", "Samsung Electronics 69.06 \n", "Jp Morgan Chase & Co. 6.69 \n", "Kweichow Moutai 'A' NaN \n", "Netflix NaN \n", "\n", " Emission Reduction Target % \\\n", "NAME \n", "Apple NaN \n", "Saudi Arabian Oil NaN \n", "Microsoft 75.0 \n", "Amazon.Com 50.0 \n", "Alibaba Group Holding Adr 1:8 NaN \n", "Facebook Class A 75.0 \n", "Tencent Holdings NaN \n", "Alphabet A 50.0 \n", "Tesla NaN \n", "Taiwan Semicon.Mnfg. 18.0 \n", "Nestle 'N' 35.0 \n", "Nvidia 25.0 \n", "Samsung Electronics 70.0 \n", "Jp Morgan Chase & Co. 50.0 \n", "Kweichow Moutai 'A' NaN \n", "Netflix NaN \n", "\n", " Compensation ($ millions) \n", "NAME \n", "Apple 2.83 M$ \n", "Saudi Arabian Oil 3.22 M$ \n", "Microsoft 4.34 M$ \n", "Amazon.Com 6.57 M$ \n", "Alibaba Group Holding Adr 1:8 nan M$ \n", "Facebook Class A 2.29 M$ \n", "Tencent Holdings 2.68 M$ \n", "Alphabet A 50.29 M$ \n", "Tesla 17.32 M$ \n", "Taiwan Semicon.Mnfg. 12.58 M$ \n", "Nestle 'N' 12.91 M$ \n", "Nvidia nan M$ \n", "Samsung Electronics nan M$ \n", "Jp Morgan Chase & Co. 4.46 M$ \n", "Kweichow Moutai 'A' 0.04 M$ \n", "Netflix 3.67 M$ " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_df.rename(index=str.title)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Replace()** can be used to replace any values, not just NaN values. You can pass also dict/list, if you want to replace multiple values." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Board member compensationTotal CO2 To RevenuesEmission Reduction Target %Compensation ($ millions)
NAME
APPLE2832025.003.13-999.02.83 M$
SAUDI ARABIAN OIL3220550.15175.59-999.03.22 M$
MICROSOFT4336000.0029.1675.04.34 M$
AMAZON.COM6567669.0040.1450.06.57 M$
ALIBABA GROUP HOLDING ADR 1:8-999.00-999.00-999.0nan M$
FACEBOOK CLASS A2291182.0025.6975.02.29 M$
TENCENT HOLDINGS2681758.8015.81-999.02.68 M$
ALPHABET A50294471.0032.2250.050.29 M$
TESLA17315009.00-999.00-999.017.32 M$
TAIWAN SEMICON.MNFG.12584165.55245.1718.012.58 M$
NESTLE 'N'12906002.1158.5235.012.91 M$
NVIDIA-999.007.1025.0nan M$
SAMSUNG ELECTRONICS-999.0069.0670.0nan M$
JP MORGAN CHASE & CO.4460719.006.6950.04.46 M$
KWEICHOW MOUTAI 'A'40219.20-999.00-999.00.04 M$
NETFLIX3665870.00-999.00-999.03.67 M$
\n", "
" ], "text/plain": [ " Board member compensation \\\n", "NAME \n", "APPLE 2832025.00 \n", "SAUDI ARABIAN OIL 3220550.15 \n", "MICROSOFT 4336000.00 \n", "AMAZON.COM 6567669.00 \n", "ALIBABA GROUP HOLDING ADR 1:8 -999.00 \n", "FACEBOOK CLASS A 2291182.00 \n", "TENCENT HOLDINGS 2681758.80 \n", "ALPHABET A 50294471.00 \n", "TESLA 17315009.00 \n", "TAIWAN SEMICON.MNFG. 12584165.55 \n", "NESTLE 'N' 12906002.11 \n", "NVIDIA -999.00 \n", "SAMSUNG ELECTRONICS -999.00 \n", "JP MORGAN CHASE & CO. 4460719.00 \n", "KWEICHOW MOUTAI 'A' 40219.20 \n", "NETFLIX 3665870.00 \n", "\n", " Total CO2 To Revenues \\\n", "NAME \n", "APPLE 3.13 \n", "SAUDI ARABIAN OIL 175.59 \n", "MICROSOFT 29.16 \n", "AMAZON.COM 40.14 \n", "ALIBABA GROUP HOLDING ADR 1:8 -999.00 \n", "FACEBOOK CLASS A 25.69 \n", "TENCENT HOLDINGS 15.81 \n", "ALPHABET A 32.22 \n", "TESLA -999.00 \n", "TAIWAN SEMICON.MNFG. 245.17 \n", "NESTLE 'N' 58.52 \n", "NVIDIA 7.10 \n", "SAMSUNG ELECTRONICS 69.06 \n", "JP MORGAN CHASE & CO. 6.69 \n", "KWEICHOW MOUTAI 'A' -999.00 \n", "NETFLIX -999.00 \n", "\n", " Emission Reduction Target % \\\n", "NAME \n", "APPLE -999.0 \n", "SAUDI ARABIAN OIL -999.0 \n", "MICROSOFT 75.0 \n", "AMAZON.COM 50.0 \n", "ALIBABA GROUP HOLDING ADR 1:8 -999.0 \n", "FACEBOOK CLASS A 75.0 \n", "TENCENT HOLDINGS -999.0 \n", "ALPHABET A 50.0 \n", "TESLA -999.0 \n", "TAIWAN SEMICON.MNFG. 18.0 \n", "NESTLE 'N' 35.0 \n", "NVIDIA 25.0 \n", "SAMSUNG ELECTRONICS 70.0 \n", "JP MORGAN CHASE & CO. 50.0 \n", "KWEICHOW MOUTAI 'A' -999.0 \n", "NETFLIX -999.0 \n", "\n", " Compensation ($ millions) \n", "NAME \n", "APPLE 2.83 M$ \n", "SAUDI ARABIAN OIL 3.22 M$ \n", "MICROSOFT 4.34 M$ \n", "AMAZON.COM 6.57 M$ \n", "ALIBABA GROUP HOLDING ADR 1:8 nan M$ \n", "FACEBOOK CLASS A 2.29 M$ \n", "TENCENT HOLDINGS 2.68 M$ \n", "ALPHABET A 50.29 M$ \n", "TESLA 17.32 M$ \n", "TAIWAN SEMICON.MNFG. 12.58 M$ \n", "NESTLE 'N' 12.91 M$ \n", "NVIDIA nan M$ \n", "SAMSUNG ELECTRONICS nan M$ \n", "JP MORGAN CHASE & CO. 4.46 M$ \n", "KWEICHOW MOUTAI 'A' 0.04 M$ \n", "NETFLIX 3.67 M$ " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_df.replace(np.nan,-999) # Pandas NaN is np.nan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dividing data to **bins** is a very important feature in Pandas. You can use **pd.cut()** (notice that it is not a dataframe method) to creata a categorical object. **pd.value_counts()** can be used to calculate the number of observations in each bin. With the **labels** parameter you can define names for different bins." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "compensation_bins = [10000,100000,1000000,10000000,100000000]" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "comp_categ = pd.cut(companies_df['Board member compensation'],compensation_bins)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "NAME\n", "APPLE (1000000.0, 10000000.0]\n", "SAUDI ARABIAN OIL (1000000.0, 10000000.0]\n", "MICROSOFT (1000000.0, 10000000.0]\n", "AMAZON.COM (1000000.0, 10000000.0]\n", "ALIBABA GROUP HOLDING ADR 1:8 NaN\n", "FACEBOOK CLASS A (1000000.0, 10000000.0]\n", "TENCENT HOLDINGS (1000000.0, 10000000.0]\n", "ALPHABET A (10000000.0, 100000000.0]\n", "TESLA (10000000.0, 100000000.0]\n", "TAIWAN SEMICON.MNFG. (10000000.0, 100000000.0]\n", "NESTLE 'N' (10000000.0, 100000000.0]\n", "NVIDIA NaN\n", "SAMSUNG ELECTRONICS NaN\n", "JP MORGAN CHASE & CO. (1000000.0, 10000000.0]\n", "KWEICHOW MOUTAI 'A' (10000.0, 100000.0]\n", "NETFLIX (1000000.0, 10000000.0]\n", "Name: Board member compensation, dtype: category\n", "Categories (4, interval[int64]): [(10000, 100000] < (100000, 1000000] < (1000000, 10000000] < (10000000, 100000000]]" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_categ" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "IntervalIndex([(10000, 100000], (100000, 1000000], (1000000, 10000000], (10000000, 100000000]],\n", " closed='right',\n", " dtype='interval[int64]')" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_categ.values.categories" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 2, 2, 2, 2, -1, 2, 2, 3, 3, 3, 3, -1, -1, 2, 0, 2],\n", " dtype=int8)" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_categ.values.codes" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1000000, 10000000] 8\n", "(10000000, 100000000] 4\n", "(10000, 100000] 1\n", "(100000, 1000000] 0\n", "Name: Board member compensation, dtype: int64" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.value_counts(comp_categ)" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "NAME\n", "APPLE Very rich board members\n", "SAUDI ARABIAN OIL Very rich board members\n", "MICROSOFT Very rich board members\n", "AMAZON.COM Very rich board members\n", "ALIBABA GROUP HOLDING ADR 1:8 NaN\n", "FACEBOOK CLASS A Very rich board members\n", "TENCENT HOLDINGS Very rich board members\n", "ALPHABET A Insanely rich board members\n", "TESLA Insanely rich board members\n", "TAIWAN SEMICON.MNFG. Insanely rich board members\n", "NESTLE 'N' Insanely rich board members\n", "NVIDIA NaN\n", "SAMSUNG ELECTRONICS NaN\n", "JP MORGAN CHASE & CO. Very rich board members\n", "KWEICHOW MOUTAI 'A' Poor board members\n", "NETFLIX Very rich board members\n", "Name: Board member compensation, dtype: category\n", "Categories (4, object): ['Poor board members' < 'Rich board members' < 'Very rich board members' < 'Insanely rich board members']" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.cut(companies_df['Board member compensation'],\n", " compensation_bins,labels = ['Poor board members','Rich board members',\n", " 'Very rich board members','Insanely rich board members'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you pass a number to the **bins** parameter, it will return that many equal-length bins." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "NAME\n", "APPLE (2.9, 63.6]\n", "SAUDI ARABIAN OIL (124.2, 184.7]\n", "MICROSOFT (2.9, 63.6]\n", "AMAZON.COM (2.9, 63.6]\n", "ALIBABA GROUP HOLDING ADR 1:8 NaN\n", "FACEBOOK CLASS A (2.9, 63.6]\n", "TENCENT HOLDINGS (2.9, 63.6]\n", "ALPHABET A (2.9, 63.6]\n", "TESLA NaN\n", "TAIWAN SEMICON.MNFG. (184.7, 245.2]\n", "NESTLE 'N' (2.9, 63.6]\n", "NVIDIA (2.9, 63.6]\n", "SAMSUNG ELECTRONICS (63.6, 124.2]\n", "JP MORGAN CHASE & CO. (2.9, 63.6]\n", "KWEICHOW MOUTAI 'A' NaN\n", "NETFLIX NaN\n", "Name: Total CO2 To Revenues, dtype: category\n", "Categories (4, interval[float64]): [(2.9, 63.6] < (63.6, 124.2] < (124.2, 184.7] < (184.7, 245.2]]" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.cut(companies_df['Total CO2 To Revenues'],4,precision=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For many purposes, **qcut()** is more useful as it bins the data based on sample quantiles. Therefore, every bins has approximately the same number of observations. You can also pass specific quantiles as a list to the function." ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "NAME\n", "APPLE (3.129, 13.632]\n", "SAUDI ARABIAN OIL (61.155, 245.17]\n", "MICROSOFT (13.632, 30.69]\n", "AMAZON.COM (30.69, 61.155]\n", "ALIBABA GROUP HOLDING ADR 1:8 NaN\n", "FACEBOOK CLASS A (13.632, 30.69]\n", "TENCENT HOLDINGS (13.632, 30.69]\n", "ALPHABET A (30.69, 61.155]\n", "TESLA NaN\n", "TAIWAN SEMICON.MNFG. (61.155, 245.17]\n", "NESTLE 'N' (30.69, 61.155]\n", "NVIDIA (3.129, 13.632]\n", "SAMSUNG ELECTRONICS (61.155, 245.17]\n", "JP MORGAN CHASE & CO. (3.129, 13.632]\n", "KWEICHOW MOUTAI 'A' NaN\n", "NETFLIX NaN\n", "Name: Total CO2 To Revenues, dtype: category\n", "Categories (4, interval[float64]): [(3.129, 13.632] < (13.632, 30.69] < (30.69, 61.155] < (61.155, 245.17]]" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.qcut(companies_df['Total CO2 To Revenues'],4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Detecting and filtering outliers is easy with boolean dataframes" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Board member compensationTotal CO2 To RevenuesEmission Reduction Target %Compensation ($ millions)
NAME
APPLE2832025.003.13NaN2.83 M$
SAUDI ARABIAN OIL3220550.15175.59NaN3.22 M$
MICROSOFT4336000.0029.1675.04.34 M$
AMAZON.COM6567669.0040.1450.06.57 M$
FACEBOOK CLASS A2291182.0025.6975.02.29 M$
TENCENT HOLDINGS2681758.8015.81NaN2.68 M$
JP MORGAN CHASE & CO.4460719.006.6950.04.46 M$
KWEICHOW MOUTAI 'A'40219.20NaNNaN0.04 M$
NETFLIX3665870.00NaNNaN3.67 M$
\n", "
" ], "text/plain": [ " Board member compensation Total CO2 To Revenues \\\n", "NAME \n", "APPLE 2832025.00 3.13 \n", "SAUDI ARABIAN OIL 3220550.15 175.59 \n", "MICROSOFT 4336000.00 29.16 \n", "AMAZON.COM 6567669.00 40.14 \n", "FACEBOOK CLASS A 2291182.00 25.69 \n", "TENCENT HOLDINGS 2681758.80 15.81 \n", "JP MORGAN CHASE & CO. 4460719.00 6.69 \n", "KWEICHOW MOUTAI 'A' 40219.20 NaN \n", "NETFLIX 3665870.00 NaN \n", "\n", " Emission Reduction Target % Compensation ($ millions) \n", "NAME \n", "APPLE NaN 2.83 M$ \n", "SAUDI ARABIAN OIL NaN 3.22 M$ \n", "MICROSOFT 75.0 4.34 M$ \n", "AMAZON.COM 50.0 6.57 M$ \n", "FACEBOOK CLASS A 75.0 2.29 M$ \n", "TENCENT HOLDINGS NaN 2.68 M$ \n", "JP MORGAN CHASE & CO. 50.0 4.46 M$ \n", "KWEICHOW MOUTAI 'A' NaN 0.04 M$ \n", "NETFLIX NaN 3.67 M$ " ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_df[companies_df['Board member compensation'] < 10000000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Filtering is also possible with the summary statistics of variables. The following command picks up cases where the values of **all** variables deviate from the mean less than two standard deviations." ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [], "source": [ "companies_df.drop('Compensation ($ millions)',axis=1,inplace=True)" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Board member compensationTotal CO2 To RevenuesEmission Reduction Target %
NAME
MICROSOFT4336000.0029.1675.0
AMAZON.COM6567669.0040.1450.0
FACEBOOK CLASS A2291182.0025.6975.0
NESTLE 'N'12906002.1158.5235.0
JP MORGAN CHASE & CO.4460719.006.6950.0
\n", "
" ], "text/plain": [ " Board member compensation Total CO2 To Revenues \\\n", "NAME \n", "MICROSOFT 4336000.00 29.16 \n", "AMAZON.COM 6567669.00 40.14 \n", "FACEBOOK CLASS A 2291182.00 25.69 \n", "NESTLE 'N' 12906002.11 58.52 \n", "JP MORGAN CHASE & CO. 4460719.00 6.69 \n", "\n", " Emission Reduction Target % \n", "NAME \n", "MICROSOFT 75.0 \n", "AMAZON.COM 50.0 \n", "FACEBOOK CLASS A 75.0 \n", "NESTLE 'N' 35.0 \n", "JP MORGAN CHASE & CO. 50.0 " ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_df[(np.abs(companies_df-companies_df.mean()) < 2*companies_df.std(axis=0)).all(1)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Winsorising is another common procedure in practical econometrics. In that method, the most extreme values are moved to specific quantiles, usually 1% and 99% quantiles. It is easiest to implement with the **clip()** method. Notice how the following command winsorise all the variables." ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Board member compensationTotal CO2 To RevenuesEmission Reduction Target %
NAME
APPLE2.832025e+066.731NaN
SAUDI ARABIAN OIL3.220550e+06164.937NaN
MICROSOFT4.336000e+0629.16075.0
AMAZON.COM6.567669e+0640.14050.0
ALIBABA GROUP HOLDING ADR 1:8NaNNaNNaN
FACEBOOK CLASS A2.369297e+0625.69075.0
TENCENT HOLDINGS2.681759e+0615.810NaN
ALPHABET A1.643321e+0732.22050.0
TESLA1.643321e+07NaNNaN
TAIWAN SEMICON.MNFG.1.258417e+07164.93723.6
NESTLE 'N'1.290600e+0758.52035.0
NVIDIANaN7.10025.0
SAMSUNG ELECTRONICSNaN69.06070.0
JP MORGAN CHASE & CO.4.460719e+066.73150.0
KWEICHOW MOUTAI 'A'2.369297e+06NaNNaN
NETFLIX3.665870e+06NaNNaN
\n", "
" ], "text/plain": [ " Board member compensation \\\n", "NAME \n", "APPLE 2.832025e+06 \n", "SAUDI ARABIAN OIL 3.220550e+06 \n", "MICROSOFT 4.336000e+06 \n", "AMAZON.COM 6.567669e+06 \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "FACEBOOK CLASS A 2.369297e+06 \n", "TENCENT HOLDINGS 2.681759e+06 \n", "ALPHABET A 1.643321e+07 \n", "TESLA 1.643321e+07 \n", "TAIWAN SEMICON.MNFG. 1.258417e+07 \n", "NESTLE 'N' 1.290600e+07 \n", "NVIDIA NaN \n", "SAMSUNG ELECTRONICS NaN \n", "JP MORGAN CHASE & CO. 4.460719e+06 \n", "KWEICHOW MOUTAI 'A' 2.369297e+06 \n", "NETFLIX 3.665870e+06 \n", "\n", " Total CO2 To Revenues \\\n", "NAME \n", "APPLE 6.731 \n", "SAUDI ARABIAN OIL 164.937 \n", "MICROSOFT 29.160 \n", "AMAZON.COM 40.140 \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "FACEBOOK CLASS A 25.690 \n", "TENCENT HOLDINGS 15.810 \n", "ALPHABET A 32.220 \n", "TESLA NaN \n", "TAIWAN SEMICON.MNFG. 164.937 \n", "NESTLE 'N' 58.520 \n", "NVIDIA 7.100 \n", "SAMSUNG ELECTRONICS 69.060 \n", "JP MORGAN CHASE & CO. 6.731 \n", "KWEICHOW MOUTAI 'A' NaN \n", "NETFLIX NaN \n", "\n", " Emission Reduction Target % \n", "NAME \n", "APPLE NaN \n", "SAUDI ARABIAN OIL NaN \n", "MICROSOFT 75.0 \n", "AMAZON.COM 50.0 \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "FACEBOOK CLASS A 75.0 \n", "TENCENT HOLDINGS NaN \n", "ALPHABET A 50.0 \n", "TESLA NaN \n", "TAIWAN SEMICON.MNFG. 23.6 \n", "NESTLE 'N' 35.0 \n", "NVIDIA 25.0 \n", "SAMSUNG ELECTRONICS 70.0 \n", "JP MORGAN CHASE & CO. 50.0 \n", "KWEICHOW MOUTAI 'A' NaN \n", "NETFLIX NaN " ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_df.clip(lower = companies_df.quantile(0.1), upper = companies_df.quantile(0.9),axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Random sampling from data is easy. You can use the **sample()** method for that. It is also possible to sample with replacement that is needed in many numerical statistical methods." ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Board member compensationTotal CO2 To RevenuesEmission Reduction Target %
NAME
SAUDI ARABIAN OIL3220550.15175.59NaN
TENCENT HOLDINGS2681758.8015.81NaN
FACEBOOK CLASS A2291182.0025.6975.0
TENCENT HOLDINGS2681758.8015.81NaN
NVIDIANaN7.1025.0
NETFLIX3665870.00NaNNaN
NETFLIX3665870.00NaNNaN
MICROSOFT4336000.0029.1675.0
ALPHABET A50294471.0032.2250.0
TENCENT HOLDINGS2681758.8015.81NaN
\n", "
" ], "text/plain": [ " Board member compensation Total CO2 To Revenues \\\n", "NAME \n", "SAUDI ARABIAN OIL 3220550.15 175.59 \n", "TENCENT HOLDINGS 2681758.80 15.81 \n", "FACEBOOK CLASS A 2291182.00 25.69 \n", "TENCENT HOLDINGS 2681758.80 15.81 \n", "NVIDIA NaN 7.10 \n", "NETFLIX 3665870.00 NaN \n", "NETFLIX 3665870.00 NaN \n", "MICROSOFT 4336000.00 29.16 \n", "ALPHABET A 50294471.00 32.22 \n", "TENCENT HOLDINGS 2681758.80 15.81 \n", "\n", " Emission Reduction Target % \n", "NAME \n", "SAUDI ARABIAN OIL NaN \n", "TENCENT HOLDINGS NaN \n", "FACEBOOK CLASS A 75.0 \n", "TENCENT HOLDINGS NaN \n", "NVIDIA 25.0 \n", "NETFLIX NaN \n", "NETFLIX NaN \n", "MICROSOFT 75.0 \n", "ALPHABET A 50.0 \n", "TENCENT HOLDINGS NaN " ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_df.sample(n=10,replace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to randomise the order of values in a dataframe, you can use Numpy's **random.permutation()**" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [], "source": [ "permut_index = np.random.permutation(len(companies_df))" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Board member compensationTotal CO2 To RevenuesEmission Reduction Target %
NAME
ALIBABA GROUP HOLDING ADR 1:8NaNNaNNaN
SAMSUNG ELECTRONICSNaN69.0670.0
TENCENT HOLDINGS2681758.8015.81NaN
ALPHABET A50294471.0032.2250.0
FACEBOOK CLASS A2291182.0025.6975.0
JP MORGAN CHASE & CO.4460719.006.6950.0
SAUDI ARABIAN OIL3220550.15175.59NaN
MICROSOFT4336000.0029.1675.0
TAIWAN SEMICON.MNFG.12584165.55245.1718.0
APPLE2832025.003.13NaN
NETFLIX3665870.00NaNNaN
NESTLE 'N'12906002.1158.5235.0
KWEICHOW MOUTAI 'A'40219.20NaNNaN
NVIDIANaN7.1025.0
AMAZON.COM6567669.0040.1450.0
TESLA17315009.00NaNNaN
\n", "
" ], "text/plain": [ " Board member compensation \\\n", "NAME \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "SAMSUNG ELECTRONICS NaN \n", "TENCENT HOLDINGS 2681758.80 \n", "ALPHABET A 50294471.00 \n", "FACEBOOK CLASS A 2291182.00 \n", "JP MORGAN CHASE & CO. 4460719.00 \n", "SAUDI ARABIAN OIL 3220550.15 \n", "MICROSOFT 4336000.00 \n", "TAIWAN SEMICON.MNFG. 12584165.55 \n", "APPLE 2832025.00 \n", "NETFLIX 3665870.00 \n", "NESTLE 'N' 12906002.11 \n", "KWEICHOW MOUTAI 'A' 40219.20 \n", "NVIDIA NaN \n", "AMAZON.COM 6567669.00 \n", "TESLA 17315009.00 \n", "\n", " Total CO2 To Revenues \\\n", "NAME \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "SAMSUNG ELECTRONICS 69.06 \n", "TENCENT HOLDINGS 15.81 \n", "ALPHABET A 32.22 \n", "FACEBOOK CLASS A 25.69 \n", "JP MORGAN CHASE & CO. 6.69 \n", "SAUDI ARABIAN OIL 175.59 \n", "MICROSOFT 29.16 \n", "TAIWAN SEMICON.MNFG. 245.17 \n", "APPLE 3.13 \n", "NETFLIX NaN \n", "NESTLE 'N' 58.52 \n", "KWEICHOW MOUTAI 'A' NaN \n", "NVIDIA 7.10 \n", "AMAZON.COM 40.14 \n", "TESLA NaN \n", "\n", " Emission Reduction Target % \n", "NAME \n", "ALIBABA GROUP HOLDING ADR 1:8 NaN \n", "SAMSUNG ELECTRONICS 70.0 \n", "TENCENT HOLDINGS NaN \n", "ALPHABET A 50.0 \n", "FACEBOOK CLASS A 75.0 \n", "JP MORGAN CHASE & CO. 50.0 \n", "SAUDI ARABIAN OIL NaN \n", "MICROSOFT 75.0 \n", "TAIWAN SEMICON.MNFG. 18.0 \n", "APPLE NaN \n", "NETFLIX NaN \n", "NESTLE 'N' 35.0 \n", "KWEICHOW MOUTAI 'A' NaN \n", "NVIDIA 25.0 \n", "AMAZON.COM 50.0 \n", "TESLA NaN " ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_df.iloc[permut_index]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Very often in econometrics, you need to transform your categorical variables to a collection of dummy variables. It easily done in Pandas using the **get_dummies()** function." ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Emiss__18.0Emiss__25.0Emiss__35.0Emiss__50.0Emiss__70.0Emiss__75.0
NAME
APPLE000000
SAUDI ARABIAN OIL000000
MICROSOFT000001
AMAZON.COM000100
ALIBABA GROUP HOLDING ADR 1:8000000
FACEBOOK CLASS A000001
TENCENT HOLDINGS000000
ALPHABET A000100
TESLA000000
TAIWAN SEMICON.MNFG.100000
NESTLE 'N'001000
NVIDIA010000
SAMSUNG ELECTRONICS000010
JP MORGAN CHASE & CO.000100
KWEICHOW MOUTAI 'A'000000
NETFLIX000000
\n", "
" ], "text/plain": [ " Emiss__18.0 Emiss__25.0 Emiss__35.0 \\\n", "NAME \n", "APPLE 0 0 0 \n", "SAUDI ARABIAN OIL 0 0 0 \n", "MICROSOFT 0 0 0 \n", "AMAZON.COM 0 0 0 \n", "ALIBABA GROUP HOLDING ADR 1:8 0 0 0 \n", "FACEBOOK CLASS A 0 0 0 \n", "TENCENT HOLDINGS 0 0 0 \n", "ALPHABET A 0 0 0 \n", "TESLA 0 0 0 \n", "TAIWAN SEMICON.MNFG. 1 0 0 \n", "NESTLE 'N' 0 0 1 \n", "NVIDIA 0 1 0 \n", "SAMSUNG ELECTRONICS 0 0 0 \n", "JP MORGAN CHASE & CO. 0 0 0 \n", "KWEICHOW MOUTAI 'A' 0 0 0 \n", "NETFLIX 0 0 0 \n", "\n", " Emiss__50.0 Emiss__70.0 Emiss__75.0 \n", "NAME \n", "APPLE 0 0 0 \n", "SAUDI ARABIAN OIL 0 0 0 \n", "MICROSOFT 0 0 1 \n", "AMAZON.COM 1 0 0 \n", "ALIBABA GROUP HOLDING ADR 1:8 0 0 0 \n", "FACEBOOK CLASS A 0 0 1 \n", "TENCENT HOLDINGS 0 0 0 \n", "ALPHABET A 1 0 0 \n", "TESLA 0 0 0 \n", "TAIWAN SEMICON.MNFG. 0 0 0 \n", "NESTLE 'N' 0 0 0 \n", "NVIDIA 0 0 0 \n", "SAMSUNG ELECTRONICS 0 1 0 \n", "JP MORGAN CHASE & CO. 1 0 0 \n", "KWEICHOW MOUTAI 'A' 0 0 0 \n", "NETFLIX 0 0 0 " ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.get_dummies(companies_df['Emission Reduction Target %'],prefix='Emiss_')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas has efficient methods to manipulate strings in a dataframe. Due to this, Pandas is very popular among researchers that need to use string data in their analysis. This makes it also a very important tool for accounting data analysis. All the Python string object's built-in methods can be used to manipulate strings in a dataframe. We already discussed string-methods in Chapter 1.\n", "\n", "Regular expressions can also be used to manipulate string-dataframes. Regular expressions is a very broad topic, and it takes time to master it. Let's look at some very simple examples. If we want to split a sentence to a words-list, repeated whitespaces make the process difficult with the standard string-methods. However, it is very easy with regular expressions." ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [], "source": [ "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The regex for multiple whitespaces is **\\s+**." ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['This', 'is', 'a', 'sample', 'text.']" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splitter = re.compile('\\s+')\n", "splitter.split(\"This is a sample text.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Regular experssions is a powerful tool, but very complex. You can search email-addresses from a text with the following regex-command." ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [], "source": [ "reg_pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'" ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [], "source": [ "email_search = re.compile(reg_pattern, flags = re.IGNORECASE)" ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [], "source": [ "example = \"\"\"The general format of an email address is local-part@domain,\n", "and a specific example is jsmith@example.com. An address consists of two parts. \n", "The part before the @ symbol (local part) identifies the name of a mailbox. \n", "This is often the username of the recipient, e.g., jsmith. \n", "The part after the @ symbol (domain) is a domain name that represents \n", "the administrative realm for the mail box, e.g., a company's domain name, example.com.\"\"\"" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['jsmith@example.com']" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email_search.findall(example)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We do not go into regular expression more at this point. But we will see some applications of them in the following chapters. If you want to learn more about regular expressions, here is a good website: [www.regular-expressions.info/index.html](https://www.regular-expressions.info/index.html)\n", "\n", "If your string data contains missing values, usually the standard string methods will not work. Then you need to use the dataframe string methods.\n", "\n", "You can slice strings normally." ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['APPLE', 'SAUDI', 'MICRO', 'AMAZO', 'ALIBA', 'FACEB', 'TENCE', 'ALPHA',\n", " 'TESLA', 'TAIWA', 'NESTL', 'NVIDI', 'SAMSU', 'JP MO', 'KWEIC', 'NETFL'],\n", " dtype='object', name='NAME')" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_df.index.str[0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And you can use regular experssions, for example, to search strings." ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index([[], [], [], ['AM'], [], [], [], [], [], [], [], [], ['AM'], [], [], []], dtype='object', name='NAME')" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_df.index.str.findall('am', flags = re.IGNORECASE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a good introduction to Pandas string methods. [pandas.pydata.org/pandas-docs/stable/user_guide/text.html](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html). If you scroll down, there is a full list methods." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Advanced methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look some more advanced methods of Pandas next. We need a new dataset for that." ] }, { "cell_type": "code", "execution_count": 282, "metadata": {}, "outputs": [], "source": [ "electricity_df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/AER/USAirlines.csv',index_col=0)" ] }, { "cell_type": "code", "execution_count": 283, "metadata": {}, "outputs": [], "source": [ "electricity_df['firm'].replace({1:'A',2:'B',3:'C',4:'D',5:'E',6:'F'},inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cost function of electricity producers." ] }, { "cell_type": "code", "execution_count": 284, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firmyearoutputcostpriceload
1A19700.95275711406401066500.534487
2A19710.98675712156901103070.532328
3A19721.09198013095701105740.547736
4A19731.17578015115301219740.540846
5A19741.16017016767301966060.591167
.....................
86F19800.1126403814788748180.517766
87F19810.15415450696910131700.580049
88F19820.1864616333889304770.556024
89F19830.2468478043888516760.537791
90F19840.30401310095008194760.525775
\n", "

90 rows × 6 columns

\n", "
" ], "text/plain": [ " firm year output cost price load\n", "1 A 1970 0.952757 1140640 106650 0.534487\n", "2 A 1971 0.986757 1215690 110307 0.532328\n", "3 A 1972 1.091980 1309570 110574 0.547736\n", "4 A 1973 1.175780 1511530 121974 0.540846\n", "5 A 1974 1.160170 1676730 196606 0.591167\n", ".. ... ... ... ... ... ...\n", "86 F 1980 0.112640 381478 874818 0.517766\n", "87 F 1981 0.154154 506969 1013170 0.580049\n", "88 F 1982 0.186461 633388 930477 0.556024\n", "89 F 1983 0.246847 804388 851676 0.537791\n", "90 F 1984 0.304013 1009500 819476 0.525775\n", "\n", "[90 rows x 6 columns]" ] }, "execution_count": 284, "metadata": {}, "output_type": "execute_result" } ], "source": [ "electricity_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many tools to deal with data that has more than two dimensions. The basic idea is to use hierarchical indexing in a two dimensional dataframe. For example, we can index the data by firm-years using **set_index()**." ] }, { "cell_type": "code", "execution_count": 285, "metadata": {}, "outputs": [], "source": [ "electricity_df.set_index(['firm','year'],inplace=True)" ] }, { "cell_type": "code", "execution_count": 286, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
outputcostpriceload
firmyear
A19700.95275711406401066500.534487
19710.98675712156901103070.532328
19721.09198013095701105740.547736
19731.17578015115301219740.540846
19741.16017016767301966060.591167
..................
F19800.1126403814788748180.517766
19810.15415450696910131700.580049
19820.1864616333889304770.556024
19830.2468478043888516760.537791
19840.30401310095008194760.525775
\n", "

90 rows × 4 columns

\n", "
" ], "text/plain": [ " output cost price load\n", "firm year \n", "A 1970 0.952757 1140640 106650 0.534487\n", " 1971 0.986757 1215690 110307 0.532328\n", " 1972 1.091980 1309570 110574 0.547736\n", " 1973 1.175780 1511530 121974 0.540846\n", " 1974 1.160170 1676730 196606 0.591167\n", "... ... ... ... ...\n", "F 1980 0.112640 381478 874818 0.517766\n", " 1981 0.154154 506969 1013170 0.580049\n", " 1982 0.186461 633388 930477 0.556024\n", " 1983 0.246847 804388 851676 0.537791\n", " 1984 0.304013 1009500 819476 0.525775\n", "\n", "[90 rows x 4 columns]" ] }, "execution_count": 286, "metadata": {}, "output_type": "execute_result" } ], "source": [ "electricity_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the values are sorted so that all the years of a certain company are in adjacent rows. With **sort_index** you can order the values according to years." ] }, { "cell_type": "code", "execution_count": 288, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
outputcostpriceload
firmyear
A19700.95275711406401066500.534487
B19700.5206355692921037950.490851
C19700.2624242862981187880.524334
D19700.0863931451671149870.432066
E19700.051028913611182220.442875
..................
B19841.38974042093908213610.528775
C19840.49331711704708440790.577078
D19840.42141114369708313740.585525
E19840.2132796102578446220.635545
F19840.30401310095008194760.525775
\n", "

90 rows × 4 columns

\n", "
" ], "text/plain": [ " output cost price load\n", "firm year \n", "A 1970 0.952757 1140640 106650 0.534487\n", "B 1970 0.520635 569292 103795 0.490851\n", "C 1970 0.262424 286298 118788 0.524334\n", "D 1970 0.086393 145167 114987 0.432066\n", "E 1970 0.051028 91361 118222 0.442875\n", "... ... ... ... ...\n", "B 1984 1.389740 4209390 821361 0.528775\n", "C 1984 0.493317 1170470 844079 0.577078\n", "D 1984 0.421411 1436970 831374 0.585525\n", "E 1984 0.213279 610257 844622 0.635545\n", "F 1984 0.304013 1009500 819476 0.525775\n", "\n", "[90 rows x 4 columns]" ] }, "execution_count": 288, "metadata": {}, "output_type": "execute_result" } ], "source": [ "electricity_df.sort_index(level=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Two-level indexing enables an easy way to pick subgroups from the data. For a series, you can just use the standard indexing style of Python." ] }, { "cell_type": "code", "execution_count": 252, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "year\n", "1970 0.952757\n", "1971 0.986757\n", "1972 1.091980\n", "1973 1.175780\n", "1974 1.160170\n", "1975 1.173760\n", "1976 1.290510\n", "1977 1.390670\n", "1978 1.612730\n", "1979 1.825440\n", "1980 1.546040\n", "1981 1.527900\n", "1982 1.660200\n", "1983 1.822310\n", "1984 1.936460\n", "Name: output, dtype: float64" ] }, "execution_count": 252, "metadata": {}, "output_type": "execute_result" } ], "source": [ "electricity_df['output']['A']" ] }, { "cell_type": "code", "execution_count": 253, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "firm\n", "A 1.173760\n", "B 0.852892\n", "C 0.367517\n", "D 0.164272\n", "E 0.073961\n", "F 0.052462\n", "Name: output, dtype: float64" ] }, "execution_count": 253, "metadata": {}, "output_type": "execute_result" } ], "source": [ "electricity_df['output'][:,1975]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With multi-index dataframes, you have to be a little bit more careful, because the subgroups are dataframes themselves. For example, you need to use **loc** to pick up the subgroups." ] }, { "cell_type": "code", "execution_count": 254, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
outputcostpriceload
year
19700.95275711406401066500.534487
19710.98675712156901103070.532328
19721.09198013095701105740.547736
19731.17578015115301219740.540846
19741.16017016767301966060.591167
19751.17376018237402656090.575417
19761.29051020228902634510.594495
19771.39067023147603164110.597409
19781.61273026391603841100.638522
19791.82544032476205692510.676287
19801.54604037877508716360.605735
19811.52790038677509972390.614360
19821.66020039960209380020.633366
19831.82231042828808595720.650117
19841.93646047483208234110.625603
\n", "
" ], "text/plain": [ " output cost price load\n", "year \n", "1970 0.952757 1140640 106650 0.534487\n", "1971 0.986757 1215690 110307 0.532328\n", "1972 1.091980 1309570 110574 0.547736\n", "1973 1.175780 1511530 121974 0.540846\n", "1974 1.160170 1676730 196606 0.591167\n", "1975 1.173760 1823740 265609 0.575417\n", "1976 1.290510 2022890 263451 0.594495\n", "1977 1.390670 2314760 316411 0.597409\n", "1978 1.612730 2639160 384110 0.638522\n", "1979 1.825440 3247620 569251 0.676287\n", "1980 1.546040 3787750 871636 0.605735\n", "1981 1.527900 3867750 997239 0.614360\n", "1982 1.660200 3996020 938002 0.633366\n", "1983 1.822310 4282880 859572 0.650117\n", "1984 1.936460 4748320 823411 0.625603" ] }, "execution_count": 254, "metadata": {}, "output_type": "execute_result" } ], "source": [ "electricity_df.loc['A']" ] }, { "cell_type": "code", "execution_count": 255, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "output 0.520635\n", "cost 569292.000000\n", "price 103795.000000\n", "load 0.490851\n", "Name: 1970, dtype: float64" ] }, "execution_count": 255, "metadata": {}, "output_type": "execute_result" } ], "source": [ "electricity_df.loc['B'].loc[1970]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to pick values of a certain year, you change the order of indices using **swaplevel**." ] }, { "cell_type": "code", "execution_count": 256, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
outputcostpriceload
firm
A1.17376018237402656090.575417
B0.85289213581002817040.558133
C0.3675175104122787210.607270
D0.1642723739412631480.532723
E0.0739611562282779300.556181
F0.0524621331613079230.495361
\n", "
" ], "text/plain": [ " output cost price load\n", "firm \n", "A 1.173760 1823740 265609 0.575417\n", "B 0.852892 1358100 281704 0.558133\n", "C 0.367517 510412 278721 0.607270\n", "D 0.164272 373941 263148 0.532723\n", "E 0.073961 156228 277930 0.556181\n", "F 0.052462 133161 307923 0.495361" ] }, "execution_count": 256, "metadata": {}, "output_type": "execute_result" } ], "source": [ "electricity_df.swaplevel('firm','year').loc[1975]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can easily calculate descriptive statistics at multiple levels. Most of the stat functions in Pandas include a **level** parameter for that. " ] }, { "cell_type": "code", "execution_count": 265, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "output 4.904952e+01\n", "cost 1.010271e+08\n", "price 4.245147e+07\n", "load 5.044141e+01\n", "dtype: float64" ] }, "execution_count": 265, "metadata": {}, "output_type": "execute_result" } ], "source": [ "electricity_df.sum()" ] }, { "cell_type": "code", "execution_count": 267, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
outputcostpriceload
firm
A21.1534643958505069348038.957875
B15.2504193191825569943018.206419
C6.1856641084783871452338.768038
D3.188515957132671005228.215159
E1.707548439913971711828.497289
F1.563908470553771054307.796634
\n", "
" ], "text/plain": [ " output cost price load\n", "firm \n", "A 21.153464 39585050 6934803 8.957875\n", "B 15.250419 31918255 6994301 8.206419\n", "C 6.185664 10847838 7145233 8.768038\n", "D 3.188515 9571326 7100522 8.215159\n", "E 1.707548 4399139 7171182 8.497289\n", "F 1.563908 4705537 7105430 7.796634" ] }, "execution_count": 267, "metadata": {}, "output_type": "execute_result" } ], "source": [ "electricity_df.sum(level=0)" ] }, { "cell_type": "code", "execution_count": 268, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
outputcostpriceload
year
19701.91091923017366795542.873152
19711.97698725061187017262.920993
19722.29539428588037059683.141480
19732.57985434089257476343.146692
19742.647980394190612809453.381160
19752.684864435558216750353.325085
19762.936119484873318324253.364455
19773.173497554830121065663.402352
19783.731864645474823105253.707459
19794.190756802035633475423.740366
19803.853406966127051489433.481546
19813.8000751063453360461293.513746
19824.0164581116782256751293.481910
19834.4931251213340552090273.482717
19844.7582201318490749843233.478301
\n", "
" ], "text/plain": [ " output cost price load\n", "year \n", "1970 1.910919 2301736 679554 2.873152\n", "1971 1.976987 2506118 701726 2.920993\n", "1972 2.295394 2858803 705968 3.141480\n", "1973 2.579854 3408925 747634 3.146692\n", "1974 2.647980 3941906 1280945 3.381160\n", "1975 2.684864 4355582 1675035 3.325085\n", "1976 2.936119 4848733 1832425 3.364455\n", "1977 3.173497 5548301 2106566 3.402352\n", "1978 3.731864 6454748 2310525 3.707459\n", "1979 4.190756 8020356 3347542 3.740366\n", "1980 3.853406 9661270 5148943 3.481546\n", "1981 3.800075 10634533 6046129 3.513746\n", "1982 4.016458 11167822 5675129 3.481910\n", "1983 4.493125 12133405 5209027 3.482717\n", "1984 4.758220 13184907 4984323 3.478301" ] }, "execution_count": 268, "metadata": {}, "output_type": "execute_result" } ], "source": [ "electricity_df.sum(level=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to remove multi-index, we can use **reset_index()**. With the **level** parameter, you can decide how many levels from the multi-index are removed." ] }, { "cell_type": "code", "execution_count": 257, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearoutputcostpriceload
firm
A19700.95275711406401066500.534487
A19710.98675712156901103070.532328
A19721.09198013095701105740.547736
A19731.17578015115301219740.540846
A19741.16017016767301966060.591167
..................
F19800.1126403814788748180.517766
F19810.15415450696910131700.580049
F19820.1864616333889304770.556024
F19830.2468478043888516760.537791
F19840.30401310095008194760.525775
\n", "

90 rows × 5 columns

\n", "
" ], "text/plain": [ " year output cost price load\n", "firm \n", "A 1970 0.952757 1140640 106650 0.534487\n", "A 1971 0.986757 1215690 110307 0.532328\n", "A 1972 1.091980 1309570 110574 0.547736\n", "A 1973 1.175780 1511530 121974 0.540846\n", "A 1974 1.160170 1676730 196606 0.591167\n", "... ... ... ... ... ...\n", "F 1980 0.112640 381478 874818 0.517766\n", "F 1981 0.154154 506969 1013170 0.580049\n", "F 1982 0.186461 633388 930477 0.556024\n", "F 1983 0.246847 804388 851676 0.537791\n", "F 1984 0.304013 1009500 819476 0.525775\n", "\n", "[90 rows x 5 columns]" ] }, "execution_count": 257, "metadata": {}, "output_type": "execute_result" } ], "source": [ "electricity_df.reset_index(level=1)" ] }, { "cell_type": "code", "execution_count": 275, "metadata": {}, "outputs": [], "source": [ "electricity_df.reset_index(inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Summary statistics can also be calculated using the **groupby** method." ] }, { "cell_type": "code", "execution_count": 271, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
outputcostpriceload
firm
A1.4102312.639003e+06462320.2000000.597192
B1.0166952.127884e+06466286.7333330.547095
C0.4123787.231892e+05476348.8666670.584536
D0.2125686.380884e+05473368.1333330.547677
E0.1138372.932759e+05478078.8000000.566486
F0.1042613.137025e+05473695.3333330.519776
\n", "
" ], "text/plain": [ " output cost price load\n", "firm \n", "A 1.410231 2.639003e+06 462320.200000 0.597192\n", "B 1.016695 2.127884e+06 466286.733333 0.547095\n", "C 0.412378 7.231892e+05 476348.866667 0.584536\n", "D 0.212568 6.380884e+05 473368.133333 0.547677\n", "E 0.113837 2.932759e+05 478078.800000 0.566486\n", "F 0.104261 3.137025e+05 473695.333333 0.519776" ] }, "execution_count": 271, "metadata": {}, "output_type": "execute_result" } ], "source": [ "electricity_df.groupby('firm').mean()" ] }, { "cell_type": "code", "execution_count": 276, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
outputcostpriceload
firmyear
A19700.95275711406401066500.534487
19710.98675712156901103070.532328
19721.09198013095701105740.547736
19731.17578015115301219740.540846
19741.16017016767301966060.591167
..................
F19800.1126403814788748180.517766
19810.15415450696910131700.580049
19820.1864616333889304770.556024
19830.2468478043888516760.537791
19840.30401310095008194760.525775
\n", "

90 rows × 4 columns

\n", "
" ], "text/plain": [ " output cost price load\n", "firm year \n", "A 1970 0.952757 1140640 106650 0.534487\n", " 1971 0.986757 1215690 110307 0.532328\n", " 1972 1.091980 1309570 110574 0.547736\n", " 1973 1.175780 1511530 121974 0.540846\n", " 1974 1.160170 1676730 196606 0.591167\n", "... ... ... ... ...\n", "F 1980 0.112640 381478 874818 0.517766\n", " 1981 0.154154 506969 1013170 0.580049\n", " 1982 0.186461 633388 930477 0.556024\n", " 1983 0.246847 804388 851676 0.537791\n", " 1984 0.304013 1009500 819476 0.525775\n", "\n", "[90 rows x 4 columns]" ] }, "execution_count": 276, "metadata": {}, "output_type": "execute_result" } ], "source": [ "electricity_df.groupby(['firm','year']).mean() # The mean of single values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**stack** and **unstack** can be used to reshape hierarchical index dataframes." ] }, { "cell_type": "code", "execution_count": 315, "metadata": {}, "outputs": [], "source": [ "electricity_df.reset_index(inplace=True)" ] }, { "cell_type": "code", "execution_count": 316, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firmyearoutputcostpriceload
0A19700.95275711406401066500.534487
1A19710.98675712156901103070.532328
2A19721.09198013095701105740.547736
3A19731.17578015115301219740.540846
4A19741.16017016767301966060.591167
.....................
85F19800.1126403814788748180.517766
86F19810.15415450696910131700.580049
87F19820.1864616333889304770.556024
88F19830.2468478043888516760.537791
89F19840.30401310095008194760.525775
\n", "

90 rows × 6 columns

\n", "
" ], "text/plain": [ " firm year output cost price load\n", "0 A 1970 0.952757 1140640 106650 0.534487\n", "1 A 1971 0.986757 1215690 110307 0.532328\n", "2 A 1972 1.091980 1309570 110574 0.547736\n", "3 A 1973 1.175780 1511530 121974 0.540846\n", "4 A 1974 1.160170 1676730 196606 0.591167\n", ".. ... ... ... ... ... ...\n", "85 F 1980 0.112640 381478 874818 0.517766\n", "86 F 1981 0.154154 506969 1013170 0.580049\n", "87 F 1982 0.186461 633388 930477 0.556024\n", "88 F 1983 0.246847 804388 851676 0.537791\n", "89 F 1984 0.304013 1009500 819476 0.525775\n", "\n", "[90 rows x 6 columns]" ] }, "execution_count": 316, "metadata": {}, "output_type": "execute_result" } ], "source": [ "electricity_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Stack** turns a dataframe into a series." ] }, { "cell_type": "code", "execution_count": 319, "metadata": {}, "outputs": [], "source": [ "data_series= electricity_df.stack()" ] }, { "cell_type": "code", "execution_count": 320, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 firm A\n", " year 1970\n", " output 0.952757\n", " cost 1140640\n", " price 106650\n", " ... \n", "89 year 1984\n", " output 0.304013\n", " cost 1009500\n", " price 819476\n", " load 0.525775\n", "Length: 540, dtype: object" ] }, "execution_count": 320, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**unstack()** can be used to rearragne data back to a dataframe." ] }, { "cell_type": "code", "execution_count": 321, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firmyearoutputcostpriceload
0A19700.95275711406401066500.534487
1A19710.98675712156901103070.532328
2A19721.0919813095701105740.547736
3A19731.1757815115301219740.540846
4A19741.1601716767301966060.591167
.....................
85F19800.112643814788748180.517766
86F19810.15415450696910131700.580049
87F19820.1864616333889304770.556024
88F19830.2468478043888516760.537791
89F19840.30401310095008194760.525775
\n", "

90 rows × 6 columns

\n", "
" ], "text/plain": [ " firm year output cost price load\n", "0 A 1970 0.952757 1140640 106650 0.534487\n", "1 A 1971 0.986757 1215690 110307 0.532328\n", "2 A 1972 1.09198 1309570 110574 0.547736\n", "3 A 1973 1.17578 1511530 121974 0.540846\n", "4 A 1974 1.16017 1676730 196606 0.591167\n", ".. ... ... ... ... ... ...\n", "85 F 1980 0.11264 381478 874818 0.517766\n", "86 F 1981 0.154154 506969 1013170 0.580049\n", "87 F 1982 0.186461 633388 930477 0.556024\n", "88 F 1983 0.246847 804388 851676 0.537791\n", "89 F 1984 0.304013 1009500 819476 0.525775\n", "\n", "[90 rows x 6 columns]" ] }, "execution_count": 321, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_series.unstack()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yet another tool to reorganise data is **pivot_table**. More of it later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merging datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Merging dataframes is often difficult for Pandas beginners. It can be a hassle. The usual cause of difficulties is to forget the importance of index with Pandas datatypes. Merging dataframes is not about gluing tables together. The merging is done according to indices." ] }, { "cell_type": "code", "execution_count": 291, "metadata": {}, "outputs": [], "source": [ "gasoline_df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Gasoline.csv',index_col=0)" ] }, { "cell_type": "code", "execution_count": 293, "metadata": {}, "outputs": [], "source": [ "gasoline_df.set_index('country',inplace=True)" ] }, { "cell_type": "code", "execution_count": 294, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearlgaspcarlincomeplrpmglcarpcap
country
AUSTRIA19604.173244-6.474277-0.334548-9.766840
AUSTRIA19614.100989-6.426006-0.351328-9.608622
AUSTRIA19624.073177-6.407308-0.379518-9.457257
AUSTRIA19634.059509-6.370679-0.414251-9.343155
AUSTRIA19644.037689-6.322247-0.445335-9.237739
..................
U.S.A.19744.798626-5.328694-1.231467-7.617558
U.S.A.19754.804932-5.346190-1.200377-7.607010
U.S.A.19764.814891-5.297946-1.154682-7.574748
U.S.A.19774.811032-5.256606-1.175910-7.553458
U.S.A.19784.818454-5.221232-1.212062-7.536176
\n", "

342 rows × 5 columns

\n", "
" ], "text/plain": [ " year lgaspcar lincomep lrpmg lcarpcap\n", "country \n", "AUSTRIA 1960 4.173244 -6.474277 -0.334548 -9.766840\n", "AUSTRIA 1961 4.100989 -6.426006 -0.351328 -9.608622\n", "AUSTRIA 1962 4.073177 -6.407308 -0.379518 -9.457257\n", "AUSTRIA 1963 4.059509 -6.370679 -0.414251 -9.343155\n", "AUSTRIA 1964 4.037689 -6.322247 -0.445335 -9.237739\n", "... ... ... ... ... ...\n", "U.S.A. 1974 4.798626 -5.328694 -1.231467 -7.617558\n", "U.S.A. 1975 4.804932 -5.346190 -1.200377 -7.607010\n", "U.S.A. 1976 4.814891 -5.297946 -1.154682 -7.574748\n", "U.S.A. 1977 4.811032 -5.256606 -1.175910 -7.553458\n", "U.S.A. 1978 4.818454 -5.221232 -1.212062 -7.536176\n", "\n", "[342 rows x 5 columns]" ] }, "execution_count": 294, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gasoline_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's split the data." ] }, { "cell_type": "code", "execution_count": 295, "metadata": {}, "outputs": [], "source": [ "gaso2_df = gasoline_df[['lrpmg','lcarpcap']]" ] }, { "cell_type": "code", "execution_count": 297, "metadata": {}, "outputs": [], "source": [ "gasoline_df.drop(['lrpmg','lcarpcap'],axis=1,inplace=True)" ] }, { "cell_type": "code", "execution_count": 298, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearlgaspcarlincomep
country
AUSTRIA19604.173244-6.474277
AUSTRIA19614.100989-6.426006
AUSTRIA19624.073177-6.407308
AUSTRIA19634.059509-6.370679
AUSTRIA19644.037689-6.322247
............
U.S.A.19744.798626-5.328694
U.S.A.19754.804932-5.346190
U.S.A.19764.814891-5.297946
U.S.A.19774.811032-5.256606
U.S.A.19784.818454-5.221232
\n", "

342 rows × 3 columns

\n", "
" ], "text/plain": [ " year lgaspcar lincomep\n", "country \n", "AUSTRIA 1960 4.173244 -6.474277\n", "AUSTRIA 1961 4.100989 -6.426006\n", "AUSTRIA 1962 4.073177 -6.407308\n", "AUSTRIA 1963 4.059509 -6.370679\n", "AUSTRIA 1964 4.037689 -6.322247\n", "... ... ... ...\n", "U.S.A. 1974 4.798626 -5.328694\n", "U.S.A. 1975 4.804932 -5.346190\n", "U.S.A. 1976 4.814891 -5.297946\n", "U.S.A. 1977 4.811032 -5.256606\n", "U.S.A. 1978 4.818454 -5.221232\n", "\n", "[342 rows x 3 columns]" ] }, "execution_count": 298, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gasoline_df" ] }, { "cell_type": "code", "execution_count": 299, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lrpmglcarpcap
country
AUSTRIA-0.334548-9.766840
AUSTRIA-0.351328-9.608622
AUSTRIA-0.379518-9.457257
AUSTRIA-0.414251-9.343155
AUSTRIA-0.445335-9.237739
.........
U.S.A.-1.231467-7.617558
U.S.A.-1.200377-7.607010
U.S.A.-1.154682-7.574748
U.S.A.-1.175910-7.553458
U.S.A.-1.212062-7.536176
\n", "

342 rows × 2 columns

\n", "
" ], "text/plain": [ " lrpmg lcarpcap\n", "country \n", "AUSTRIA -0.334548 -9.766840\n", "AUSTRIA -0.351328 -9.608622\n", "AUSTRIA -0.379518 -9.457257\n", "AUSTRIA -0.414251 -9.343155\n", "AUSTRIA -0.445335 -9.237739\n", "... ... ...\n", "U.S.A. -1.231467 -7.617558\n", "U.S.A. -1.200377 -7.607010\n", "U.S.A. -1.154682 -7.574748\n", "U.S.A. -1.175910 -7.553458\n", "U.S.A. -1.212062 -7.536176\n", "\n", "[342 rows x 2 columns]" ] }, "execution_count": 299, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gaso2_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Merge** combindes according to the index values" ] }, { "cell_type": "code", "execution_count": 301, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearlgaspcarlincomeplrpmglcarpcap
country
AUSTRIA19604.173244-6.474277-0.334548-9.766840
AUSTRIA19604.173244-6.474277-0.351328-9.608622
AUSTRIA19604.173244-6.474277-0.379518-9.457257
AUSTRIA19604.173244-6.474277-0.414251-9.343155
AUSTRIA19604.173244-6.474277-0.445335-9.237739
..................
U.S.A.19784.818454-5.221232-1.231467-7.617558
U.S.A.19784.818454-5.221232-1.200377-7.607010
U.S.A.19784.818454-5.221232-1.154682-7.574748
U.S.A.19784.818454-5.221232-1.175910-7.553458
U.S.A.19784.818454-5.221232-1.212062-7.536176
\n", "

6498 rows × 5 columns

\n", "
" ], "text/plain": [ " year lgaspcar lincomep lrpmg lcarpcap\n", "country \n", "AUSTRIA 1960 4.173244 -6.474277 -0.334548 -9.766840\n", "AUSTRIA 1960 4.173244 -6.474277 -0.351328 -9.608622\n", "AUSTRIA 1960 4.173244 -6.474277 -0.379518 -9.457257\n", "AUSTRIA 1960 4.173244 -6.474277 -0.414251 -9.343155\n", "AUSTRIA 1960 4.173244 -6.474277 -0.445335 -9.237739\n", "... ... ... ... ... ...\n", "U.S.A. 1978 4.818454 -5.221232 -1.231467 -7.617558\n", "U.S.A. 1978 4.818454 -5.221232 -1.200377 -7.607010\n", "U.S.A. 1978 4.818454 -5.221232 -1.154682 -7.574748\n", "U.S.A. 1978 4.818454 -5.221232 -1.175910 -7.553458\n", "U.S.A. 1978 4.818454 -5.221232 -1.212062 -7.536176\n", "\n", "[6498 rows x 5 columns]" ] }, "execution_count": 301, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(gasoline_df,gaso2_df,on='country')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You could also do this by *activating* both indices for merging." ] }, { "cell_type": "code", "execution_count": 304, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearlgaspcarlincomeplrpmglcarpcap
country
AUSTRIA19604.173244-6.474277-0.334548-9.766840
AUSTRIA19604.173244-6.474277-0.351328-9.608622
AUSTRIA19604.173244-6.474277-0.379518-9.457257
AUSTRIA19604.173244-6.474277-0.414251-9.343155
AUSTRIA19604.173244-6.474277-0.445335-9.237739
..................
U.S.A.19784.818454-5.221232-1.231467-7.617558
U.S.A.19784.818454-5.221232-1.200377-7.607010
U.S.A.19784.818454-5.221232-1.154682-7.574748
U.S.A.19784.818454-5.221232-1.175910-7.553458
U.S.A.19784.818454-5.221232-1.212062-7.536176
\n", "

6498 rows × 5 columns

\n", "
" ], "text/plain": [ " year lgaspcar lincomep lrpmg lcarpcap\n", "country \n", "AUSTRIA 1960 4.173244 -6.474277 -0.334548 -9.766840\n", "AUSTRIA 1960 4.173244 -6.474277 -0.351328 -9.608622\n", "AUSTRIA 1960 4.173244 -6.474277 -0.379518 -9.457257\n", "AUSTRIA 1960 4.173244 -6.474277 -0.414251 -9.343155\n", "AUSTRIA 1960 4.173244 -6.474277 -0.445335 -9.237739\n", "... ... ... ... ... ...\n", "U.S.A. 1978 4.818454 -5.221232 -1.231467 -7.617558\n", "U.S.A. 1978 4.818454 -5.221232 -1.200377 -7.607010\n", "U.S.A. 1978 4.818454 -5.221232 -1.154682 -7.574748\n", "U.S.A. 1978 4.818454 -5.221232 -1.175910 -7.553458\n", "U.S.A. 1978 4.818454 -5.221232 -1.212062 -7.536176\n", "\n", "[6498 rows x 5 columns]" ] }, "execution_count": 304, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(gasoline_df,gaso2_df,left_index=True,right_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also use **join**. It uses the index values for merging by default." ] }, { "cell_type": "code", "execution_count": 303, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearlgaspcarlincomeplrpmglcarpcap
country
AUSTRIA19604.173244-6.474277-0.334548-9.766840
AUSTRIA19604.173244-6.474277-0.351328-9.608622
AUSTRIA19604.173244-6.474277-0.379518-9.457257
AUSTRIA19604.173244-6.474277-0.414251-9.343155
AUSTRIA19604.173244-6.474277-0.445335-9.237739
..................
U.S.A.19784.818454-5.221232-1.231467-7.617558
U.S.A.19784.818454-5.221232-1.200377-7.607010
U.S.A.19784.818454-5.221232-1.154682-7.574748
U.S.A.19784.818454-5.221232-1.175910-7.553458
U.S.A.19784.818454-5.221232-1.212062-7.536176
\n", "

6498 rows × 5 columns

\n", "
" ], "text/plain": [ " year lgaspcar lincomep lrpmg lcarpcap\n", "country \n", "AUSTRIA 1960 4.173244 -6.474277 -0.334548 -9.766840\n", "AUSTRIA 1960 4.173244 -6.474277 -0.351328 -9.608622\n", "AUSTRIA 1960 4.173244 -6.474277 -0.379518 -9.457257\n", "AUSTRIA 1960 4.173244 -6.474277 -0.414251 -9.343155\n", "AUSTRIA 1960 4.173244 -6.474277 -0.445335 -9.237739\n", "... ... ... ... ... ...\n", "U.S.A. 1978 4.818454 -5.221232 -1.231467 -7.617558\n", "U.S.A. 1978 4.818454 -5.221232 -1.200377 -7.607010\n", "U.S.A. 1978 4.818454 -5.221232 -1.154682 -7.574748\n", "U.S.A. 1978 4.818454 -5.221232 -1.175910 -7.553458\n", "U.S.A. 1978 4.818454 -5.221232 -1.212062 -7.536176\n", "\n", "[6498 rows x 5 columns]" ] }, "execution_count": 303, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gasoline_df.join(gaso2_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Concat** is a more general tool for merging data. It is easiest to understand how it works, if we use very simple datasets. Let's create three simple Pandas Series." ] }, { "cell_type": "code", "execution_count": 305, "metadata": {}, "outputs": [], "source": [ "x1 = pd.Series([20,30,40], index=['a','b','c'])\n", "x2 = pd.Series([50,60], index=['e','f'])\n", "x3 = pd.Series([70,80,90], index=['h','i','j'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we just feed them to **concat**, they are joined together as a longer Series." ] }, { "cell_type": "code", "execution_count": 306, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 20\n", "b 30\n", "c 40\n", "e 50\n", "f 60\n", "h 70\n", "i 80\n", "j 90\n", "dtype: int64" ] }, "execution_count": 306, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([x1,x2,x3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you change the axis, the result will be a dataframe, because there are no overlapping index values." ] }, { "cell_type": "code", "execution_count": 307, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
a20.0NaNNaN
b30.0NaNNaN
c40.0NaNNaN
eNaN50.0NaN
fNaN60.0NaN
hNaNNaN70.0
iNaNNaN80.0
jNaNNaN90.0
\n", "
" ], "text/plain": [ " 0 1 2\n", "a 20.0 NaN NaN\n", "b 30.0 NaN NaN\n", "c 40.0 NaN NaN\n", "e NaN 50.0 NaN\n", "f NaN 60.0 NaN\n", "h NaN NaN 70.0\n", "i NaN NaN 80.0\n", "j NaN NaN 90.0" ] }, "execution_count": 307, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([x1,x2,x3],axis=1)" ] }, { "cell_type": "code", "execution_count": 308, "metadata": {}, "outputs": [], "source": [ "x4 = pd.concat([x1,x2])" ] }, { "cell_type": "code", "execution_count": 309, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
a20NaN
b30NaN
c40NaN
e5050.0
f6060.0
\n", "
" ], "text/plain": [ " 0 1\n", "a 20 NaN\n", "b 30 NaN\n", "c 40 NaN\n", "e 50 50.0\n", "f 60 60.0" ] }, "execution_count": 309, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([x4,x2],axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you just want to keep the intersction of the series, you can use **join='inner'**" ] }, { "cell_type": "code", "execution_count": 310, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
e5050
f6060
\n", "
" ], "text/plain": [ " 0 1\n", "e 50 50\n", "f 60 60" ] }, "execution_count": 310, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([x4,x2],axis=1,join='inner')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data aggreagation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data aggregation is one the most important steps in data preprocessing. Pandas has many tools for that." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Groupby** is an important tool in aggragation. Let's load a new dataset for this." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "comp_returns_df = pd.read_csv('comp_returns.csv',index_col=0)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NAMEISO COUNTRY CODERETURN ON EQUITY - TOTAL (%)TOTAL DEBT % COMMON EQUITYRESEARCH & DEVELOPMENT/SALES -ACCOUNTING STANDARDS FOLLOWED
0APPLEUS55.92119.404.95US standards (GAAP)
1SAUDI ARABIAN OILSA32.2516.96NaNIFRS
2MICROSOFTUS40.1461.5613.59US standards (GAAP)
3AMAZON.COMUS21.9583.3112.29US standards (GAAP)
4FACEBOOK CLASS AUS19.960.4721.00US standards (GAAP)
5TENCENT HOLDINGSHK24.6853.747.72IFRS
6ALPHABET AUS18.122.2615.71US standards (GAAP)
7TESLAUS-14.94202.7710.73US standards (GAAP)
8WALMARTUS20.2272.95NaNUS standards (GAAP)
9JOHNSON & JOHNSONUS25.3646.5713.28US standards (GAAP)
10TAIWAN SEMICON.MNFG.TW20.9411.898.08IFRS
11NESTLE 'N'CH23.0571.421.88IFRS
12PROCTER & GAMBLEUS27.7876.102.79US standards (GAAP)
13VISA 'A'US40.3457.25NaNUS standards (GAAP)
14MASTERCARDUS143.83144.70NaNUS standards (GAAP)
15NVIDIAUS25.9516.3122.49US standards (GAAP)
163I GROUPGB2.737.67NaNIFRS
17JP MORGAN CHASE & CO.US14.91220.24NaNUS standards (GAAP)
18HOME DEPOTUSNaN-1010.37NaNUS standards (GAAP)
19KWEICHOW MOUTAI 'A'CN33.129.140.85Local standards
20SAMSUNG ELECTRONICSKR8.697.227.39IFRS
21UNITEDHEALTH GROUPUS25.4071.33NaNUS standards (GAAP)
22ROCHE HOLDINGCH44.7147.5819.38IFRS
23VERIZON COMMUNICATIONSUS33.64181.59NaNUS standards (GAAP)
24LVMHFR20.8268.970.28IFRS
25NETFLIXUS29.12204.388.73US standards (GAAP)
26ADOBE (NAS)US29.6739.3017.15US standards (GAAP)
27TOYOTA MOTORJP10.45102.453.66US standards (GAAP)
28PAYPAL HOLDINGSUS15.2429.488.49US standards (GAAP)
29SALESFORCE.COMUS0.519.0414.76US standards (GAAP)
\n", "
" ], "text/plain": [ " NAME ISO COUNTRY CODE RETURN ON EQUITY - TOTAL (%) \\\n", "0 APPLE US 55.92 \n", "1 SAUDI ARABIAN OIL SA 32.25 \n", "2 MICROSOFT US 40.14 \n", "3 AMAZON.COM US 21.95 \n", "4 FACEBOOK CLASS A US 19.96 \n", "5 TENCENT HOLDINGS HK 24.68 \n", "6 ALPHABET A US 18.12 \n", "7 TESLA US -14.94 \n", "8 WALMART US 20.22 \n", "9 JOHNSON & JOHNSON US 25.36 \n", "10 TAIWAN SEMICON.MNFG. TW 20.94 \n", "11 NESTLE 'N' CH 23.05 \n", "12 PROCTER & GAMBLE US 27.78 \n", "13 VISA 'A' US 40.34 \n", "14 MASTERCARD US 143.83 \n", "15 NVIDIA US 25.95 \n", "16 3I GROUP GB 2.73 \n", "17 JP MORGAN CHASE & CO. US 14.91 \n", "18 HOME DEPOT US NaN \n", "19 KWEICHOW MOUTAI 'A' CN 33.12 \n", "20 SAMSUNG ELECTRONICS KR 8.69 \n", "21 UNITEDHEALTH GROUP US 25.40 \n", "22 ROCHE HOLDING CH 44.71 \n", "23 VERIZON COMMUNICATIONS US 33.64 \n", "24 LVMH FR 20.82 \n", "25 NETFLIX US 29.12 \n", "26 ADOBE (NAS) US 29.67 \n", "27 TOYOTA MOTOR JP 10.45 \n", "28 PAYPAL HOLDINGS US 15.24 \n", "29 SALESFORCE.COM US 0.51 \n", "\n", " TOTAL DEBT % COMMON EQUITY RESEARCH & DEVELOPMENT/SALES - \\\n", "0 119.40 4.95 \n", "1 16.96 NaN \n", "2 61.56 13.59 \n", "3 83.31 12.29 \n", "4 0.47 21.00 \n", "5 53.74 7.72 \n", "6 2.26 15.71 \n", "7 202.77 10.73 \n", "8 72.95 NaN \n", "9 46.57 13.28 \n", "10 11.89 8.08 \n", "11 71.42 1.88 \n", "12 76.10 2.79 \n", "13 57.25 NaN \n", "14 144.70 NaN \n", "15 16.31 22.49 \n", "16 7.67 NaN \n", "17 220.24 NaN \n", "18 -1010.37 NaN \n", "19 9.14 0.85 \n", "20 7.22 7.39 \n", "21 71.33 NaN \n", "22 47.58 19.38 \n", "23 181.59 NaN \n", "24 68.97 0.28 \n", "25 204.38 8.73 \n", "26 39.30 17.15 \n", "27 102.45 3.66 \n", "28 29.48 8.49 \n", "29 9.04 14.76 \n", "\n", " ACCOUNTING STANDARDS FOLLOWED \n", "0 US standards (GAAP) \n", "1 IFRS \n", "2 US standards (GAAP) \n", "3 US standards (GAAP) \n", "4 US standards (GAAP) \n", "5 IFRS \n", "6 US standards (GAAP) \n", "7 US standards (GAAP) \n", "8 US standards (GAAP) \n", "9 US standards (GAAP) \n", "10 IFRS \n", "11 IFRS \n", "12 US standards (GAAP) \n", "13 US standards (GAAP) \n", "14 US standards (GAAP) \n", "15 US standards (GAAP) \n", "16 IFRS \n", "17 US standards (GAAP) \n", "18 US standards (GAAP) \n", "19 Local standards \n", "20 IFRS \n", "21 US standards (GAAP) \n", "22 IFRS \n", "23 US standards (GAAP) \n", "24 IFRS \n", "25 US standards (GAAP) \n", "26 US standards (GAAP) \n", "27 US standards (GAAP) \n", "28 US standards (GAAP) \n", "29 US standards (GAAP) " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_returns_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can group data by the accounting standards variable and calculate the mean. Notice how Pandas automatically drops the country variable, because you cannot calculate a mean from strings:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RETURN ON EQUITY - TOTAL (%)TOTAL DEBT % COMMON EQUITYRESEARCH & DEVELOPMENT/SALES -
ACCOUNTING STANDARDS FOLLOWED
IFRS22.2337535.681257.455000
Local standards33.120009.140000.850000
US standards (GAAP)29.1785034.8138112.115714
\n", "
" ], "text/plain": [ " RETURN ON EQUITY - TOTAL (%) \\\n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS 22.23375 \n", "Local standards 33.12000 \n", "US standards (GAAP) 29.17850 \n", "\n", " TOTAL DEBT % COMMON EQUITY \\\n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS 35.68125 \n", "Local standards 9.14000 \n", "US standards (GAAP) 34.81381 \n", "\n", " RESEARCH & DEVELOPMENT/SALES - \n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS 7.455000 \n", "Local standards 0.850000 \n", "US standards (GAAP) 12.115714 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can make a two-level grouping by adding a list of columns to **groupby**." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RETURN ON EQUITY - TOTAL (%)TOTAL DEBT % COMMON EQUITYRESEARCH & DEVELOPMENT/SALES -
ISO COUNTRY CODEACCOUNTING STANDARDS FOLLOWED
CHIFRS33.88000059.50010.630000
CNLocal standards33.1200009.1400.850000
FRIFRS20.82000068.9700.280000
GBIFRS2.7300007.670NaN
HKIFRS24.68000053.7407.720000
JPUS standards (GAAP)10.450000102.4503.660000
KRIFRS8.6900007.2207.390000
SAIFRS32.25000016.960NaN
TWIFRS20.94000011.8908.080000
USUS standards (GAAP)30.16421131.43212.766154
\n", "
" ], "text/plain": [ " RETURN ON EQUITY - TOTAL (%) \\\n", "ISO COUNTRY CODE ACCOUNTING STANDARDS FOLLOWED \n", "CH IFRS 33.880000 \n", "CN Local standards 33.120000 \n", "FR IFRS 20.820000 \n", "GB IFRS 2.730000 \n", "HK IFRS 24.680000 \n", "JP US standards (GAAP) 10.450000 \n", "KR IFRS 8.690000 \n", "SA IFRS 32.250000 \n", "TW IFRS 20.940000 \n", "US US standards (GAAP) 30.164211 \n", "\n", " TOTAL DEBT % COMMON EQUITY \\\n", "ISO COUNTRY CODE ACCOUNTING STANDARDS FOLLOWED \n", "CH IFRS 59.500 \n", "CN Local standards 9.140 \n", "FR IFRS 68.970 \n", "GB IFRS 7.670 \n", "HK IFRS 53.740 \n", "JP US standards (GAAP) 102.450 \n", "KR IFRS 7.220 \n", "SA IFRS 16.960 \n", "TW IFRS 11.890 \n", "US US standards (GAAP) 31.432 \n", "\n", " RESEARCH & DEVELOPMENT/SALES - \n", "ISO COUNTRY CODE ACCOUNTING STANDARDS FOLLOWED \n", "CH IFRS 10.630000 \n", "CN Local standards 0.850000 \n", "FR IFRS 0.280000 \n", "GB IFRS NaN \n", "HK IFRS 7.720000 \n", "JP US standards (GAAP) 3.660000 \n", "KR IFRS 7.390000 \n", "SA IFRS NaN \n", "TW IFRS 8.080000 \n", "US US standards (GAAP) 12.766154 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_returns_df.groupby(['ISO COUNTRY CODE','ACCOUNTING STANDARDS FOLLOWED']).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many other groupby-methods, like **size()**. The full list of **groupby** -methods are here: [pandas.pydata.org/pandas-docs/stable/reference/groupby.html](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html)" ] }, { "cell_type": "code", "execution_count": 349, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ISO COUNTRY CODE\n", "CH 2\n", "CN 1\n", "FR 1\n", "GB 1\n", "HK 1\n", "JP 1\n", "KR 1\n", "SA 1\n", "TW 1\n", "US 20\n", "dtype: int64" ] }, "execution_count": 349, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_returns_df.groupby('ISO COUNTRY CODE').size()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we just use the **groupby** method without a following method, it will return a Pandas groupby object." ] }, { "cell_type": "code", "execution_count": 350, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 350, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_returns_df.groupby('ISO COUNTRY CODE')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This object can be used for iteration." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CH\n", "2\n", "CN\n", "1\n", "FR\n", "1\n", "GB\n", "1\n", "HK\n", "1\n", "JP\n", "1\n", "KR\n", "1\n", "SA\n", "1\n", "TW\n", "1\n", "US\n", "20\n" ] } ], "source": [ "for name,group in comp_returns_df.groupby('ISO COUNTRY CODE'):\n", " print(name)\n", " print(len(group))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You could also group data according to columns using **axis=1**. It makes not much sense with this data, so we skip that example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can easily pick up just one column from a groupby object." ] }, { "cell_type": "code", "execution_count": 353, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ISO COUNTRY CODE\n", "CH 33.880000\n", "CN 33.120000\n", "FR 20.820000\n", "GB 2.730000\n", "HK 24.680000\n", "JP 10.450000\n", "KR 8.690000\n", "SA 32.250000\n", "TW 20.940000\n", "US 30.164211\n", "Name: RETURN ON EQUITY - TOTAL (%), dtype: float64" ] }, "execution_count": 353, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_returns_df.groupby('ISO COUNTRY CODE')['RETURN ON EQUITY - TOTAL (%)'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You could group data by defining a dict, a function etc. We do not go to these advanced methods here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Data aggregation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we know the basics of **groupby**, we can analyse more how it can be used to aggregate data. The methods of **groupby** are **count, sum, mean, median, std, var, min, max, prod, first and last**. These are opitmised for groupby objects, but many other methods work too. Actually, you can define your own functions with **groupby.agg()**." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "def mean_median(arr):\n", " return arr.mean()-arr.median()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RETURN ON EQUITY - TOTAL (%)TOTAL DEBT % COMMON EQUITYRESEARCH & DEVELOPMENT/SALES -
ACCOUNTING STANDARDS FOLLOWED
IFRS0.238753.41125-0.100000
Local standards0.000000.000000.000000
US standards (GAAP)3.79850-36.51619-0.669286
\n", "
" ], "text/plain": [ " RETURN ON EQUITY - TOTAL (%) \\\n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS 0.23875 \n", "Local standards 0.00000 \n", "US standards (GAAP) 3.79850 \n", "\n", " TOTAL DEBT % COMMON EQUITY \\\n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS 3.41125 \n", "Local standards 0.00000 \n", "US standards (GAAP) -36.51619 \n", "\n", " RESEARCH & DEVELOPMENT/SALES - \n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS -0.100000 \n", "Local standards 0.000000 \n", "US standards (GAAP) -0.669286 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').agg(mean_median)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Basically, those Pandas methods will work with **groupby** that are some kind of aggregations. For example, quantile can be used" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RETURN ON EQUITY - TOTAL (%)TOTAL DEBT % COMMON EQUITYRESEARCH & DEVELOPMENT/SALES -
ACCOUNTING STANDARDS FOLLOWED
IFRS35.98869.70513.730
Local standards33.1209.1400.850
US standards (GAAP)41.898202.77019.845
\n", "
" ], "text/plain": [ " RETURN ON EQUITY - TOTAL (%) \\\n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS 35.988 \n", "Local standards 33.120 \n", "US standards (GAAP) 41.898 \n", "\n", " TOTAL DEBT % COMMON EQUITY \\\n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS 69.705 \n", "Local standards 9.140 \n", "US standards (GAAP) 202.770 \n", "\n", " RESEARCH & DEVELOPMENT/SALES - \n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS 13.730 \n", "Local standards 0.850 \n", "US standards (GAAP) 19.845 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').quantile(0.9)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even though **describe** is not an aggregating function, it will also work." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RETURN ON EQUITY - TOTAL (%)TOTAL DEBT % COMMON EQUITYRESEARCH & DEVELOPMENT/SALES -
countmeanstdmin25%50%75%maxcountmean...75%maxcountmeanstdmin25%50%75%max
ACCOUNTING STANDARDS FOLLOWED
IFRS8.022.2337512.9816322.7317.787521.99526.572544.718.035.68125...57.547571.426.07.4550006.7116610.283.25757.5557.990019.38
Local standards1.033.12000NaN33.1233.120033.12033.120033.121.09.14000...9.14009.141.00.850000NaN0.850.85000.8500.85000.85
US standards (GAAP)20.029.1785030.798945-14.9417.400025.38030.6625143.8321.034.81381...119.4000220.2414.012.1157146.0237702.798.550012.78515.472522.49
\n", "

3 rows × 24 columns

\n", "
" ], "text/plain": [ " RETURN ON EQUITY - TOTAL (%) \\\n", " count mean \n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS 8.0 22.23375 \n", "Local standards 1.0 33.12000 \n", "US standards (GAAP) 20.0 29.17850 \n", "\n", " \\\n", " std min 25% 50% 75% \n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS 12.981632 2.73 17.7875 21.995 26.5725 \n", "Local standards NaN 33.12 33.1200 33.120 33.1200 \n", "US standards (GAAP) 30.798945 -14.94 17.4000 25.380 30.6625 \n", "\n", " TOTAL DEBT % COMMON EQUITY \\\n", " max count mean \n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS 44.71 8.0 35.68125 \n", "Local standards 33.12 1.0 9.14000 \n", "US standards (GAAP) 143.83 21.0 34.81381 \n", "\n", " ... \\\n", " ... 75% max \n", "ACCOUNTING STANDARDS FOLLOWED ... \n", "IFRS ... 57.5475 71.42 \n", "Local standards ... 9.1400 9.14 \n", "US standards (GAAP) ... 119.4000 220.24 \n", "\n", " RESEARCH & DEVELOPMENT/SALES - \\\n", " count mean \n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS 6.0 7.455000 \n", "Local standards 1.0 0.850000 \n", "US standards (GAAP) 14.0 12.115714 \n", "\n", " \n", " std min 25% 50% 75% max \n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS 6.711661 0.28 3.2575 7.555 7.9900 19.38 \n", "Local standards NaN 0.85 0.8500 0.850 0.8500 0.85 \n", "US standards (GAAP) 6.023770 2.79 8.5500 12.785 15.4725 22.49 \n", "\n", "[3 rows x 24 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can make the previous table more readable by unstacking it." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " ACCOUNTING STANDARDS FOLLOWED\n", "RETURN ON EQUITY - TOTAL (%) count IFRS 8.00000\n", " Local standards 1.00000\n", " US standards (GAAP) 20.00000\n", " mean IFRS 22.23375\n", " Local standards 33.12000\n", " ... \n", "RESEARCH & DEVELOPMENT/SALES - 75% Local standards 0.85000\n", " US standards (GAAP) 15.47250\n", " max IFRS 19.38000\n", " Local standards 0.85000\n", " US standards (GAAP) 22.49000\n", "Length: 72, dtype: float64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').describe().unstack()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Apply()** is the most versatile method to use with groupby objects." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def medalists(df,var = 'RETURN ON EQUITY - TOTAL (%)'):\n", " return df.sort_values(by=var)[-3:]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NAMEISO COUNTRY CODERETURN ON EQUITY - TOTAL (%)TOTAL DEBT % COMMON EQUITYRESEARCH & DEVELOPMENT/SALES -ACCOUNTING STANDARDS FOLLOWED
ACCOUNTING STANDARDS FOLLOWED
IFRS5TENCENT HOLDINGSHK24.6853.747.72IFRS
1SAUDI ARABIAN OILSA32.2516.96NaNIFRS
22ROCHE HOLDINGCH44.7147.5819.38IFRS
Local standards19KWEICHOW MOUTAI 'A'CN33.129.140.85Local standards
US standards (GAAP)0APPLEUS55.92119.404.95US standards (GAAP)
14MASTERCARDUS143.83144.70NaNUS standards (GAAP)
18HOME DEPOTUSNaN-1010.37NaNUS standards (GAAP)
\n", "
" ], "text/plain": [ " NAME ISO COUNTRY CODE \\\n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS 5 TENCENT HOLDINGS HK \n", " 1 SAUDI ARABIAN OIL SA \n", " 22 ROCHE HOLDING CH \n", "Local standards 19 KWEICHOW MOUTAI 'A' CN \n", "US standards (GAAP) 0 APPLE US \n", " 14 MASTERCARD US \n", " 18 HOME DEPOT US \n", "\n", " RETURN ON EQUITY - TOTAL (%) \\\n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS 5 24.68 \n", " 1 32.25 \n", " 22 44.71 \n", "Local standards 19 33.12 \n", "US standards (GAAP) 0 55.92 \n", " 14 143.83 \n", " 18 NaN \n", "\n", " TOTAL DEBT % COMMON EQUITY \\\n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS 5 53.74 \n", " 1 16.96 \n", " 22 47.58 \n", "Local standards 19 9.14 \n", "US standards (GAAP) 0 119.40 \n", " 14 144.70 \n", " 18 -1010.37 \n", "\n", " RESEARCH & DEVELOPMENT/SALES - \\\n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS 5 7.72 \n", " 1 NaN \n", " 22 19.38 \n", "Local standards 19 0.85 \n", "US standards (GAAP) 0 4.95 \n", " 14 NaN \n", " 18 NaN \n", "\n", " ACCOUNTING STANDARDS FOLLOWED \n", "ACCOUNTING STANDARDS FOLLOWED \n", "IFRS 5 IFRS \n", " 1 IFRS \n", " 22 IFRS \n", "Local standards 19 Local standards \n", "US standards (GAAP) 0 US standards (GAAP) \n", " 14 US standards (GAAP) \n", " 18 US standards (GAAP) " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_returns_df.groupby('ACCOUNTING STANDARDS FOLLOWED').apply(medalists)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is also possible to use the bins created with **cut** or **qcut** as a grouping criteria in **groupby**." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "ROE_quartiles = pd.qcut(comp_returns_df['RETURN ON EQUITY - TOTAL (%)'],4)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RETURN ON EQUITY - TOTAL (%)\n", "(-14.940999999999999, 18.12] 8\n", "(18.12, 24.68] 7\n", "(24.68, 32.25] 7\n", "(32.25, 143.83] 7\n", "dtype: int64" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_returns_df.groupby(ROE_quartiles).size()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RETURN ON EQUITY - TOTAL (%)TOTAL DEBT % COMMON EQUITYRESEARCH & DEVELOPMENT/SALES -
RETURN ON EQUITY - TOTAL (%)
(-14.940999999999999, 18.12]6.96375072.64125010.123333
(18.12, 24.68]21.66000051.8214298.541667
(24.68, 32.25]27.93285767.27857112.888000
(32.25, 143.83]55.95714388.7457149.692500
\n", "
" ], "text/plain": [ " RETURN ON EQUITY - TOTAL (%) \\\n", "RETURN ON EQUITY - TOTAL (%) \n", "(-14.940999999999999, 18.12] 6.963750 \n", "(18.12, 24.68] 21.660000 \n", "(24.68, 32.25] 27.932857 \n", "(32.25, 143.83] 55.957143 \n", "\n", " TOTAL DEBT % COMMON EQUITY \\\n", "RETURN ON EQUITY - TOTAL (%) \n", "(-14.940999999999999, 18.12] 72.641250 \n", "(18.12, 24.68] 51.821429 \n", "(24.68, 32.25] 67.278571 \n", "(32.25, 143.83] 88.745714 \n", "\n", " RESEARCH & DEVELOPMENT/SALES - \n", "RETURN ON EQUITY - TOTAL (%) \n", "(-14.940999999999999, 18.12] 10.123333 \n", "(18.12, 24.68] 8.541667 \n", "(24.68, 32.25] 12.888000 \n", "(32.25, 143.83] 9.692500 " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_returns_df.groupby(ROE_quartiles).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pivot tables are another option to organise data with Pandas. They are very popular in spreadsheet softwares, like Excel.It is very similar to cross-tabs that we use in statistics. Pandas has a specific function for pivot tables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's load a new data." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "large_df = pd.read_csv('large_table.csv',delimiter=';')" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "large_df.drop('Type',inplace=True,axis=1)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NAMEISO COUNTRY CODERETURN ON EQUITY - TOTAL (%)TOTAL DEBT % COMMON EQUITYRESEARCH & DEVELOPMENT/SALES -INDUSTRY GROUPACCOUNTING STANDARDS FOLLOWEDCSR Sustainability External AuditAudit CommitteeCSR Sustainability Committee
0APPLEUS55.92119.404.9534US standards (GAAP)YYY
1SAUDI ARABIAN OILSA32.2516.96NaN97IFRSNaNYY
2MICROSOFTUS40.1461.5613.5958US standards (GAAP)NaNYY
3AMAZON.COMUS21.9583.3112.2987US standards (GAAP)YYY
4FACEBOOK CLASS AUS19.960.4721.00151US standards (GAAP)YYY
.................................
100CHINA MERCHANTS BANK 'A'CN16.84301.52NaN102IFRSYYY
101ENELIT7.00202.46NaN169IFRSYYY
102JD COM ADR 1:2CN17.2812.282.1887US standards (GAAP)NaNYN
103KEYENCEJP11.840.002.76214Local standardsNaNNN
104HSBC HOLDINGSGB3.71328.78NaN102IFRSYYY
\n", "

105 rows × 10 columns

\n", "
" ], "text/plain": [ " NAME ISO COUNTRY CODE RETURN ON EQUITY - TOTAL (%) \\\n", "0 APPLE US 55.92 \n", "1 SAUDI ARABIAN OIL SA 32.25 \n", "2 MICROSOFT US 40.14 \n", "3 AMAZON.COM US 21.95 \n", "4 FACEBOOK CLASS A US 19.96 \n", ".. ... ... ... \n", "100 CHINA MERCHANTS BANK 'A' CN 16.84 \n", "101 ENEL IT 7.00 \n", "102 JD COM ADR 1:2 CN 17.28 \n", "103 KEYENCE JP 11.84 \n", "104 HSBC HOLDINGS GB 3.71 \n", "\n", " TOTAL DEBT % COMMON EQUITY RESEARCH & DEVELOPMENT/SALES - \\\n", "0 119.40 4.95 \n", "1 16.96 NaN \n", "2 61.56 13.59 \n", "3 83.31 12.29 \n", "4 0.47 21.00 \n", ".. ... ... \n", "100 301.52 NaN \n", "101 202.46 NaN \n", "102 12.28 2.18 \n", "103 0.00 2.76 \n", "104 328.78 NaN \n", "\n", " INDUSTRY GROUP ACCOUNTING STANDARDS FOLLOWED \\\n", "0 34 US standards (GAAP) \n", "1 97 IFRS \n", "2 58 US standards (GAAP) \n", "3 87 US standards (GAAP) \n", "4 151 US standards (GAAP) \n", ".. ... ... \n", "100 102 IFRS \n", "101 169 IFRS \n", "102 87 US standards (GAAP) \n", "103 214 Local standards \n", "104 102 IFRS \n", "\n", " CSR Sustainability External Audit Audit Committee \\\n", "0 Y Y \n", "1 NaN Y \n", "2 NaN Y \n", "3 Y Y \n", "4 Y Y \n", ".. ... ... \n", "100 Y Y \n", "101 Y Y \n", "102 NaN Y \n", "103 NaN N \n", "104 Y Y \n", "\n", " CSR Sustainability Committee \n", "0 Y \n", "1 Y \n", "2 Y \n", "3 Y \n", "4 Y \n", ".. ... \n", "100 Y \n", "101 Y \n", "102 N \n", "103 N \n", "104 Y \n", "\n", "[105 rows x 10 columns]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "large_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The default **pivot_table** aggregation type is mean." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['NAME', 'ISO COUNTRY CODE', 'RETURN ON EQUITY - TOTAL (%)',\n", " 'TOTAL DEBT % COMMON EQUITY', 'RESEARCH & DEVELOPMENT/SALES -',\n", " 'INDUSTRY GROUP', 'ACCOUNTING STANDARDS FOLLOWED',\n", " 'CSR Sustainability External Audit', 'Audit Committee',\n", " 'CSR Sustainability Committee'],\n", " dtype='object')" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "large_df.columns" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ACCOUNTING STANDARDS FOLLOWEDIFRSLocal standardsUS standards (GAAP)
CSR Sustainability External Audit
NNaNNaN22.380000
Y18.96833323.31546.280556
\n", "
" ], "text/plain": [ "ACCOUNTING STANDARDS FOLLOWED IFRS Local standards \\\n", "CSR Sustainability External Audit \n", "N NaN NaN \n", "Y 18.968333 23.315 \n", "\n", "ACCOUNTING STANDARDS FOLLOWED US standards (GAAP) \n", "CSR Sustainability External Audit \n", "N 22.380000 \n", "Y 46.280556 " ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "large_df.pivot_table(values='RETURN ON EQUITY - TOTAL (%)',\n", " index='CSR Sustainability External Audit',columns='ACCOUNTING STANDARDS FOLLOWED')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use hierarchical index and add marginal sums to the table." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ACCOUNTING STANDARDS FOLLOWEDIFRSLocal standardsUS standards (GAAP)All
CSR Sustainability External AuditAudit Committee
NYNaNNaN22.38000022.380000
YN24.660000NaN10.45000017.555000
Y18.72087023.31547.30428635.547667
All18.96833323.31544.44205134.386308
\n", "
" ], "text/plain": [ "ACCOUNTING STANDARDS FOLLOWED IFRS Local standards \\\n", "CSR Sustainability External Audit Audit Committee \n", "N Y NaN NaN \n", "Y N 24.660000 NaN \n", " Y 18.720870 23.315 \n", "All 18.968333 23.315 \n", "\n", "ACCOUNTING STANDARDS FOLLOWED US standards (GAAP) \\\n", "CSR Sustainability External Audit Audit Committee \n", "N Y 22.380000 \n", "Y N 10.450000 \n", " Y 47.304286 \n", "All 44.442051 \n", "\n", "ACCOUNTING STANDARDS FOLLOWED All \n", "CSR Sustainability External Audit Audit Committee \n", "N Y 22.380000 \n", "Y N 17.555000 \n", " Y 35.547667 \n", "All 34.386308 " ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "large_df.pivot_table(values='RETURN ON EQUITY - TOTAL (%)',\n", " index=['CSR Sustainability External Audit','Audit Committee'],\n", " columns='ACCOUNTING STANDARDS FOLLOWED',margins=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also use hierarchical columns" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CSR Sustainability CommitteeNYAll
Audit CommitteeNYNY
ISO COUNTRY CODE
CANaN19.890000NaNNaN19.890000
CHNaNNaNNaN12.72000012.720000
CN0.8502.1800000.280000NaN1.103333
DENaNNaNNaN10.31500010.315000
DKNaNNaNNaN11.85000011.850000
FRNaNNaNNaN4.9025004.902500
GBNaNNaNNaN18.01500018.015000
HKNaN15.420000NaN7.72000011.570000
INNaNNaNNaN0.2200000.220000
JP2.760NaN2.5050001.8900002.415000
KRNaNNaNNaN7.3900007.390000
NLNaNNaNNaN14.71000014.710000
TWNaNNaNNaN8.0800008.080000
USNaN13.755000NaN10.56424211.055128
All1.80513.3355561.7633339.9063469.758333
\n", "
" ], "text/plain": [ "CSR Sustainability Committee N Y All\n", "Audit Committee N Y N Y \n", "ISO COUNTRY CODE \n", "CA NaN 19.890000 NaN NaN 19.890000\n", "CH NaN NaN NaN 12.720000 12.720000\n", "CN 0.850 2.180000 0.280000 NaN 1.103333\n", "DE NaN NaN NaN 10.315000 10.315000\n", "DK NaN NaN NaN 11.850000 11.850000\n", "FR NaN NaN NaN 4.902500 4.902500\n", "GB NaN NaN NaN 18.015000 18.015000\n", "HK NaN 15.420000 NaN 7.720000 11.570000\n", "IN NaN NaN NaN 0.220000 0.220000\n", "JP 2.760 NaN 2.505000 1.890000 2.415000\n", "KR NaN NaN NaN 7.390000 7.390000\n", "NL NaN NaN NaN 14.710000 14.710000\n", "TW NaN NaN NaN 8.080000 8.080000\n", "US NaN 13.755000 NaN 10.564242 11.055128\n", "All 1.805 13.335556 1.763333 9.906346 9.758333" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "large_df.pivot_table(values='RESEARCH & DEVELOPMENT/SALES -',\n", " columns=['CSR Sustainability Committee','Audit Committee'],\n", " index='ISO COUNTRY CODE',margins=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cross-tabs are a special case of pivot tables, where the values are frequencies." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ACCOUNTING STANDARDS FOLLOWEDAudit CommitteeCSR Sustainability External AuditINDUSTRY GROUP...NAMERESEARCH & DEVELOPMENT/SALES -RETURN ON EQUITY - TOTAL (%)TOTAL DEBT % COMMON EQUITY
CSR Sustainability CommitteeNYAllNYAllNYAllN...AllNYAllNYAllNYAll
ISO COUNTRY CODE
CA1.0NaNNaN1.0NaNNaN0.0NaNNaN1.0...NaN1.0NaNNaN1.0NaNNaN1.0NaNNaN
CHNaN3.03.0NaN3.03.0NaN3.03.0NaN...3.0NaN3.03.0NaN3.03.0NaN3.03.0
CN3.06.0NaN3.06.0NaN1.04.0NaN3.0...NaN2.01.0NaN3.06.0NaN3.06.0NaN
DENaN2.02.0NaN2.02.0NaN2.02.0NaN...2.0NaN2.02.0NaN2.02.0NaN2.02.0
DKNaN1.01.0NaN1.01.0NaN1.01.0NaN...1.0NaN1.01.0NaN1.01.0NaN1.01.0
FRNaN4.04.0NaN4.04.0NaN4.04.0NaN...4.0NaN4.04.0NaN4.04.0NaN4.04.0
GBNaN3.02.0NaN3.02.0NaN3.02.0NaN...2.0NaN2.02.0NaN3.02.0NaN3.02.0
HK1.03.0NaN1.03.0NaN0.02.0NaN1.0...NaN1.01.0NaN1.03.0NaN1.03.0NaN
INNaN2.02.0NaN2.02.0NaN2.02.0NaN...2.0NaN2.02.0NaN2.02.0NaN2.02.0
ITNaN1.0NaNNaN1.0NaNNaN1.0NaNNaN...NaNNaN0.0NaNNaN1.0NaNNaN1.0NaN
JP1.03.02.01.03.02.00.02.02.01.0...2.01.03.02.01.03.02.01.03.02.0
KRNaN1.01.0NaN1.01.0NaN1.01.0NaN...1.0NaN1.01.0NaN1.01.0NaN1.01.0
NL1.01.01.01.01.01.00.01.01.01.0...1.00.01.01.01.01.01.01.01.01.0
SANaN1.0NaNNaN1.0NaNNaN0.0NaNNaN...NaNNaN0.0NaNNaN1.0NaNNaN1.0NaN
TWNaN1.01.0NaN1.01.0NaN1.01.0NaN...1.0NaN1.01.0NaN1.01.0NaN1.01.0
US11.055.023.011.055.023.04.034.023.011.0...23.06.033.023.011.051.023.011.055.023.0
All2.040.042.02.040.042.02.040.042.02.0...42.02.040.042.02.040.042.02.040.042.0
\n", "

17 rows × 24 columns

\n", "
" ], "text/plain": [ " ACCOUNTING STANDARDS FOLLOWED \\\n", "CSR Sustainability Committee N Y All \n", "ISO COUNTRY CODE \n", "CA 1.0 NaN NaN \n", "CH NaN 3.0 3.0 \n", "CN 3.0 6.0 NaN \n", "DE NaN 2.0 2.0 \n", "DK NaN 1.0 1.0 \n", "FR NaN 4.0 4.0 \n", "GB NaN 3.0 2.0 \n", "HK 1.0 3.0 NaN \n", "IN NaN 2.0 2.0 \n", "IT NaN 1.0 NaN \n", "JP 1.0 3.0 2.0 \n", "KR NaN 1.0 1.0 \n", "NL 1.0 1.0 1.0 \n", "SA NaN 1.0 NaN \n", "TW NaN 1.0 1.0 \n", "US 11.0 55.0 23.0 \n", "All 2.0 40.0 42.0 \n", "\n", " Audit Committee \\\n", "CSR Sustainability Committee N Y All \n", "ISO COUNTRY CODE \n", "CA 1.0 NaN NaN \n", "CH NaN 3.0 3.0 \n", "CN 3.0 6.0 NaN \n", "DE NaN 2.0 2.0 \n", "DK NaN 1.0 1.0 \n", "FR NaN 4.0 4.0 \n", "GB NaN 3.0 2.0 \n", "HK 1.0 3.0 NaN \n", "IN NaN 2.0 2.0 \n", "IT NaN 1.0 NaN \n", "JP 1.0 3.0 2.0 \n", "KR NaN 1.0 1.0 \n", "NL 1.0 1.0 1.0 \n", "SA NaN 1.0 NaN \n", "TW NaN 1.0 1.0 \n", "US 11.0 55.0 23.0 \n", "All 2.0 40.0 42.0 \n", "\n", " CSR Sustainability External Audit \\\n", "CSR Sustainability Committee N Y All \n", "ISO COUNTRY CODE \n", "CA 0.0 NaN NaN \n", "CH NaN 3.0 3.0 \n", "CN 1.0 4.0 NaN \n", "DE NaN 2.0 2.0 \n", "DK NaN 1.0 1.0 \n", "FR NaN 4.0 4.0 \n", "GB NaN 3.0 2.0 \n", "HK 0.0 2.0 NaN \n", "IN NaN 2.0 2.0 \n", "IT NaN 1.0 NaN \n", "JP 0.0 2.0 2.0 \n", "KR NaN 1.0 1.0 \n", "NL 0.0 1.0 1.0 \n", "SA NaN 0.0 NaN \n", "TW NaN 1.0 1.0 \n", "US 4.0 34.0 23.0 \n", "All 2.0 40.0 42.0 \n", "\n", " INDUSTRY GROUP ... NAME \\\n", "CSR Sustainability Committee N ... All \n", "ISO COUNTRY CODE ... \n", "CA 1.0 ... NaN \n", "CH NaN ... 3.0 \n", "CN 3.0 ... NaN \n", "DE NaN ... 2.0 \n", "DK NaN ... 1.0 \n", "FR NaN ... 4.0 \n", "GB NaN ... 2.0 \n", "HK 1.0 ... NaN \n", "IN NaN ... 2.0 \n", "IT NaN ... NaN \n", "JP 1.0 ... 2.0 \n", "KR NaN ... 1.0 \n", "NL 1.0 ... 1.0 \n", "SA NaN ... NaN \n", "TW NaN ... 1.0 \n", "US 11.0 ... 23.0 \n", "All 2.0 ... 42.0 \n", "\n", " RESEARCH & DEVELOPMENT/SALES - \\\n", "CSR Sustainability Committee N Y All \n", "ISO COUNTRY CODE \n", "CA 1.0 NaN NaN \n", "CH NaN 3.0 3.0 \n", "CN 2.0 1.0 NaN \n", "DE NaN 2.0 2.0 \n", "DK NaN 1.0 1.0 \n", "FR NaN 4.0 4.0 \n", "GB NaN 2.0 2.0 \n", "HK 1.0 1.0 NaN \n", "IN NaN 2.0 2.0 \n", "IT NaN 0.0 NaN \n", "JP 1.0 3.0 2.0 \n", "KR NaN 1.0 1.0 \n", "NL 0.0 1.0 1.0 \n", "SA NaN 0.0 NaN \n", "TW NaN 1.0 1.0 \n", "US 6.0 33.0 23.0 \n", "All 2.0 40.0 42.0 \n", "\n", " RETURN ON EQUITY - TOTAL (%) \\\n", "CSR Sustainability Committee N Y All \n", "ISO COUNTRY CODE \n", "CA 1.0 NaN NaN \n", "CH NaN 3.0 3.0 \n", "CN 3.0 6.0 NaN \n", "DE NaN 2.0 2.0 \n", "DK NaN 1.0 1.0 \n", "FR NaN 4.0 4.0 \n", "GB NaN 3.0 2.0 \n", "HK 1.0 3.0 NaN \n", "IN NaN 2.0 2.0 \n", "IT NaN 1.0 NaN \n", "JP 1.0 3.0 2.0 \n", "KR NaN 1.0 1.0 \n", "NL 1.0 1.0 1.0 \n", "SA NaN 1.0 NaN \n", "TW NaN 1.0 1.0 \n", "US 11.0 51.0 23.0 \n", "All 2.0 40.0 42.0 \n", "\n", " TOTAL DEBT % COMMON EQUITY \n", "CSR Sustainability Committee N Y All \n", "ISO COUNTRY CODE \n", "CA 1.0 NaN NaN \n", "CH NaN 3.0 3.0 \n", "CN 3.0 6.0 NaN \n", "DE NaN 2.0 2.0 \n", "DK NaN 1.0 1.0 \n", "FR NaN 4.0 4.0 \n", "GB NaN 3.0 2.0 \n", "HK 1.0 3.0 NaN \n", "IN NaN 2.0 2.0 \n", "IT NaN 1.0 NaN \n", "JP 1.0 3.0 2.0 \n", "KR NaN 1.0 1.0 \n", "NL 1.0 1.0 1.0 \n", "SA NaN 1.0 NaN \n", "TW NaN 1.0 1.0 \n", "US 11.0 55.0 23.0 \n", "All 2.0 40.0 42.0 \n", "\n", "[17 rows x 24 columns]" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "large_df.pivot_table(columns='CSR Sustainability Committee',index='ISO COUNTRY CODE',\n", " aggfunc='count',margins=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, it will return values for all variables. Therefore it is more convenient to use **crosstab**." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ISO COUNTRY CODECHCNDEDKFRGBHKINITJPKRNLTWUSAll
CSR Sustainability External Audit
N000000000000033
Y35214322121113563
All35214322121113866
\n", "
" ], "text/plain": [ "ISO COUNTRY CODE CH CN DE DK FR GB HK IN IT JP KR \\\n", "CSR Sustainability External Audit \n", "N 0 0 0 0 0 0 0 0 0 0 0 \n", "Y 3 5 2 1 4 3 2 2 1 2 1 \n", "All 3 5 2 1 4 3 2 2 1 2 1 \n", "\n", "ISO COUNTRY CODE NL TW US All \n", "CSR Sustainability External Audit \n", "N 0 0 3 3 \n", "Y 1 1 35 63 \n", "All 1 1 38 66 " ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab(large_df['CSR Sustainability External Audit'],large_df['ISO COUNTRY CODE'],margins=True)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }