3. Pandas data basics#
We will mostly use Pandas dataframe as our data type. This and the next chapter will introduce the dataframe in detail. Pandas tries to pick up the best features of Numpy, spreadsheets and relational databases, and actually succeeds mostly in it. Pandas is an unbelievably powerful data manipulation tool.
Before we start exploring Pandas, it is good to understand how the standard data types in Python work. So, let’s begin with them.
3.1. Standard data types#
3.1.1. List#
A list is probably the most used data type in Python. It is a variable-length collection of items. A list can be defined in Python using either square brackets or the list function. Lists in Python are mutable. You can change the values of a list afterwards.
example_list = [1,2,'three',4]
example_list[2]
'three'
example_list[2] = 3
example_list
[1, 2, 3, 4]
You can add a new element to the end of the list with append().
example_list.append(5)
example_list
[1, 2, 3, 4, 5]
You can also add an item to a specific location in the list using insert().
example_list.insert(3,3.5)
example_list
[1, 2, 3, 3.5, 4, 5]
With pop() you can remove items from the list.
example_list.pop(3)
3.5
example_list
[1, 2, 3, 4, 5]
You can remove a specific item from the list using remove().
example_list.remove(3)
example_list
[1, 2, 4, 5]
Lists are commonly used to acquire items from generators. Here is an example:
small_gen = (x**2 for x in range(10)) # a generator
list_of_gen = [item for item in small_gen]
list_of_gen
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
You can check if an item is or is not in the list with in and is not.
16 in list_of_gen
True
17 in list_of_gen
False
With + you can combine lists.
example_list + example_list
[1, 2, 4, 5, 1, 2, 4, 5]
You can extend a list with multiple elements with extend.
example_list.extend([6,7,8])
example_list
[1, 2, 4, 5, 6, 7, 8]
You can use sort() to sort a list. It organises the original list. Pandas, in contrary, creates a new object in this kind of operations.
unsorted_list = [7,3,6,4,8,2]
unsorted_list.sort()
unsorted_list
[2, 3, 4, 6, 7, 8]
Slicing is an important concept in Python. The sections of most sequency types can be selected with slice notation.
small_list = list(range(10))
small_list[1:3]
[1, 2]
small_list[0:3]
[0, 1, 2]
small_list[:3]
[0, 1, 2]
small_list[3:]
[3, 4, 5, 6, 7, 8, 9]
small_list[-3:]
[7, 8, 9]
small_list[-5:-3]
[5, 6]
Step can be used to define the frequency of slicing.
small_list[::2]
[0, 2, 4, 6, 8]
small_list[1:8:2]
[1, 3, 5, 7]
small_list[::-1]
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
3.1.2. Tuple#
A tuple is an immutable sequence of Python objects. Because it is immutable, it has also a fixed length. You can create tuples using parantheses.
small_tuple = (10,20,30,40)
small_tuple
(10, 20, 30, 40)
Otherwise, tuples are very similar to lists in Python. For example, you can access elements with square brackets.
small_tuple[2]
30
And you can concatenate them using +.
small_tuple + small_tuple
(10, 20, 30, 40, 10, 20, 30, 40)
But, they are immutable. You can not change values in them.
small_tuple[2] = 35
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-37-19b2de519515> in <module>
----> 1 small_tuple[2] = 35
TypeError: 'tuple' object does not support item assignment
Python will automatically unpack the tuple, if it is fed to a collection of variables (remember how a function returns a tuple).
a,b,c,d = small_tuple
a
10
b
20
Because a tuple is immutable, there are not many tuple methods. One useful is count, which counts the occurrences of a value.
small_tuple.count(40)
1
3.1.3. Dict#
Dict is the most complicated built-in data type in Python. It is a mutable collection of key-value pairs of Python objects. Dicts are created using curly braces.
sample_dict = {'a': 500, 'b' : 1000, 'c' : 1500, 'd': 2000}
You can access and add elements of a dict using brackes with a key inside them.
sample_dict['a']
500
sample_dict['e'] = 2500
sample_dict
{'a': 500, 'b': 1000, 'c': 1500, 'd': 2000, 'e': 2500}
There is no restrction that the keys should be strings and the values should be numbers.
sample_dict[6] = 'some_text'
sample_dict
{'a': 500, 'b': 1000, 'c': 1500, 'd': 2000, 'e': 2500, 6: 'some_text'}
You can delete an item from a dict using del.
del sample_dict[6]
sample_dict
{'a': 500, 'b': 1000, 'c': 1500, 'd': 2000, 'e': 2500}
The keys and values methods can be used to get iterators of the dict’s keys and values. (Remember that the list method was commonly used to extract the contents of an iterator.)
list(sample_dict.keys())
['a', 'b', 'c', 'd', 'e']
list(sample_dict.values())
[500, 1000, 1500, 2000, 2500]
You can merge two dictionaries with update.
sample_dict.update({'f':3000,'g':3500})
sample_dict
{'a': 500, 'b': 1000, 'c': 1500, 'd': 2000, 'e': 2500, 'f': 3000, 'g': 3500}
3.1.4. Set#
A set in Python is a collection of unique elements. They can be created with set or with curly braces.
set([5,5,5,6,6,6,7,7,7,8,8,8])
{5, 6, 7, 8}
{4,4,4,5,5,5,6,6,6,7,7,7}
{4, 5, 6, 7}
A Python set is very similar to a mathematical set, so it has methods for mathematical set operations.
x = {3,4,5,6,7,8}
y = {6,7,8,9,10}
x.union(y)
{3, 4, 5, 6, 7, 8, 9, 10}
x.intersection(y)
{6, 7, 8}
x.difference(y)
{3, 4, 5}
x.symmetric_difference(y)
{3, 4, 5, 9, 10}
3.1.5. Comprehensions#
Comprehensions, especially the list ones, are a great feature in Python. With them, you can efficiently create collections with a single statement. Let’s look some examples. (% is the module operator)
squares = [x**2 for x in range(10) if x%2==0]
squares
[0, 4, 16, 36, 64]
You could do exactly the same with a much longer for loop.
squares = []
for x in range(10):
if x%2==0:
squares.append(x**2)
squares
[0, 4, 16, 36, 64]
Here is another example:
fd = open('quotes.txt',encoding='utf-8')
[text.upper().rstrip() for text in fd if len(text) < 50]
['“JAVA IS, IN MANY WAYS, C++–.”(MICHAEL FELDMAN)',
'EVERYTHING CAN BE IMPROVED. ~CLARENCE W. BARRON',
"DON'T WATER YOUR WEEDS. ~HARVEY MACKAY",
'"NECESSITY IS THE MOTHER OF INVENTION" - PLATO',
'"INFORMATION IS NOT KNOWLEDGE." ALBERT EINSTEIN',
'DELETED CODE IS DEBUGGED CODE. — JEFF SICKEL',
'THE BEST CODE IS NO CODE AT ALL.',
'NO CODE IS FASTER THAN NO CODE. — MERB MOTTO']
Python also has set and dict comprehensions that work similarly. Here is an example. First, we set our stream position to the beginning
fd.seek(0)
0
The following command creates a dict where the keys are the first five letters of each quote and the values are the lengths of the quotes (if the quote is less than 50 characters long).
{item[0:5] : len(item) for item in fd if len(item) < 50}
{'“Java': 48,
'Every': 49,
"Don't": 40,
'"Nece': 47,
'"Info': 48,
'Delet': 46,
'The b': 33,
'No co': 46}
Nested list comprehensions are useful, but somewhat difficult to grasp. Let’s look some examples.
fd.seek(0)
0
In the following, we split the quotes to a list of words, creating a list of lists
quotes = [quote.rstrip().split(" ") for quote in fd if len(quote) < 45]
quotes
[["Don't", 'water', 'your', 'weeds.', '', '~Harvey', 'MacKay'],
['The', 'best', 'code', 'is', 'no', 'code', 'at', 'all.']]
Using a nested list comprehension, we can now collect individual words to a list.
[word for quote in quotes for word in quote]
["Don't",
'water',
'your',
'weeds.',
'',
'~Harvey',
'MacKay',
'The',
'best',
'code',
'is',
'no',
'code',
'at',
'all.']
Always remember to close the file.
fd.close()
3.2. Pandas data structures#
Two main data structures of Pandas are Series and Dataframe. Let’s analys both of them.
3.2.1. Pandas series#
Let’s start by importing Pandas.
import pandas as pd
A pandas series is a one-dimensional array of Numpy-type objects, and an associated array of labels, called index.
first_pandas = pd.Series([5,3,7,3,7,1])
first_pandas
0 5
1 3
2 7
3 3
4 7
5 1
dtype: int64
If we do not pass an index when defining a series, it will be given the default index of sequential values.
first_pandas.index
RangeIndex(start=0, stop=6, step=1)
You can define a custom index when creating a series.
second_pandas = pd.Series([1,2,3], index=['first','second','third'])
second_pandas
first 1
second 2
third 3
dtype: int64
A pandas series works like a dict. You can use the index value to pick up a single item.
second_pandas['second']
2
You can also use boolean indexing. Notice how the following command picks up the values on index locations 0, 2 and 4.
first_pandas[first_pandas > 4]
0 5
2 7
4 7
dtype: int64
You can use Numpy operations to your Pandas series.
import numpy as np
first_pandas*2
0 10
1 6
2 14
3 6
4 14
5 2
dtype: int64
np.log(first_pandas)
0 1.609438
1 1.098612
2 1.945910
3 1.098612
4 1.945910
5 0.000000
dtype: float64
As you have noticed, Pandas Series is very similar to a Python dict. Actually, you can very easily create a Pandas series from a dict.
sample_dict = {'a': 500, 'b' : 1000, 'c' : 1500, 'd': 2000}
sample_series = pd.Series(sample_dict)
sample_series
a 500
b 1000
c 1500
d 2000
dtype: int64
Important feature in Pandas is that everything is aligned by the index.
sample_series2 = pd.Series([100,200,300,400], index=['c','a','d','b'])
sample_series + sample_series2
a 700
b 1400
c 1600
d 2300
dtype: int64
If you look carefully, you can see that the summing was done “index-wise”.
You can give a name to the values and index of your Series.
sample_series.name = 'some_values'
sample_series.index.name = 'some_letters'
sample_series
some_letters
a 500
b 1000
c 1500
d 2000
Name: some_values, dtype: int64
You can also change your index values very easily.
sample_series.index = ['aa','bb','cc','dd']
sample_series
aa 500
bb 1000
cc 1500
dd 2000
Name: some_values, dtype: int64
3.2.2. Dataframe#
Pandas Dataframe is a rectangular table of objects with similar features as Pandas Series. Each column can have different type of values. In a way, it is a 2-dimensional extension of Pandas Series. For example, it has both a row and a column index.
A dataframe can be constucted in many ways. One popular option is to use a dict of equal-length lists.
company_data = {'year': ['2015','2016','2017','2018','2019','2020'],
'sales': [102000, 101000, 105000, 115000, 111000, 109000],
'profit': [5000, 6000, 8000, 7000, 9000, 3000]}
company_df = pd.DataFrame(company_data)
company_df
year | sales | profit | |
---|---|---|---|
0 | 2015 | 102000 | 5000 |
1 | 2016 | 101000 | 6000 |
2 | 2017 | 105000 | 8000 |
3 | 2018 | 115000 | 7000 |
4 | 2019 | 111000 | 9000 |
5 | 2020 | 109000 | 3000 |
You can define the order of columns when creating a dataframe.
pd.DataFrame(company_data, columns = ['year','profit','sales'])
year | profit | sales | |
---|---|---|---|
0 | 2015 | 5000 | 102000 |
1 | 2016 | 6000 | 101000 |
2 | 2017 | 8000 | 105000 |
3 | 2018 | 7000 | 115000 |
4 | 2019 | 9000 | 111000 |
5 | 2020 | 3000 | 109000 |
If you add a column to the list that is not included in the dict, it will be added to the dataframe with missing values.
company_df = pd.DataFrame(company_data, columns = ['year','sales','profit','total_assets'])
company_df
year | sales | profit | total_assets | |
---|---|---|---|---|
0 | 2015 | 102000 | 5000 | NaN |
1 | 2016 | 101000 | 6000 | NaN |
2 | 2017 | 105000 | 8000 | NaN |
3 | 2018 | 115000 | 7000 | NaN |
4 | 2019 | 111000 | 9000 | NaN |
5 | 2020 | 109000 | 3000 | NaN |
You can pick up one column from a dataframe using the usual bracket-notation. It will return the column as a Pandas series.
company_df['profit']
0 5000
1 6000
2 8000
3 7000
4 9000
5 3000
Name: profit, dtype: int64
Value assignment to a dataframe is easy. You can either add one value that is repeated in every instance of a column.
company_df['total_assets'] = 105000
company_df
year | sales | profit | total_assets | |
---|---|---|---|---|
0 | 2015 | 102000 | 5000 | 105000 |
1 | 2016 | 101000 | 6000 | 105000 |
2 | 2017 | 105000 | 8000 | 105000 |
3 | 2018 | 115000 | 7000 | 105000 |
4 | 2019 | 111000 | 9000 | 105000 |
5 | 2020 | 109000 | 3000 | 105000 |
Or you can add a full list of values.
company_df['total_assets'] = np.random.randint(105000,120000,6)
company_df
year | sales | profit | total_assets | |
---|---|---|---|---|
0 | 2015 | 102000 | 5000 | 109004 |
1 | 2016 | 101000 | 6000 | 109098 |
2 | 2017 | 105000 | 8000 | 108761 |
3 | 2018 | 115000 | 7000 | 113786 |
4 | 2019 | 111000 | 9000 | 108386 |
5 | 2020 | 109000 | 3000 | 112777 |
Remember that if you add a Pandas series to a datarame, the values will be matched according to their index values! For missing index values, a NaN-value will be added.
debt_sr = pd.Series([9000,6000,8000,10000],index=[2,0,5,3])
debt_sr
2 9000
0 6000
5 8000
3 10000
dtype: int64
company_df['debt'] = debt_sr
company_df
year | sales | profit | total_assets | debt | |
---|---|---|---|---|---|
0 | 2015 | 102000 | 5000 | 109004 | 6000.0 |
1 | 2016 | 101000 | 6000 | 109098 | NaN |
2 | 2017 | 105000 | 8000 | 108761 | 9000.0 |
3 | 2018 | 115000 | 7000 | 113786 | 10000.0 |
4 | 2019 | 111000 | 9000 | 108386 | NaN |
5 | 2020 | 109000 | 3000 | 112777 | 8000.0 |
There are many ways to remove columns. One option is to use del that works also with other Python data types.
del company_df['debt']
company_df
year | sales | profit | total_assets | |
---|---|---|---|---|
0 | 2015 | 102000 | 5000 | 109004 |
1 | 2016 | 101000 | 6000 | 109098 |
2 | 2017 | 105000 | 8000 | 108761 |
3 | 2018 | 115000 | 7000 | 113786 |
4 | 2019 | 111000 | 9000 | 108386 |
5 | 2020 | 109000 | 3000 | 112777 |
A dict of dicts can also by used to create a dataframe. The keys of the outer dict are the columns and the keys of the inner dict are the index values. (Btw., In real applications we are not “writing” our data. Almost always we load the data from a file, using for example, Pandas read_csv(). So, do not worry.)
company_dict = {'company a': {2010: 10.9, 2011: 8.7, 2012: 9.5},'company b': {2010: 11.2, 2011: 9.6, 2012: 8.8}}
new_df = pd.DataFrame(company_dict)
new_df
company a | company b | |
---|---|---|
2010 | 10.9 | 11.2 |
2011 | 8.7 | 9.6 |
2012 | 9.5 | 8.8 |
Transposing data is done in the same way as in Numpy, just by adding .T after the name of the dataframe.
new_df.T
2010 | 2011 | 2012 | |
---|---|---|---|
company a | 10.9 | 8.7 | 9.5 |
company b | 11.2 | 9.6 | 8.8 |
There are still many other ways to construct a dataframe, and we will use some of them in the following chapters. A detailed introduction to constructing a dataframe can be found here: pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html
There is actually a third data type in Pandas, Index objects. They are used to hold the labels of columns and index of a dataframe or series. A Pandas index is immutable and has similiraties with Python sets. However, the Pandas index can contain duplicate items.
new_df.index
Int64Index([2010, 2011, 2012], dtype='int64')
3.3. Basic Pandas functions#
Let’s use a more realistic data for the following examples.( More about read_csv() in the later chapters.)
big_companies_df = pd.read_csv('big_companies.csv',index_col='NAME',delimiter=';')
big_companies_df
DIVIDEND YIELD | CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
APPLE | 0.71 | 28.01 | 55.92 | 27.65 | 35.4 | 119.40 |
SAUDI ARABIAN OIL | 0.21 | 31.73 | 32.25 | 28.34 | 29.5 | 16.96 |
MICROSOFT | 1.06 | 43.46 | 40.14 | 24.89 | 36.5 | 61.56 |
AMAZON.COM | 0.00 | 14.60 | 21.95 | 12.97 | 121.1 | 83.31 |
ALIBABA GROUP HOLDING ADR 1:8 | 0.00 | 31.23 | 23.77 | 17.39 | 31.0 | 16.60 |
FACEBOOK CLASS A | 0.00 | 41.11 | 19.96 | 19.94 | 32.0 | 0.47 |
TENCENT HOLDINGS | 0.23 | 35.55 | 24.68 | 15.92 | 42.6 | 53.74 |
ALPHABET A | 0.00 | 33.18 | 18.12 | 17.76 | 33.2 | 2.26 |
TESLA | 0.00 | 11.21 | -14.94 | -1.61 | 1036.3 | 202.77 |
WALMART | 1.54 | 4.88 | 20.22 | 12.55 | 26.6 | 72.95 |
JOHNSON & JOHNSON | 2.71 | 23.83 | 25.36 | 17.38 | 26.7 | 46.57 |
TAIWAN SEMICON.MNFG. | 2.31 | 57.91 | 20.94 | 18.96 | 24.7 | 11.89 |
NESTLE 'N' | 2.47 | 16.85 | 23.05 | 14.21 | 23.3 | 71.42 |
With reindex(), you can rearrange the values of the dataframe. Let’s load Numpy.
import numpy as np
The following command shuffles the order of the index values (company names).
random_order = np.random.permutation(big_companies_df.index)
random_order
array(['MICROSOFT', 'APPLE', 'WALMART', 'ALPHABET A', "NESTLE 'N'",
'TESLA', 'SAUDI ARABIAN OIL', 'TAIWAN SEMICON.MNFG.',
'JOHNSON & JOHNSON', 'TENCENT HOLDINGS', 'AMAZON.COM',
'FACEBOOK CLASS A', 'ALIBABA GROUP HOLDING ADR 1:8'], dtype=object)
big_companies_df.reindex(index = random_order)
DIVIDEND YIELD | CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
MICROSOFT | 1.06 | 43.46 | 40.14 | 24.89 | 36.5 | 61.56 |
APPLE | 0.71 | 28.01 | 55.92 | 27.65 | 35.4 | 119.40 |
WALMART | 1.54 | 4.88 | 20.22 | 12.55 | 26.6 | 72.95 |
ALPHABET A | 0.00 | 33.18 | 18.12 | 17.76 | 33.2 | 2.26 |
NESTLE 'N' | 2.47 | 16.85 | 23.05 | 14.21 | 23.3 | 71.42 |
TESLA | 0.00 | 11.21 | -14.94 | -1.61 | 1036.3 | 202.77 |
SAUDI ARABIAN OIL | 0.21 | 31.73 | 32.25 | 28.34 | 29.5 | 16.96 |
TAIWAN SEMICON.MNFG. | 2.31 | 57.91 | 20.94 | 18.96 | 24.7 | 11.89 |
JOHNSON & JOHNSON | 2.71 | 23.83 | 25.36 | 17.38 | 26.7 | 46.57 |
TENCENT HOLDINGS | 0.23 | 35.55 | 24.68 | 15.92 | 42.6 | 53.74 |
AMAZON.COM | 0.00 | 14.60 | 21.95 | 12.97 | 121.1 | 83.31 |
FACEBOOK CLASS A | 0.00 | 41.11 | 19.96 | 19.94 | 32.0 | 0.47 |
ALIBABA GROUP HOLDING ADR 1:8 | 0.00 | 31.23 | 23.77 | 17.39 | 31.0 | 16.60 |
You can also rearrange columns with reindex().
random_columns = np.random.permutation(big_companies_df.columns)
random_columns
array(['TOTAL DEBT % COMMON EQUITY', 'RETURN ON EQUITY - TOTAL (%)',
'P/E - RATIO', 'CASH FLOW/SALES', 'RETURN ON INVESTED CAPITAL',
'DIVIDEND YIELD'], dtype=object)
big_companies_df.reindex(columns = random_columns)
TOTAL DEBT % COMMON EQUITY | RETURN ON EQUITY - TOTAL (%) | P/E - RATIO | CASH FLOW/SALES | RETURN ON INVESTED CAPITAL | DIVIDEND YIELD | |
---|---|---|---|---|---|---|
NAME | ||||||
APPLE | 119.40 | 55.92 | 35.4 | 28.01 | 27.65 | 0.71 |
SAUDI ARABIAN OIL | 16.96 | 32.25 | 29.5 | 31.73 | 28.34 | 0.21 |
MICROSOFT | 61.56 | 40.14 | 36.5 | 43.46 | 24.89 | 1.06 |
AMAZON.COM | 83.31 | 21.95 | 121.1 | 14.60 | 12.97 | 0.00 |
ALIBABA GROUP HOLDING ADR 1:8 | 16.60 | 23.77 | 31.0 | 31.23 | 17.39 | 0.00 |
FACEBOOK CLASS A | 0.47 | 19.96 | 32.0 | 41.11 | 19.94 | 0.00 |
TENCENT HOLDINGS | 53.74 | 24.68 | 42.6 | 35.55 | 15.92 | 0.23 |
ALPHABET A | 2.26 | 18.12 | 33.2 | 33.18 | 17.76 | 0.00 |
TESLA | 202.77 | -14.94 | 1036.3 | 11.21 | -1.61 | 0.00 |
WALMART | 72.95 | 20.22 | 26.6 | 4.88 | 12.55 | 1.54 |
JOHNSON & JOHNSON | 46.57 | 25.36 | 26.7 | 23.83 | 17.38 | 2.71 |
TAIWAN SEMICON.MNFG. | 11.89 | 20.94 | 24.7 | 57.91 | 18.96 | 2.31 |
NESTLE 'N' | 71.42 | 23.05 | 23.3 | 16.85 | 14.21 | 2.47 |
If we have names in the reindex list that are not included in the original index/columns, these are added to the dataframe with missing values. You can prevent those missing values, for example, with ffill in the parameter method (there is also bfill and nearest).
new_df = big_companies_df.reindex(['APPLE','MICROSOFT','TESLA'])
new_df
DIVIDEND YIELD | CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
APPLE | 0.71 | 28.01 | 55.92 | 27.65 | 35.4 | 119.40 |
MICROSOFT | 1.06 | 43.46 | 40.14 | 24.89 | 36.5 | 61.56 |
TESLA | 0.00 | 11.21 | -14.94 | -1.61 | 1036.3 | 202.77 |
Ffill is a forward fill. With strings, it adds the values of the original dataframe forward, according to the alphabetical order of the strings. Because APPLE is after AMAZON.COM, ALPHABET and ALIBABA… in alphabetical order, NaN values are put to these new rows.
new_df.reindex(big_companies_df.index,method = 'ffill')
DIVIDEND YIELD | CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
APPLE | 0.71 | 28.01 | 55.92 | 27.65 | 35.4 | 119.40 |
SAUDI ARABIAN OIL | 1.06 | 43.46 | 40.14 | 24.89 | 36.5 | 61.56 |
MICROSOFT | 1.06 | 43.46 | 40.14 | 24.89 | 36.5 | 61.56 |
AMAZON.COM | NaN | NaN | NaN | NaN | NaN | NaN |
ALIBABA GROUP HOLDING ADR 1:8 | NaN | NaN | NaN | NaN | NaN | NaN |
FACEBOOK CLASS A | 0.71 | 28.01 | 55.92 | 27.65 | 35.4 | 119.40 |
TENCENT HOLDINGS | 1.06 | 43.46 | 40.14 | 24.89 | 36.5 | 61.56 |
ALPHABET A | NaN | NaN | NaN | NaN | NaN | NaN |
TESLA | 0.00 | 11.21 | -14.94 | -1.61 | 1036.3 | 202.77 |
WALMART | 0.00 | 11.21 | -14.94 | -1.61 | 1036.3 | 202.77 |
JOHNSON & JOHNSON | 0.71 | 28.01 | 55.92 | 27.65 | 35.4 | 119.40 |
TAIWAN SEMICON.MNFG. | 1.06 | 43.46 | 40.14 | 24.89 | 36.5 | 61.56 |
NESTLE 'N' | 1.06 | 43.46 | 40.14 | 24.89 | 36.5 | 61.56 |
It is easy to drop values by index values from a series/dataframe using drop(). By default, Pandas uses index. Defining axis=1, you can drop columns. Another option is to drop both using index and columns
big_companies_df.drop('APPLE')
DIVIDEND YIELD | CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
SAUDI ARABIAN OIL | 0.21 | 31.73 | 32.25 | 28.34 | 29.5 | 16.96 |
MICROSOFT | 1.06 | 43.46 | 40.14 | 24.89 | 36.5 | 61.56 |
AMAZON.COM | 0.00 | 14.60 | 21.95 | 12.97 | 121.1 | 83.31 |
ALIBABA GROUP HOLDING ADR 1:8 | 0.00 | 31.23 | 23.77 | 17.39 | 31.0 | 16.60 |
FACEBOOK CLASS A | 0.00 | 41.11 | 19.96 | 19.94 | 32.0 | 0.47 |
TENCENT HOLDINGS | 0.23 | 35.55 | 24.68 | 15.92 | 42.6 | 53.74 |
ALPHABET A | 0.00 | 33.18 | 18.12 | 17.76 | 33.2 | 2.26 |
TESLA | 0.00 | 11.21 | -14.94 | -1.61 | 1036.3 | 202.77 |
WALMART | 1.54 | 4.88 | 20.22 | 12.55 | 26.6 | 72.95 |
JOHNSON & JOHNSON | 2.71 | 23.83 | 25.36 | 17.38 | 26.7 | 46.57 |
TAIWAN SEMICON.MNFG. | 2.31 | 57.91 | 20.94 | 18.96 | 24.7 | 11.89 |
NESTLE 'N' | 2.47 | 16.85 | 23.05 | 14.21 | 23.3 | 71.42 |
big_companies_df.drop('DIVIDEND YIELD',axis=1)
CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|
NAME | |||||
APPLE | 28.01 | 55.92 | 27.65 | 35.4 | 119.40 |
SAUDI ARABIAN OIL | 31.73 | 32.25 | 28.34 | 29.5 | 16.96 |
MICROSOFT | 43.46 | 40.14 | 24.89 | 36.5 | 61.56 |
AMAZON.COM | 14.60 | 21.95 | 12.97 | 121.1 | 83.31 |
ALIBABA GROUP HOLDING ADR 1:8 | 31.23 | 23.77 | 17.39 | 31.0 | 16.60 |
FACEBOOK CLASS A | 41.11 | 19.96 | 19.94 | 32.0 | 0.47 |
TENCENT HOLDINGS | 35.55 | 24.68 | 15.92 | 42.6 | 53.74 |
ALPHABET A | 33.18 | 18.12 | 17.76 | 33.2 | 2.26 |
TESLA | 11.21 | -14.94 | -1.61 | 1036.3 | 202.77 |
WALMART | 4.88 | 20.22 | 12.55 | 26.6 | 72.95 |
JOHNSON & JOHNSON | 23.83 | 25.36 | 17.38 | 26.7 | 46.57 |
TAIWAN SEMICON.MNFG. | 57.91 | 20.94 | 18.96 | 24.7 | 11.89 |
NESTLE 'N' | 16.85 | 23.05 | 14.21 | 23.3 | 71.42 |
big_companies_df.drop(index=['APPLE','TESLA'],columns=['DIVIDEND YIELD','P/E - RATIO'])
CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|
NAME | ||||
SAUDI ARABIAN OIL | 31.73 | 32.25 | 28.34 | 16.96 |
MICROSOFT | 43.46 | 40.14 | 24.89 | 61.56 |
AMAZON.COM | 14.60 | 21.95 | 12.97 | 83.31 |
ALIBABA GROUP HOLDING ADR 1:8 | 31.23 | 23.77 | 17.39 | 16.60 |
FACEBOOK CLASS A | 41.11 | 19.96 | 19.94 | 0.47 |
TENCENT HOLDINGS | 35.55 | 24.68 | 15.92 | 53.74 |
ALPHABET A | 33.18 | 18.12 | 17.76 | 2.26 |
WALMART | 4.88 | 20.22 | 12.55 | 72.95 |
JOHNSON & JOHNSON | 23.83 | 25.36 | 17.38 | 46.57 |
TAIWAN SEMICON.MNFG. | 57.91 | 20.94 | 18.96 | 11.89 |
NESTLE 'N' | 16.85 | 23.05 | 14.21 | 71.42 |
An important thing to notice is that most of the Pandas methods are non-destructive. In the previous operations, we have made many manipulations to the big_companies_df dataframe. However, it is still the same.
big_companies_df
DIVIDEND YIELD | CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
APPLE | 0.71 | 28.01 | 55.92 | 27.65 | 35.4 | 119.40 |
SAUDI ARABIAN OIL | 0.21 | 31.73 | 32.25 | 28.34 | 29.5 | 16.96 |
MICROSOFT | 1.06 | 43.46 | 40.14 | 24.89 | 36.5 | 61.56 |
AMAZON.COM | 0.00 | 14.60 | 21.95 | 12.97 | 121.1 | 83.31 |
ALIBABA GROUP HOLDING ADR 1:8 | 0.00 | 31.23 | 23.77 | 17.39 | 31.0 | 16.60 |
FACEBOOK CLASS A | 0.00 | 41.11 | 19.96 | 19.94 | 32.0 | 0.47 |
TENCENT HOLDINGS | 0.23 | 35.55 | 24.68 | 15.92 | 42.6 | 53.74 |
ALPHABET A | 0.00 | 33.18 | 18.12 | 17.76 | 33.2 | 2.26 |
TESLA | 0.00 | 11.21 | -14.94 | -1.61 | 1036.3 | 202.77 |
WALMART | 1.54 | 4.88 | 20.22 | 12.55 | 26.6 | 72.95 |
JOHNSON & JOHNSON | 2.71 | 23.83 | 25.36 | 17.38 | 26.7 | 46.57 |
TAIWAN SEMICON.MNFG. | 2.31 | 57.91 | 20.94 | 18.96 | 24.7 | 11.89 |
NESTLE 'N' | 2.47 | 16.85 | 23.05 | 14.21 | 23.3 | 71.42 |
Pandas makes almost all manipulations so that the manipulated dataframe is returned as a new object. If you want to change the original dataframe, almost all the methods have inplace=True for that.
Indexing, selecting values and slicing Pandas dataframes is very similar to other Python objects. One important difference is that when using labels for slicing, the last label is inclusive.
big_companies_df['MICROSOFT':'TESLA']
DIVIDEND YIELD | CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
MICROSOFT | 1.06 | 43.46 | 40.14 | 24.89 | 36.5 | 61.56 |
AMAZON.COM | 0.00 | 14.60 | 21.95 | 12.97 | 121.1 | 83.31 |
ALIBABA GROUP HOLDING ADR 1:8 | 0.00 | 31.23 | 23.77 | 17.39 | 31.0 | 16.60 |
FACEBOOK CLASS A | 0.00 | 41.11 | 19.96 | 19.94 | 32.0 | 0.47 |
TENCENT HOLDINGS | 0.23 | 35.55 | 24.68 | 15.92 | 42.6 | 53.74 |
ALPHABET A | 0.00 | 33.18 | 18.12 | 17.76 | 33.2 | 2.26 |
TESLA | 0.00 | 11.21 | -14.94 | -1.61 | 1036.3 | 202.77 |
You can also pick up columns using double brackets, put the slicing is not possible.
big_companies_df[['DIVIDEND YIELD','CASH FLOW/SALES']]
DIVIDEND YIELD | CASH FLOW/SALES | |
---|---|---|
NAME | ||
APPLE | 0.71 | 28.01 |
SAUDI ARABIAN OIL | 0.21 | 31.73 |
MICROSOFT | 1.06 | 43.46 |
AMAZON.COM | 0.00 | 14.60 |
ALIBABA GROUP HOLDING ADR 1:8 | 0.00 | 31.23 |
FACEBOOK CLASS A | 0.00 | 41.11 |
TENCENT HOLDINGS | 0.23 | 35.55 |
ALPHABET A | 0.00 | 33.18 |
TESLA | 0.00 | 11.21 |
WALMART | 1.54 | 4.88 |
JOHNSON & JOHNSON | 2.71 | 23.83 |
TAIWAN SEMICON.MNFG. | 2.31 | 57.91 |
NESTLE 'N' | 2.47 | 16.85 |
You can also pick up values using booleans.
big_companies_df[big_companies_df['P/E - RATIO'] > 40]
DIVIDEND YIELD | CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
AMAZON.COM | 0.00 | 14.60 | 21.95 | 12.97 | 121.1 | 83.31 |
TENCENT HOLDINGS | 0.23 | 35.55 | 24.68 | 15.92 | 42.6 | 53.74 |
TESLA | 0.00 | 11.21 | -14.94 | -1.61 | 1036.3 | 202.77 |
You can build a boolean dataframe easily. In the following, we check if a value in the dataframe is larger than 10.
big_companies_df > 10
DIVIDEND YIELD | CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
APPLE | False | True | True | True | True | True |
SAUDI ARABIAN OIL | False | True | True | True | True | True |
MICROSOFT | False | True | True | True | True | True |
AMAZON.COM | False | True | True | True | True | True |
ALIBABA GROUP HOLDING ADR 1:8 | False | True | True | True | True | True |
FACEBOOK CLASS A | False | True | True | True | True | False |
TENCENT HOLDINGS | False | True | True | True | True | True |
ALPHABET A | False | True | True | True | True | False |
TESLA | False | True | False | False | True | True |
WALMART | False | False | True | True | True | True |
JOHNSON & JOHNSON | False | True | True | True | True | True |
TAIWAN SEMICON.MNFG. | False | True | True | True | True | True |
NESTLE 'N' | False | True | True | True | True | True |
To make the indexing more clear, Pandas has operators loc and iloc that can be used to select the rows and columns from a dataframe using Numpy-like notation. loc works with labels and iloc with integer positions.
big_companies_df
DIVIDEND YIELD | CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
APPLE | 0.71 | 28.01 | 55.92 | 27.65 | 35.4 | 119.40 |
SAUDI ARABIAN OIL | 0.21 | 31.73 | 32.25 | 28.34 | 29.5 | 16.96 |
MICROSOFT | 1.06 | 43.46 | 40.14 | 24.89 | 36.5 | 61.56 |
AMAZON.COM | 0.00 | 14.60 | 21.95 | 12.97 | 121.1 | 83.31 |
ALIBABA GROUP HOLDING ADR 1:8 | 0.00 | 31.23 | 23.77 | 17.39 | 31.0 | 16.60 |
FACEBOOK CLASS A | 0.00 | 41.11 | 19.96 | 19.94 | 32.0 | 0.47 |
TENCENT HOLDINGS | 0.23 | 35.55 | 24.68 | 15.92 | 42.6 | 53.74 |
ALPHABET A | 0.00 | 33.18 | 18.12 | 17.76 | 33.2 | 2.26 |
TESLA | 0.00 | 11.21 | -14.94 | -1.61 | 1036.3 | 202.77 |
WALMART | 1.54 | 4.88 | 20.22 | 12.55 | 26.6 | 72.95 |
JOHNSON & JOHNSON | 2.71 | 23.83 | 25.36 | 17.38 | 26.7 | 46.57 |
TAIWAN SEMICON.MNFG. | 2.31 | 57.91 | 20.94 | 18.96 | 24.7 | 11.89 |
NESTLE 'N' | 2.47 | 16.85 | 23.05 | 14.21 | 23.3 | 71.42 |
Notice that with loc we can use slicing also with columns.
big_companies_df.loc['TENCENT HOLDINGS':'TAIWAN SEMICON.MNFG.','DIVIDEND YIELD':'P/E - RATIO']
DIVIDEND YIELD | CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | |
---|---|---|---|---|---|
NAME | |||||
TENCENT HOLDINGS | 0.23 | 35.55 | 24.68 | 15.92 | 42.6 |
ALPHABET A | 0.00 | 33.18 | 18.12 | 17.76 | 33.2 |
TESLA | 0.00 | 11.21 | -14.94 | -1.61 | 1036.3 |
WALMART | 1.54 | 4.88 | 20.22 | 12.55 | 26.6 |
JOHNSON & JOHNSON | 2.71 | 23.83 | 25.36 | 17.38 | 26.7 |
TAIWAN SEMICON.MNFG. | 2.31 | 57.91 | 20.94 | 18.96 | 24.7 |
big_companies_df.iloc[6:12,0:5]
DIVIDEND YIELD | CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | |
---|---|---|---|---|---|
NAME | |||||
TENCENT HOLDINGS | 0.23 | 35.55 | 24.68 | 15.92 | 42.6 |
ALPHABET A | 0.00 | 33.18 | 18.12 | 17.76 | 33.2 |
TESLA | 0.00 | 11.21 | -14.94 | -1.61 | 1036.3 |
WALMART | 1.54 | 4.88 | 20.22 | 12.55 | 26.6 |
JOHNSON & JOHNSON | 2.71 | 23.83 | 25.36 | 17.38 | 26.7 |
TAIWAN SEMICON.MNFG. | 2.31 | 57.91 | 20.94 | 18.96 | 24.7 |
When adding dataframes, any index-column pairs that are missing from either dataframe, will be replaced with NaN-values. Let’s look an example. We create two dataframes with partial data. Both are missing one (different) row and one column.
partial_data_df = big_companies_df.drop(index='APPLE',columns='DIVIDEND YIELD')
partial_data_df
CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|
NAME | |||||
SAUDI ARABIAN OIL | 31.73 | 32.25 | 28.34 | 29.5 | 16.96 |
MICROSOFT | 43.46 | 40.14 | 24.89 | 36.5 | 61.56 |
AMAZON.COM | 14.60 | 21.95 | 12.97 | 121.1 | 83.31 |
ALIBABA GROUP HOLDING ADR 1:8 | 31.23 | 23.77 | 17.39 | 31.0 | 16.60 |
FACEBOOK CLASS A | 41.11 | 19.96 | 19.94 | 32.0 | 0.47 |
TENCENT HOLDINGS | 35.55 | 24.68 | 15.92 | 42.6 | 53.74 |
ALPHABET A | 33.18 | 18.12 | 17.76 | 33.2 | 2.26 |
TESLA | 11.21 | -14.94 | -1.61 | 1036.3 | 202.77 |
WALMART | 4.88 | 20.22 | 12.55 | 26.6 | 72.95 |
JOHNSON & JOHNSON | 23.83 | 25.36 | 17.38 | 26.7 | 46.57 |
TAIWAN SEMICON.MNFG. | 57.91 | 20.94 | 18.96 | 24.7 | 11.89 |
NESTLE 'N' | 16.85 | 23.05 | 14.21 | 23.3 | 71.42 |
partial_data2_df = big_companies_df.drop(index='TESLA',columns='CASH FLOW/SALES')
partial_data2_df
DIVIDEND YIELD | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|
NAME | |||||
APPLE | 0.71 | 55.92 | 27.65 | 35.4 | 119.40 |
SAUDI ARABIAN OIL | 0.21 | 32.25 | 28.34 | 29.5 | 16.96 |
MICROSOFT | 1.06 | 40.14 | 24.89 | 36.5 | 61.56 |
AMAZON.COM | 0.00 | 21.95 | 12.97 | 121.1 | 83.31 |
ALIBABA GROUP HOLDING ADR 1:8 | 0.00 | 23.77 | 17.39 | 31.0 | 16.60 |
FACEBOOK CLASS A | 0.00 | 19.96 | 19.94 | 32.0 | 0.47 |
TENCENT HOLDINGS | 0.23 | 24.68 | 15.92 | 42.6 | 53.74 |
ALPHABET A | 0.00 | 18.12 | 17.76 | 33.2 | 2.26 |
WALMART | 1.54 | 20.22 | 12.55 | 26.6 | 72.95 |
JOHNSON & JOHNSON | 2.71 | 25.36 | 17.38 | 26.7 | 46.57 |
TAIWAN SEMICON.MNFG. | 2.31 | 20.94 | 18.96 | 24.7 | 11.89 |
NESTLE 'N' | 2.47 | 23.05 | 14.21 | 23.3 | 71.42 |
Those cells that are found in both dataframes have their values summed. If a specific cell is only in one of the dataframes, a NaN value is inserted to that location in the resulting dataframe.
partial_data_df + partial_data2_df
CASH FLOW/SALES | DIVIDEND YIELD | P/E - RATIO | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
ALIBABA GROUP HOLDING ADR 1:8 | NaN | NaN | 62.0 | 47.54 | 34.78 | 33.20 |
ALPHABET A | NaN | NaN | 66.4 | 36.24 | 35.52 | 4.52 |
AMAZON.COM | NaN | NaN | 242.2 | 43.90 | 25.94 | 166.62 |
APPLE | NaN | NaN | NaN | NaN | NaN | NaN |
FACEBOOK CLASS A | NaN | NaN | 64.0 | 39.92 | 39.88 | 0.94 |
JOHNSON & JOHNSON | NaN | NaN | 53.4 | 50.72 | 34.76 | 93.14 |
MICROSOFT | NaN | NaN | 73.0 | 80.28 | 49.78 | 123.12 |
NESTLE 'N' | NaN | NaN | 46.6 | 46.10 | 28.42 | 142.84 |
SAUDI ARABIAN OIL | NaN | NaN | 59.0 | 64.50 | 56.68 | 33.92 |
TAIWAN SEMICON.MNFG. | NaN | NaN | 49.4 | 41.88 | 37.92 | 23.78 |
TENCENT HOLDINGS | NaN | NaN | 85.2 | 49.36 | 31.84 | 107.48 |
TESLA | NaN | NaN | NaN | NaN | NaN | NaN |
WALMART | NaN | NaN | 53.2 | 40.44 | 25.10 | 145.90 |
If you want to have some other than NaN values to these locations, you can use the add() method with the fill_value parameter. This will replace the missing values with a specified value. Notice how APPLE CASH FLOW/SALES and TESLA/DIVIDEND YIELD still have missing values. This is because they are missing from both partial dataframes.
partial_data_df.add(partial_data2_df,fill_value=0)
CASH FLOW/SALES | DIVIDEND YIELD | P/E - RATIO | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
ALIBABA GROUP HOLDING ADR 1:8 | 31.23 | 0.00 | 62.0 | 47.54 | 34.78 | 33.20 |
ALPHABET A | 33.18 | 0.00 | 66.4 | 36.24 | 35.52 | 4.52 |
AMAZON.COM | 14.60 | 0.00 | 242.2 | 43.90 | 25.94 | 166.62 |
APPLE | NaN | 0.71 | 35.4 | 55.92 | 27.65 | 119.40 |
FACEBOOK CLASS A | 41.11 | 0.00 | 64.0 | 39.92 | 39.88 | 0.94 |
JOHNSON & JOHNSON | 23.83 | 2.71 | 53.4 | 50.72 | 34.76 | 93.14 |
MICROSOFT | 43.46 | 1.06 | 73.0 | 80.28 | 49.78 | 123.12 |
NESTLE 'N' | 16.85 | 2.47 | 46.6 | 46.10 | 28.42 | 142.84 |
SAUDI ARABIAN OIL | 31.73 | 0.21 | 59.0 | 64.50 | 56.68 | 33.92 |
TAIWAN SEMICON.MNFG. | 57.91 | 2.31 | 49.4 | 41.88 | 37.92 | 23.78 |
TENCENT HOLDINGS | 35.55 | 0.23 | 85.2 | 49.36 | 31.84 | 107.48 |
TESLA | 11.21 | NaN | 1036.3 | -14.94 | -1.61 | 202.77 |
WALMART | 4.88 | 1.54 | 53.2 | 40.44 | 25.10 | 145.90 |
You can also do arithmetic operations between a dataframe and a Pandas series. One thing to note is that these operations use broadcasting like Numpy arrays. Broadcasting is a very important concept in ML. You can read more about it here. numpy.org/doc/stable/user/basics.broadcasting.html
In its simplest, broadcasting means that if we are subtracting a Series from a Dataframe, the Series is subtracted from every row (or column) of the dataframe.
In the following, we subtract the values of TESLA from every row of our dataframe.
sample_series = big_companies_df.loc['TESLA']
sample_series
DIVIDEND YIELD 0.00
CASH FLOW/SALES 11.21
RETURN ON EQUITY - TOTAL (%) -14.94
RETURN ON INVESTED CAPITAL -1.61
P/E - RATIO 1036.30
TOTAL DEBT % COMMON EQUITY 202.77
Name: TESLA, dtype: float64
The index of the series is matched with the columns of the dataframe.
big_companies_df-sample_series
DIVIDEND YIELD | CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
APPLE | 0.71 | 16.80 | 70.86 | 29.26 | -1000.9 | -83.37 |
SAUDI ARABIAN OIL | 0.21 | 20.52 | 47.19 | 29.95 | -1006.8 | -185.81 |
MICROSOFT | 1.06 | 32.25 | 55.08 | 26.50 | -999.8 | -141.21 |
AMAZON.COM | 0.00 | 3.39 | 36.89 | 14.58 | -915.2 | -119.46 |
ALIBABA GROUP HOLDING ADR 1:8 | 0.00 | 20.02 | 38.71 | 19.00 | -1005.3 | -186.17 |
FACEBOOK CLASS A | 0.00 | 29.90 | 34.90 | 21.55 | -1004.3 | -202.30 |
TENCENT HOLDINGS | 0.23 | 24.34 | 39.62 | 17.53 | -993.7 | -149.03 |
ALPHABET A | 0.00 | 21.97 | 33.06 | 19.37 | -1003.1 | -200.51 |
TESLA | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 |
WALMART | 1.54 | -6.33 | 35.16 | 14.16 | -1009.7 | -129.82 |
JOHNSON & JOHNSON | 2.71 | 12.62 | 40.30 | 18.99 | -1009.6 | -156.20 |
TAIWAN SEMICON.MNFG. | 2.31 | 46.70 | 35.88 | 20.57 | -1011.6 | -190.88 |
NESTLE 'N' | 2.47 | 5.64 | 37.99 | 15.82 | -1013.0 | -131.35 |
If the index/columns do not match, the resulting dataframe is a union.
sample_series = sample_series.append(pd.Series(1,index=['extra row']))
big_companies_df-sample_series
CASH FLOW/SALES | DIVIDEND YIELD | P/E - RATIO | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | TOTAL DEBT % COMMON EQUITY | extra row | |
---|---|---|---|---|---|---|---|
NAME | |||||||
APPLE | 16.80 | 0.71 | -1000.9 | 70.86 | 29.26 | -83.37 | NaN |
SAUDI ARABIAN OIL | 20.52 | 0.21 | -1006.8 | 47.19 | 29.95 | -185.81 | NaN |
MICROSOFT | 32.25 | 1.06 | -999.8 | 55.08 | 26.50 | -141.21 | NaN |
AMAZON.COM | 3.39 | 0.00 | -915.2 | 36.89 | 14.58 | -119.46 | NaN |
ALIBABA GROUP HOLDING ADR 1:8 | 20.02 | 0.00 | -1005.3 | 38.71 | 19.00 | -186.17 | NaN |
FACEBOOK CLASS A | 29.90 | 0.00 | -1004.3 | 34.90 | 21.55 | -202.30 | NaN |
TENCENT HOLDINGS | 24.34 | 0.23 | -993.7 | 39.62 | 17.53 | -149.03 | NaN |
ALPHABET A | 21.97 | 0.00 | -1003.1 | 33.06 | 19.37 | -200.51 | NaN |
TESLA | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | NaN |
WALMART | -6.33 | 1.54 | -1009.7 | 35.16 | 14.16 | -129.82 | NaN |
JOHNSON & JOHNSON | 12.62 | 2.71 | -1009.6 | 40.30 | 18.99 | -156.20 | NaN |
TAIWAN SEMICON.MNFG. | 46.70 | 2.31 | -1011.6 | 35.88 | 20.57 | -190.88 | NaN |
NESTLE 'N' | 5.64 | 2.47 | -1013.0 | 37.99 | 15.82 | -131.35 | NaN |
Basically all the Numpy’s element-wise array methods work with Pandas objects.
Tesla looks a little bit better now (no negative returns). :)
np.abs(big_companies_df)
DIVIDEND YIELD | CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
APPLE | 0.71 | 28.01 | 55.92 | 27.65 | 35.4 | 119.40 |
SAUDI ARABIAN OIL | 0.21 | 31.73 | 32.25 | 28.34 | 29.5 | 16.96 |
MICROSOFT | 1.06 | 43.46 | 40.14 | 24.89 | 36.5 | 61.56 |
AMAZON.COM | 0.00 | 14.60 | 21.95 | 12.97 | 121.1 | 83.31 |
ALIBABA GROUP HOLDING ADR 1:8 | 0.00 | 31.23 | 23.77 | 17.39 | 31.0 | 16.60 |
FACEBOOK CLASS A | 0.00 | 41.11 | 19.96 | 19.94 | 32.0 | 0.47 |
TENCENT HOLDINGS | 0.23 | 35.55 | 24.68 | 15.92 | 42.6 | 53.74 |
ALPHABET A | 0.00 | 33.18 | 18.12 | 17.76 | 33.2 | 2.26 |
TESLA | 0.00 | 11.21 | 14.94 | 1.61 | 1036.3 | 202.77 |
WALMART | 1.54 | 4.88 | 20.22 | 12.55 | 26.6 | 72.95 |
JOHNSON & JOHNSON | 2.71 | 23.83 | 25.36 | 17.38 | 26.7 | 46.57 |
TAIWAN SEMICON.MNFG. | 2.31 | 57.91 | 20.94 | 18.96 | 24.7 | 11.89 |
NESTLE 'N' | 2.47 | 16.85 | 23.05 | 14.21 | 23.3 | 71.42 |
If you want to apply a function to rows or columns of a dataframe, you can use apply.
differ = lambda x:np.max(x)-np.min(x)
differ calculates the difference between the maximum and minimum of each column.
big_companies_df.apply(differ)
DIVIDEND YIELD 2.71
CASH FLOW/SALES 53.03
RETURN ON EQUITY - TOTAL (%) 70.86
RETURN ON INVESTED CAPITAL 29.95
P/E - RATIO 1013.00
TOTAL DEBT % COMMON EQUITY 202.30
dtype: float64
We can also do this row-wise.
big_companies_df.apply(differ,axis=1)
NAME
APPLE 118.69
SAUDI ARABIAN OIL 32.04
MICROSOFT 60.50
AMAZON.COM 121.10
ALIBABA GROUP HOLDING ADR 1:8 31.23
FACEBOOK CLASS A 41.11
TENCENT HOLDINGS 53.51
ALPHABET A 33.20
TESLA 1051.24
WALMART 71.41
JOHNSON & JOHNSON 43.86
TAIWAN SEMICON.MNFG. 55.60
NESTLE 'N' 68.95
dtype: float64
For element-wise manipulations, you can use applymap().
big_companies_df.applymap(lambda x: -x)
DIVIDEND YIELD | CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
APPLE | -0.71 | -28.01 | -55.92 | -27.65 | -35.4 | -119.40 |
SAUDI ARABIAN OIL | -0.21 | -31.73 | -32.25 | -28.34 | -29.5 | -16.96 |
MICROSOFT | -1.06 | -43.46 | -40.14 | -24.89 | -36.5 | -61.56 |
AMAZON.COM | -0.00 | -14.60 | -21.95 | -12.97 | -121.1 | -83.31 |
ALIBABA GROUP HOLDING ADR 1:8 | -0.00 | -31.23 | -23.77 | -17.39 | -31.0 | -16.60 |
FACEBOOK CLASS A | -0.00 | -41.11 | -19.96 | -19.94 | -32.0 | -0.47 |
TENCENT HOLDINGS | -0.23 | -35.55 | -24.68 | -15.92 | -42.6 | -53.74 |
ALPHABET A | -0.00 | -33.18 | -18.12 | -17.76 | -33.2 | -2.26 |
TESLA | -0.00 | -11.21 | 14.94 | 1.61 | -1036.3 | -202.77 |
WALMART | -1.54 | -4.88 | -20.22 | -12.55 | -26.6 | -72.95 |
JOHNSON & JOHNSON | -2.71 | -23.83 | -25.36 | -17.38 | -26.7 | -46.57 |
TAIWAN SEMICON.MNFG. | -2.31 | -57.91 | -20.94 | -18.96 | -24.7 | -11.89 |
NESTLE 'N' | -2.47 | -16.85 | -23.05 | -14.21 | -23.3 | -71.42 |
sort_index can be used for sorting.
big_companies_df.sort_index()
DIVIDEND YIELD | CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
ALIBABA GROUP HOLDING ADR 1:8 | 0.00 | 31.23 | 23.77 | 17.39 | 31.0 | 16.60 |
ALPHABET A | 0.00 | 33.18 | 18.12 | 17.76 | 33.2 | 2.26 |
AMAZON.COM | 0.00 | 14.60 | 21.95 | 12.97 | 121.1 | 83.31 |
APPLE | 0.71 | 28.01 | 55.92 | 27.65 | 35.4 | 119.40 |
FACEBOOK CLASS A | 0.00 | 41.11 | 19.96 | 19.94 | 32.0 | 0.47 |
JOHNSON & JOHNSON | 2.71 | 23.83 | 25.36 | 17.38 | 26.7 | 46.57 |
MICROSOFT | 1.06 | 43.46 | 40.14 | 24.89 | 36.5 | 61.56 |
NESTLE 'N' | 2.47 | 16.85 | 23.05 | 14.21 | 23.3 | 71.42 |
SAUDI ARABIAN OIL | 0.21 | 31.73 | 32.25 | 28.34 | 29.5 | 16.96 |
TAIWAN SEMICON.MNFG. | 2.31 | 57.91 | 20.94 | 18.96 | 24.7 | 11.89 |
TENCENT HOLDINGS | 0.23 | 35.55 | 24.68 | 15.92 | 42.6 | 53.74 |
TESLA | 0.00 | 11.21 | -14.94 | -1.61 | 1036.3 | 202.77 |
WALMART | 1.54 | 4.88 | 20.22 | 12.55 | 26.6 | 72.95 |
big_companies_df.sort_index(axis=1)
CASH FLOW/SALES | DIVIDEND YIELD | P/E - RATIO | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
APPLE | 28.01 | 0.71 | 35.4 | 55.92 | 27.65 | 119.40 |
SAUDI ARABIAN OIL | 31.73 | 0.21 | 29.5 | 32.25 | 28.34 | 16.96 |
MICROSOFT | 43.46 | 1.06 | 36.5 | 40.14 | 24.89 | 61.56 |
AMAZON.COM | 14.60 | 0.00 | 121.1 | 21.95 | 12.97 | 83.31 |
ALIBABA GROUP HOLDING ADR 1:8 | 31.23 | 0.00 | 31.0 | 23.77 | 17.39 | 16.60 |
FACEBOOK CLASS A | 41.11 | 0.00 | 32.0 | 19.96 | 19.94 | 0.47 |
TENCENT HOLDINGS | 35.55 | 0.23 | 42.6 | 24.68 | 15.92 | 53.74 |
ALPHABET A | 33.18 | 0.00 | 33.2 | 18.12 | 17.76 | 2.26 |
TESLA | 11.21 | 0.00 | 1036.3 | -14.94 | -1.61 | 202.77 |
WALMART | 4.88 | 1.54 | 26.6 | 20.22 | 12.55 | 72.95 |
JOHNSON & JOHNSON | 23.83 | 2.71 | 26.7 | 25.36 | 17.38 | 46.57 |
TAIWAN SEMICON.MNFG. | 57.91 | 2.31 | 24.7 | 20.94 | 18.96 | 11.89 |
NESTLE 'N' | 16.85 | 2.47 | 23.3 | 23.05 | 14.21 | 71.42 |
If you want sort by values, you can use sort_values()
big_companies_df.sort_values(by = 'P/E - RATIO',ascending=False)
DIVIDEND YIELD | CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
TESLA | 0.00 | 11.21 | -14.94 | -1.61 | 1036.3 | 202.77 |
AMAZON.COM | 0.00 | 14.60 | 21.95 | 12.97 | 121.1 | 83.31 |
TENCENT HOLDINGS | 0.23 | 35.55 | 24.68 | 15.92 | 42.6 | 53.74 |
MICROSOFT | 1.06 | 43.46 | 40.14 | 24.89 | 36.5 | 61.56 |
APPLE | 0.71 | 28.01 | 55.92 | 27.65 | 35.4 | 119.40 |
ALPHABET A | 0.00 | 33.18 | 18.12 | 17.76 | 33.2 | 2.26 |
FACEBOOK CLASS A | 0.00 | 41.11 | 19.96 | 19.94 | 32.0 | 0.47 |
ALIBABA GROUP HOLDING ADR 1:8 | 0.00 | 31.23 | 23.77 | 17.39 | 31.0 | 16.60 |
SAUDI ARABIAN OIL | 0.21 | 31.73 | 32.25 | 28.34 | 29.5 | 16.96 |
JOHNSON & JOHNSON | 2.71 | 23.83 | 25.36 | 17.38 | 26.7 | 46.57 |
WALMART | 1.54 | 4.88 | 20.22 | 12.55 | 26.6 | 72.95 |
TAIWAN SEMICON.MNFG. | 2.31 | 57.91 | 20.94 | 18.96 | 24.7 | 11.89 |
NESTLE 'N' | 2.47 | 16.85 | 23.05 | 14.21 | 23.3 | 71.42 |
You can use many colums when sorting by values.
big_companies_df.sort_values(by = ['DIVIDEND YIELD','P/E - RATIO'],ascending=[True,False])
DIVIDEND YIELD | CASH FLOW/SALES | RETURN ON EQUITY - TOTAL (%) | RETURN ON INVESTED CAPITAL | P/E - RATIO | TOTAL DEBT % COMMON EQUITY | |
---|---|---|---|---|---|---|
NAME | ||||||
TESLA | 0.00 | 11.21 | -14.94 | -1.61 | 1036.3 | 202.77 |
AMAZON.COM | 0.00 | 14.60 | 21.95 | 12.97 | 121.1 | 83.31 |
ALPHABET A | 0.00 | 33.18 | 18.12 | 17.76 | 33.2 | 2.26 |
FACEBOOK CLASS A | 0.00 | 41.11 | 19.96 | 19.94 | 32.0 | 0.47 |
ALIBABA GROUP HOLDING ADR 1:8 | 0.00 | 31.23 | 23.77 | 17.39 | 31.0 | 16.60 |
SAUDI ARABIAN OIL | 0.21 | 31.73 | 32.25 | 28.34 | 29.5 | 16.96 |
TENCENT HOLDINGS | 0.23 | 35.55 | 24.68 | 15.92 | 42.6 | 53.74 |
APPLE | 0.71 | 28.01 | 55.92 | 27.65 | 35.4 | 119.40 |
MICROSOFT | 1.06 | 43.46 | 40.14 | 24.89 | 36.5 | 61.56 |
WALMART | 1.54 | 4.88 | 20.22 | 12.55 | 26.6 | 72.95 |
TAIWAN SEMICON.MNFG. | 2.31 | 57.91 | 20.94 | 18.96 | 24.7 | 11.89 |
NESTLE 'N' | 2.47 | 16.85 | 23.05 | 14.21 | 23.3 | 71.42 |
JOHNSON & JOHNSON | 2.71 | 23.83 | 25.36 | 17.38 | 26.7 | 46.57 |