Processing unstructured data

22. Processing unstructured data#

22.1. Extracting accounting data from different document types#

22.1.1. JSON#

JSON stands for JavaScript Object Notation. It is a popular format for transporting and storing data, especially in web-page management. Python has a built-in library for encoding/decoding JSON files. In the following are basic examples how to use it.

import json

The format of JSON is very similar to a Python dictionary.

json_example = '{ "Company":"Nokia", "Country":"Finland", "ROA":0.12}'

loads() turns JSONs to dictionaries.

result_dict = json.loads(json_example)

result_dict

{'Company': 'Nokia', 'Country': 'Finland', 'ROA': 0.12}

result_dict['ROA']

0.12

dumps() can be used to change dictionaries to JSON objects.

pyth_dict = {"Company":"Apple", "Country":"USA", "ROA":0.17}

results_json = json.dumps(pyth_dict)

results_json # Notice the outer single quotation marks.

'{"Company": "Apple", "Country": "USA", "ROA": 0.17}'

More specifically, dumps() will change Python objects into JSON objects with the following rules:

Python dict to JSON object
Python list to JSON array
Python tuple to JSON array
Python string to JSON string
Python int to JSON number
Python float to JSON number
Python boolean to JSON boolean
Python None to JSON null

Very often, data providers set up API services to connect applications to databases. Often, these API services will use XML or JSON formats to exchange data between the client and the server. Therefore, it is essential to know how to read these files.

With json-library, you can use the built-in open() function to open a json file and json.loads() to transfer it to a Python object.

fd = open('company_tickers.json')

comp_dict = json.loads(fd.read())

list(comp_dict.items())[:10]

[('0', {'cik_str': 1750, 'ticker': 'AIR', 'title': 'AAR CORP'}),
 ('1', {'cik_str': 1800, 'ticker': 'ABT', 'title': 'ABBOTT LABORATORIES'}),
 ('2', {'cik_str': 1961, 'ticker': 'WDDD', 'title': 'WORLDS INC'}),
 ('3', {'cik_str': 2034, 'ticker': 'ACET', 'title': 'ACETO CORP'}),
 ('4', {'cik_str': 2034, 'ticker': 'ACETQ', 'title': 'ACETO CORP'}),
 ('5', {'cik_str': 2098, 'ticker': 'ACU', 'title': 'ACME UNITED CORP'}),
 ('6',
  {'cik_str': 2135,
   'ticker': 'ACS',
   'title': 'AFFILIATED COMPUTER SERVICES INC'}),
 ('7',
  {'cik_str': 2178,
   'ticker': 'AE',
   'title': 'ADAMS RESOURCES & ENERGY, INC.'}),
 ('8', {'cik_str': 2186, 'ticker': 'BKTI', 'title': 'BK Technologies Corp'}),
 ('9',
  {'cik_str': 2230,
   'ticker': 'ADX',
   'title': 'ADAMS DIVERSIFIED EQUITY FUND, INC.'})]

Pandas has also functios to read JSON objects

import pandas as pd

json_df = pd.read_json('company_tickers.json')

json_df

	0	1	2	3	4	5	6	7	8	9	...	13623	13624	13625	13626	13627	13628	13629	13630	13631	13632
cik_str	1750	1800	1961	2034	2034	2098	2135	2178	2186	2230	...	1787791	1788028	1788028	1788841	1789760	1789760	1789832	1790625	1791091	1791091
ticker	AIR	ABT	WDDD	ACET	ACETQ	ACU	ACS	AE	BKTI	ADX	...	JIH	AMHC	AMHCU	GRNV	CIIC	CIICU	HESM	LIVK	HCCO	HCCOU
title	AAR CORP	ABBOTT LABORATORIES	WORLDS INC	ACETO CORP	ACETO CORP	ACME UNITED CORP	AFFILIATED COMPUTER SERVICES INC	ADAMS RESOURCES & ENERGY, INC.	BK Technologies Corp	ADAMS DIVERSIFIED EQUITY FUND, INC.	...	Juniper Industrial Holdings, Inc.	Amplitude Healthcare Acquisition Corp	Amplitude Healthcare Acquisition Corp	GreenVision Acquisition Corp.	CIIG Merger Corp.	CIIG Merger Corp.	Hess Midstream LP	LIV Capital Acquisition Corp.	Healthcare Merger Corp.	Healthcare Merger Corp.

3 rows × 13633 columns

json_df.transpose()

	cik_str	ticker	title
0	1750	AIR	AAR CORP
1	1800	ABT	ABBOTT LABORATORIES
2	1961	WDDD	WORLDS INC
3	2034	ACET	ACETO CORP
4	2034	ACETQ	ACETO CORP
...	...	...	...
13628	1789760	CIICU	CIIG Merger Corp.
13629	1789832	HESM	Hess Midstream LP
13630	1790625	LIVK	LIV Capital Acquisition Corp.
13631	1791091	HCCO	Healthcare Merger Corp.
13632	1791091	HCCOU	Healthcare Merger Corp.

13633 rows × 3 columns

22.1.2. XML#

XML is another important format, when you are acquiring data from the web using APIs. The following link has a lot of information how to use Python to process XML documents.

https://docs.python.org/3/library/xml.etree.elementtree.html

22.1.3. XBLR#

XBLR is a very important data format for accounting/finance. It is an universal reporting standard that allows financial statement information to be downloaded directly into spreadsheets. The following links provide information how to process XBLR documents in Python.

https://pypi.org/project/python-xbrl/

https://www.codeproject.com/Articles/1227268/Accessing-Financial-Reports-in-the-EDGAR-Database

https://www.codeproject.com/Articles/1227765/Parsing-XBRL-with-Python

22.1.4. PDF#

Very often, the information from companies is in the form of pdf documents. For analysis in Python, they need to be changed to textual form. They are many ways to do that, like specialized software, webpages etc.

Python also has many libraries that can be used to process pdf documents. In my opinion, PDFMiner is one of the best ones.

Below is a code that can efficiently change PDF documents to textual form. I found it from the internet, so no credits for me.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
# From PDFInterpreter import both PDFResourceManager and PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
# Import this to raise exception whenever text extraction from PDF is not allowed
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
from pdfminer.converter import PDFPageAggregator

def convert_pdfminer(fname):
        fp = open(fname, 'rb')
        parser = PDFParser(fp)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        laparams = LAParams()
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        text = ''
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)
            layout = device.get_result()
            for lt_obj in layout:
                if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
                    text += lt_obj.get_text()
        return text

22.2. Extracting data from the Internet#

22.2.2. Structured data#

Below is an example that collects information from the internet to a structured format (dataframe) using Pandas read_html -function.

test_df = pd.read_html('https://www.nordnet.fi/markkinakatsaus/osakekurssit?selectedTab=keyFigures&sortField=pe&sortOrder=asc&exchangeCountry=FI',decimal=',')

work_df = pd.DataFrame()

work_df['Name'] = test_df[0]['Nimi']

work_df['P/E']= [i for [i,j] in test_df[0]['P/S'].str.split(' ')]

work_df['P/S']= [i for [i,j] in test_df[0]['P/B'].str.split(' ')]

work_df['P/B']= [i for [i,j] in test_df[0]['Tulos/osake'].str.split(' ')]

temp_list = []
for value in test_df[0]['Osinkotuotto']:
    try:
        temp_list.append(value.split(' ')[0])
    except:
        temp_list.append(value)
        
work_df['Tulos/osake'] = temp_list

temp_list = []
for value in test_df[0]['Osinko/osake']:
    try:
        temp_list.append(value.split(' ')[0])
    except:
        temp_list.append(value)
        
work_df['Osinkotuotto'] = temp_list

temp_list = []
for value in test_df[0]['Unnamed: 11']:
    try:
        temp_list.append(value.split(' ')[0])
    except:
        temp_list.append(value)
        
work_df['Lainoitusaste'] = temp_list

work_df

	Name	P/E	P/S	P/B	Tulos/osake	Osinkotuotto	Lainoitusaste
0	Nexstim Oyj	0.26	2.72	84.03	0.22	–	0
1	Fondia Oyj	7.61	1.46	4.93	0.97	–	10
2	Sievi Capital plc	8.38	4.88	0.89	0.12	–	0
3	Scanfil Plc	8.54	0.63	2.06	0.68	2.62	40
4	Honkarakenne Oyj B	8.58	0.46	1.72	0.43	–	10
...	...	...	...	...	...	...	...
95	KONE Corporation	39.22	3.67	12.74	1.81	2.39	85
96	Gofore Plc	42.74	2.63	5.78	0.3	1.56	10
97	Aallon Group Oyj	48.31	2.01	4.07	0.2	1.87	0
98	Telia Company AB	48.54	1.68	1.9	0.71	4.79	85
99	LeadDesk Oyj	52.63	8.53	13.76	0.41	–	10

100 rows × 7 columns

work_df['P/E'] = work_df['P/E'].astype('float')

work_df['P/S'] = work_df['P/S'].astype('float')

work_df['P/B'] = work_df['P/B'].astype('float')

work_df['Tulos/osake'].replace('–',np.nan,inplace=True)

work_df['Tulos/osake'] = work_df['Tulos/osake'].astype('float')

work_df['Osinkotuotto'].replace('–',np.nan,inplace=True)

work_df['Osinkotuotto'] = work_df['Osinkotuotto'].astype('float')

work_df['Lainoitusaste'].replace('–',np.nan,inplace=True)

work_df['Lainoitusaste'] = work_df['Lainoitusaste'].astype('float')

work_df.hist(figsize=(10,10))
plt.show()

_images/1fae92b609399d8efe5f0364bfed0e60af7005ffb704a1a9602152cdf7b04c0b.png

22.3. Processing text data#

The straightforward way to process text data is to use regular expressions. They are very efficient, but unfortunately, very difficult. Luckily, many NLP libraries have good text processing tools available. Especially good is Natural Language Toolkit (https://www.nltk.org/).

22.3.1. Regular experssions#

https://www.regular-expressions.info/quickstart.html

22.4. Processing video and images#

22.4.1. Satellite data#

Below is an exaxmple, where NO2 observations from Sentinel 5P satellite (https://sentinel.esa.int/web/sentinel/missions/sentinel-5p) are processed to usable image form in Python.

from netCDF4 import Dataset
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

ESA provides data from Sentinel satellites using NetCDF file format. From pro.arcgis.com: “NetCDF (network Common Data Form) is a file format for storing multidimensional scientific data (variables) such as temperature, humidity, pressure, wind speed, and direction. Each of these variables can be displayed through a dimension (such as time) in ArcGIS by making a layer or table view from the netCDF file.”

my_example_nc_file = 'S5P_NRTI_L2__NO2____20200309T105605_20200309T110105_12457_01_010302_20200309T114222.nc'
fh = Dataset(my_example_nc_file, mode='r')

print(fh.groups['PRODUCT'].variables['nitrogendioxide_tropospheric_column_precision'])

<class 'netCDF4._netCDF4.Variable'>
float32 nitrogendioxide_tropospheric_column_precision(time, scanline, ground_pixel)
    units: mol m-2
    standard_name: troposphere_mole_content_of_nitrogen_dioxide standard_error
    long_name: Precision of the tropospheric vertical column of nitrogen dioxide
    coordinates: longitude latitude
    multiplication_factor_to_convert_to_molecules_percm2: 6.022141e+19
    _FillValue: 9.96921e+36
path = /PRODUCT
unlimited dimensions: 
current shape = (1, 373, 450)
filling on

lons = fh.groups['PRODUCT'].variables['longitude'][:][0,:,:]
lats = fh.groups['PRODUCT'].variables['latitude'][:][0,:,:]
no2 = fh.groups['PRODUCT'].variables['nitrogendioxide_tropospheric_column_precision'][0,:,:]
print (lons.shape)
print (lats.shape)
print (no2.shape)

(373, 450)
(373, 450)
(373, 450)

no2_units = fh.groups['PRODUCT'].variables['nitrogendioxide_tropospheric_column_precision'].units

no2_units

'mol m-2'

from matplotlib.colors import LogNorm
from mpl_toolkits.basemap import Basemap

lon_0 = lons.mean()
lat_0 = lats.mean()

m = Basemap(width=3000000,height=2500000,
            resolution='l',projection='stere',lat_0=lat_0,lon_0=lon_0)

xi, yi = m(lons, lats)

plt.figure(figsize=(10,10))
cs = m.pcolor(xi,yi,np.squeeze(no2),norm=LogNorm(), cmap='jet')
m.drawparallels(np.arange(-80., 81., 10.), labels=[1,0,0,0], fontsize=10)
m.drawmeridians(np.arange(-180., 181., 10.), labels=[0,0,0,1], fontsize=10)

# Add Coastlines, States, and Country Boundaries
m.drawcoastlines(linewidth=2.5)
m.drawcountries(linewidth=2.5)
cbar = m.colorbar(cs, location='bottom', pad="10%")
cbar.set_label(no2_units)
plt.savefig('test.png')

_images/c5594a41a77db43b191132c7a1e62d31081b71ae17e4f8376661316d5f810d6f.png

plt.figure(figsize=(10,10))
plt.imshow(no2,cmap='hot')

<matplotlib.image.AxesImage at 0x173bea96488>

_images/99cf9e12af3c736671faf371e276439f83c222c145f6b27c87db80369e1bcc5b.png

Processing unstructured data

Contents

22. Processing unstructured data#

22.1. Extracting accounting data from different document types#

22.1.1. JSON#

22.1.2. XML#

22.1.3. XBLR#

22.1.4. PDF#

22.2. Extracting data from the Internet#

22.2.2. Structured data#

22.3. Processing text data#

22.3.1. Regular experssions#

22.4. Processing video and images#

22.4.1. Satellite data#

22.5. Speech recognition and synthesis#

22.6. Feature engineering#

Processing unstructured data

Contents

22. Processing unstructured data#

22.1. Extracting accounting data from different document types#

22.1.1. JSON#

22.1.2. XML#

22.1.3. XBLR#

22.1.4. PDF#

22.2. Extracting data from the Internet#

22.2.1. Social media#

22.2.2. Structured data#

22.3. Processing text data#

22.3.1. Regular experssions#

22.4. Processing video and images#

22.4.1. Satellite data#

22.5. Speech recognition and synthesis#

22.6. Feature engineering#