2. NLP example - LDA and other summarisation tools

In this example, we analyse a collection of academic journals with Latent Dirichlet Allocation (LDA) and other summarisation tools.

To manipulate and list files in directories, we need the os -library.

import os
Copy to clipboard

I have the data in the acc_journals -folder that is under the work folder. Unfortunately, this data is not available anywhere. If you want to follow this example, just put a collection of txt-files to a “acc_journals”-folder (under the work folder) and follow the steps.

data_path = './acc_journals/'
Copy to clipboard

listdir() makes a list of filenames inside data_path

files = os.listdir(data_path)
Copy to clipboard

The name of the first text file is “2019_1167.txt”.

files[0]
Copy to clipboard
'2019_1167.txt'
Copy to clipboard

In total, we have 2126 articles.

len(files)
Copy to clipboard
2126
Copy to clipboard

The filenames have a publication year as the first four digits of the name. With the following code we can collect the publication years to a list.

file_years = []
for file in files:
    file_years.append(int(file[:4])) # Pick the first four letters from the filename -string and turn it into a integer.
Copy to clipboard
[file_years.count(a) for a in set(file_years)]
Copy to clipboard
[260, 296, 297, 134, 128, 129, 174, 234, 226, 248]
Copy to clipboard

In this example we will need numpy to manipulate arrays, and Matplotlib for plots.

import numpy as np
import matplotlib.pyplot as plt
Copy to clipboard

Let’s plot the number of documents per year.

plt.style.use('fivethirtyeight') # Define the style of figures.
plt.figure(figsize=[10,6]) # Define the size of figures.
# The first argument: years, the second argument: the list of document frequencies for different years (built using list comprehension)
plt.bar(list(set(file_years)),[file_years.count(a) for a in set(file_years)]) 
plt.xticks(list(set(file_years))) # The years below the bars
plt.show()
Copy to clipboard
_images/2_2_LDA_example_16_0.png

The common practice is to remove stopwords from documents. These are common fill words that do not contain information about the content of documents. We use the stopwords-list from the NLTK library (www.nltk.org). Here is a short description of NLTK from their web page: “NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.”

from nltk.corpus import stopwords
Copy to clipboard
stop_words = stopwords.words("english")
Copy to clipboard

Here are some example stopwords.

stop_words[0:10]
Copy to clipboard
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
Copy to clipboard

Usually, stopword-lists are extended with useless words specific to the corpus we are analysing. That is done below. These additional words are found by analysing the results at different steps of the analysis. Then the analysis is repeated.

stop_words.extend(['fulltext','document','downloaded','download','emerald','emeraldinsight',
                   'accepted','com','received','revised','archive','journal','available','current',
                   'issue','full','text','https','doi','org','www','com''ieee','cid','et','al','pp',
                   'vol','fig','reproduction','prohibited','reproduced','permission','accounting','figure','chapter'])
Copy to clipboard

The following code reads every file in the files list and reads it content as an item to the raw_text list. Thus, we have a list with 2126 items and where each item is a raw text of one document.

raw_text = []
for file in files:
    fd = open(os.path.join(data_path,file),'r',errors='ignore')
    raw_text.append(fd.read())
Copy to clipboard

Here is an example of raw text from the first document.

raw_text[0][3200:3500]
Copy to clipboard
'currency exchanges and markets have daily dollar volume of around $50 billion.1 Over 300 "cryptofunds" have emerged (hedge funds that invest solely in cryptocurrencies), attracting around $10 billion in assets under management (Rooney and Levy 2018).2 Recently, bitcoin futures have commenced trading'
Copy to clipboard

For the following steps, we need Gensim that is a multipurpose NLP library especially designed for topic modelling.

Here is a short description from the Gensim Github-page (github.com/RaRe-Technologies/gensim):

“Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. Features:

  • All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core),

  • Intuitive interfaces

    • Easy to plug in your own input corpus/datastream (trivial streaming API)

    • Easy to extend with other Vector Space algorithms (trivial transformation API)

  • Efficient multicore implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning.

  • Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.”

import gensim
Copy to clipboard

Gensim has a convenient simple_preprocess() -function that makes many text cleaning procedures automatically. It converts a document into a list of lowercase tokens, ignoring tokens that are too short (less than two characters) or too long (more than 15 characters). The following code goes through all the raw texts and applies simple_preprocess() to them. So, docs_cleaned is a list with lists of tokens as items.

docs_cleaned = []
for item in raw_text:
    tokens = gensim.utils.simple_preprocess(item)
    docs_cleaned.append(tokens)
Copy to clipboard

Here is an example from the first document after cleaning. The documents are now lists of tokens, and here the list is joined back as a string of text.

" ".join(docs_cleaned[0][300:400])
Copy to clipboard
'for assistance relating to data acknowledges financial support from the capital markets co operative research centre acknowledges financial support from the australian research council arc de supplementary data can be found on the review of financial studies web site send correspondence to talis putnins uts business school university of technology sydney po box broadway nsw australia telephone mail talis putnins uts edu au the author published by oxford university press on behalf of the society for financial studies all rights reserved for permissions please mail journals permissions oup com doi rfs hhz downloaded from https academic oup com rfs article'
Copy to clipboard

Next, we remove the stopwords from the documents.

docs_nostops = []
for item in docs_cleaned:
    red_tokens = [word for word in item if word not in stop_words]
    docs_nostops.append(red_tokens)
Copy to clipboard

An example text from the first document after removing the stopwords.

" ".join(docs_nostops[0][300:400])
Copy to clipboard
'hedge funds invest solely attracting around billion assets management rooney levy recently bitcoin futures commenced trading cme cboe catering institutional demand trading hedging bitcoin fringe asset quickly maturing rapid growth anonymity provide users created considerable regulatory challenges application million cryptocurrency exchange traded fund etf rejected securities exchange commission sec march several rejected amid concerns including lack regulation chinese government banned residents trading made initial coin offerings icos illegal september central bank heads bank england mark carney publicly expressed concerns many potential benefits including faster efficient settlement payments regulatory concerns center around use illegal trade drugs hacks thefts illegal pornography even'
Copy to clipboard

As a next step, we remove everything else but nouns, adjectives, verbs and adverbs recognised by our language model. As our language model, we use the large English model from Spacy (spacy.io). Here are some key features of Spacy from their web page:

  • Non-destructive tokenisation

  • Named entity recognition

  • Support for 59+ languages

  • 46 statistical models for 16 languages

  • Pretrained word vectors

  • State-of-the-art speed

  • Easy deep learning integration

  • Part-of-speech tagging

  • Labelled dependency parsing

  • Syntax-driven sentence segmentation

  • Built-in visualisers for syntax and NER

  • Convenient string-to-hash mapping

  • Export to NumPy data arrays

  • Efficient binary serialisation

  • Easy model packaging and deployment

  • Robust, rigorously evaluated accuracy

First, we load the library.

import spacy
Copy to clipboard

Then we define that only certain part-of-speed (POS) -tags are allowed.

allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']
Copy to clipboard

The en_core_web_lg model is quite large (700MB) so it takes a while to download it. We do not need the dependency parser or named-entity-recognition, so we disable them.

nlp = spacy.load('en_core_web_lg', disable=['parser', 'ner'])
Copy to clipboard

The following code goes through the documents and removes words that are not recognised by our language model as nouns, adjectives, verbs or adverbs.

docs_lemmas = []
for red_tokens in docs_nostops:
    doc = nlp(" ".join(red_tokens)) # We need to join the list of tokens back to a single string.
    docs_lemmas.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
Copy to clipboard

Here is again an example text from the first document. Things are looking good. We have a clean collection of words that are meaningful when we try to figure out what is discussed in the text.

" ".join(docs_lemmas[0][300:400])
Copy to clipboard
'silk road marketplace combine public nature blockchain provide unique laboratory analyze illegal ecosystem evolve bitcoin network individual identity mask pseudo anonymity character alpha public nature blockchain allow link bitcoin transaction individual user market participant identify user release see next autonomous future commence trade contract bitcoin price approximately bitcoin time future launch contract notional value academic library user review financial study seize authority bitcoin seizure combine source provide sample user know involved illegal activity starting point analysis apply different empirical approach go sample estimate population illegal activity first approach exploit trade network user know involve illegal activity illegal user use bitcoin'
Copy to clipboard

Next, we build our bigram-model. Bigrams are two-word pairs that naturally belong together. Like the words New and York. We connect these words before we do the LDA analysis.

bigram = gensim.models.Phrases(docs_lemmas,threshold = 80, min_count=3) 
bigram_mod = gensim.models.phrases.Phraser(bigram)
Copy to clipboard
docs_bigrams = [bigram_mod[doc] for doc in docs_lemmas]
Copy to clipboard

In this sample text, the model creates bigrams academic_library, starting_point and bitcoin_seizure. The formed bigrams are quite good. It is very often difficult to set the parameters of Phrases so that we have only reasonable bigrams.

" ".join(docs_bigrams[0][300:400])
Copy to clipboard
'evolve bitcoin network individual identity mask pseudo anonymity character alpha public nature blockchain allow link bitcoin transaction individual user market participant identify user release see next autonomous future commence trade contract bitcoin price approximately bitcoin time future launch contract notional value academic_library user review financial study seize authority bitcoin_seizure combine source provide sample user know involved illegal activity starting_point analysis apply different empirical approach go sample estimate population illegal activity first approach exploit trade network user know involve illegal activity illegal user use bitcoin_blockchain reconstruct complete network transaction market participant apply type network cluster analysis identify distinct community datum legal'
Copy to clipboard

2.1. LDA

lda

Okay, let’s start our LDA analysis. For this, we need the corpora module from Gensim.

import gensim.corpora as corpora
Copy to clipboard

First, with, Dictionary, we build our dictionary using the list docs_bigrams. Then we filter the most extreme cases from the dictionary:

  • no_below = 2 : no words that are in less than two documents

  • no_above = 0.7 : no words that are in more than 70 % of the documents

  • keep_n = 50000 : keep the 50000 most frequent words

id2word = corpora.Dictionary(docs_bigrams)
id2word.filter_extremes(no_below=2, no_above=0.7, keep_n=50000)
Copy to clipboard

Then, we build our corpus by indexing the words of documents using our id2word -dictionary

corpus = [id2word.doc2bow(text) for text in docs_bigrams]
Copy to clipboard

corpus contains a list of tuples for every document, where the first item of the tuples is a word-index and the second is the frequency of that word in the document.

For example, in the first document, a word with index 0 in our dictionary is once, a word with index 1 is four times, etc.

corpus[0][0:10]
Copy to clipboard
[(0, 1),
 (1, 4),
 (2, 3),
 (3, 50),
 (4, 5),
 (5, 4),
 (6, 5),
 (7, 1),
 (8, 2),
 (9, 2)]
Copy to clipboard

We can check what those words are just by indexing our dictionary id2word. As you can see, ‘accept’ is 50 times in the first document.

[id2word[i] for i in range(10)]
Copy to clipboard
['abnormal',
 'absolute',
 'ac',
 'academic_library',
 'accept',
 'access',
 'accessible',
 'accompany',
 'accuracy',
 'achieve']
Copy to clipboard

This is the main step in our analysis; we build the LDA model. There are many parameters in Gensim’s LdaModel. Luckily the default parameters work quite well. You can read more about the parameters from radimrehurek.com/gensim/models/ldamodel.html

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=10, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=len(corpus),
                                           passes=10,
                                           alpha='asymmetric',
                                           per_word_topics=False,
                                            eta = 'auto')
Copy to clipboard

pyLDAvis is a useful library to visualise the results: github.com/bmabey/pyLDAvis.

import pyLDAvis.gensim
Copy to clipboard
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
Copy to clipboard
/home/mikkoranta/python3/gensim/lib/python3.8/site-packages/joblib/numpy_pickle.py:103: DeprecationWarning: tostring() is deprecated. Use tobytes() instead.
  pickler.file_handle.write(chunk.tostring('C'))
/home/mikkoranta/python3/gensim/lib/python3.8/site-packages/joblib/numpy_pickle.py:103: DeprecationWarning: tostring() is deprecated. Use tobytes() instead.
  pickler.file_handle.write(chunk.tostring('C'))
Copy to clipboard

With pyLDAvis, we get an interactive figure with intertopic distances and the most important words for each topic. More separate the topic “bubbles” are, better the model.

vis
Copy to clipboard
Slide to adjust relevance metric:(2)
0.00.20.40.60.81.0
PC1PC2Marginal topic distribtion2%5%10%12345678910Intertopic Distance Map (via multidimensional scaling)Overall term frequencyEstimated term frequency within the selected topic1. saliency(term w) = frequency(w) * [sum_t p(t | w) * log(p(t | w)/p(t))] for topics t; see Chuang et. al (2012)2. relevance(term w | topic t) = λ * p(w | t) + (1 - λ) * p(w | t)/p(w); see Sievert & Shirley (2014)consumerteampricepatientbrandnetworkprojectmarketingoptimalcapabilityprivacydemandinnovationorganizationaltasksecuritysaleinvestmentmemberretaileremployeehospitalpurchasesupplierinventoryoperationperceivejobparticipantpolicyTop-30 Most Salient Terms(1)010,00020,00030,00040,000

We need pandas to present the most important words in a dataframe.

import pandas as pd
Copy to clipboard

The following code builds a dataframe from the ten most important words for each topic. Now our task would be to figure out the topics from these words.

top_words_df = pd.DataFrame()
for i in range(10):
    temp_words = lda_model.show_topic(i,10)
    just_words = [name for (name,_) in temp_words]
    top_words_df['Topic ' + str(i+1)] = just_words
Copy to clipboard
top_words_df
Copy to clipboard
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
0 privacy project network team manager price capability consumer price sale
1 security innovation investment task marketing optimal organizational marketing patient participant
2 perceive idea innovation member analytic demand construct brand demand retailer
3 patient organizational country network people distribution patent medium period trust
4 website action capability organizational employee policy strategic price policy consumer
5 web digital strategic project financial inform item purchase capacity store
6 participant activity platform job risk contract feature search inventory app
7 consumer community standard employee big operation communication network risk game
8 intention software digital communication corporate solution manager content operation sample
9 attitude goal risk tie return supplier operation sale pricing task

The following steps will build a figure with the evolution of each topic. These are calculated by evaluating the weight of each topic in the documents for a certain year and then summing up these weights.

evolution = np.zeros([len(corpus),10]) # We pre-build the numpy array filled with zeroes.
ind = 0
for bow in corpus:
    topics = lda_model.get_document_topics(bow)
    for topic in topics:
        evolution[ind,topic[0]] = topic[1]
    ind+=1
Copy to clipboard

We create a pandas dataframe from the NumPy array and add the years and column names.

evolution_df = pd.DataFrame(evolution)
evolution_df['Year'] = file_years
evolution_df['Date'] = pd.to_datetime(evolution_df['Year'],format = "%Y") # Change Year to datetime-object.
evolution_df.set_index('Date',inplace=True) # Set Date as an index of the dataframe
evolution_df.drop('Year',axis=1,inplace = True)
Copy to clipboard
plt.style.use('fivethirtyeight')
fig,axs = plt.subplots(5,2,figsize = [15,10])
for ax,column in zip(axs.flat,evolution_df.groupby('Date').mean().columns):
    ax.plot(evolution_df.groupby('Date').mean()[column])
    ax.set_title(column,{'fontsize':14})
plt.subplots_adjust(hspace=0.4)
Copy to clipboard
_images/2_2_LDA_example_80_0.png

Next, we plot the marginal topic distribution, i.e., the relative importance of the topics. It is calculated by summing the topic weights of all documents.

First, we calculate the topic weights for every document.

doc_tops = []
for doc in corpus:
    doc_tops.append([item for (_,item) in lda_model.get_document_topics(doc)])
doc_tops_df = pd.DataFrame(doc_tops,columns=top_words_df.columns)
Copy to clipboard

Then, we sum (and plot) these weights.

doc_tops_df = doc_tops_df/doc_tops_df.sum().sum()
Copy to clipboard
doc_tops_df.sum(axis=0).plot.bar()
Copy to clipboard
<matplotlib.axes._subplots.AxesSubplot at 0x7fca454fd2b0>
Copy to clipboard
_images/2_2_LDA_example_85_1.png

As our subsequent analysis, let’s search the most representative document for each topic. It is the document that has the largest weight for a certain topic.

doc_topics = []
for doc in corpus:
    doc_topics.append([item for (_,item) in lda_model.get_document_topics(doc)])               
Copy to clipboard
temp_df = pd.DataFrame(doc_topics)
Copy to clipboard

With idxmax(), we can pick up the index that has the largest weight.

temp_df.idxmax()
Copy to clipboard
0    1172
1    1775
2     858
3    1705
4    1644
5     133
6     517
7    1203
8     303
dtype: int64
Copy to clipboard

We can now use files to connect indices to documents. For example, the most representative document of Topic 1 (index 0) is “PRACTICING SAFE COMPUTING: A MULTIMETHOD EMPIRICAL EXAMINATION OF HOME COMPUTER USER SECURITY BEHAVIORAL INTENTIONS”

files[1172]
Copy to clipboard
'2010_2729.txt'
Copy to clipboard
raw_text[1172][0:500]
Copy to clipboard
'Anderson & Agarwal/Practicing Safe Computing\n\nSPECIAL ISSUE\n\nPRACTICING SAFE COMPUTING: A MULTIMETHOD EMPIRICAL EXAMINATION OF HOME COMPUTER USER SECURITY BEHAVIORAL INTENTIONS1\n\nBy: Catherine L. Anderson Decision, Operations, and Information Technologies Department Robert H. Smith School of Business University of Maryland Van Munching Hall College Park, MD 20742-1815 U.S.A. Catherine_Anderson@rhsmith.umd.edu\nRitu Agarwal Center for Health Information and Decision Systems Robert H. Smith School '
Copy to clipboard

Let’s build a master table that has the document names and other information.

master_df = pd.DataFrame({'Article':files})
Copy to clipboard

First, we add the most important topic of each article to the table.

top_topic = []
for doc in corpus:
    test=lda_model.get_document_topics(doc)
    test2 = [item[1] for item in test]
    top_topic.append(test2.index(max(test2))+1)

master_df['Top topic'] = top_topic
Copy to clipboard

With head(), we can check the first (ten) values of our dataframe.

master_df.head(10)
Copy to clipboard
Article Top topic
0 2019_1167.txt 1
1 2017_27218.txt 3
2 2012_40089.txt 3
3 2012_2684.txt 2
4 2015_16392.txt 1
5 2019_3891.txt 1
6 2011_1162.txt 1
7 2015_19684.txt 4
8 2010_2866.txt 2
9 2018_30301.txt 2

value_counts() for the “Top topic” -column can be used to check that in how many documents each topic is the most important.

master_df['Top topic'].value_counts()
Copy to clipboard
1    794
2    604
3    402
4    210
5     81
6     23
7      9
9      2
8      1
Name: Top topic, dtype: int64
Copy to clipboard

2.2. Summarisation

Let’s do something else. Gensim also includes efficient summarisation-functions (these are not related to LDA any more):

From the summarization -module, we can use summarize to automatically build a short summarisation of the document. Notice that we use the original documents for this and not the preprocessed ones. ratio = 0.01 means that the length of the summarisation should be 1 % from the original document.

summary = []
for file in raw_text:
    summary.append(gensim.summarization.summarize(file.replace('\n',' '),ratio=0.01))
Copy to clipboard

Below is an example summary for the first document.

summary[1]
Copy to clipboard
'*NLM Title Abbreviation:*     J Appl Psychol *Publisher:*     US : American Psychological Association *ISSN:*     0021-9010 (Print)     1939-1854 (Electronic) *ISBN:*     978-1-4338-9042-0 *Language:*     English *Keywords:*     ability, personality, interests, motivation, individual differences *Abstract:*     This article reviews 100 years of research on individual differences     and their measurement, with a focus on research published in the     Journal of Applied Psychology.\nWe focus on 3 major individual     differences domains: (a) knowledge, skill, and ability, including     both the cognitive and physical domains; (b) personality, including     integrity, emotional intelligence, stable motivational attributes     (e.g., achievement motivation, core self-evaluations), and     creativity; and (c) vocational interests.\n(PsycINFO Database Record (c) 2017 APA, all     rights reserved) *Document Type:*     Journal Article *Subjects:*     *Individual Differences; *Measurement; *Motivation; *Occupational     Interests; *Personality; Ability; Interests *PsycINFO Classification:*     Occupational & Employment Testing (2228)     Occupational Interests & Guidance (3610) *Population:*     Human *Tests & Measures:*     Vocational Preference Inventory *Methodology:*     Literature Review *Supplemental Data:*     Tables and Figures Internet *Format Covered:*     Electronic *Publication Type:*     Journal; Peer Reviewed Journal *Publication History:*     First Posted: Feb 2, 2017; Accepted: Jun 27, 2016; Revised: Jun 2,     2016; First Submitted: Aug 12, 2015 *Release Date:*     20170202 *Correction Date:*     20170309 *Copyright:*     American Psychological Association.\n85)  The development of standardized measures of attributes on which individuals differ emerged very early in psychology’s history, and has been a major theme in research published in the /Journal of Applied Psychology/ (/JAP/).\nThus, ability, personality, interest patterns, and motivational traits (e.g., achievement motivation, core self-evaluations [CSEs]) fall under the individual differences umbrella, whereas variables that are transient, such as mood, or that are closely linked to the specifics of the work setting (e.g., turnover intentions or perceived organizational climate), do not.\nThe very first volume of the journal contained research on topics that are core topics of research today, including criterion-related validity (Terman et al., 1917 <#c252>), bias and group differences (Sunne, 1917 <#c245>), measurement issues (Miner, 1917 <#c181>; Yerkes, 1917 <#c282>), and relationships with learning and the development of knowledge and skill (Bingham, 1917 <#c26>).\nOver the years, studies have reflected the tension between viewing cognitive abilities as enduring capacities that are largely innate versus treating them as measures of developed capabilities, a distinction that has implications for the study of criterion related validity, group differences, aging effects, and the structure of human abilities (Kuncel & Beatty, 2013 <#c155>).\nIn addition to the direct study of knowledge and skill measurement, these individual differences have been a part of research on job analysis, leadership, career development, performance appraisal, training, and skill acquisition, among others.\nIn a well-cited /JAP/ article, Ghiselli and Barthol (1953) <#c94> summarized this first era by reviewing 113 studies about the validity of personality inventories for predicting job performance.\nBy using only measures designed to assess the FFM, Hurtz and Donovan’s (2000) <#c129> meta-analysis addressed potential construct-related validity concerns of prior meta-analyses and found evidence for personality as a predictor of task and contextual performance.\nSeveral meta-analyses, large-scale studies, and simulations were published in /JAP/ that demonstrated the limited impact of socially desirable responding and faking on the construct-related and criterion-related validity of personality measures (Ellingson, Smith, & Sackett, 2001 <#c66>; Ones, Viswesvaran, & Reiss, 1996 <#c200>; and, later on, Hogan, Barrett, & Hogan, 2007 <#c112>; Schmitt & Oswald, 2006 <#c229>).\nIn /JAP/, research has also appeared measuring personality with conditional reasoning tests (Bing et al., 2007 <#c25>), SJTs (Motowidlo, Hooper, & Jackson, 2006 <#c185>), structured interviews (Van Iddekinge, Raymark, & Roth, 2005 <#c268>), and ideal point models (Stark, Chernyshenko, Drasgow, & Williams, 2006 <#c240>), though not all of these approaches demonstrated incremental prediction over self-reports.\nSome of the more notable advancements are that (a) personality can interact with the situation to affect behavior (e.g., Tett & Burnett’s [2003] <#c253> trait activation theory); (b) motivational forces mediate the effects of personality (e.g., Barrick, Stewart, & Piotrowski, 2002 <#c18>), with both implicit and explicit motives being important (Frost, Ko, & James, 2007 <#c86>; Lang, Zettler, Ewen, & Hülsheger, 2012 <#c157>); (c) personality is stable, yet also prone to change, across life (Woods & Hampson, 2010 <#c279>); (d) personality both affects and is affected by work (Wille & De Fruyt, 2014 <#c275>); and (e) personality traits represent stable distributions of variable personality states (Judge, Simon, Hurst, & Kelley, 2014 <#c147>; Minbashian, Wood, & Beckmann, 2010 <#c180>).\n/JAP/ has published several articles that have contributed to the understanding of integrity tests, including Hogan and Hogan (1989) <#c113> on the measurement of employee reliability, and studies that examined faking (Cunningham, Wong, & Barbee, 1994 <#c51>), subgroup differences (Ones & Viswesvaran, 1998 <#c199>), and difficulties in predicting low base rate behaviors (Murphy, 1987 <#c188>).\nIt is worth noting that other individual difference measures have also been examined in /JAP/ as predictors of CWB, including biodata (Rosenbaum, 1976 <#c217>), conditional reasoning (Bing et al., 2007 <#c25>; LeBreton, Barksdale, Robin, & James, 2007 <#c161>), and the Big Five personality dimensions (Berry, Ones, & Sackett, 2007 <#c22>).\nFourth, many studies during this period focused on the vocational interests of particular groups or of individuals in specific jobs or occupations, including tuberculosis patients (Shultz & Rush, 1942 <#c235>), industrial psychology students (Lawshe & Deutsch, 1952 <#c159>), retired YMCA secretaries (Verburg, 1952 <#c271>), and female computer programmers (Perry & Cannon, 1968 <#c206>).'
Copy to clipboard
master_df['Summaries'] = summary
Copy to clipboard
master_df
Copy to clipboard
Article Top topic Summaries
0 2019_1167.txt 1 We add to the literature on the economics of c...
1 2017_27218.txt 3 *NLM Title Abbreviation:* J Appl Psychol *...
2 2012_40089.txt 3 Service Quality in Software-as-a-Service: Deve...
3 2012_2684.txt 2 RESEARCH ARTICLE UNDERSTANDING USER REVISIONS...
4 2015_16392.txt 1 *Document Type:* Article *Subject Terms:* ...
... ... ... ...
2121 2016_2942.txt 1 {rajiv.kohli@mason.wm.edu} Sharon Swee-Lin Tan...
2122 2016_81.txt 2 J Bus Ethics (2016) 138:349364 DOI 10.1007/s10...
2123 2018_13221.txt 6 Sinha Carlson School of Management, University...
2124 2016_23236.txt 1 *Document Type:* Article *Subject Terms:* ...
2125 2016_38632.txt 4 To understand the trade-offs involved in decid...

2126 rows × 3 columns

The same summarization -module also includes a function to search keywords from the documents. It works in the same way as summarize().

keywords = []
for file in raw_text:
    keywords.append(gensim.summarization.keywords(file.replace('\n',' '),ratio=0.01).replace('\n',' '))
Copy to clipboard

Here are keywords for the second document

keywords[1]
Copy to clipboard
'journal journals psychology psychological personality person personal research researchers researched performance performing perform performed performances testing tests test tested measurement measures measureable measure measured measuring ability abilities difference different differing differs study studies studied individual differences individuals differ job jobs interests interesting interested including include includes included traits trait'
Copy to clipboard
master_df['Keywords'] = keywords
Copy to clipboard
master_df
Copy to clipboard
Article Top topic Summaries Keywords
0 2019_1167.txt 1 We add to the literature on the economics of c... bitcoin bitcoins user users transactions trans...
1 2017_27218.txt 3 *NLM Title Abbreviation:* J Appl Psychol *... journal journals psychology psychological pers...
2 2012_40089.txt 3 Service Quality in Software-as-a-Service: Deve... service services saas research researchers cus...
3 2012_2684.txt 2 RESEARCH ARTICLE UNDERSTANDING USER REVISIONS... use uses usefulness features feature user user...
4 2015_16392.txt 1 *Document Type:* Article *Subject Terms:* ... automation automated automate work working hum...
... ... ... ... ...
2121 2016_2942.txt 1 {rajiv.kohli@mason.wm.edu} Sharon Swee-Lin Tan... data ehr ehrs research researchers health pati...
2122 2016_81.txt 2 J Bus Ethics (2016) 138:349364 DOI 10.1007/s10... organizational relationship relationships altr...
2123 2018_13221.txt 6 Sinha Carlson School of Management, University... firms firm production product products modelin...
2124 2016_23236.txt 1 *Document Type:* Article *Subject Terms:* ... reviewer reviews review reviewers reviewed rev...
2125 2016_38632.txt 4 To understand the trade-offs involved in decid... time timing hire hiring hired management manag...

2126 rows × 4 columns

2.3. Similarities

As a next example, we analyse the similarities between the documents.

Gensim has a specific function for that: SparseMatrixSimilarity

from gensim.similarities import SparseMatrixSimilarity
Copy to clipboard

The parameters to the function are the word-frequency corpus and the length of the dictionary.

index = SparseMatrixSimilarity(corpus,num_features=len(id2word))
Copy to clipboard
sim_matrix = index[corpus]
Copy to clipboard

In the similarity matrix, the diagonal has values 1 (similarity of a document with itself). We replace those values with zero to find the most similar documents from the corpus.

for i in range(len(sim_matrix)):
    sim_matrix[i,i] = 0
Copy to clipboard

Seaborn (https://seaborn.pydata.org/) has a convenient heatmap function to plot similarities. Here is information about Seaborn from their web page: Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

sns.heatmap(sim_matrix)
Copy to clipboard
<matplotlib.axes._subplots.AxesSubplot at 0x7fca4c47e6d0>
Copy to clipboard
_images/2_2_LDA_example_124_1.png

We search the most similar article by locating the index with a largest value for every column.

master_df['Most_similar_article'] = np.argmax(sim_matrix,axis=1)
Copy to clipboard
master_df
Copy to clipboard
Article Top topic Summaries Keywords Most_similar_article
0 2019_1167.txt 1 We add to the literature on the economics of c... bitcoin bitcoins user users transactions trans... 1306
1 2017_27218.txt 3 *NLM Title Abbreviation:* J Appl Psychol *... journal journals psychology psychological pers... 2090
2 2012_40089.txt 3 Service Quality in Software-as-a-Service: Deve... service services saas research researchers cus... 28
3 2012_2684.txt 2 RESEARCH ARTICLE UNDERSTANDING USER REVISIONS... use uses usefulness features feature user user... 1796
4 2015_16392.txt 1 *Document Type:* Article *Subject Terms:* ... automation automated automate work working hum... 1280
... ... ... ... ... ...
2121 2016_2942.txt 1 {rajiv.kohli@mason.wm.edu} Sharon Swee-Lin Tan... data ehr ehrs research researchers health pati... 301
2122 2016_81.txt 2 J Bus Ethics (2016) 138:349364 DOI 10.1007/s10... organizational relationship relationships altr... 2070
2123 2018_13221.txt 6 Sinha Carlson School of Management, University... firms firm production product products modelin... 1942
2124 2016_23236.txt 1 *Document Type:* Article *Subject Terms:* ... reviewer reviews review reviewers reviewed rev... 1917
2125 2016_38632.txt 4 To understand the trade-offs involved in decid... time timing hire hiring hired management manag... 412

2126 rows × 5 columns

2.4. Cluster model

Next, we build a topic analysis using a different approach. We create a TF-IDF model from the corpus and apply K-means clustering for that model.

Explanation of TF-IDF from Wikipedia: “In information retrieval, TF–IDF or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modelling. The TF–IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.”

Explanation of K-means clustering from Wikipedia: “k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centres or cluster centroid), serving as a prototype of the cluster…k-means clustering minimizes within-cluster variances (squared Euclidean distances)…”

kmeans

From Gensim.models we pick up TfidModel.

from gensim.models import TfidfModel
Copy to clipboard

As parameters, we need the corpus and the dictionary.

tf_idf_model = TfidfModel(corpus,id2word=id2word)
Copy to clipboard

We use the model to build up a TF-IDF -transformed corpus.

tform_corpus = tf_idf_model[corpus]
Copy to clipboard

corpus2csc converts a streamed corpus in bag-of-words format into a sparse matrix, with documents as columns.

spar_matr = gensim.matutils.corpus2csc(tform_corpus)
Copy to clipboard
spar_matr
Copy to clipboard
<32237x2126 sparse matrix of type '<class 'numpy.float64'>'
	with 1897708 stored elements in Compressed Sparse Column format>
Copy to clipboard

Sparse matrix to normal array. Also, we need to transpose it for the K-means model.

tfidf_matrix = spar_matr.toarray().transpose()
Copy to clipboard
tfidf_matrix
Copy to clipboard
array([[0.00125196, 0.00243009, 0.00474072, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.00594032, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.0077254 , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.00396927, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])
Copy to clipboard
tfidf_matrix.shape
Copy to clipboard
(2126, 32237)
Copy to clipboard

Scikit-learn has a function to form a K-means clustering model from a matrix. It is done below. We use ten clusters.

from sklearn.cluster import KMeans

kmodel = KMeans(n_clusters=10)

kmodel.fit(tfidf_matrix)

clusters = km.labels_.tolist()
Copy to clipboard
master_df['Tf_idf_clusters'] = clusters
Copy to clipboard
km.cluster_centers_
Copy to clipboard
array([[ 1.33476883e-03,  4.71424426e-03,  2.62007897e-04, ...,
         2.71050543e-20,  3.38813179e-21,  0.00000000e+00],
       [ 6.67748803e-04,  9.98513608e-04,  9.39093177e-04, ...,
         1.00166900e-03,  6.77626358e-21,  5.77802403e-05],
       [ 1.96181867e-04,  3.44260097e-04,  1.83757076e-04, ...,
        -5.42101086e-20,  0.00000000e+00, -3.38813179e-21],
       ...,
       [ 0.00000000e+00,  3.26892198e-03,  1.67826509e-03, ...,
        -8.13151629e-20, -1.69406589e-21, -3.38813179e-21],
       [ 5.84538250e-04,  1.55948712e-03,  2.28571545e-04, ...,
         0.00000000e+00, -3.38813179e-21, -3.38813179e-21],
       [ 5.10233572e-04,  4.11021862e-03,  1.24048649e-03, ...,
        -2.71050543e-20,  1.69406589e-21, -3.38813179e-21]])
Copy to clipboard
master_df
Copy to clipboard
Article Top topic Summaries Keywords Most_similar_article Tf_idf_clusters
0 2019_1167.txt 1 We add to the literature on the economics of c... bitcoin bitcoins user users transactions trans... 1306 5
1 2017_27218.txt 3 *NLM Title Abbreviation:* J Appl Psychol *... journal journals psychology psychological pers... 2090 4
2 2012_40089.txt 3 Service Quality in Software-as-a-Service: Deve... service services saas research researchers cus... 28 4
3 2012_2684.txt 2 RESEARCH ARTICLE UNDERSTANDING USER REVISIONS... use uses usefulness features feature user user... 1796 4
4 2015_16392.txt 1 *Document Type:* Article *Subject Terms:* ... automation automated automate work working hum... 1280 2
... ... ... ... ... ... ...
2121 2016_2942.txt 1 {rajiv.kohli@mason.wm.edu} Sharon Swee-Lin Tan... data ehr ehrs research researchers health pati... 301 8
2122 2016_81.txt 2 J Bus Ethics (2016) 138:349364 DOI 10.1007/s10... organizational relationship relationships altr... 2070 4
2123 2018_13221.txt 6 Sinha Carlson School of Management, University... firms firm production product products modelin... 1942 5
2124 2016_23236.txt 1 *Document Type:* Article *Subject Terms:* ... reviewer reviews review reviewers reviewed rev... 1917 5
2125 2016_38632.txt 4 To understand the trade-offs involved in decid... time timing hire hiring hired management manag... 412 5

2126 rows × 6 columns

Let’s collect the ten most important words for each cluster.

centroids = km.cluster_centers_.argsort()[:, ::-1] # Sort the words according to their importance.

for i in range(num_clusters):
    j=i+1
    print("Cluster %d words:" % j, end='')
    for ind in centroids[i, :10]:
        print(' %s' % id2word.id2token[ind],end=',')
    print()
    print()
Copy to clipboard
Cluster 1 words: patent, movie, tweet, invention, citation, inventor, innovation, twitter, follower, release,

Cluster 2 words: analytic, innovation, digital, capability, governance, supply_chain, sustainability, right_reserve, big, platform,

Cluster 3 words: persistent_linking, site_ehost, true_db, learning_please, academic_licensee, syllabus_mean, newsletter_content, electronic_reserve, authorize_ebscohost, course_pack,

Cluster 4 words: auction, bidder, bid, advertiser, bidding, price, pricing, combinatorial_auction, seller, valuation,

Cluster 5 words: team, employee, job, organizational, project, ethic, security, task, member, community,

Cluster 6 words: consumer, privacy, web_delivery, rating, disclosure, investor, participant, website, app, wom,

Cluster 7 words: brand, marketing, retailer, consumer, advertising, purchase, marketer, elasticity, promotion, sale,

Cluster 8 words: optimization, optimal, theorem, algorithm, node, approximation, scheduling, constraint, parameter, distribution,

Cluster 9 words: patient, hospital, physician, care, healthcare, medical, clinical, hit, health, health_care,

Cluster 10 words: price, inventory, retailer, supplier, pricing, contract, profit, demand, forecast, optimal,
Copy to clipboard

With a smaller corpus, we could use fancy visualisations, like multidimensional scaling and Ward-clustering, to represent the document relationship. However, with over 2000 documents, that is not meaningful. Below are examples of both (not related to our analysis).

mds

ward

With pandas crosstab, we can easily check the connection between the LDA topics and the K-means clusters.

pd.crosstab(master_df['Top topic'],master_df['Tf_idf_clusters'])
Copy to clipboard
Tf_idf_clusters 0 1 2 3 4 5 6 7 8 9
Top topic
1 14 151 91 10 175 157 28 77 32 59
2 10 122 44 13 138 124 21 37 39 56
3 7 79 21 4 76 114 15 25 18 43
4 5 30 6 1 31 84 16 8 9 20
5 1 8 0 2 12 34 6 2 7 9
6 1 1 0 1 2 13 0 0 2 3
7 0 0 0 0 2 6 0 0 0 1
8 0 1 0 0 0 0 0 0 0 0
9 0 0 0 0 1 0 0 0 0 1