Topic Modeling

Topic modelling is an unsupervised text mining approach. Input: corpus of unstructured text documents with no labels (reviews, news) Output: Multiple topics for a single document or corpus

Steps:

Tokenize: split raw text into individual tokens
Bag-of-Words Model:
- Method to move from tokens to numeric features
- Each document is represented by a vector where m is the number of unique terms across all documents.
Create document-term matrix:
- Each document can be represented as a term vector, with an entry indicating the number of time a term appears in the document:
Additional Preprocessing steps:
- Minimum token length, like excluding tokens of length < 2
- Converting all words to lowercase.
- Filter stop words
- Stemming
Translate Document Term Matrix to:
- TF-IDF in order to give higher weights to more “important” terms.
- TF-IDF: Common approach for weighting the score for a term in a document.
  - Term Frequency: Number of times a given term appears in a single document.
  - Inverse Document Frequency: basically penalizes common terms that appear in almost every document.

Two Topic Modeling Approaches:

Probabilistic - view each document as a mixture of a small number of topics where words and documents get probability scores for each topic.
- Latent Dirichlet Allocation (LDA)
Matrix Factorisation - apply methods from linear algebra to decompose a single matrix into a set of smaller matrices.
- Non-negative Matrix Factorisation (NMF)

Evaluate number of topics:

Perplexity: how surprised a model is of new data it has not seen before, and is measured as the normalized log-likelihood of a held-out test set.
Coherence: Measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic.

NFM Model

NMF can be applied for topic modeling, where the input is a document-term matrix
- Input: Document-term matrix A; Number of topics k.
- Output: Two k-dimensional factors W and H approximating A H Factor:
Contains term weights relative to each of the k topics.
Each row corresponds to a topic, and each column corresponds to a unique term in the corpus vocabulary.
Sorting the values in each row gives us a ranking of terms W Factor
Contains document membership weights across the k topics.
Each row corresponds to a different document, and each column corresponds to a topic.
Sorting the values gives us a ranking of the most relevant documents for each topic.

Latent Dirichlet Allocation

LDA assumes that all words in the document can be assigned a probability of belonging to a topic.
Topics in a document are unknown, but the idea is topics are present as the text is generated based on a distribution of topics and distribution of words in that topic.

Goal

Determine the number of topics that a document contains.
Each document is modeled as a multinomial distribution of topics
Each topic is also modeled as a multinomial distribution of words.
LDA assumes the text feed into the model will contain words that are related.

Steps

Tokenization: Split the text into sentences and the sentences into words, and lowercase words and remove punctuation.
Remove stopwords
Lemmatize: Words in third person are changed to first person and verbs in past and future tenses are changed into present.
Stemming: Words are reduced to their root form
Create Bag-of-Words Model
- Dictionary containing the number of times a word appears in the training set.
Can also create a N_Gram Model:
- bigrams which are two words frequently occurring together in the document
- trigrams, which are 3 words frequently occurring - for example. This can provide more contextual information than just bag of words.
Perform TF-IDF on document set
Perform Topic Model (e.g. LDA) + Evaluate

Example Output:

Topic 1:
 _0.016“car” + 0.014“power” + 0.010“light” + 0.009“drive” + 0.007“mount” + 0.007“controller” + 0.007“cool” + 0.007“engine” + 0.007“back” + ‘0.006“turn”.

It means the top 10 keywords that contribute to this topic are: ‘car’, ‘power’, ‘light’.. and so on and the weight of ‘car’ on topic 1 is 0.016.
The weights reflect how important a keyword is to that topic