# Topic Modeling ______ - Topic modelling is an unsupervised text mining approach. **Input**: corpus of unstructured text documents with no labels (reviews, news) **Output**: Multiple topics for a single document or corpus **Steps:** - Tokenize: split raw text into individual tokens - **Bag-of-Words Model:** - Method to move from tokens to numeric features - Each document is represented by a vector where m is the number of unique terms across all documents. - **Create document-term matrix:** - Each document can be represented as a term vector, with an entry indicating the number of time a term appears in the document: - **Additional Preprocessing steps:** - Minimum token length, like excluding tokens of length < 2 - Converting all words to lowercase. - Filter stop words - Stemming - **Translate Document Term Matrix to:** - TF-IDF in order to give higher weights to more "important" terms. - TF-IDF: Common approach for weighting the score for a term in a document. - Term Frequency: Number of times a given term appears in a single document. - Inverse Document Frequency: basically penalizes common terms that appear in almost every document. **Two Topic Modeling Approaches:** 1. **Probabilistic** - view each document as a mixture of a small number of topics where words and documents get probability scores for each topic. - Latent Dirichlet Allocation (LDA) 2. **Matrix Factorisation** - apply methods from linear algebra to decompose a single matrix into a set of smaller matrices. - Non-negative Matrix Factorisation (NMF) **Evaluate number of topics:** 1. **Perplexity**: how surprised a model is of new data it has not seen before, and is measured as the normalized log-likelihood of a held-out test set. 2. **Coherence**: Measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. ![image](../assets/topic_model1.png) ## NFM Model - NMF can be applied for topic modeling, where the input is a document-term matrix - **Input**: Document-term matrix A; Number of topics k. - **Output**: Two k-dimensional factors W and H approximating A **H Factor:** - Contains term weights relative to each of the k topics. - Each row corresponds to a topic, and each column corresponds to a unique term in the corpus vocabulary. - Sorting the values in each row gives us a ranking of terms **W Factor** - Contains document membership weights across the k topics. - Each row corresponds to a different document, and each column corresponds to a topic. - Sorting the values gives us a ranking of the most relevant documents for each topic. ![image](../assets/nfm_model.png) ## Latent Dirichlet Allocation - LDA assumes that all words in the document can be assigned a probability of belonging to a topic. - Topics in a document are unknown, but the idea is topics are present as the text is generated based on a distribution of topics and distribution of words in that topic. **Goal** - Determine the number of topics that a document contains. - Each document is modeled as a multinomial distribution of topics - Each topic is also modeled as a multinomial distribution of words. - LDA assumes the text feed into the model will contain words that are related. **Steps** - Tokenization: Split the text into sentences and the sentences into words, and lowercase words and remove punctuation. - Remove stopwords - Lemmatize: Words in third person are changed to first person and verbs in past and future tenses are changed into present. - Stemming: Words are reduced to their root form - Create Bag-of-Words Model - Dictionary containing the number of times a word appears in the training set. - Can also create a N_Gram Model: - **bigrams** which are two words frequently occurring together in the document - **trigrams**, which are 3 words frequently occurring - for example. This can provide more contextual information than just bag of words. - Perform TF-IDF on document set - Perform Topic Model (e.g. LDA) + Evaluate **Example Output:** ``` Topic 1: _0.016“car” + 0.014“power” + 0.010“light” + 0.009“drive” + 0.007“mount” + 0.007“controller” + 0.007“cool” + 0.007“engine” + 0.007“back” + ‘0.006“turn”. ``` - It means the top 10 keywords that contribute to this topic are: ‘car’, ‘power’, ‘light’.. and so on and the weight of ‘car’ on topic 1 is 0.016. - The weights reflect how important a keyword is to that topic