less than 1 minute read

LDA topic modeling-based industry classification and its interpretation

Link to my writing

We explore industry classifications using Natural Language Processing (NLP). We propose a text-based industry classification using Latent Dirichlet Allocation (LDA) topic modeling by extracting distinguishable features from business descriptions in financial reports. The proposed method generates topics of the entire financial reports text data and assign the best topic that matches each company. Then we could cluster companies within the same topics. The method also enables human interpretation of the clustered result. Terms that compose topics are probabilistically returned. Through those probabilistic terms, we could define the new industry classification results more specifically and efficiently.

Figure 5: The scatter plot of text-based classifications.

Alt text Alt text Alt text

The coordinate of each dot represents the text-based classification after extracting two-dimensional codes from the autoencoder. The color of each dot in Figure (A) represents the classification based on Fama-French 12 industry classifications. Figure (B) indicates the LDA Topic Modeling clustering result. Figure (C) indicates the spherical Kmeans clustering result using the reduced features of the coded layer.