LDA topic modeling-based industry classification and its interpretation

less than 1 minute read

LDA topic modeling-based industry classification and its interpretation

We explore industry classifications using Natural Language Processing (NLP). We propose a text-based industry classification using Latent Dirichlet Allocation (LDA) topic modeling by extracting distinguishable features from business descriptions in financial reports. The proposed method generates topics of the entire financial reports text data and assign the best topic that matches each company. Then we could cluster companies within the same topics. The method also enables human interpretation of the clustered result. Terms that compose topics are probabilistically returned. Through those probabilistic terms, we could define the new industry classification results more specifically and efficiently.

Figure 5: The scatter plot of text-based classifications.

Alt text

The coordinate of each dot represents the text-based classification after extracting two-dimensional codes from the autoencoder. The color of each dot in Figure (A) represents the classification based on Fama-French 12 industry classifications. Figure (B) indicates the LDA Topic Modeling clustering result. Figure (C) indicates the spherical Kmeans clustering result using the reduced features of the coded layer.

Share on

Twitter Facebook LinkedIn

Jonghyuk Lee

LDA topic modeling-based industry classification and its interpretation

LDA topic modeling-based industry classification and its interpretation

Share on

You may also enjoy

IPO underwriter’s conflict of interest: When IPO underwriter has stake of the issuer

Bitcoin: A hedge for centralized banking risk

Mutual Fund’s R2 as Predictor of Performance