References

Package can be installed from CRAN

install.packages("ldatuning")

or downloaded from the GitHub repository (developer version).

install.packages("devtools")
devtools::install_github("nikita-moor/ldatuning")

Package ldatuning realizes 4 metrics to select perfect number of topics for LDA model.

library("ldatuning")

Load “AssociatedPress” dataset from the topicmodels package.

library("topicmodels")
data("AssociatedPress", package="topicmodels")
dtm <- AssociatedPress[1:10, ]

The most easy way is to calculate all metrics at once. All existing methods require to train multiple LDA models to select one with the best performance. It is computation intensive procedure and ldatuning uses parallelism, so do not forget to point correct number of CPU cores in mc.core parameter to archive the best performance.

All standard LDA methods and parameters from topimodels package can be set with method and control.

result <- FindTopicsNumber(
  dtm,
  topics = seq(from = 2, to = 15, by = 1),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = 2L,
  verbose = TRUE
)

## fit models... done.
## calculate metrics:
##   Griffiths2004... done.
##   CaoJuan2009... done.
##   Arun2010... done.
##   Deveaud2014... done.

Result is a number of topics and corresponding values of metrics

topics	Griffiths2004	CaoJuan2009	Arun2010	Deveaud2014
15	-15297.82	0.5047240	15.92711	0.1362596
14	-15338.24	0.4927860	15.36552	0.1406462
13	-15319.82	0.4944709	15.80569	0.1504368
12	-15326.94	0.4756351	15.81278	0.1594651
11	-15293.55	0.4347111	15.23313	0.1770861
10	-15291.00	0.3829542	14.93706	0.1969989
9	-15303.87	0.3379840	14.71664	0.2181424
8	-15256.30	0.3061726	14.78140	0.2435689
7	-15259.80	0.2746812	14.82908	0.2746203
6	-15251.04	0.2612029	15.28425	0.3101625
5	-15226.91	0.1875260	15.34470	0.3718687
4	-15242.86	0.1779016	16.29708	0.4323482
3	-15266.66	0.1600736	16.97832	0.5318997
2	-15349.79	0.1169522	18.47430	0.6989189

Simple approach in analyze of metrics is to find extremum, more complete description is in corresponding papers:

minimization:
- Arun2010 [1]
- CaoJuan2009 [2]
maximization:
- Deveaud2014 [3]
- Griffiths2004 [4,5]

Support function FindTopicsNumber_plot can be used for easy analyze of the results

FindTopicsNumber_plot(result)

Results calculated on the whole dataset (about 10 hours on quad-core computer) look like

From this plot can be made conclusion that optimal number of topics is in range 90-140. Metric Deveaud2014 is not informative in this situation.

References

1. Rajkumar Arun, V. Suresh, C. E. Veni Madhavan, and M. N. Narasimha Murthy. 2010. On finding the natural number of topics with latent dirichlet allocation: Some observations. In Advances in knowledge discovery and data mining, Mohammed J. Zaki, Jeffrey Xu Yu, Balaraman Ravindran and Vikram Pudi (eds.). Springer Berlin Heidelberg, 391–402. http://doi.org/10.1007/978-3-642-13657-3_43

2. Cao Juan, Xia Tian, Li Jintao, Zhang Yongdong, and Tang Sheng. 2009. A density-based method for adaptive lda model selection. Neurocomputing — 16th European Symposium on Artificial Neural Networks 2008 72, 7–9: 1775–1781. http://doi.org/10.1016/j.neucom.2008.06.011

3. Romain Deveaud, Éric SanJuan, and Patrice Bellot. 2014. Accurate and effective latent concept modeling for ad hoc information retrieval. Document numérique 17, 1: 61–84. http://doi.org/10.3166/dn.17.1.61-84

4. Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences 101, suppl 1: 5228–5235. http://doi.org/10.1073/pnas.0307752101

5. Martin Ponweiser. 2012. Latent dirichlet allocation in r. Retrieved from http://epub.wu.ac.at/id/eprint/3558

Select number of topics for LDA model

Murzintcev Nikita

2020-04-20

References