LDAShiny
is a user-friendly shiny (Chang et al. 2017) web application to carry out an exploratory review of scientific literature and textual information in general, which implements the generative probabilistic model of Dirichlet Latent Allocation (LDA) (Blei, Ng, and Jordan 2003). The motivation to create LDAShiny
is to streamline the routine work flow of LDA so that users unfamiliar with R can perform the analysis interactively in a web browser.
Data may be obtained by querying the SCOPUS or Clarivate Analytics Web of Science (WoS) database by diverse fields, such as topic, author, journal, timespan, and so on.
For the demonstration, an exploratory review literature of the species Oreochromis niloticus was carried out. Taking into account documents in which the name of the species was mentioned in the title, abstract or keyword, ensuring that as many potentially relevant documents as possible are included (Figure 1)
A number of 6196 abstracts of articles were found (in the last three decades 1991- June 2020). This number of documents makes an individual exploratory review take too long, so the set of articles considered provides a good case to test the application.
You can download the references (up to 2,000 full records) by checking the ‘Select All’ box and clicking on the link ‘Export’. Choose the file type “CSV export” and “Document Title”, “Year” and “Abstract”(Figure 2).
The SCOPUS export tool creates an export file with the default name “scopus.csv”. You can download the file (O.niloticus.csv
) that contains the data for this example at (https://github.com/JavierDeLaHoz/o.niloticus/blob/main/O.niloticus.csv). For CSV files, accented characters do not display correctly in Excel unless you save the file and then use Excel’s data import functionality to view the file. You can also use the following instruction from R to download the file and save it.
Pre-processing seeks to normalize or convert the text set to a more convenient standard form and also allows for the reduction of the data matrix’s dimensionality by eliminating noise or meaningless terms. Required dataset must be in a wide format (one article or abstract per row). Upload the O.niloticus.csv
data file to LDAShiny from the Upload Data panel. Next, on Data cleanning panel clicking Incorporate information
button, then specify the columns for id document
(Title in our case),Select document vector
(Abstract), and Select publish year
(Year), next select, clicking on the checkbose to select ngram
(Bigrams for this demonstration), remove the numbers, select the language for the stopwords and include the words you want to delete,in our case we use in addition to the default list,a pre-compiled list called SMART (System for the Mechanical Analysis and Retrieval of Text) the stopword
package. The complete list of words used for the stopword option can be downloaded.(https://github.com/JavierDeLaHoz/stopword/blob/main/stopword.csv) For this example,no stemming was performed and the Sparcity slider used was 99.5%, that is, the terms that appear in more than 0.5%. Finally, the Create DTM (Document Term Matrix) button is clicked and a spinner is displayed when it is being calculated. At the end of the process, a summary is shown like the one in the Video 2.
The cleaning process should be viewed as an iterative process, as an identical cleaning procedure cannot be guaranteed when conducting an exploratory review. Before the cleaning process, there were 530143 (Original) terms in the corpus, however, the procedure reduced the number of unique terms to 3268 (Final), greatly reducing computational needs.
The resulting DTM matrix can be previewed in the Document Term Matrix Visualizations
panel. The most frequent terms found in all analyzed documents are shown both in tabular and graphical form.
The View Data
botton show some basic corpus statistics, term_freq gives term frequencies, doc_frec the number of documents in which each term appears, and inverse-document frequency (idf),that measure the importance of a term. Also shown are a series of buttons that allow downloading in CSV , EXCEL or PDF formats, to print the file, to copy it to the clipboard, and a button to configure the number of rows to be showed in the summary (Video 3).
Clicking the View barplot
button will be display a barplot. The number of bars can be configured by the slider shown in the Dropdown button Select number of term. In the upper right part of the graph (export button), clicking on it, you can download the graph in diferent formats (png, jpeg, svg and pdf) (Video 3).
Clicking the View worcloud button will be display a wordcloud. The number of words can be configured by the slider shown in the Dropdown button Select number of term In the upper right part of the graph (export button), clicking on it, you can download the graph in diferent formats (png,jpeg, svg and pdf) (Video 3).
Once the DTM matrix has been obtained, the next step is to determine the optimal number of topics.
Note that finding the number of topics is an intensive computational procedure, the procedure may take time (from a few minutes to even a couple of days) depending on the size of the DTM, the number of models (number of topics to evaluate),iterations ,and the number of cores from your computer (LDAShiny
works with detectCores()-1 number of cores on your computer).
In the video 4 , the configuration options for each of the metrics used to calculate the number of topics are shown.
The processing time for each of the four metrics was 10408, 1987, 6018 and 2755 seconds for “coherence”, “4-metrics”, “perplexity” and “Harmonic mean” respectively. Regarding the number of topics, the metrics grifths2004 (Grifths and Steyvers 2004), CaoJuan2009 (Cao, Xia, Li, Zhang, and Tang 2009), Arun2010 ((Arun, Suresh, Madhavan, and Murthy 2010)),Deveaud2014 (Deveaud, SanJuan, and Bellot 2014) ,Perplexity and Harmonic mean agreed that the number of suitable topics is between 35 and 50, and Coherence 24 Topics. Therefore, the lowest number of topics was preferred in this case as the intention is to provide an overview.
Once a decision has been made on the number of topics, the LDA model is fitted. Input is a document term matrix. LDAShiny
implements Gibbs sampling. The parameters of inference should be used as a guide, however, some may be changed, such as the number of iterations that may be greater, and we could use the recommendation of Grifths and Steyvers (2004) to use value of 50/k. In our case we use 1000 iterations, 100 burnin and value 3.57 as input parameters (Video 5)
The output from the model contains several objects, DTM is reduced to two matrices. The first one, theta (\(\theta\)), has rows representing a distribution of topics on the documents \(P\left ( topic_{k}\mid document_{d} \right )\). The second one, phi (\(\phi\)), has rows that represent a distribution of words on topics \(P\left ( token_{v}\mid topic_{k} \right )\).
These matrices can be viewed and downloaded in the menu Download tabular results
. Also in this menu we can see and configure a summary of the model in the tab Summary LDA
three sliders will be shown at the top, this allows the summary configuration: Select number of labels
, Select number top terms
, and Select assignments
the latter is a documents by topics matrix similar to theta. (Video 6).
LDAShiny
provides a topic labeling (the same used by the texrmineR package) based on a naive labeling algorithm based on n-gram, however, these algorithms have a limited capacity, nonetheless, they can serve as a guide.
In the menu Download tabular results
clicking on the Allocation button
, A table will be shown where the user can find the documents that can be organized by topic. Thanks to the slider located at the top one we can choose the number of documents per topic to be displayed.(Video 6)
With the purpose to facilitate the characterization of the topics in terms of their tendency, LDAShiny
use simple regression slopes for each topic, where the year were an dependent variable and the proportions of the topics in the corresponding year are the response variable (Grifths and Steyvers 2004).
\[\theta _{k}^{y}=\frac{\sum_{m\epsilon y}\theta _{mk}}{n^y}\] Where \(m\epsilon y\) represents the articles published in a certain year and, \(\theta_{mk}\) the proportion of the kth-topic and \(n^y\) the total number of articles published in the year y .Topics whose regression slopes are significant (at a certain level of significance) will be considered to have an increasing trend if they are positive, and with a decreasing trend otherwise. If the slopes are not significant, it will be said that the trend of the topic is fluctuating.
Clicking on the trend
button, a table showing the results of a simple linear regression (intercept, slope, test statistic, standard error and p-value) (Video 6).
In addition to the tabular options mentioned, in the Download graphic results
menu we can find three buttons trend
, View worcloud by topic
, and heat_map
Clicking on the trend
button, a line graph will be shown (one line for each topic) where time trends can be visualized. The graphic is interactive, clicking on the lines, they will be removed or displayed as the user decides (Video 7).
Clicking the View worcloud by topic
button will be display a wordcloud. In the drop-down button you can select the topic from which we want to generate the wordcloud, also,in the slider you can select the number of words to show (Video 7).
Clicking the heatmap
button will display a heatmap. The years are shown on the x-axis, the y-axis shows the topics and the color variation represents the probabilities (Video 7) .
Arun R, Suresh V, Madhavan CV, Murthy MN (2010). On finding the natural number of topics with latent dirichlet allocation: Some observations. In Pacific-Asia conference on knowledge discovery and data mining, pp. 391-402. Springer.
Blei DM, Ng AY, Jordan MI (2003). Latent Dirichlet allocation. Journal of machine Learning research , 3(Jan), 993-1022.
Chang, Winston, Joe Cheng, JJ Allaire, Yihui Xie, and Jonathan McPherson. 2017. Shiny: Web Application Framework for R.https://CRAN.R-project.org/package=shiny
Cao J, Xia T, Li J, Zhang Y, Tang S (2009). A density-based method for adaptive LDA model selection.Neurocomputing, 72(7-9), 1775{1781.
Deveaud R, SanJuan E, Bellot P (2014). Accurate and efective latent concept modeling for ad hoc information retrieval. Document numerique, 17(1), 61{84.
Grifths TL, Steyvers M (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235.