Wordcloud2 - Top Three Budget Speech Analysis
Wordcloud2 has been generated for top three budget speeches in the recent past three decades starting from 1991-92. It was a historical budget sppech of our recent political history by Dr.Manmohan Singh. Then, came another high profile budget speech in the year 2004-05 by Shri.P.Chidambaram. Then, now the running honourable finance minister Nirmala Seetharaman has made a budget speech 2020-21 for which there was high expectations from the people of India. Wordcloud is too basic a tool for analysing the budget speech. The intent of this blog is demonstration of wordcloud2 generation and budget analysis with wordcloud is only a byproduct.
#-------------------------------------------------------------------------
#R Code:
#step-by-step approach to wordcloud2 for budget speech in pdf
#step1: save budget speech 1991-92, 2004-05, and 2020-21 as pdf in working directory
#step2: load NLP, tm, RColorBrewer, SnowballC, wordcloud2, pdftools, stringr
#step3: input the pdf file as text using pdftools package
#step4: clean the Text data
#step5: make corpus and dtm of cleaned text data
#step6: visualize the text data and generate word cloud
library(NLP)
library(tm)
library(RColorBrewer)
library(SnowballC)
library(pdftools)
library(stringr)
library(wordcloud2)
# ebook in pdf as input data for wordcloud
txt <- pdf_text("bs199192.pdf") #hold pdf file in wd
abc <- character() #vector initialization
for(i in 2:length(txt)){
xyz <- strsplit(txt[i], '\r\n')
for(line in xyz){
abc <- c(abc, line)
}
}
abc <- as.matrix(abc)
head(abc)
tail(abc)
#text data cleaning
# stringr functions for removing symbols
abc <- str_remove_all(abc,"–")
abc <- str_remove_all(abc,"’")
abc <- str_remove_all(abc,"—")
abc <- str_remove_all(abc,"“")
abc <- str_remove_all(abc,"”")
abc <- str_remove_all(abc,"")
# tm functions for text cleaning
abc<-removeNumbers(abc)
abc<-removePunctuation(abc)
abc<-tolower(abc)
abc<-removeWords(abc,c("crores", "per", "also", "can"))
stopwords<-c("the", "and", stopwords("en"))
abc<-removeWords(abc, stopwords("en"))
abc<-stripWhitespace(abc)
abc<-wordStem(abc) #function from SnowballC
review_text<-abc
head(review_text)
tail(review_text)
#A vector source interprets each element of the vector as a document
review_source<-VectorSource(review_text)
#Next step: to make a corpus
corpus<-Corpus(review_source)
inspect(corpus[1:10])
#Next step: Document-term matrix
dtm<-DocumentTermMatrix(corpus)
dtm #displays meta data
inspect(dtm[1:9, 101:108])
dtm2<-as.matrix(dtm)
frequency<-colSums(dtm2)
frequency<-sort(frequency, decreasing = T)
head(frequency,20)
df<- data.frame(names(frequency), frequency)
#visualization via wordcloud package
wordcloud2(df, size=0.5, shape="star", widgetsize=c(1000,1000))
#-------------------------------------------------------------------------
First wordcloud2 has been generated for the 1991-92 budget speech by Dr.Manmohan Singh. The high frequency words are: will, propose, duty, tax, excise, government, revenue, increase, fiscal, expenditure, interest, economy, development, sector, financial. These words which have been used more frequently in the budget speech 1991-92 express concepts related to budget planning in general. The star wordcloud requires a higher pixel size for proper display. The wordcloud2 with argument shape="star" is given below.
Figure 1 Wordcloud2 with 'star'
The second wordcloud2 has been generated with argument shape="diamond", font size = 1, and a relatively smaller display size. The R Code changes to be made for the second wordcloud2 is given below.
#-------------------------------------------------------------------------
#R Code:
# ebook in pdf as input data for wordcloud
txt <- pdf_text("bs200405.pdf")
#visualization via wordcloud package
wordcloud2(df, size=1, shape="diamond", widgetsize=c(900,600))
#-------------------------------------------------------------------------
The wordcloud2 with shape="diamond" is shown in Figure 2 given below.
Figure 2 Wordcloud2 with 'diamond'
The high frequency words in the budget 2004-05 speech made by Shri.P.Chidambaram are: will, propose, tax, government, scheme, duty, sector, states, excise, water, capital, public, investment, rural. We tend to conclude that the budget speech appears to be more focused with specification of words like sector, states, water, public, and rural.
The third wordcloud2 has been made with shape="triangle" with font size=0.7 and a slightly larger display size compared to the second wordcloud2. The changes to be made in the R Code is given below for ready reference.
#-------------------------------------------------------------------------
#R Code
# ebook in pdf as input data for wordcloud
txt <- pdf_text("bs202021.pdf")
#visualization via wordcloud package
wordcloud2(df, size=0.7, shape="triangle", widgetsize=c(1000,800))
#-------------------------------------------------------------------------
The wordcloud2 with shape="triangle" is shown below in Figure 3.
Figure 3 Wordcloud2 with 'diamond'
The budget speech 2020-21 is next in line for our analysis with wordcloud2. The high frequency words in this budget speech made by Honourable Nirmala Seetharaman are: will, tax, proposed, infrastructure, duty, order, income, government, provide, customs, india, sector, capital, budget, act, health.
In terms of the high frequency words in the wordcloud2 analysis, we could possibly rank the three budget speeches as given below:
1. Budget Speech 2004-05 by Shri.P.Chidambaram (Former Finance Minister)
2. Budget Speech 2020-21 by Honorable Nirmala Seetharamn (Finance Minister)
3. Budget Speech 1991-92 by Dr.Manmohan Singh (Former Finance Minister & PM)
We conclude the wordcloud2 generation and its analysis with high regards for the New Economic Policy 1991 and the great contribution made by Dr. Manmohan Singh for the economic development of India. Wordcloud2 is a simple tool for text document analysis and offers insights into the document being analyzed. Jai Hind!
Comments
Post a Comment