Introduction to Natural Language Processing -Part 2

Step1. Load all necessary libraries and remove previously stored data from R-Studio

 rm(list = ls()) #Remove data from previous session in R-Studio
 #Load all libraries and dependencies
 load_libry<- function(x){
 for(i in x){
 if(! require(i,character.only = TRUE)){   install.packages(i,dependencies = TRUE) }
 }
 }

 pkgs <- c("dplyr","tidyverse","tidytext","tm","SnowballC","reshape2","e1071","tm","caret","textstem",
           "doSNOW","parallel","randomForest","janeaustenr","readxl","rlist","tokenizers","stopwords",
           "wordcloud2","wordcloud","doParallel","parallel")
 
load_libry(pkgs)
#set working directory
 setwd("D:/Praveen/MachineLearning")

Step2. Load data and stop words

#Load data
paragraphs<- read_excel("paragraphs.xlsx")

#load stop words
stop_words<-as.data.frame(stopwords(source="smart"))
colnames(stop_words)<-"word"
#load custom stop words
stop_words_custom<read.csv("stop_words_custom.csv",stringsAsFactors = FALSE)
#Combine both kind of stop words
 stop_words_uodated<-rbind(stop_words ,stop_words_custom)

Step3. Tokenization

Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens.

So create text in to tokens to process them further. Uses library tidytext to create tokens and then lemmatize tokens. lemmatize the text so as to get its root form eg: “functions”,”funtionality” as “function” .

Note:-Perform stemming if corpus size is huge; Since lemmatization process will take long time

unnestDF <- function(data, col) {   data<-as.data.frame(data %>%
                         unnest_tokens(word, !! rlang::sym(col)))
   data$word<-lemmatize_strings(data$word, dictionary = lexicon::hash_lemmas)
   return (data)
 }

Step4. Write function to remove numbers and all special characters from corpus

We can filter out all tokens that we are not interested in, such as all standalone punctuation. This can be done by iterating over all tokens and only keeping those tokens that are all alphabetic.

#Remove Numbers & all special characters
RmvNumber<- function(data,col){
  for ( i in 1:nrow(data)){
    size=nrow(data)
    data$word[i]<-removeNumbers(data$word[i]) #remove numbers
    data$word[i]<- gsub("[^[:alnum:]]", "", data$word[i]) #remove special characters
    data$word[i]<- gsub("^$|^ $", NA, data$word[i]) #Replace blanks with NA
    print(size-i)
  }
  data<-data[!is.na(data$word), ]
  return (data)
}

Step5. Remove stop words

Stop words are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These words do not carry important meaning and are usually removed from texts .For some applications like documentation classification, it may make sense to remove stop words.

Note:- There would be time when you’ll need to add custom words depending upon your use case . So please feel free to expand stop words list provided by R.

#Soft removal of stop words
sft_Remv_StpW<- function(data,cols){          col1=cols[1];col2=cols[2];   data<-data%>%
     anti_join(stop_words_updated,by=(col2)) %>%
     group_by_at(vars(col1,col2)) %>%
     summarise(Count=n()) %>%
     arrange(-Count)
   data<-na.omit(data) #Remove NA
   return(data)
 }

Step6. Enable Parallel processing

If something takes less time if done through parallel processing, why not do it and save time?  😛

Create a PARLLEL SOCKET Cluster
 cores=detectCores()-1
 myCluster<-makeCluster(cores, type="PSOCK")
 registerDoParallel(myCluster)

start.time<-Sys.time()

#convert text in to tokens
 t1_lm<-unnestDF(text_df,'ParaText')


#Remove Numbers
 t1_lm_temp<-RmvNumber(t1_lm,'word')

#Remove stop words
 unigram<-sft_Remv_StpW(t1_lm_temp,cols)

 total.time<- Sys.time()- start.time
 total.time 

Step7. Store unigrams:-

In simple words , an N-gram is simply a sequence of N words. Unigram is a single word, bigram is two consecutive words and so on.

unigram<-unigram[with(unigram, order( ParaId)),] #sort dataframe

Step8. Calculate tf-idf score

Idea of using tf-idf is to give more weight to a term that is common in a specific document but uncommon across all documents.

#########################Calculate tf-idf score

df<-unigram %>%
  bind_tf_idf(word, ParaId,Count)%>%
  arrange(desc(tf_idf))

########check tf-idf score frequency
score<-df %>% group_by(tf_idf) %>%summarise('Score'=n()) %>% arrange(desc(tf_idf))

Step9. Plot Histogram

Plot histogram to check distribution of tokens’ frequency

##Plot histogram to check distribution of word's frequency
opar <- options(scipen=100)
par(bg="gray")       ## set plot background color
hist(df$Count,
     main="Distribution of words in data",
     xlab="Count",
     breaks="FD",
     xlim=c(0,20),col="black")
grid(col="white") ## plots on top of histogram; re-plot
## histogram if you like ...
par(opar) ## restore original settings 

Step10. Plot wordCloud

Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a specific word appears in a source of textual data (such as a speech, blog post, or database), the bigger and bolder it appears in the word cloud.

In simple words, Word cloud is a perfect way to collect and display the most popular words that resonate with the audience.

###Plot word cloud
set.seed(1234)
wordcloud2(df_temp[c('word','Count')], shape='triangle', backgroundColor='black')

Step11. Build Dcoument-Term Matrix

In simple words,The document-term matrix is a two-dimensional matrix whose rows are the documents and columns are the terms so each entry (i, j) represents the frequency of term i in document j.

##############################Build DTM(Document Term Matrix) for Unigrams
Uni_dtm<- as.data.frame(dcast(df_temp, ParaId ~ word, value.var = "Count",fun.aggregate = sum))
dim(Uni_dtm) #660* 941

“All these pre-processing steps aim to reduce the vocabulary size without removing any important content (which in some cases may not be true when you lowercase certain words, ie. ‘Bush’ is different than ‘bush’, while ‘Another’ has usually the same sense as ‘another’). The smaller the vocabulary is, the lower is the memory complexity, and the more robustly are the parameters for the words estimated. You also have to pre-process the test data in the same way.”

In short, you will understand all this much better when you’ll work NLP use cases.

Please feel free to check out my previous post on NLP to understand NLP nuances like DTM, N-grams, TF-IDF score in more detail here.

0

Leave a Reply