Introduction to Natural Language Processing-Part 1

What is Natural Language Processing?

Natural language refers to the way we, humans, communicate with each other namely, speech and text.We are surrounded by text.Think about how much text you see each day: Signs,Menus,Email,SMS,Web Pages,logs,chat messages ,and so much more…The list is endless. And this is where Natural Language processing comes in picture.

The study of natural language processing has been around for more than 50 years and grew out of the field of linguistics with the rise of computers.
Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by software.
In this post, you will discover what natural language processing is and other nuances of NLP.

Why to use Natural Language Processing:-

  • More than 80% of the data in this world is unstructured in nature, which includes text. You need text mining and Natural Language processing  (NLP) to make sense out of this data.
  • Natural Language Processing (NLP) helps you extract insights from emails of customers, their tweets & reviews, text messages.
  • Natural Language Processing (NLP) can power many applications, such as language translation, question answering systems, chatbots and document summarizers.

Unstructured data:-

In simple words ,The phrase unstructured data usually refers to information that doesn’t reside in a traditional row-column database.

The Computer World magazine states that unstructured information might account for more than 70%–80% of all data in organizations so that means only 20% of the available data is present in structured form.And the amount of unstructured data in enterprises is growing significantly — often many times faster than structured databases are growing.

Unstructured data files often include text that may contain data such as dates , numbers also , and multimedia content. Examples include e-mail messages, word processing documents,logs , videos, photos, audio files, presentations, webpages and many other kinds of business documents. Note that while these sorts of files may have an internal structure, they are still considered “unstructured” because the data they contain doesn’t fit neatly in a database.

DTM(Document Term Matrix) :-

What is DTM:-
A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
In simple words,The document-term matrix is a two-dimensional matrix whose rows are the documents and columns are the terms so each entry (i, j) represents the frequency of term i in document j.In case of tf-idf ,each entry (i, j) represents the tf-idf score of each term i in document j.

Figure: Term-document Matrix

Why do we need DTM:-
The DTM representation is a fairly simple way to represent the documents as a numeric structure. Representing text as a numerical structure is a common starting point for text mining and analytics such as search and ranking, creating taxonomies, categorization, document similarity, and text-based machine learning. If you want to compare two documents for similarity you will usually start with numeric representation of the documents. If you want to do machine learning magic on your documents you may start by creating a DTM representation on the documents and using data derived from the representation as features.

TF-IDF (Term Frequency – Inverse Document Frequency):-

Idea of using tf-idf is to give more weight to a term that is common in a specific document but uncommon across all documents. The reason behind this is-

  • Words that are very common in a specific document are probably important to the topic of that document.
  • Words that are very common in all documents probably aren’t important to the topic of any of them
  • TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

    TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
  • IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

    IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

TF-IDF = TF(t)*IDF(t)


In simple words , an N-gram is simply a sequence of N words. For instance, let us take a look at the following examples.

  • krishna ( is a Unigram)
  • Thank you (is Bigrams)
  • The Three Musketeers (is Trigrams)

Now that we understand this concept, we can build with it: that’s the N-gram model. Basically, an N-gram model predicts the occurrence of a word based on the occurrence of its N – 1 previous words. So here we are answering the question – how far back in the history of a sequence of words should we go to predict the next word? For instance, a bigram model (N = 2) predicts the occurrence of a word given only its previous word (as N – 1 = 1 in this case). Similarly, a trigram model (N = 3) predicts the occurrence of a word based on its previous two words (as N – 1 = 2 in this case).

For the purpose of our example, we’ll consider a very small sample of sentences, but in reality, a corpus will be extremely large. Say our corpus contains the following sentences:

  • Krishna said thank you.
  • Edappadi K.Palanisamy is the Chief Minister of Tamil Nadu
  • Tamil is official language of the Indian state of Tamil Nadu.
  • It’s raining in Rajasthan.

Let’s assume a bigram model. So we are going to find the probability of a word based only on its previous word. In general, we can say that this probability is (the number of times the previous word ‘wp’ occurs before the word ‘wn’) / (the total number of times the previous word ‘wp’ occurs in the corpus) =

(Count (wp wn))/(Count (wp))
Let’s work this out with an example.
To find the probability of the word “you” following the word “thank”, we can write this as P (you | thank) which is a conditional probability.
This becomes equal to:

=(No. of times “Thank You” occurs) / (No. of times “Thank” occurs)
= 1/1
= 1
We can say with certainty that whenever “Thank” occurs, it will be followed by “You” (This is because we have trained on a set of only five sentences and “Thank” occurred only once in the context of “Thank You”). Let’s see an example of a case when the preceding word occurs in different contexts.

Let’s calculate the probability of the word “Nadu” coming after “Tamil”. We want to find the P (Nadu| Tamil). This means that we are trying to find the probability that the next word will be “Nadu” given the word “Tamil”. We can do this by:

=(No of times “Tamil Nadu” occurs) / (No. of times “Tamil” occurs)
= 2/3
= 0.67

This is because in our corpus, one of the three preceding “Tamil” was followed by “Language”(remove “is” & “official” in text pre-processing phase). So, the P (Language| Tamil) = 1 / 3.
In our corpus, only “Nadu” and “Language” occur after “Tamil” with the probabilities 2 / 3 and 1 / 3 respectively. So if we want to create a next word prediction software based on our corpus, and a user types in “Tamil”, we will give two options: “Nadu” ranked most likely and “Language” ranked less likely.

Generally, the Bigram model works well and it may not be necessary to use Trigram models or higher N-gram models.


Leave a Reply