Naïve Bayes Algorithm overview explained

1. Overview

Naive Bayes is a very simple algorithm based on conditional probability and counting. Essentially, your model is a probability table that gets updated through your training data. To predict a new observation, you’d simply “lookup” the class probabilities in your “probability table” based on its feature values.

It’s called “naive” because its core assumption of conditional independence (i.e. all input features are independent of one another) rarely holds in the real world.

Strengths: Even though the conditional independence assumption rarely holds, NB models perform surprisingly well in practice, especially for their simplicity. They are easy to implement and can scale with your dataset.

Weaknesses: Naïve Bayes models are often beaten by models adequately trained and tuned using the previous algorithms listed due to their sheer simplicity.

2. Introduction

In a world full of Machine Learning and Artificial Intelligence, surrounding almost everything around us, Classification and Prediction is one the most important aspects of Machine Learning and Naive Bayes is a simple but surprisingly powerful algorithm for predictive modelling, according to Machine Learning Industry Experts. So, in this Naive Bayes article, I’ll be covering some of the beginners’ friendly topics.

3.  What is Naïve Bayes algorithm?

Naive Bayes is a simple supervised machine learning algorithm that uses the Bayes’ theorem with strong independence assumptions between the features to procure results. That means that the algorithm assumes that each input variable is independent. It is a naive assumption to make about real-world data. For example, if you use Naive Bayes for sentiment analysis, given the sentence ‘I like Harry Potter’, the algorithm will look at the individual words and not the entire sentence. In a sentence, words that stand next to each other influence the meaning of each other, and the position of words in a sentence is also important. However, phrases like ‘I like Harry Potter, Harry Potter-like I’, and ‘Potter I like Harry’ are the same for the algorithm.

It turns out that the algorithm can effectively solve many complex problems. For example, building a text classifier with Naive Bayes is much easier than with more exciting algorithms such as neural networks. The model works well even with insufficient or mislabeled data, so you don’t have to ‘feed’ it hundreds of thousands of examples before you can get something reasonable out of it. Even if Naive Bayes can take as many as 50 lines, it is very effective.

It is called Naive because it assumes that each input variable is independent. This is a strong assumption and unrealistic for real data; however, the technique is very effective on many complex problems. The thought behind naive Bayes classification is to try to classify the data by maximizing P(O | Ci) * P(Ci) using Bayes theorem of posterior probability (where O is the Object or tuple in a dataset and “i” is an index of the class).

4.  Bayes Theorem

Bayes’ theorem is a way to figure out Conditional Probability. Conditional probability is the probability of an event happening, given that it has some relationship to one or more other events. For example, your probability of getting a parking space is connected to the time of day you park, where you park, and what conventions are going on at any time. Bayes’ theorem is slightly more nuanced. In a nutshell, it gives you the actual probability of an event given information about tests.

  • “Events” Are different from “tests.” For example, there is a test for liver disease, but that’s separate from the event of actually having liver disease.
  • Tests are flawed: just because you have a positive test does not mean you actually have the disease. Many tests have a high false-positive rate. Rare events tend to have higher false-positive rates than more common events. We’re not just talking about medical tests here. For example, spam filtering can have high false-positive rates. Bayes’ theorem takes the test results and calculates your real probability that the test has identified the event.

Bayes’ theorem (also known as Bayes’ rule) is a deceptively simple formula used to calculate conditional probability. The theorem was named after English mathematician Thomas Bayes (1701-1761). The formal definition for the rule is:

bayes' theorem

Where A and B are events, and P(B) ≠ 0.

  • P(A|B) is a conditional probability: The probability of event A occurring given that B is true. It is also called the posterior probability of A given B.
  • P(B|A) is also a conditional probability: The probability of event B occurring given that A is true.
  • P(A) and P(B) are the probabilities of observing A and B, respectively, without any given conditions.
  • A and B are different events.

In most cases, you can’t just plug numbers into an equation; You have to figure out what your “tests” and “events” are first. For two events, A and B, Bayes’ theorem allows you to figure out p(A|B) (the probability that event A happened, given that test B was positive) from p(B|A) (the probability that test B happened, given that event A happened). It can be a little tricky to wrap your head around as, technically, you’re working backwards; you may have to switch your tests and events around, which can get confusing. An example should clarify what I mean by “switch the tests and events around.”

You might be interested in finding out a patient’s probability of having liver disease if they are an alcoholic. “Being an alcoholic” is the test (kind of like a litmus test) for liver disease.

  • A could mean the event “Patient has liver disease.” Past data tells you that 10% of patients entering your clinic have liver disease. P(A) = 0.10.
  • B could mean the litmus test that “Patient is an alcoholic.” Five per cent of the clinic’s patients are alcoholics. P(B) = 0.05.
  • You might also know that among those patients diagnosed with liver disease, 7% are alcoholics. This is your P(B|A): the probability that a patient is alcoholic, given that they have liver disease, is 7%.

Bayes’ theorem tells us:

P(A|B) = (0.07 * 0.1)/0.05 = 0.14

In other words, if the patient is an alcoholic, their chances of having the liver disease are 0.14 (14%). This is a significant increase from the 10% suggested by past data. But it’s still unlikely that any particular patient has liver disease.

5.  Working of the algorithm

Let’s understand it using an example. Below I have a training data set of weather and corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify whether players will play or not based on weather condition. Let’s follow the below steps to perform it.

So here we have our Data, which comprises the Day, Outlook, Humidity, Wind Conditions and the final column being Play, which we have to predict.

  • We will create a frequency table using each attribute of the dataset.
  • For each frequency table, we will generate a likelihood table.
  • Likelihood of ‘Yes‘ given ‘Sunny ‘is:

P(c|x) = P(Yes|Sunny) = P(Sunny|Yes) * P(Yes) / P(Sunny) = (0.3 x 0.71) / 0.36  =  0.591

  • Similarly Likelihood of ‘No‘ given ‘Sunny ‘is:

P(c|x) = P(No|Sunny) = P(Sunny|No) * P(No) / P(Sunny) = (0.4 x 0.36) / 0.36  =  0.40

  • In the same way, we need to create the Likelihood Table for other attributes.

Suppose we have a Day with the following values :

  • Outlook   =  Rain 
  • Humidity   =  High
  • Wind  =  Weak
  • Play = ?

So, with the data, we have to predict whether “we can play on that day or not”.

Likelihood of ‘Yes’ on that Day = P(Outlook = Rain|Yes) * P(Humidity = High|Yes) * P(Wind = Weak | Yes) * P(Yes)

=  2/9 * 3/9 * 6/9 * 9/14 =  0.0199

Likelihood of ‘No’ on that Day = P(Outlook = Rain|No) * P(Humidity = High|No) * P(Wind = Weak|No) * P(No)

  =  2/5 * 4/5 * 2/5 * 5/14 =  0.0166

Now we normalize the values, then

P(Yes) =  0.0199 / (0.0199 + 0.0166)  =  0.55

P(No) = 0.0166 / (0.0199 + 0.0166)  =  0.45

Our model predicts that there is a 55% chance there will be a Game tomorrow.

6.  Advantages and Disadvantages of the Algorithm

  • Advantages
  • Simple
    • Almost no hyperparameters and great usability out of the box.
  • Fast
    • The way Naive Bayes has implemented means fast training and fast predictions.
  • Large Data Friendly
    • Linear Time Complexity of Naive Bayes means it will remain efficient even when data gets really big.
  • Accurate
    • If you’re sure Naive Bayes is appropriate for your data and tasks at hand, Naive Bayes can be surprisingly accurate for its simplicity and efficiency.
  • Lightweight
    • Naive Bayes doesn’t clog the RAM like random forests or require high computation resources like SVMs or Neural Networks. Even though some of these models can be considered lightweight, we can probably say Naive Bayes is ultra-lightweight.
  • Disadvantages
  • Bias
    • Naive Bayes learns fast and easily, but if your training set is not ideal, NB can be highly biased, and results will be garbage. There aren’t many parameters to fix things either.
  • Applicability
    • Lots of real-world problems have co-dependent features. Meaning features will have a high correlation with each other. In these cases, bias and inaccuracy will come to the picture. Don’t you wish we could apply Naive Bayes to every problem out there?
  • Not So Open-Minded
    • Suppose you have complex problems with complex features. Naive Bayes will avoid these since NB is great at navigating around noise. It’s kind of my way or the highway algorithm in that sense.

7.  Practical Application

Here are some areas where this algorithm finds applications:

  • Text Classification
    • Most of the time, Naive Bayes finds uses in-text classification due to its assumption of independence and high performance in solving multi-class problems. It enjoys a high rate of success than other algorithms due to its speed and efficiency. 
  • Sentiment Analysis
    • One of the most prominent areas of machine learning is sentiment analysis, and this algorithm is quite useful there as well. Sentiment analysis focuses on identifying whether the customers think positively or negatively about a certain topic (product or service).
  • Recommender Systems
    • With the help of Collaborative Filtering, Naive Bayes Classifier builds a powerful recommender system to predict if a user would like a particular product (or resource) or not. Amazon, Netflix, and Flipkart are prominent companies that use recommender systems to suggest products to their customers. 

Article Credit:-

Name: Vivek Kumar
Designation: Int. MSc. in Mathematics, IIT Roorkee
Research area: Machine Learning

Leave a Comment