top of page
Search

Text Classification using Naive Bayes Classifier

Naive Bayes algorithm is commonly used for text classification. In this blog, we will deep dive through NBC concepts and its implementation.


A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of the classifier is based on the Bayes theorem. [1]


The Bayesian approach considers both prior belief and Evidence. Where as, in Frequentist method, it only considers the evidence. The frequentist approach assigns probabilities to data, not to hypotheses, whereas the Bayesian approach assigns probabilities to hypotheses. [2]


Naive Assumption:


The word "Naive" in the NBC, comes from the fact that - it makes a naive assumption of conditional independence of the events. For example, if A and B are two independent events, they will not have any link with each other. This Naive assumption plays a very important role in NBC.


Bayes' Theorem:


The Bayes' theorem computes the conditional probability using the joint probability between the features and possible labels with the marginal probability of the features occurring across all possible labels. Naive Bayes, makes use of the Bayes theorem to compute the conditional probability. [3]



ree

ree

Image : Handwritten formula - as taught in the class


Applications of NBC :


  • Social Media text classification - sentiment analysis

  • Spam mail detector

  • It is widely used for multi-class predictions.


Now, let's go through the implementation of the Naive Bayes Classifier.



1 - Import statements:


Below are the imports that I've used, in my code. Along with numpy, pandas, I'm also using gensim and nltk.


gensim and re - For the effective pre-processing of the text


nltk - To fetch the stop words in English (most commonly used words in English)

ree

2 - Reading the data file


Note : In this assignment, I'm considering only 8000 rows in the dataframe out of the entire data, as my notebook crashed for very huge data. So, here we will be dealing with data with 8k rows

ree

3 - Dividing the data into 3 parts - train, dev and test


  • train - 80 %

  • dev - 10 %

  • test - 10 %

ree

4 - Displaying the train data


ree

5 - Defining the functions to preprocess the train data


  • I've written two functions with slight variations to pre-process the train_data.

  • The first function, preprocess - is used to preprocess each row in the data individually and returns a list of words w.r.t each row

  • I'm using regular expressions - to remove special characters, digits, extra spaces, leading and trailing spaces etc from the sentence.

  • Also, for the easy processing I'm using genesis.utils.simple_preprocess to convert the sentence into a list of words

  • And, as a last step, I'm excluding all the stopwords in the list of words of that respective sentence


ree

6 - Building Vocabulary dictionary

  • In this step, I am building the vocabulary for the training data ( {word : occurence} )

  • In this vocab, the words whose occurrence is less than 5, will be omitted.

  • To omit the less occurrence words, first I'm building a separate dictionary for rare words

  • And the, compare the vocab with rare words - and remove all the rare_word from the vocab


ree

  • Printing the initial vocabulary, rare_words, and then - the processed vocabulary (excluding rare words)

ree

7 - Fetching the unique class labels in the data


ree

8 - Divide the training_data into different subsets that will contain only the unique class labels


  • Based on the unique class labels obtained in the above step, divide the data into subsets. Here, Since we have 6 unique class labels - I'm dividing the data into 6 different subsets for each class type respectively.


  • Printing a sample subset of responsibility df

ree

  • Data which contains only the class type - "Experience"


ree

9 - Creating the processed list of rows for each subset data that was created in the previous step.



ree

  • Printing the processed documents in each subset data of unique class


ree

ree

10 - Calculating the probability of occurrence of each word in present in the processed vocabulary dictionary


ree

  • Displaying the calculated probability occurrence for each word

ree

11 - Calculate the conditional probabilities for each word w.r.t to each unique class type


ree

  • P( {word} | Responsibility )

ree

  • P( { word} | Education )

ree

  • P( { word} | Requirement )

ree

  • P( { word} | Skill )

ree

  • P( { word} | Soft Skill )

ree

  • P( { word} | Responsibility )

ree

12 - Defining the Naive Bayes function to compute the probability of each class.

  • This function will compute the probability of each class using bayes theorem

  • Once the probability of each class is calculated, it will be stored in a dictionary

  • In the end, the class type with the max probability value will be the predicted class type for the given sentence


ree

ree

13 - Now, use the dev data to calculate the initial accuracy


ree

  • Preprocess the dev data:


ree

14 - Defining a function to predict the class type of each sentence.


  • This function predicts the class type of each sentence (row) in the given dev data

  • For each row in the data, it applies the bayes theorem to compute the probability of each unique class

  • Once, the naive bayes function (defined above) returns the predicted class type, it will be appended to the list of predicted labels

  • I've provided the detailed comments in the code (as seen below), on how the code works

ree

15 - Making a function call to predict the class type of each document in dev data and printing the results


ree

16 - Defining a function to calculate the accuracy

  • Calculating the initial accuracy on the dev data

  • As seen below, the initial accuracy obtained is 59.62 %


ree

17 - Laplace Smoothing


  • After obtaining the initial accuracy on dev data, in this step I implemented Laplace Smoothing to check how the accuracy would change


  • Before , diving into the code - let's see what is Laplace Smoothing and why do we need it? [5]

  • Laplace smoothing is a smoothing technique that handles the problem of zero probability in Naïve Bayes.

  • If the word is absent in the training dataset, then we don’t have its likelihood.

  • When computing probability, since we multiply all the likelihoods, the probability will become zero. This is the zero probability problem [5]


ree

image source : [5]


  • Conditional probabilities for a word w.r.t to class type will be calculated using below Laplace technique: [5]

  • Here, alpha represents the smoothing parameter, K represents the number of dimensions (features) in the data, and N represents the number of reviews with y=positive


  • Defining a function to implement Laplace smoothing :


  • In the below code, I'm calculating the conditional probabilities using Laplace technique, mentioned above

  • By using the smoothing_param that is given the input, for each different values of smoothing_param the accuracy will also vary, which we will see in the later section.

  • I've defined another function, which will calculate the conditional probability for a missing word in training data set, by using Laplace technique


ree


18 - Defining a function to calculate conditional probability using the given smoothing_param


ree

19 - Displaying the conditional probability of preforming Laplace smoothing


  • Conditional probability of each word w.r.t to Experience class type

ree

20 - Defining a function to predict the class type using Laplace smoothing technique


  • This is function is similar function, where the class type was predicted. But the difference is, it uses Laplace technique


ree


21 - Making a function call to predict the class type use Laplace smoothing technique, and displaying the results


ree

22 - Calculating the accuracy for dev data - using Laplace smoothing

  • smoothing_param = 1

  • Accuracy = 64.62 %

  • As we, observe here : the accuracy increased after applying the Laplace smoothing


ree

23 - Experimenting with different smoothing parameters


  • In this step, I am finding different set of accuracies by experimenting with different smoothing parameters.


  • Defining a function to experiment with different values of smoothing parameters

ree


  • As shown in the below screenshot, I experimented with smoothing_params = 1, 100 and 200

  • And, I observed that, as and when I increased the value of smoothing_param, the accuracy decreased ( as seen below)

  • Experiment results:

  • smoothing param = 1, accuracy 64.62 %

  • smoothing param = 100, accuracy 57.25 %

  • smoothing param = 200, accuracy 55.12 %

  • I got the highest accuracy for smoothing_param = 1


ree


24 - Calculate the final accuracy


  • As the final step, calculating the final accuracy with the test data


ree


  • Calculating the final accuracy with test data

  • As seen below, for the test data - I obtained the accuracy of 63.75%


ree

Please refer the link below - which leads to my detailed implementation of the above mentioned code : https://github.com/harshithaharshi/datamining/blob/main/Data_Mining_Assignment2.ipynb


CONTRIBUTIONS:


  • Implemented the code to perform Naive Bayes classification, from the scratch

  • As part of the experiment, I tried experimenting with different values of smoothing parameters (as part of Laplace smoothing), to find the best smoothing parameter for test data

  • Below is the accuracy results for dev data, as part of my experiment:

  • smoothing param = 1, accuracy 64.62 %

  • smoothing param = 100, accuracy 57.25 %

  • smoothing param = 200, accuracy 55.12 %

  • And, as a final accuracy - I obtained 63.75% accuracy for the test data

  • In the process of this assignment - I was able to relate all the concepts that were taught in the class. It was a great experience to implement the Naive Bayes classifier from scratch.

CHALLENGES :

  • The initial struggle was to write the logic for the data-preprocessing. I explored various ways in regular expressions, gensim and stopwords removal - to be able to write a logic to process the data

  • The next challenge was to think over the appropriate data structures that should be used to store and retrieve the data at each stage - when a processing was done.

  • Planning over an appropriate data structure is very crucial for the efficiency of the code.

  • Implementing Laplace smoothing was a little challenging for me. As mentioned in the reference below, I read the concepts from the source mentioned to understand the logic and implement it on my own


References :















 
 
 

Comments


bottom of page