NLP: Bag of words and TF-IDF explained!
In the previous article, we have been through tokenization, use of stop words, stemming and lemmatization. Basically, processing the text while it is still readable. To give this data as input to any model, we’d need to transform them to some numerical format — ‘The vectors’. Let us go through a few ways this can be done.
- Bag of words
- TF-IDF
- Word2Vec
We do know how a document or paragraph text data can be tokenized, breaking it down into a list of sentences or words. We will now be taking these tokenized units, map them to a vector and send as input to the model.
Bag of words
In bag of words, we take all unique words from the corpus, note the frequency of occurrence and sort them in descending order. All the word — vector mapping we have will be used to represent each sentence. Let us take it in steps and examples to have a clear picture.
Paragraph: “The news mentioned here is fake. Audience do not encourage fake news. Fake news is false or misleading”
Step 1: Tokenize the data, remove stop words and perform stemming or lemmatization.
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
paragraph = """The news mentioned here is fake. Audience do not encourage fake news. Fake news is false or misleading"""
sentences = nltk.sent_tokenize(paragraph)
lemmatizer = WordNetLemmatizer()
corpus = []
for i in range(len(sentences)):
sent = re.sub('[^a-zA-Z]', ' ', sentences[i])
sent = sent.lower()
sent= sent.split()
sent = [lemmatizer.lemmatize(word) for word in sent if not word in set(stopwords.words('english'))]
sent = ' '.join(sent)
corpus.append(sent)print(corpus)
Output:
['news mentioned fake', 'audience encourage fake news', 'fake news false misleading']
Step 2: List all unique words
Unique words: [‘news’, ‘mentioned’, ‘fake’, ‘audience’, ‘encourage’, ‘false’, ‘misleading’]
Step 3: Create a dictionary with mapping of words to a number. This should now be sorted on frequency of occurrence in descending order.
Step 4: Now, create a table in each sentence, for the presence of each word in the dictionary, assign ‘1’ else assign ‘0’
The data we had is now transformed as:
- news mentioned fake — [1 1 0 0 0 1 0]
- audience encourage fake news — [1 1 1 1 0 0 0]
- fake news false misleading — [1 1 0 0 1 0 1]
And these vectors are sent as input to the model as independent features. We can use ‘CountVectorizer’ from sci-kit learn to perform steps 2, 3 and 4
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
independentFeatures = cv.fit_transform(corpus).toarray()
Output:
Bag of words will really be helpful in prediction problems like language modeling and documentation classification. Bag of words do have few shortcomings. Mentioning a few of them below:
- Vocabulary: The vocabulary requires careful design to manage the size, which in-turn impacts the sparsity of the document representations.
- Meaning: The values here are represented either as 1’s or 0’s. Consider the words ‘news’ and ‘fake’ — both the words are having same representation, and semantics of the words are same. This causes the difficulty to identify the importance of words.
To overcome these shortcomings, TF-IDF can be used.
Term Frequency — Inverse Document Frequency (TF-IDF)
We calculate the term frequency and Inverse document frequency for every word in the corpus, and multiplication of TF and IDF gives the document vectors. So, how do we calculate TF-IDF
Term Frequency (TF) — (No. of repeated words in sentence) / (No. of words in sentence)
Inverse Document Frequency (IDF) — log[ (No. of sentences) / (No. of sentences containing word)]
Let us take the same example — “The news mentioned here is fake. Audience do not encourage fake news. Fake news is false or misleading”
Step 1: Passing the data through stemming or lemmatization. Take all the unique words, and sort based on frequency of occurrence. These are steps 1,2,3 we have observed in Bag of words (BOW)
Step 2: Calculate Term Frequency
Let us calculate TF for sentence — 1
- news — 1 / 3 = 0.33 {News is repeated once in the sentence, and total words are 3 — giving 1/3}
- mentioned — 1 / 3 = 0.33
- fake — 1 / 3 = 0.33
- audience , encourage, false, mentioned, misleading — 0 / 3 = 0 {These words did not occur in the sentence — there is no repetition, hence zero}
Let us calculate TF for sentence — 2
- audience — 1 / 4 = 0.25 (Audience word is repeated once in the sentence, and total words in the sentence are 4 — giving 1/4)
- encourage — 1 / 4 = 0.25
- fake — 1 / 4 = 0.25
- news — 1 / 4 = 0.25
- false, mentioned, misleading — 0 / 4 = 0
Similarly — we calculate Term Frequency for rest of sentences.
Step 3: Calculate IDF
Let us calculate IDF for all the words:
- news — log_e(3/3) = 0 {we have 3 sentences, and news word is present in all three sentences, hence log(3/3)}
- mentioned — log_e(3/1) = 1.0986
- fake — log_e(3/3) = 0
- audience — log_e(3/1) = 1.0986
- encourage — log_e(3/1) = 1.0986
- false — log_e(3/1) = 1.0986
- misleading — log_e(3/1) = 1.0986
Step 4: Calculate document vectors multiplying TF and IDF values. Steps 2, 3, 4 can be achieved through TF-IDF vectorizer from sci-kit learn.
# Creating the TF-IDF model
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
independentFeatures_tfIDF = tfidf.fit_transform(corpus).toarray()
Output:
You can access the code snippet on GitHub, give it a try. Contrary to bag of words, the vectors here have different values, giving importance to a set of words. Though the models solves the issues observed on BOW, there are shortcomings even here, such as
- TF-IDF does not capture position in text, semantics, co-occurrences
- TF-IDF computes document similarity directly in the word-count space, making it slow for large documents
Bag of words or TF-IDF features can be used as inputs for Naive bayes model to classify spam and ham. The upcoming blogs will be on classification of Spam and Ham, and word2vec. Happy learning :)