In Natural Language Processing (NLP), we seek to give machines the ability to understand human language. That’s interesting, no! But there is a big problem: humans communicate with sentences, words… while machines only understand numbers.
It then becomes important to be able to translate a text written in a human language into the machine language.
In the previous episode, we discussed the pre-processing step that is more than necessary to be able to start the translation.
In this episode, we will try to focus on transforming the words contained in a text into numbers so that they can be interpreted by the computer.
To succeed in this task, there are several approaches, we will focus on the simplest ones such as One Hot Encoding and Bag Of Words. In future episodes, we will touch on the latest and most effective techniques.
One Hot Encoding
One Hot Encoding refers to the process by which categorical variables are converted into binary vectors (0,1) in which each vector contains one 1.
When it comes to NLP, why would we need one hot encoding? Indeed, we have previously raised the subject of translating a text written in human language into machine language. Since the machine only knows the binary (0 and 1) it would then be wise to use one hot encoding to represent our words.
How exactly does it work? Let us first define in a clear and concise way some terms that will be useful in the future.
Let’s say we have a book written in English. We define as vocabulary V, the number of distinct words contained in the book. The text contained in the book is called a corpus. A corpus is therefore a list of words. A vector (one hot encoding) in our context is a set of 0 and 1 that will represent a word in our corpus. For each word wi in our vocabulary, we can represent it with a vector of size N,N being the number of words contained in our vocabulary V,[0,0,0,0,…,1,…,0,0,0] such that the i th element takes 1 and the others 0. Each sentence in our corpus will then be represented by all the one hot encoding vectors of each of its words.
To explain one hot encoding in a simple way, let’s take a trivial example :
NLP is a part of intelligence artificial
We have a sentence composed of 7 distinct words, so our vocabulary V includes 7 words: { “NLP”, “is”, “a”, “branch”, “of”, “intelligence”, “artificial” }
To represent each word of our vocabulary, we will have a vector composed of 0 except for the ith element which will be 1.
So NLP will have as a vector one hot encoding: [ 1 0 0 0 0 0 0 0 0 0 0 0 ]. Is will have as vector [ 0 1 0 0 0 0 0 0 0 0 0 0 ] so on and so forth. Our sentence can then be represented by the matrix contained in the table below:
NLP | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
is | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
a | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
branch | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
of | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
artificial | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
intelligence | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
As you can see, it’s quite trivial.
Thus we can represent the sentence as a 7*7 matrix where each line is a one hot encoding vector of a word. The size of a vector for a given word therefore depends on the size of our vocabulary. This is the main disadvantage of this technique, as the larger the corpus becomes, the larger the vocabulary is likely to be. And a language generally has several thousand distinct words. We can then quickly find ourselves with matrices having enormous sizes.
The other drawback of one hot encoding is that it does not really provide information about the semantics or even the context of a word, its only purpose being to transform a categorical value into a numerical value. There are then other techniques much more up to date and more adapted to the NLP field.
Bag Of Words
The Bag Of Words model is a simplified representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the (multiset) bag of its words, without taking into account the grammar and even the order of the words but only by keeping the multiplicity. (Wikipedia)
Just like One Hot Encoding, the Bag of words model allows you to digitally represent a text so that it can be understood by the machine.
In this case, the vector representation of the text describes the occurrence of words present in an entry (a document, a sentence).
The idea behind this approach is very simple as you will see.
With the terms defined in the previous section, let us assume that we have a corpus composed of N distinct words. The size of our vocabulary V is therefore N.
To represent a sentence, we define a vector of fixed length N, each element i of the vector represents a word from our vocabulary.
What can I do to know the values of a vector representing a sentence? Each element i has for value the number of occurrences of the word in the sentence.
Let’s take a simple example to understand the concept. We have the following corpus:
Life is short but life can seem long. The night is near.
Our V vocabulary is therefore composed of the following words: { “life”, “is”, “short”, “but”, “can”, “seem”, “long”, “the”, “night”, “near” }. To represent a sentence, we will need a vector of size 10 ( the number of words in our vocabulary).
The first sentence will be represented as follows: [ 2 1 1 1 1 1 1 0 0 0 ]
We see from this representation that the is present twice in the sentence, the other words is, short, but, can, seem, long and night are present only once while the words like night and near do not appear in the sentence.
The same process is applied to the second sentence to obtain its BOW vector
The second sentence will be: [ 0 1 0 0 0 0 0 1 1 1 ]
The problem with the BOW method is that it does not provide a way to determine the meaning of the text or to extract the context in which the words appear. It only gives us as information the occurrence of words in a sentence.
Despite this, the BOW remains a way to extract entities from the text to use as input in Machine Learning algorithms, document classification for example.
To obtain a better representation of the two techniques presented in this episode, it should not be forgotten that it is necessary to do the preliminary pre-processing of the data mentioned in the previous episode.
Implementation
In this part we will make a small implementation in Python of the two techniques presented above. As always if you don’t like codes, you can jump to the conclusion:).
Let’s start with the One Hot Encoding. We will use the same techniques talked about in the previous episode. Let’s take a book, and choose some portion of it and do the OHE of its phrases.
In this implementation we will use the book Siddharta, downloaded from Project Guntenberg:
The book is too long so we will just take 5 phrases and it will be our corpus. Next we do the pre processing of the data and create our vocabulary
Now let’s do the one hot encoding of a phrase in this text using our vocabulary :
Now we print the OHE vector for the first word of the phrase which is day so the 1 in its vector will be in the 30th element of the vector:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
The same thing with the word siddharta which have a 1 in the 10th element:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
For the implementation of the bag of words we first create a function that returns a vocabulary by taking as input a corpus. Next, we create another function that takes as input a sentence and a corpus and returns the sentence’s bag of words.
After testing our bagofwords funtion with the two sentences we get the following results :
Conclusion
In this episode, we introduced two approches to translate textual data in a form which will be understandable to the computer. One Hot Encoding and Bag Of Words are two trivial techniques but they can be useful in the NLP Kingdom.