One Hot Encoding et Bag of Words : Comprendre les différences

Table of content

One Hot Encoding and Bag of Words: Understanding the Differences.

In this article, we analyze two popular methods of numerical text representation in NLP.

You will discover how one hot encoding and bag of words can transform text data to make it usable by machines. We will discuss their specificities, advantages and limitations, illustrating everything with concrete examples of implementation in Python to optimize your projects in natural language processing.

In natural language processing ( NLP ), we are trying to give machines the ability to understand human language. That's interesting, isn't it! But a big problem arises: humans communicate with sentences, words... while machines only understand numbers. It then becomes important to be able to translate a text written in a human language towards a language machine.

During the episode previous , we have covered the pre-processing step which is more than necessary to be able to attack the translation and then move on to natural language processing NLP.

In this episode, we set ourselves the goal of transforming the words in a text into numbers so that they can be interpreted by the computer.

To achieve this task, there are several approaches, we will focus on the simplest ones to achieve such as the one hot encoding and the bag of words. In the next episodes, we will touch on the latest and most effective techniques.

One Hot Encoding and Bag of Words: Understanding the differences.

The One Hot Encoding refers to the process by which categorical variables are converted into the form of vectors binary (0,1) in which each vector contains a single 1 .

In Natural Language Processing NLP, why would we need the one hot encoding ?

Indeed, we have previously raised the subject of translating a text written in human language into machine language. Since the machine only knows binary (0 and 1), it would then be wise to use one hot encoding to represent our words.

How exactly does it work? Let’s first define clearly and concisely some terms that will be useful to us later.

Let's say we have a book written in French.

Vocabulary: We define as vocabulary V, the number of distinct words contained in the book.

Corpus: We call corpus, the text contained in the book. A corpus is therefore a set of words.

Vector: A vector (one hot encoding) in our context is a set of 0 and 1 which allows us to represent a word of our corpus. For each word wi from our vocabulary, we can represent it with a vector of size N,N being the number of words contained in our vocabulary V, [0,0,0,..,1,…,0,0] such that the i th element takes 1 and the others 0. Each sentence in our corpus will then be represented by the set of vectors one hot encoding of each of his words.

To explain in a simple way the one hot encoding, let's take a trivial example:

Le NLP est une branche de l’intelligence artificielle.

We have a sentence composed of 8 distinct words, so our vocabulary V includes 8 words: { « Le », « NLP », « est », « une », « branche », « de » , »intelligence », »artificielle »} (Here we consider that "l'" has been changed to "le")

So to represent each word in our vocabulary, we will have a vector composed of 0 except for the i th element which will be 1.

So The will have as vector one hot encoding : [ 1 0 0 0 0 0 0 0 ]. NLP will have as vector 0 1 0 0 0 0 0 0 ]. And so on. Our sentence can then be represented by the matrix contained in the table below:

The	1	0	0	0	0	0	0	0
NLP	0	1	0	0	0	0	0	0
is	0	0	1	0	0	0	0	0
une	0	0	0	1	0	0	0	0
branche	0	0	0	0	1	0	0	0
of	0	0	0	0	0	1	0	0
le	1	0	0	0	0	0	0	0
intelligence	0	0	0	0	0	0	1	0
artificielle	0	0	0	0	0	0	0	1

As you can see, it's pretty trivial.
So we can represent the sentence as a 9*8 matrix where each line is a one hot encoding vector of a word. The size of a vector for a given word therefore depends on the size of our vocabulary. This is the main drawback of this technique, in fact the larger the corpus becomes, the more the size of the vocabulary is likely to increase. And a language generally has several thousand distinct words. We can then quickly end up with matrices having enormous sizes.

The other disadvantage of the one hot encoding is that it does not really allow to have information on the semantics or even the context of a word, its only goal being to transform a categorical value into a numerical value. There are then other techniques much more up to date and more adapted to the field of natural language processing (NLP).

II. Bag Of Words : : Understanding the differences.

The Bag of Words model Bag Of Words is a simplified representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, without taking into account the grammar and even the word order but only preserving the multiplicity.Wikipedia).

As for the One Hot Encoding,the Bag of words therefore allows the digital representation of a text to be made in such a way that it can be understood by the machine.

Here the vector representation of the text describes the occurrence of words present in an entry (a document, a sentence).
The idea behind this approach is very simple, you will see.

With the terms defined in the previous section, let us assume that we have a corpus composed of N distinct words. The size of our vocabulary V is therefore N.

To represent a sentence, we define a fixed length vector N, each element i of the vector represents a word from our vocabulary.

How to know the values of a vector representing a sentence? Each element i has the value of the number of occurrences of the word in the sentence.

Let's take a simple example to understand the concept. We have the following corpus:

La vie est courte mais la vie peut paraître longue.

La nuit est proche .

Our vocabulary V is therefore composed of the following words: { « la », »vie », »est », « courte », « mais », « peut », « paraître », « longue », « nuit », « proche » }. To represent a sentence, we will therefore need a vector of size 10 (the number of words in our vocabulary).

The first sentence will be represented as follows: [ 2 2 1 1 1 1 1 1 0 0 ]

We see from this representation that la and vie are present twice in the sentence, the other words est, courte, mais, peut, paraître and longue ne sont présentes qu’une seule fois tandis que les mots comme nuit and proche do not appear in the sentence.

The same process is applied at the level of the second sentence to obtain its vector Bag of words

The second sentence: [ 1 0 1 0 0 0 0 0 1 1 ]

The problem with this method is that it does not allow us to determine the meaning of the text or extract the context in which the words appear. It only gives us information about the occurrence of the words in a sentence.
Despite everything, the Bag of words remains a way to extract entities from text to use as input into algorithms Machine Learning in the case of document classification.

To get a better representation of the two techniques presented in this episode, we must not forget that it is necessary to first do the preprocessing. data mentioned in the previous episode.

Implement One Hot Encoding and Bag Of Words in NLP.

In this part we will make a small implementation in Python of the two techniques presented above. As always if you don't like codes, you can skip to the conclusion :).

Let's start with the One Hot Encoding. We will use the same techniques as in the previous episode. Let's take a book, choose a part of it and do theOHE of his sentences.

In this implementation we will use the book Homer, downloaded from Projet Guntenberg :

	homer_response = requests.get(« https://web.archive.org/web/20211128034110/https://www.gutenberg.org/files/52927/52927-0.txt »)
	homer_data = homer_response.text
	homer_data = homer_data.split(« *** »)[2]

view raw

get_book.py hosted with by GitHub

The book is too long, so we'll just take 3 sentences and that will be our corpus. Then we do the data preprocessing and create our vocabulary.

	text_stems_sid,text_lems_sid = process_data( » « .join(homer_data.split(« . »)[10:13]))
	vocab = list(set(text_stems_sid))
	print( » « .join(homer_data.split(« . »)[10:13]))

view raw

create_vocab.py hosted with by GitHub

Now let's do it one hot encoding from the last sentence of this text using our vocabulary:

	stems,lems = process_data(homer_data.split(« . »)[12])
	print(homer_data.split(« . »)[12])
	onehot_encoded = list()
	for word in stems:
	letter = [0 for _ in range(len(set(text_stems_sid)))]
	print(word,vocab.index(word))
	letter[vocab.index(word)] = 1
	onehot_encoded.append(letter)

view raw

onehot.py hosted with by GitHub

One Hot Encoding

As you might guess, the OHE of the first word of this sentence ( minerv ) will be composed of 0 except for the 29e value which will be 1 :

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Now let's tackle the Bag Of Words. For the implementation of the bag of words, we first create a function that returns a vocabulary by taking a corpus as input. Then, we create another function that takes a sentence and a corpus as input and returns the bag of words of the sentence.

	import numpy as np
	from nltk import word_tokenize

	corpus = [« La vie est courte mais la vie peut paraître longue »,« La nuit est proche »]

	#definir deux phrases du corpus
	phrase_1 = « La vie est courte mais la vie peut paraître longue »
	phrase_2 = « La nuit est proche »

	# fonction retournant un vocabulaire
	def vocabulary(corpus):
	voc = []
	for sentence in corpus:
	words = word_tokenize(sentence.lower())
	voc.extend(words)

	voc_clean= []
	for w in voc:
	if w not in voc_clean:
	voc_clean.append(w)
	return voc_clean


	# fonction retournant un sac de mots
	def bagofwords(sentence,corpus):
	vocab = vocabulary(corpus)
	sentence_words = words = word_tokenize(sentence.lower())
	bag_of_words = np.zeros(len(vocab))
	for w_in_sentence in sentence_words :
	for i,w in enumerate(vocab) :
	if w == w_in_sentence :
	bag_of_words[i] += 1
	return bag_of_words

bag_of_words_fr.py hosted with by GitHub

After testing the bagofwords function with both sentences, we got the following results:

print(bagofwords(phrase_1,corpus))

phrase1.py hosted with by GitHub

print(bagofwords(phrase_2,corpus))

view raw

phrase_2.py hosted with by GitHub

In this episode, we presented two approaches to translating text data into a form that is understandable to a computer. One Hot Encoding and Bag Of Words are two trivial techniques, but they can be useful in the Kingdom of Natural Language Processing NLP.

The analysis of One Hot Encoding and Bag of Words: Understanding the Differences. allows you to choose the technique best suited to your NLP needs.

Do not hesitate to read our article onPreprocessing techniques in Natural Language Processing.

Pour plus d’informations sur l’IA, consultez notre FAQ IA

Pour ne pas ratez nos prochaines publications sur le sujet, cliquez ici !