NLP applications like sentiment analysis, spam detection or question answering require the encoding of text into a form understandable by computers prior applying algorithms to them. This is done by encoding methods like word embeddings. Word embedding algorithms convert words/sentences into numerical vectors. Depending on the word embedding model , the embedding vectors carries information about the semantic, syntax and context in which the word was used. There are many word embedding’s techniques out there and this post aims to introduce three of them. However, before addressing them, I believe it is important to first introduce the concept of transfer learning via the use of pre-trained models.
The huge amount of data necessary for many deep learning applications, the long computing time and expensive resources needed for such training encouraged researchers and data scientists to make use of pre-trained models. Such approach is called : transfer learning. It consists of reusing a model that was trained for a certain task to another task similar to the former. Let’s say once you successfully trained a model to classify cats and now you need a model to classify tigers. Instead of training the new model from scratch, knowing that cats and tigers have similar characteristics you can reuse your model trained on cats and just apply a few changes to it to fit your new dataset. In fact, at Baamtu for one of our NLP project we used pre-trained Camembert and Universal sentence encoder (USE) to embed our dataset. While we could used a pre-trained Word2vec as well we decided to train a variant of it ourselves.
Word2vec is a popular word embedding model created by Mikolov and al at google in 2013. It uses one neural network hidden layer to predict either a target word from its neighbors (context) for a skip gram model or a word from its context for a CBOW (continuous bag of words). The input and output of word2vec is the one hot vector of the dataset vocabulary. During training a window slides across the corpus such that for each word in the vocabulary the neural network is trained to predict its neighboring words by assigning them a higher probability in comparison to other words in the vocabulary. The word vector is the network’s hidden layer.
Continuous bag of words (CBOW)
Cbow model: Lorenzo Ferrone, Fabio Massimo Zanzotto, Symbolic, Distributed and Distributional Representations for Natural Language Processing in the Era of Deep Learning: a Survey
Skip-gram model : Manish Nayak, an intuitive introduction of word2vec by building a word2vec from scratch.
Word2vec is known as a “context-free” model in a sense that it produces a unique vector for a word no matter the context in which it was used. In fact, this model only takes into account word context during training but once the embeddings have been produced they are used as it on the test data (It doesn’t learn from the context of the test data). Besides, since during training word2vec produces word level embeddings if there is a new word that was not in the vocabulary of the training data we can not have embeddings for it. In other words, word2vec works better for vocabulary words that have similar context with the training data.
Due to these limitations, for word encoding purpose, at Baamtu we trained FastText a variant of Word2vec instead. Like Word2vec, FastText supports Skip-gram and CBOW but instead of only producing word embeddings it can also produce character-level embeddings. In fact, if n-gram embeddings are turned on, FastText represents words as character n-grams and encode them. For example, for n = 3 (trigram) the word “matter” will be represented as <ma, mat, att, tte, ter, er>. This feature is useful as it enables the model to get embeddings for out of vocabulary words.
Camembert a french delicacy !
Camembert is a language model trained and released in 2019 by researchers at Facebook AI. It is based on the RoBERTa architecture, a variant of BERT. Camembert differs from other BERT_based models by the fact that it was trained on a french corpus. There is not much documentation about Camembert online but fortunately for us, understanding how BERT works suffice to understand all the other models based on the same architecture. So let us dig into BERT and the ingenuity behind it.
BERT : Bidirectional Encoder Representations from Transformers
BERT is a bidirectional pre-trained language model based on transformer architecture. I will not attempt to explain what transformers are because there are plenty of amazing blog post talking about it. I especially recommend this JayAllamar article http://jalammar.github.io/illustrated-transformer/. However, keep in mind that Bert is made of 12 stacked encoders and a multi-head attention. It was pre-trained with unsupervised data for masked language modelling and next sentence prediction. The first task consist of predicting the original input from a distorted input where a few words (15%) have been randomly masqued. The second training objective is, based on two concatenated sentence to get the network determine if the second sentence should come after the first one.
Masked language modelling
Bert masked language model : http://jalammar.github.io/illustrated-bert/.
Within the 15% words of the corpus randomly selected to be masked, 80 % are replaced by the token [MASK], 10% are replaced by a random word from the corpus and 10% are left unchanged. Masked language modelling gives BERT the ability to efficiently train a transformer to learn left and right context of a word.
Next sentence prediction
Bert Next sentence prediction : http://jalammar.github.io/illustrated-bert/
The next sentence prediction task in trying to learn the relationship between two sentences encodes the meaning of the first sentence in the final hidden state of the classification token [CLS]. However, the sentence embedding in [CLS] is to be used for classification only. For other tasks the embeddings of all the other tokens should be used as the sentence vector of length 768. However, which layer to take the embeddings from depends on the task, I will advice trying different layers and choose the one that produces a better performance.
RoBERTa: Robustly optimized BERT approach
RoBERTa is a BERT model with a different training approach. RoBERTa removes next-sentence prediction (NSP) tasks and adds dynamic masking, large mini-batches and larger Byte-pair encoding. In BERT the input is masked only once such that it has the same masked words for all epochs while with RoBERTa, masked words changes from one epoch to another. We should also note that because there is no NSP in ROBERTa/Camembert there is no need to add the special tokens [CLS] and [SEP] to the input.
Universal sentence encoder (USE)
Developed by google AI, USE produces a vector representation of a sentence. The authors released two USE architectures, one based on the transformer and the other one is a deep averaging network (DAN). According to the authors, the USE with a transformer-based architecture performs slightly better than DAN but it is also computationally expensive.
Transformer encoder : The model first compute word embeddings then determine the sentence vector by computing the element-wise sum of word vectors. The encoder input is a tokenized string and the output is a 512 length vector.
Deep average network: Input embeddings of words and bi-grams are first averaged together then passed to a deep neural network for sentence embedding. The input of the encoder is also a tokenized string and the output a 512 length vector.
Universal sentence encoder : https://tfhub.dev/google/universal-sentence-encoder/1
Camembert and USE are two contextual language models, they both use transformers and attention instead of RNN or LSTM to incorporate sequential information to produce embeddings of sequence of words. Besides, in contrast to word2vec, they generate dynamic embeddings. Let ‘s say you have a pre-trained Camembert or USE and you want to encode a sentence. Unlike Word2vec, these models won’t just return the embeddings they learned during training but, for each token in your sentence they will compute it’s embeddings using the vector representations of neighboring tokens and the network weights from training.
Camembert and USE also deal better with out of vocabulary words since they tokenize data such that new words can be reconstructed with tokens in the vocabulary. However, they differ in the fact that USE is a multi language model trained on English corpus while Camembert is exclusively for French text. Besides, the first model is sentence encoder while the second is a token encoder. We might want to use the latter for French dataset since according to Louis Martin and al’s experiments, in some cases it performs better than multi-language models. However, keep in mind that it might be hard to get good sentence representation from Camembert.
In this article we introduced Word2vec, Camembert and USE. I hope you got a better understanding of these models, if you want to learn how to use them for a classification task on python stay tuned for the next post.