The main objective of NLP (Natural Language Processing) is to give computers the ability to understand, process and analyze texts written in human languages.
The way computers understand language is quite different from ours. A machine does not know French, nor English neither Wolof, it only understands binary ( the numbers 1 and 0).
To “give” a computer the ability to “understand” texts written in human languages, it is first necessary to “translate” those texts into “machine language“.
Before starting this translation step, however, it is necessary to “pre-process” these textual data.
Why this pre-treatment?
Indeed, when we write texts (like this one), we rely on various elements to specify the different things that are happening, but also to relay other informations, which may not always be useful to the machine.
For example, a “.” to specify the end of a sentence or a capital letter to mark the beginning of another. Conjugating a verb in a specific tense to express the time of the event, using articles for gender and/or plurality, etc. You get the idea, we use a lot of details to make ourselves more understandable to the reader.
However, in order to be able to extract information from a text, a computer does not (always) need all these details. It merely is just noise to it, and it can make it harder to really grasp the meaning of a text.
So, to simplify things for the machine, there are a few pre-treatment steps that are essentials, depending on the objective.
In this episode, we will be exploring the realm of text pre-processing, by explaining and implementing its different concepts with Python code. Without further ado, let’s begin the journey.
Before starting, it is important to specify that there is no consensus on what should be done in this pre-treatment step. This usually depends on the task at hand. Some techniques can be useful, for example, when classifying text but problematic when doing sentiment analysis.
Lower case, accent, special characters
In this part we will talk about the basics of text pre processing which is removing the special characters, lowering your words and removing accent in the text :
- Capital letters are usually pointless and can create some misunderstandings for the computer. Let’s say for example, we have in a text the word “sunshine” written in two different ways: SunShine and sunshine, the machine can and will interpret it as two distinct words because computers are case sensitive (it basically means they make a difference between A and a). Therefore, it is often preferable that all your words are in lowercase.
- Accents can also cause confusion. If we take a similar example, création and creation represent two different words for the computer.
- Some languages like english have contractions : Contractions, which are sometimes called ‘short forms’, commonly combine a pronoun or noun and a verb, or a verb and not, in a shorter form (source).They can become problematic. For example, the negation of a verb can be interpreted as a completely different word, to avoid this kind of problem it is better to remove them. There are several ways to remove contractions, the most common one is to define your own dictionary file and map each contraction to its expanded form, for example “haven’t” is mapped to “have not”.
- Special characters are often superfluous – even if in some cases they may be needed (sentiment analysis) – it is better to remove them if they are not.
In computing, stopwords are words that are filtered before or after data processing in natural language (NLP). Although the term “empty words” generally refers to the most common words in a language, there is no single universal list of empty words used by all natural language processing tools. (Wikipedia)
Indeed, depending on the NLP tasks, some words and expressions are useless in the context of the work to be done.
Suppose we wanted to do a simple similarity test between documents written in English. To do this, we have the idea of counting the 15 most frequent words for each document. If two documents have more than 7 words in common in their most frequent words, we will assume that they are similar, otherwise they aren’t.
This is a fairly simple and trivial process, the similarity between documents requires much more than these steps, as an example however, let’s try to keep it simple.
If I use texts in their raw form without removing stopwords, there is a risk that we will only end up with similar texts. Why?
It’s simple, ten words, listed in order of frequency, comprise around 25% of the recorded English language, according to an ambitious project at Oxford University.
Going further, the top 100 words comprise about 50% of our language, while 50,000 words comprise 95% of our language.(source )
You now see the usefulness of removing empty words in this case that could distort similarity.
There are lists (not very extensive) of stop words in English in libraries (such as NLTK for Python) or on the Internet. As it differs according to the objectives and languages, it is preferable to use these lists as a starting point and to add or delete words if necessary depending on your context.
In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. (Wikipedia)
In simpler terms, stemming is the process of reducing a word to its ‘root’. For example : walk, walks, walked, walking may be reduced to walk and they will then share the same meaning in the text.
Search engines use it when you make a request, to show more results and/or correct mistakes on your query (query expansion).
Let’s say you lived in a cave for the past decades like one of my friends and now you want to watch the Star Wars Series (there are so many of them) and like me, you are not very good at english. You search “ what ordering watching stars war”
As you can see the first answer is the one i was looking for, even though i didn’t enter a grammatically correct query.
Search engines use different techniques to “expand” and make your query better and one of them is Stemming.
If you want to read more on it, I found this article dating back to 2003 stating that Google started using stemming.
Now you get what is essentially stemming and what it is used for ( I hope !), let’s deep dive into more details.
There are different algorithms that implements stemming : Lovins Stemmer , Porter Stemmer, Paice Stemmer, Snowball, etc. They each have their own way of retrieving the stemma of a word.
For the Porter Stemmer for example these are some rules of the first step in the algorithm :
As you can see, there are many rules, it would take too much time to cover them in this article. If you want to read more about this particular stemmer, you can read the original paper written in 1980 here.
Let’s take a concrete example here :
They walked through the rainy dark like gaunt ghosts, and Garraty didn’t like to look at them. They were the walking dead.
This is an extract from “The Long Walk”, passing it through Porter Stemmer would give the following results :
they walk through the raini dark like gaunt ghosts, and garrati didn’t like to look at them. they were the walk dead.
By reading the results after stemming, you can clearly see that after reducing the words, there are some words that don’t really exist : ‘raini’, garrati (which is a name so it doesn’t really matter). To why this can happen, this is the best answer that i could find online :
It is often taken to be a crude error that a stemming algorithm does not leave a real word after removing the stem. But the purpose of stemming is to bring variant forms of a word together, not to map a word onto its ‘paradigm’ form. Source
In short, Stemming is used to group together in a “raw” way several words sharing the same meaning by removing gender, conjugation, etc.
The algorithms are not, however, perfect. They can work for some cases and for others, group words that do not share the same meaning.
PS : Stemming is not a concept applicable to all languages. It is not, for example, applicable in Chinese. But to languages of the Indo-European (*) group , a common pattern of word structure does emerge. Assuming words are written left to right, the stem, or root of a word is on the left, and zero or more suffixes may be added on the right. Source
In computational linguistics, lemmatization is the algorithmic process of determining the lemma of a word based on its intended meaning.(Wikipedia)
In many languages, words appear in several inflected forms. For example, in English, the verb ‘to walk’ may appear as ‘walk’, ‘walked’, ‘walks’, ‘walking’ , ‘walker’. The base form, ‘walk’, that one might look up in a dictionary, is called the lemma for the word.
Here the main purpose is to regroup different words of a text that share the same “meaning” into one word which is the lemma without sometimes creating “new” words like in Stemming.
However Lemmatizers are expensive, and much more difficult to create because you need to have a dictionary that contains most of the words in your language.
For example the phrase :
They walked through the rainy dark like gaunt ghosts, and Garraty didn’t like to look at them. They were the walking dead.
After lemmatization we will have :
They walk through the rainy dark like gaunt ghosts, and Garraty didn’t like to look at them. They be the walk dead.
As you can see, in the case of the Lemmatization, all of the words here can be found in a dictionary.
Search engines can also use lemmatization instead of stemming, it on some cases, provides more accurate results but is more difficult to implement.
Processing a big chunk of text is not usually the best way to go. Like we always say “divide and conquer”. The same concept is applicable in NLP tasks too. When we have texts, we separate it into different tokens, most of the time, each token represent a word. It will be easier to process and filter useless tokens ( like special characters and empty words).
To define it simply, tokenization is the process of dividing a text into a list of tokens(usually words).
An n-gram is a contiguous sequence of n items from a given sample of text or speech. (Wikipedia)
So N-grams are sequence of words formed from a text. Here the N describe the number of words combined together.
If you had the phrase :
It is over Anakin! I have the high ground!
Which is basically the tokenization of the phrase. So tokenization can be seen as a special case of an N-grams where N=1.
A 2-gram, also called bigrams of the same phrase would produce this :
N-grams can be used to know what sequence of words are most common(Language modeling). Like this website that is calculating most common trigrams starting with the word “Like” in the Corpus of Contemporary American English.
POS (Part Of Speech) tagging
In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc. (Wikipedia)
So POS tagging is like identifying the nature of each word in a text.
It is very useful, especially in lemmatization because you need to know what the word is before trying to lemmatize it. For example, the way you lemmatize nouns and verbs is different because they express plurality or gender in a different way.
Let’s take our same example sentence, after passing it to a Python POS tagger, we get :
It is also useful in translation task. Let’s take this simple example that i found online :
I fish a fish.
Translated in french you have :
je pêche un poisson.
So here fish has different meaning when it’s employed as a verb and as a noun. It is necessary then to have a tool to differentiate the two.
It will be very tedious to do this task in a very long text though. Hopefully there is a tool that can do it for you in python in the nltk library.
You can also use a machine learning approach to teach your own model of pos tagging but we won’t get into that here.
Now that you know most of the things for text preprocessing let’s do a quick recap and implement all of them with some python code.
Implementation in Python
We will now try to implement some of these techniques in the Python language.
To do this, let’s try to do a simple exercise: take a book and study its most important words to briefly understand what the book is about.
If you don’t like code, you can skip this part and go to the conclusion 🙂
Let’s first download Frakenstein book from the Gunteberg project library.
frankenstein_response = requests.get('https://www.gutenberg.org/files/84/84-0.txt') frankenstein_data = frankenstein_response.text frankenstein_data = frankenstein_data.split(&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;***&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;)
We then define a function to perform some of the pre-processing tasks:
#It's for the POS tagging def get_wordnet_pos(treebank_tag): if treebank_tag.startswith('J'): return wordnet.ADJ elif treebank_tag.startswith('V'): return wordnet.VERB elif treebank_tag.startswith('N'): return wordnet.NOUN elif treebank_tag.startswith('R'): return wordnet.ADV else: return wordnet.NOUN
def process_data(data): #Put all the words in the book in lowercase data = data.lower() #remove all the accents if they exist data = unicodedata.normalize('NFKD',data).encode('ascii', 'ignore').decode('utf-8', 'ignore') #Only take the letters and numbers and remove all the special characters pattern = r'[^a-zA-z0-9\s]' data = re.sub(pattern, '', data) #Remove all the Stop Words like 'I', 'The' , ..... and tokenize the text stop_words = set(stopwords.words('english')) word_tokens = nltk.word_tokenize(data) words = [w for w in word_tokens if not w in stop_words] #Create a stemmer and Stem our words ps = nltk.porter.PorterStemmer() text_stems = [ps.stem(word) for word in words] #Create a lemmatizer and lemmatize our words wordnet_lemmatizer = WordNetLemmatizer() words = nltk.pos_tag(words) words = [(word,tag) for word,tag in words if tag !=''] text_lemms = [wordnet_lemmatizer.lemmatize(word,get_wordnet_pos(tag)) for word,tag in words] return (text_stems, text_lemms)
Then we count the most frequent words in the text first for the text passed by a Stemmer:
#Now lets count the the most common words of Frankenstein with the stems text_stems,text_lems = process_data(frankenstein_data) count = Counter(text_stems) print('Most common words of Frankenstein with the stems : ') for word in count.most_common(15): print (word)
For the words passed through a Lemmatizer:
#Now lets count the the most common words of Frankenstein with the lemmas count = Counter(text_lems) print('Most common words of Frankenstein with the lemmas : ') for word in count.most_common(15): print (word)
You can see that we have almost the same list some wordcount have decreased and some words have appeared in the top 15.
By looking at these words, We can see that the most common word in Frankenstein is “one”, i don’t really know why maybe it’s how the book was written. If we look through the list you can see other words that kind of describe what the book is about words like “feel”, “man” , “father” , “friend”, “love”, “live”. The book is about a “man “creating a monster that wants to “feel” “human” and “love” and he hates his creator that he calls “father”….. It’s a bit of a stretch but if you read the book you can get why most of the words are here.
Just for fun, let’s count the most frequent bigrams:
ngram_counts = Counter(ngrams(text_lems, 2)) print('10 Most common bigrams of Frankenstein : ') for word in ngram_counts.most_common(10): print (word)
We see that the most common bigrams are old man and take place those words are pretty common together in real life too.
Native country is also very common in the book Frankenstein is away so it’s prettty plausible
The book consists of letter so we have Dear Victor too ……….. Well you see it’s pretty fun to play with it……..
We reviewed the main methods of text data pre-processing, which are used to facilitate the translation of a text written in human language into machine language.
The biggest remark made during the research for this article is that most of these techniques exist only for the main languages and are “perfect” only for English in general. For less proficient languages, such as Wolof, it becomes essential to be able to implement all these techniques in order to effectively process texts written in those languages.
To read the next episode “From Words To Numbers”