{"id":3093,"date":"2025-03-06T07:50:51","date_gmt":"2025-03-06T06:50:51","guid":{"rendered":"https:\/\/blog.baamtu.com\/?p=3093"},"modified":"2025-03-07T05:16:48","modified_gmt":"2025-03-07T04:16:48","slug":"techniques-de-pretraitement-en-traitement-du-langage-naturel","status":"publish","type":"post","link":"https:\/\/blog.baamtu.com\/en\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/","title":{"rendered":"Preprocessing techniques in Natural Language Processing"},"content":{"rendered":"<div data-elementor-type=\"wp-post\" data-elementor-id=\"3093\" class=\"elementor elementor-3093\">\n\t\t\t\t<div class=\"elementor-element elementor-element-572b07e e-flex e-con-boxed e-con e-parent\" data-id=\"572b07e\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-98d6576 elementor-widget elementor-widget-text-editor\" data-id=\"98d6576\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<style>\/*! elementor - v3.23.0 - 05-08-2024 *\/\n.elementor-widget-text-editor.elementor-drop-cap-view-stacked .elementor-drop-cap{background-color:#69727d;color:#fff}.elementor-widget-text-editor.elementor-drop-cap-view-framed .elementor-drop-cap{color:#69727d;border:3px solid;background-color:transparent}.elementor-widget-text-editor:not(.elementor-drop-cap-view-default) .elementor-drop-cap{margin-top:8px}.elementor-widget-text-editor:not(.elementor-drop-cap-view-default) .elementor-drop-cap-letter{width:1em;height:1em}.elementor-widget-text-editor .elementor-drop-cap{float:left;text-align:center;line-height:1;font-size:50px}.elementor-widget-text-editor .elementor-drop-cap-letter{display:inline-block}<\/style>\t\t\t\t<p data-start=\"94\" data-end=\"438\"><span style=\"color: #000000;\">Preprocessing techniques in Natural Language Processing (NLP) help prepare and clean raw data, making it usable for analysis models. These preliminary steps, often invisible but crucial, are key to transforming vast amounts of unstructured data into actionable information. Whether it's cleaning raw text, normalizing formats, or removing unnecessary noise, preprocessing ensures the optimal performance of NLP models. <\/span><\/p><p data-start=\"94\" data-end=\"438\"><span style=\"color: #000000;\">In this article, we demystify these techniques and their impact on the quality of results. Mastering them allows for optimal performance in projects.<\/span><\/p><div id=\"ez-toc-container\" class=\"ez-toc-v2_0_69_1 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of content<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewbox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewbox=\"0 0 24 24\" version=\"1.2\" baseprofile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/blog.baamtu.com\/en\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#1_Techniques_de_pretraitement_en_traitement_du_langage_naturel_NLP_Une_etape_cle_pour_la_performance_des_modeles\" title=\"Preprocessing techniques in Natural Language Processing (NLP): A key step for model performance\">Preprocessing techniques in Natural Language Processing (NLP): A key step for model performance<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/blog.baamtu.com\/en\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#2_Pourquoi_le_Pretraitement_est-il_important\" title=\"Why is preprocessing important?\">Why is preprocessing important?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/blog.baamtu.com\/en\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#3Les_principales_techniques_de_pretraitement_en_NLP\" title=\"3. The main pre-processing techniques in NLP\">3. The main pre-processing techniques in NLP<\/a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/blog.baamtu.com\/en\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#Les_Stopwords_mots_vides\" title=\"Les Stopwords (mots vides)\">Les Stopwords (mots vides)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/blog.baamtu.com\/en\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#Stemming_racinisation_ou_desuffixation\" title=\"Stemming (stemming or desuffixation)\">Stemming (stemming or desuffixation)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/blog.baamtu.com\/en\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#Lemmatisation\" title=\"Lemmatisation\">Lemmatisation<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/blog.baamtu.com\/en\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#Tokenisation\" title=\"Tokenization\">Tokenization<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/blog.baamtu.com\/en\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#_Tableau_Analyse_des_n-grammes_et_de_leur_occurrence_dans_un_corpus_textuel\" title=\"\u00a0Table: Analysis of n-grams and their occurrence in a text corpus\">\u00a0Table: Analysis of n-grams and their occurrence in a text corpus<\/a><\/li><\/ul><\/li><\/ul><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"1_Techniques_de_pretraitement_en_traitement_du_langage_naturel_NLP_Une_etape_cle_pour_la_performance_des_modeles\"><\/span><span style=\"color: #000000;\"><strong>Preprocessing techniques in Natural Language Processing (NLP): A key step for model performance<\/strong><\/span><span class=\"ez-toc-section-end\"><\/span><\/h2><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">The main objective of the <\/span><b>NLP <\/b><span style=\"font-weight: 400;\">(Natural Language Processing) is to give computers the ability to understand, process, and analyze texts written in human languages.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">The way computers <\/span><b>understand<\/b><span style=\"font-weight: 400;\"> language is quite different from ours. A machine doesn't know French, English, or even Wolof; it only understands the <\/span><b>binary system.<\/b><span style=\"font-weight: 400;\"> (numbers 1 and 0).<\/span><\/span><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-741fcc8 e-flex e-con-boxed e-con e-parent\" data-id=\"741fcc8\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-34ba5fd elementor-widget elementor-widget-image\" data-id=\"34ba5fd\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<style>\/*! elementor - v3.23.0 - 05-08-2024 *\/\n.elementor-widget-image{text-align:center}.elementor-widget-image a{display:inline-block}.elementor-widget-image a img[src$=\".svg\"]{width:48px}.elementor-widget-image img{vertical-align:middle;display:inline-block}<\/style>\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"497\" height=\"280\" src=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/computer_understand_011_language-1.jpg\" class=\"attachment-medium_large size-medium_large wp-image-3240\" alt=\"\" srcset=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/computer_understand_011_language-1.jpg 497w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/computer_understand_011_language-1-300x169.jpg 300w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/computer_understand_011_language-1-18x10.jpg 18w\" sizes=\"(max-width: 497px) 100vw, 497px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-3fe0702 e-flex e-con-boxed e-con e-parent\" data-id=\"3fe0702\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-46eb49f elementor-widget elementor-widget-text-editor\" data-id=\"46eb49f\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">To give a computer the ability to <\/span><b>understand<\/b><span style=\"font-weight: 400;\"> texts written in human languages, we must first perform the <\/span><b>translation.<\/b><span style=\"font-weight: 400;\"> of these texts into the <\/span><b>langage machine \u00bb<\/b><span style=\"font-weight: 400;\">.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">Before addressing this translation step, it is necessary, however, to perform the <\/span><b>preprocessing.<\/b><span style=\"font-weight: 400;\"> of this textual data.<\/span><\/span><\/p><h2 data-pm-slice=\"1 1 []\"><span class=\"ez-toc-section\" id=\"2_Pourquoi_le_Pretraitement_est-il_important\"><\/span><span style=\"color: #000000;\">Why is preprocessing important?<\/span><span class=\"ez-toc-section-end\"><\/span><\/h2><p><span style=\"font-weight: 400; color: #000000;\">Indeed, when we write texts (like this one), we use various elements to specify the different things that happen, but also to convey other information, which may not always be useful to the machine.<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">For example, a <\/span><b>\u00ab . \u00bb<\/b><span style=\"font-weight: 400;\"> to specify the end of a sentence or a capital letter to mark the beginning of another. The conjugation of a verb in a tense to specify the time of the event, the use of articles to specify gender, the use of conjunctions, etc.<\/span><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">You see what I mean! We use a set of details to make ourselves more understandable to the reader.<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">However, to extract information from a text, a computer doesn't (always) need all these details. They generally represent noise for it and can make the processing much more complex <\/span><b>the understanding<\/b><span style=\"font-weight: 400;\"> of a text.<\/span><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">So, to simplify things, there are a few preprocessing steps that are necessary.<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">In this episode, we will explore some of them, try to explain them, and implement a few.<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">Before we begin, it is important to clarify that there is no consensus regarding what should be done in this pre-processing step. It generally depends on the task at hand. Some techniques may be useful, for example, in more problematic text classification when we want to do sentiment analysis.<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">Without further ado, let's start with the most manageable pre-treatment methods.<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">Lower case, accents, contractions, special characters.<\/span><\/p><h2 data-pm-slice=\"1 1 []\"><span class=\"ez-toc-section\" id=\"3Les_principales_techniques_de_pretraitement_en_NLP\"><\/span><span style=\"color: #000000;\">3. The main pre-processing techniques in NLP<\/span><span class=\"ez-toc-section-end\"><\/span><\/h2><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">The <\/span><b>capital letters<\/b><span style=\"font-weight: 400;\"> are usually unnecessary and can create confusion. For example, let's say that in a text we find the word <\/span><b>\u00ab <\/b><b><i>\"sun\"<\/i><\/b> <span style=\"font-weight: 400;\">written in two different ways: \" <\/span><b><i>\"sun\"<\/i><\/b><span style=\"font-weight: 400;\"> and <\/span><b><i>\"Soleil\"<\/i><\/b><span style=\"font-weight: 400;\">. The machine may think that these are two different words because computers are case sensitive. This basically means that A and a are not represented the same way at the computer level (<\/span><a style=\"color: #000000;\" href=\"https:\/\/web.archive.org\/web\/20230528153051\/https:\/\/fr.wikipedia.org\/wiki\/American_Standard_Code_for_Information_Interchange\"><span style=\"font-weight: 400;\">ASCII<\/span><\/a><span style=\"font-weight: 400;\">) . In such cases, it is therefore preferable that all your words are in lower case.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">The <\/span><b>accents<\/b><span style=\"font-weight: 400;\"> can also cause trouble. If we take a similar example, <\/span><b><i>cr\u00e9ation<\/i><\/b><span style=\"font-weight: 400;\"> and <\/span><b><i>\"creation\"<\/i><\/b><span style=\"font-weight: 400;\"> represent two different words to the computer.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">Some languages \u200b\u200b(like English) also contain <\/span><b>contractions. \u00ab <\/b><b><i>He is<\/i><\/b><b> \u00bb <\/b><span style=\"font-weight: 400;\">which is considered two words if contracted into <\/span><b>\u00ab <\/b><b><i>he\u2019s<\/i><\/b><b>\u00ab <\/b><span style=\"font-weight: 400;\">is just a word for the machine. To remedy this <\/span><span style=\"font-weight: 400;\">ce<\/span><span style=\"font-weight: 400;\"> problem, he <\/span><span style=\"font-weight: 400;\">is<\/span><span style=\"font-weight: 400;\"> It is often useful to have a dictionary containing the different contractions of a language and their corrections. <\/span><span style=\"font-weight: 400;\">The<\/span><b> special characters<\/b><span style=\"font-weight: 400;\"> are often superfluous \u2013 even if in some cases they may be needed (sentiment analysis) \u2013 otherwise it is better to remove them.<\/span><\/span><\/p><ul><li><h4><span class=\"ez-toc-section\" id=\"Les_Stopwords_mots_vides\"><\/span><span style=\"color: #000000;\"><strong>Les Stopwords (mots vides)<\/strong><\/span><span class=\"ez-toc-section-end\"><\/span><\/h4><\/li><\/ul><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">In computing, stop words (<\/span><i><span style=\"font-weight: 400;\">stopwords<\/span><\/i><span style=\"font-weight: 400;\">) are words that are filtered before or after natural language data processing (<\/span><b>NLP<\/b><span style=\"font-weight: 400;\">). Although the term \"stop words\" generally refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools. (<\/span><b>Wikipedia<\/b><span style=\"font-weight: 400;\">).<\/span><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">Indeed, depending on the NLP tasks, certain words and expressions are useless in the context of the work to be carried out.<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">Suppose we want to do <\/span><span style=\"font-weight: 400;\">a<\/span><span style=\"font-weight: 400;\"> a simple similarity test between documents written in French. To do this, we have the idea of \u200b\u200bcounting for each document, the 15 most frequent words. If two documents have more than 7 words in common in their most frequent words we will assume that they are similar, otherwise they are different.<\/span><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">This is a fairly simple and trivial process, similarity between documents requires much more than these steps, however, as an example, let's try to keep it simple.<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">If I use texts in their raw form without removing the <\/span><i><span style=\"font-weight: 400;\">stopwords<\/span><\/i><b>, <\/b><span style=\"font-weight: 400;\">we would risk ending up with only similar texts. Why?<\/span><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">It\u2019s simple, the 45 to 50 most frequent words in the French language represent on average 50% of a text.<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">And if we go further, the 600 most frequent words in the French language represent 90% of a text.<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><strong><a style=\"color: #000000;\" href=\"http:\/\/www.maitresseuh.fr\/aider-les-eleves-a-lire-rapidement-les-mots-frequents-a112961886\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/strong><\/span><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-5755859 e-flex e-con-boxed e-con e-parent\" data-id=\"5755859\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-24b419f elementor-widget elementor-widget-image\" data-id=\"24b419f\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"500\" height=\"288\" src=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/pasted-image-0.png\" class=\"attachment-medium_large size-medium_large wp-image-3250\" alt=\"\" srcset=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/pasted-image-0.png 500w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/pasted-image-0-300x173.png 300w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/pasted-image-0-18x10.png 18w\" sizes=\"(max-width: 500px) 100vw, 500px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-009cf45 e-flex e-con-boxed e-con e-parent\" data-id=\"009cf45\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-8028a6c elementor-widget elementor-widget-text-editor\" data-id=\"8028a6c\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"font-weight: 400; color: #000000;\">You now see the usefulness of removing in this case the empty words which could distort the similarity.<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">There are lists (not very extensive) of <\/span><i><span style=\"font-weight: 400;\">stop words<\/span><\/i><span style=\"font-weight: 400;\"> in French within libraries (such as <\/span><i><span style=\"font-weight: 400;\">NLTK<\/span><\/i><span style=\"font-weight: 400;\"> for Python) or on the Internet. As it differs depending on the purposes, it is best to use these lists as a starting point and add or remove words as needed.<\/span><\/span><\/p><ul><li><h4><span class=\"ez-toc-section\" id=\"Stemming_racinisation_ou_desuffixation\"><\/span><span style=\"color: #000000;\"><strong>Stemming (stemming or desuffixation)<\/strong><\/span><span class=\"ez-toc-section-end\"><\/span><\/h4><\/li><\/ul><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">In linguistic morphology and information retrieval, the <\/span><i><span style=\"font-weight: 400;\">stemming<\/span><\/i><span style=\"font-weight: 400;\"> is the process of reducing inflected (or sometimes derived) words to their stem, base, or root \u2013 usually a written form.<\/span><a style=\"color: #000000;\" href=\"https:\/\/fr.wikipedia.org\/wiki\/Racinisation\" target=\"_blank\" rel=\"noopener\"><b>Wikipedia<\/b><\/a><span style=\"font-weight: 400;\">)<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">In simpler terms, the <\/span><i><span style=\"font-weight: 400;\">stemming<\/span><\/i><span style=\"font-weight: 400;\"> is the process of reducing a word to its \u201croot.\u201d<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">For example in french : <\/span><b>marcher<\/b><span style=\"font-weight: 400;\">, <\/span><b>marches<\/b><span style=\"font-weight: 400;\">, <\/span><b>marchons<\/b><span style=\"font-weight: 400;\">, <\/span><b>marcheur<\/b><span style=\"font-weight: 400;\">, \u2026 will be reduced to <\/span><b>march<\/b><span style=\"font-weight: 400;\"> and they will thus share the same meaning in the text.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">Search engines use it when we make a query to display more results and\/or correct errors in your query (<\/span><a style=\"color: #000000;\" href=\"https:\/\/en.wikipedia.org\/wiki\/Query_expansion\" target=\"_blank\" rel=\"noopener\"><b>query expansion<\/b><\/a><span style=\"font-weight: 400;\">).<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">Let\u2019s say you\u2019ve been living in a cave for the last few decades and now you want to watch Star Wars (there are so many) and you\u2019re not very good at French grammar like me. You search for \u201c <\/span><i><span style=\"font-weight: 400;\">quelles ordre regardait stars war<\/span><\/i><span style=\"font-weight: 400;\"> \u00bb<\/span><\/span><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-decb8dd e-flex e-con-boxed e-con e-parent\" data-id=\"decb8dd\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-7df2b98 elementor-widget elementor-widget-image\" data-id=\"7df2b98\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"768\" height=\"462\" src=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-07-12-34-57-768x462.png\" class=\"attachment-medium_large size-medium_large wp-image-3260\" alt=\"\" srcset=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-07-12-34-57-768x462.png 768w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-07-12-34-57-300x180.png 300w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-07-12-34-57-18x12.png 18w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-07-12-34-57.png 805w\" sizes=\"(max-width: 768px) 100vw, 768px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-ba3d45b e-flex e-con-boxed e-con e-parent\" data-id=\"ba3d45b\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-c6c9ba9 elementor-widget elementor-widget-text-editor\" data-id=\"c6c9ba9\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">As you can see, the first answer is the one I was looking for even though I didn't enter a grammatically correct query.<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">Search engines use different techniques to \u201cexpand\u201d and make your query better and one of them is <\/span><i><span style=\"font-weight: 400;\"><a href=\"https:\/\/botpress.com\/fr\/blog\/natural-language-processing-nlp\">Stemming<\/a>.<\/span><\/i><span style=\"font-weight: 400;\"><br \/><\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">There are different algorithms that implement the <\/span><i><span style=\"font-weight: 400;\">stemming<\/span><\/i><span style=\"font-weight: 400;\"> : Lovins Stemmer, Porter Stemmer, Paice Stemmer, etc. Everyone has their own way of getting the <\/span><i><span style=\"font-weight: 400;\">stemma<\/span><\/i><i><span style=\"font-weight: 400;\"> (root)<\/span><\/i><span style=\"font-weight: 400;\"> in a word.<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">Most of these algorithms work with the English language.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">However, there are algorithms for <\/span><i><span style=\"font-weight: 400;\">stemming<\/span><\/i><span style=\"font-weight: 400;\"> which have been implemented for the <\/span><a style=\"color: #000000;\" href=\"https:\/\/web.archive.org\/web\/20230528153051\/http:\/\/snowball.tartarus.org\/algorithms\/french\/stemmer.html)\"><span style=\"font-weight: 400;\">French<\/span><\/a><span style=\"font-weight: 400;\"> and the library <\/span><i><span style=\"font-weight: 400;\">NLTK<\/span><\/i><span style=\"font-weight: 400;\"> Python has a <\/span><i><span style=\"font-weight: 400;\">Stemmer<\/span><\/i><span style=\"font-weight: 400;\"> in French.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">Let's test this <\/span><i><span style=\"font-weight: 400;\">Stemmer<\/span><\/i><span style=\"font-weight: 400;\"> with a sentence from Stephen King's The Long Walk:<\/span><\/span><\/p><p><span style=\"color: #000000;\"><b><i>Ils marchaient dans l\u2019obscurit\u00e9 pluvieuse comme des fant\u00f4mes d\u00e9charn\u00e9s, et Garraty n\u2019aimait pas les regarder. C\u2019\u00e9taient des morts-vivants.<\/i><\/b><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">After passing it to the <\/span><i><span style=\"font-weight: 400;\">Stemmer<\/span><\/i><span style=\"font-weight: 400;\"> of <\/span><i><span style=\"font-weight: 400;\">NLTK<\/span><\/i><span style=\"font-weight: 400;\"> we get the following result:<\/span><\/span><\/p><p><span style=\"color: #000000;\"><b><i>il march dan l\u2019obscur pluvieux comm de fant\u00f4m d\u00e9charnes et garraty n\u2019aim pas le regarder c\u2019et de morts-vivants<\/i><\/b><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">Reading the results after extraction, you can clearly see that after reducing the words, there are several that do not exist in the French dictionary: <\/span><b>fantom,aim,<\/b><span style=\"font-weight: 400;\">\u2026.<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">As to why this might happen, this is the best answer I could find online:<\/span><\/span><\/p><p><span style=\"color: #000000;\"><b><i>It is often considered a gross error that a stemming algorithm does not leave a real word after removing the stem. But the goal of stemming is to gather the different forms of a word, not to match a word to its \u2018paradigmatic\u2019 form.<\/i><\/b><a style=\"color: #000000;\" href=\"https:\/\/tartarus.org\/martin\/PorterStemmer\/\" target=\"_blank\" rel=\"noopener\"><b> Source<\/b><\/a><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">In summary, the <\/span><i><span style=\"font-weight: 400;\">Stemming<\/span><\/i><span style=\"font-weight: 400;\"> therefore serves to group together in a \u201c<\/span><i><span style=\"font-weight: 400;\">bully<\/span><\/i><span style=\"font-weight: 400;\">\u201dseveral words sharing the same meaning by removing gender, number, conjugation, etc.<\/span><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">Algorithms are not, however, perfect. They can work for some cases and for others, group together words that do not share the same meaning.<\/span><\/p><p><span style=\"color: #000000;\"><b><i>PS :<\/i><\/b><i><span style=\"font-weight: 400;\"> Stemming is not a concept that applies to all languages. It is not, for example, applicable in Chinese. But for languages \u200b\u200bof the Indo-European group a common pattern of word structure emerges. Assuming that words are written from left to right, the stem or root of a word is on the left, and zero or more suffixes may be added on the right. <\/span><\/i><a style=\"color: #000000;\" href=\"http:\/\/snowball.tartarus.org\/texts\/introduction.html\"><i><span style=\"font-weight: 400;\">Source<\/span><\/i><\/a><\/span><\/p><ul><li><h4><span class=\"ez-toc-section\" id=\"Lemmatisation\"><\/span><span style=\"color: #000000;\"><strong>Lemmatisation<\/strong><\/span><span class=\"ez-toc-section-end\"><\/span><\/h4><\/li><\/ul><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">In computational linguistics, lemmatization is the algorithmic process of determining the lemma of a word based on its intended meaning.( <\/span><a style=\"color: #000000;\" href=\"https:\/\/fr.wikipedia.org\/wiki\/Lemmatisation\" target=\"_blank\" rel=\"noopener\"><b>Wikipedia<\/b><\/a><span style=\"font-weight: 400;\"> )<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">In many languages, words appear in multiple inflected forms. For example, in <\/span><span style=\"font-weight: 400;\">anglais<\/span> <span style=\"font-weight: 400;\">fran\u00e7ais<\/span><span style=\"font-weight: 400;\">, the verb\u00a0 <\/span><b>marcher<\/b><span style=\"font-weight: 400;\"> may appear as <\/span><b>marchera<\/b><span style=\"font-weight: 400;\">, <\/span><b>march\u00e9<\/b><span style=\"font-weight: 400;\">, <\/span><b>marcheront<\/b><span style=\"font-weight: 400;\"> , etc. The basic form \u201c <\/span><b>marcher \u00bb<\/b><span style=\"font-weight: 400;\">, which one might find in a dictionary, is called the <\/span><i><span style=\"font-weight: 400;\">lemma<\/span><\/i><span style=\"font-weight: 400;\"> of the word.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">Here, the main goal of the <\/span><i><span style=\"font-weight: 400;\">lemmatization<\/span><\/i><span style=\"font-weight: 400;\">, is to group together the different words of a text which share the same \u201cmeaning\u201d into a single word which is the <\/span><i><span style=\"font-weight: 400;\">lemma<\/span><\/i><span style=\"font-weight: 400;\"> sans pour autant cr\u00e9er de \u00ab <\/span><i><span style=\"font-weight: 400;\">new \u00bb <\/span><\/i><span style=\"font-weight: 400;\">words like what is done in the case of the <\/span><i><span style=\"font-weight: 400;\">stemming<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">However, the <\/span><i><span style=\"font-weight: 400;\">lemmatizers<\/span><\/i><span style=\"font-weight: 400;\"> are much more difficult to create because you need a dictionary that contains most of the words in your language and you also need to know the nature of the word in question: a verb and a noun are <\/span><i><span style=\"font-weight: 400;\">lemmatized<\/span><\/i><span style=\"font-weight: 400;\"> in a different way.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">Search engines can also use lemmatization instead of <\/span><i><span style=\"font-weight: 400;\">stemming<\/span><\/i><span style=\"font-weight: 400;\">, it provides in some cases more precise results but is more difficult to implement.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">As for the <\/span><i><span style=\"font-weight: 400;\">Stemming<\/span><\/i><span style=\"font-weight: 400;\">, <\/span><i><span style=\"font-weight: 400;\">NLTK<\/span><\/i><span style=\"font-weight: 400;\"> also has a <\/span><i><span style=\"font-weight: 400;\">lemmatizer<\/span><\/i><span style=\"font-weight: 400;\"> for the French.<\/span><\/span><\/p><ul><li><h4><span class=\"ez-toc-section\" id=\"Tokenisation\"><\/span><span style=\"color: #000000;\">Tokenization<\/span><span class=\"ez-toc-section-end\"><\/span><\/h4><\/li><\/ul><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">Processing a large chunk of text is usually not the best way to go about it. As they always say <\/span><b>\"divide and conquer\"<\/b><span style=\"font-weight: 400;\">. The same concept applies to NLP tasks as well.<\/span><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">When we have texts, we separate them into different tokens and each token represents a word. It will be easier to process (do stemming or lemmatization) and filter out unnecessary tokens (like special characters or stop words).<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">N-gram<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">A <\/span><i><span style=\"font-weight: 400;\">n-gram<\/span><\/i><span style=\"font-weight: 400;\"> is a contiguous sequence of n elements of a given sample of text. (<\/span><a style=\"color: #000000;\" href=\"https:\/\/fr.wikipedia.org\/wiki\/N-gramme\" target=\"_blank\" rel=\"noopener\"><b>Wikipedia<\/b><\/a><span style=\"font-weight: 400;\">)<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">The <\/span><i><span style=\"font-weight: 400;\">n-gram<\/span><\/i><span style=\"font-weight: 400;\"> are therefore sequences of words formed from a text. Here N describes the number of words combined together.<\/span><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">If you had the sentence:<\/span><\/p><p><span style=\"color: #000000;\"><b><i>\u00ab c\u2019est fini Anakin, j\u2019ai l\u2019avantage sur toi ! \u00bb<\/i><\/b><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">Each small square represents a token (word\/1-gram), which is practically the tokenization of the sentence. Thus, the <\/span><b>tokenization<\/b><span style=\"font-weight: 400;\"> can be considered as a special case of <\/span><i><span style=\"font-weight: 400;\">n-gram <\/span><\/i><span style=\"font-weight: 400;\">where N=1<\/span><b>.<\/b><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">A <\/span><i><span style=\"font-weight: 400;\">2-n-gram<\/span><\/i><span style=\"font-weight: 400;\"> also called <\/span><i><span style=\"font-weight: 400;\">bigram<\/span><\/i><span style=\"font-weight: 400;\"> from the same sentence would produce this:<\/span><\/span><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-e75b563 e-flex e-con-boxed e-con e-parent\" data-id=\"e75b563\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-022e922 elementor-widget elementor-widget-image\" data-id=\"022e922\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"524\" height=\"108\" src=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-07-12-05-02.png\" class=\"attachment-medium_large size-medium_large wp-image-3270\" alt=\"\" srcset=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-07-12-05-02.png 524w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-07-12-05-02-300x62.png 300w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-07-12-05-02-18x4.png 18w\" sizes=\"(max-width: 524px) 100vw, 524px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-d64b1b7 e-flex e-con-boxed e-con e-parent\" data-id=\"d64b1b7\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-8ad7193 elementor-widget elementor-widget-text-editor\" data-id=\"8ad7193\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">The <\/span><i><span style=\"font-weight: 400;\">n-gram<\/span><\/i><span style=\"font-weight: 400;\"> can be used to find out which sequence of words is most common. Like this <\/span><a style=\"color: #000000;\" href=\"https:\/\/web.archive.org\/web\/20230528153051\/http:\/\/www.lexique.org\/listes\/liste_trigrammes.php\"><span style=\"font-weight: 400;\">site<\/span><\/a><span style=\"font-weight: 400;\"> Web that calculates the most common trigrams in French:<\/span><\/span><\/p><table><tbody><tr><td><p><span style=\"color: #000000;\"><b>1-Gram<\/b><\/span><\/p><\/td><td><p><span style=\"color: #000000;\"><b>Occurrence<\/b><\/span><\/p><\/td><td><p><span style=\"color: #000000;\"><b>2-Grams<\/b><\/span><\/p><\/td><td><p><span style=\"color: #000000;\"><b>Occurence<\/b><\/span><\/p><\/td><td><p><span style=\"color: #000000;\"><b>3-Grams<\/b><\/span><\/p><\/td><td><p><span style=\"color: #000000;\"><b>Occurence<\/b><\/span><\/p><\/td><\/tr><tr><td><p><span style=\"font-weight: 400; color: #000000;\">of<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">1024824<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">de la<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">132940<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">il y a<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">8903<\/span><\/p><\/td><\/tr><tr><td><p><span style=\"font-weight: 400; color: #000000;\">la<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">602084<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">\u00e0 la<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">56794<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">et de la<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">4796<\/span><\/p><\/td><\/tr><tr><td><p><span style=\"font-weight: 400; color: #000000;\">and<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">563643<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">et de<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">37743<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">il y avait<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">4397<\/span><\/p><\/td><\/tr><tr><td><p><span style=\"font-weight: 400; color: #000000;\">le<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">411923<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">dans la<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">30090<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">que je ne<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">3894<\/span><\/p><\/td><\/tr><\/tbody><\/table><h4><span class=\"ez-toc-section\" id=\"_Tableau_Analyse_des_n-grammes_et_de_leur_occurrence_dans_un_corpus_textuel\"><\/span><strong><span style=\"color: #000000;\">\u00a0Table: Analysis of n-grams and their occurrence in a text corpus<\/span><\/strong><span class=\"ez-toc-section-end\"><\/span><\/h4><p><span style=\"font-weight: 400; color: #000000;\">POS tagging<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">In corpus linguistics, the<\/span><b> POS tagging<\/b><span style=\"font-weight: 400;\"> (POS tagging or PoS or POST tagging), also called grammatical tagging or word category disambiguation, is the process of marking a word in a text (corpus) as corresponding to a particular part of speech, according to its definition and context, that is, its relationship to neighboring and related words in a sentence, paragraph, phrase, etc. A simplified form of this notion is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc. (<\/span><a style=\"color: #000000;\" href=\"https:\/\/fr.wikipedia.org\/wiki\/%C3%89tiquetage_morpho-syntaxique\" target=\"_blank\" rel=\"noopener\"><b>Wikip\u00e9dia<\/b><\/a><span style=\"font-weight: 400;\">)<\/span><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">So, POS tagging is like identifying the nature of each word in a text.<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">It is very useful, especially in <\/span><b>lemmatization<\/b><span style=\"font-weight: 400;\">, because you need to know what the word is before you try to lemmatize it. For example, the way you <\/span><b>lemmatisez<\/b><span style=\"font-weight: 400;\"> Nouns and verbs may differ because they express plurality or gender in a different way.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">Let's take our same example sentence, after passing it to a <\/span><b>POS tagger<\/b><span style=\"font-weight: 400;\"> de Python, nous obtenons :<\/span><\/span><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-6b5f5bc e-flex e-con-boxed e-con e-parent\" data-id=\"6b5f5bc\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-05b47f7 elementor-widget elementor-widget-image\" data-id=\"05b47f7\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"603\" height=\"136\" src=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-07-12-09-11.png\" class=\"attachment-medium_large size-medium_large wp-image-3284\" alt=\"\" srcset=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-07-12-09-11.png 603w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-07-12-09-11-300x68.png 300w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-07-12-09-11-18x4.png 18w\" sizes=\"(max-width: 603px) 100vw, 603px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-4de1588 e-flex e-con-boxed e-con e-parent\" data-id=\"4de1588\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-79d9f19 elementor-widget elementor-widget-text-editor\" data-id=\"79d9f19\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"font-weight: 400; color: #000000;\">It is also useful in translation tasks. Let\u2019s take this simple example I found online:<\/span><\/p><p><span style=\"color: #000000;\"><b>I fish a fish .<\/b><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">Translated into French you have:<\/span><\/p><p><span style=\"color: #000000;\"><b>je p\u00eache un poisson.<\/b><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">The word \u201c<\/span><b>fish\u201d <\/b><span style=\"font-weight: 400;\">\u00a0has two meanings here. It refers to the verb <\/span><b>to fish (p\u00eacher) <\/b><span style=\"font-weight: 400;\">when used after a subject and when following an article, it refers to the word <\/span><b>fish <\/b><span style=\"font-weight: 400;\">(poisson). It is therefore necessary to have a tool to differentiate the two.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">However, it will be very tedious to do this task in a very long text. There are libraries that allow you to do this work in languages \u200b\u200bsuch as French or English. It is also possible to use techniques of <\/span><b>Deep<\/b> <b>Learning<\/b><span style=\"font-weight: 400;\"> to carry out this work.<\/span><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">Implementation in Python<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">We will now try to implement some of these techniques in the Python language.<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">To do this, let's try a simple exercise: take a book and study its most important words and bigrams to briefly grasp what the book is about.<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">If you don't like codes, you can skip this part and go to the conclusion <\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">Let's first download Voltaire's Zadig book from the Gunteberg Project Library.<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">[python]<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">zadig_response = requests.get(\u2018https:\/\/www.gutenberg.org\/cache\/epub\/4647\/pg4647.txt\u2019)<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">zadig_data = zadig_response.text<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"># Each book comes with licenses that we are not really interested in so we will remove this part<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">zadig_data = zadig_data.split(\u2018*******\u2019)[2]<\/span><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">[\/python]<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">Let's first download Voltaire's Zadig book from the Gunteberg Project Library.<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">[python]<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">def process_data(data):<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">#Let's put all the words in the book in lowercase<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">data = data.lower()<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">#Take only letters and numbers and remove all special characters<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">pattern = r'[^a-zA-z0-9\\s]\u2019<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">data = re.sub(pattern,\u00a0 \u00bb, data)<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">#Remove all stop words <\/span><span style=\"font-weight: 400;\">(stopwords)<\/span><span style=\"font-weight: 400;\">\u00a0 like \u2018les\u2019, \u2018du\u2019, \u2026\u2026. and tokenize the text<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">stop_words = set(stopwords.words(\u2018french\u2019))<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">stop_words.add(\u2018[\u2018)<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">stop_words.add(\u2018]\u2019)<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">stop_words.add(\u2018les\u2019)<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">stop_words.add(\u2018a\u2019)<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">word_tokens = nltk.word_tokenize(data)<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">words = [w for w in word_tokens if not unidecode(w) in stop_words]<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">#remove all accents if they exist<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">data = unicodedata.normalize(\u2018NFKD\u2019,data).encode(\u2018ascii\u2019, \u2018ignore\u2019).decode(\u2018utf-8\u2019, \u2018ignore\u2019)<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">#Create a stemmer in French<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">fs = FrenchStemmer()<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">text_stems = [fs.stem(word) for word in words]<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">#Create a lemmatizer in French<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">lemmatizer = FrenchLefffLemmatizer()<\/span><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">text_lemms = [lemmatizer.lemmatize(word,\u2019v\u2019) for word in words]<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">return (text_stems, text_lemms)<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">[\/python]<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">Then we count the most frequent words in the text first for the text passed through a Stemmer:<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">[python]<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">#Now let's count the words for lemmas and stems<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">text_stems,text_lems = process_data(zadig_data)<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">count = Counter(text_stems)<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">print(\u2018Most used words in Zadig with stems:\u2019)<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">for word in count.most_common(15):<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">print (word)<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">[\/python]<\/span><span style=\"font-weight: 400;\">\u00a0<\/span><\/span><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-fd7e110 e-flex e-con-boxed e-con e-parent\" data-id=\"fd7e110\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-d504da9 elementor-widget elementor-widget-image\" data-id=\"d504da9\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"510\" height=\"301\" src=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-11-09-52-19.png\" class=\"attachment-medium_large size-medium_large wp-image-3297\" alt=\"\" srcset=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-11-09-52-19.png 510w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-11-09-52-19-300x177.png 300w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-11-09-52-19-18x12.png 18w\" sizes=\"(max-width: 510px) 100vw, 510px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-640880c e-flex e-con-boxed e-con e-parent\" data-id=\"640880c\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-cd0c94c elementor-widget elementor-widget-text-editor\" data-id=\"cd0c94c\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"font-weight: 400; color: #000000;\">For words passed through a Lemmatizer:<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">[python]<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">count = Counter(text_lems)<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">print(\u2018Most used words in Zadig with lemmas:\u2019)<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">for word in count.most_common(15):<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">print (word)<\/span><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">[\/python]<\/span><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-ae53ba1 e-flex e-con-boxed e-con e-parent\" data-id=\"ae53ba1\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-4f218c4 elementor-widget elementor-widget-image\" data-id=\"4f218c4\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"500\" height=\"298\" src=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-11-09-55-03.png\" class=\"attachment-medium_large size-medium_large wp-image-3304\" alt=\"\" srcset=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-11-09-55-03.png 500w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-11-09-55-03-300x179.png 300w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-11-09-55-03-18x12.png 18w\" sizes=\"(max-width: 500px) 100vw, 500px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-eeb5a7d e-flex e-con-boxed e-con e-parent\" data-id=\"eeb5a7d\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-22286b7 elementor-widget elementor-widget-text-editor\" data-id=\"22286b7\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"font-weight: 400; color: #000000;\">Just for fun, let's count the most common bigrams:<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">[python]<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">ngram_counts = Counter(ngrams(text_lems, 2))<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">print(\u2018 the 10 most frequent bigrams : \u2018)<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">for word in ngram_counts.most_common(10):<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">print (word)<\/span><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">[\/python]<\/span><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-cf7db60 e-flex e-con-boxed e-con e-parent\" data-id=\"cf7db60\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-d17333f elementor-widget elementor-widget-image\" data-id=\"d17333f\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"341\" height=\"212\" src=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-11-09-56-37.png\" class=\"attachment-medium_large size-medium_large wp-image-3311\" alt=\"\" srcset=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-11-09-56-37.png 341w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-11-09-56-37-300x187.png 300w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/Capture-du-2019-01-11-09-56-37-18x12.png 18w\" sizes=\"(max-width: 341px) 100vw, 341px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-c5d1407 e-flex e-con-boxed e-con e-parent\" data-id=\"c5d1407\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t<div class=\"elementor-element elementor-element-efbdcba e-con-full e-flex e-con e-child\" data-id=\"efbdcba\" data-element_type=\"container\">\n\t\t\t\t<div class=\"elementor-element elementor-element-6335445 elementor-widget elementor-widget-text-editor\" data-id=\"6335445\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"font-weight: 400; color: #000000;\">In conclusion<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">We have reviewed the main methods of text data pre-processing, they are used to facilitate the translation of a text written in human language into machine language.<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">La plus grande remarque faite lors des recherches pour cet article est que la plupart de ces techniques n\u2019existent que pour les langues <\/span><b>main<\/b><span style=\"font-weight: 400;\"> and are only \"perfect\" for English in general. For less extensive languages, such as Wolof, it becomes essential to be able to implement all these techniques in order to effectively process texts written using them.<\/span><\/span><\/p><p data-start=\"0\" data-end=\"335\"><span style=\"color: #000000;\">Do not hesitate to read our article on<a href=\"https:\/\/blog.baamtu.com\/en\/les-fondements-du-nlp-et-du-deep-learning-guide-complet\/\" target=\"_blank\" rel=\"noopener\"> The foundations of NLP and deep learning<\/a>.<\/span><\/p><p data-start=\"0\" data-end=\"335\"><a href=\"https:\/\/tally.so\/r\/3j7r1Q?AT=techNLP\">If you want our next articles on the subject, click here!<\/a><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>","protected":false},"excerpt":{"rendered":"<p>Les techniques de pr\u00e9traitement en traitement du langage naturel permettent de pr\u00e9parer et d\u2019assainir les donn\u00e9es brutes pour les rendre [&hellip;]<\/p>","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_uag_custom_page_level_css":"","_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[1],"tags":[],"class_list":["post-3093","post","type-post","status-publish","format-standard","hentry","category-blog"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v23.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Techniques de pr\u00e9traitement en traitement du langage naturel - Blog de Baamtu<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.baamtu.com\/en\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Techniques de pr\u00e9traitement en traitement du langage naturel - Blog de Baamtu\" \/>\n<meta property=\"og:description\" content=\"Les techniques de pr\u00e9traitement en traitement du langage naturel permettent de pr\u00e9parer et d\u2019assainir les donn\u00e9es brutes pour les rendre [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/blog.baamtu.com\/en\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/\" \/>\n<meta property=\"og:site_name\" content=\"Blog de Baamtu\" \/>\n<meta property=\"article:published_time\" content=\"2025-03-06T06:50:51+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-03-07T04:16:48+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/computer_understand_011_language-1.jpg\" \/>\n<meta name=\"author\" content=\"Baamtu\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Baamtu\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"16 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/\"},\"author\":{\"name\":\"Baamtu\",\"@id\":\"https:\/\/blog.baamtu.com\/#\/schema\/person\/13e4a1b7b2e6f6435ae42c1469d97aa7\"},\"headline\":\"Techniques de pr\u00e9traitement en traitement du langage naturel\",\"datePublished\":\"2025-03-06T06:50:51+00:00\",\"dateModified\":\"2025-03-07T04:16:48+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/\"},\"wordCount\":3120,\"commentCount\":1,\"publisher\":{\"@id\":\"https:\/\/blog.baamtu.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/computer_understand_011_language-1.jpg\",\"articleSection\":[\"Blog\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/\",\"url\":\"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/\",\"name\":\"Techniques de pr\u00e9traitement en traitement du langage naturel - Blog de Baamtu\",\"isPartOf\":{\"@id\":\"https:\/\/blog.baamtu.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/computer_understand_011_language-1.jpg\",\"datePublished\":\"2025-03-06T06:50:51+00:00\",\"dateModified\":\"2025-03-07T04:16:48+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#primaryimage\",\"url\":\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/computer_understand_011_language-1.jpg\",\"contentUrl\":\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/computer_understand_011_language-1.jpg\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/blog.baamtu.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Techniques de pr\u00e9traitement en traitement du langage naturel\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/blog.baamtu.com\/#website\",\"url\":\"https:\/\/blog.baamtu.com\/\",\"name\":\"Blog de Baamtu\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/blog.baamtu.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/blog.baamtu.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/blog.baamtu.com\/#organization\",\"name\":\"Blog de Baamtu\",\"url\":\"https:\/\/blog.baamtu.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.baamtu.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2026\/05\/cropped-Mono-Baamtu-Profile-1-1.png\",\"contentUrl\":\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2026\/05\/cropped-Mono-Baamtu-Profile-1-1.png\",\"width\":460,\"height\":472,\"caption\":\"Blog de Baamtu\"},\"image\":{\"@id\":\"https:\/\/blog.baamtu.com\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/blog.baamtu.com\/#\/schema\/person\/13e4a1b7b2e6f6435ae42c1469d97aa7\",\"name\":\"Baamtu\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.baamtu.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/5732172cc11a69b9c95d9d577e1fad6f?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/5732172cc11a69b9c95d9d577e1fad6f?s=96&d=mm&r=g\",\"caption\":\"Baamtu\"},\"url\":\"https:\/\/blog.baamtu.com\/en\/author\/baamtu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Techniques de pr\u00e9traitement en traitement du langage naturel - Blog de Baamtu","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.baamtu.com\/en\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/","og_locale":"en_US","og_type":"article","og_title":"Techniques de pr\u00e9traitement en traitement du langage naturel - Blog de Baamtu","og_description":"Les techniques de pr\u00e9traitement en traitement du langage naturel permettent de pr\u00e9parer et d\u2019assainir les donn\u00e9es brutes pour les rendre [&hellip;]","og_url":"https:\/\/blog.baamtu.com\/en\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/","og_site_name":"Blog de Baamtu","article_published_time":"2025-03-06T06:50:51+00:00","article_modified_time":"2025-03-07T04:16:48+00:00","og_image":[{"url":"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/computer_understand_011_language-1.jpg"}],"author":"Baamtu","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Baamtu","Est. reading time":"16 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#article","isPartOf":{"@id":"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/"},"author":{"name":"Baamtu","@id":"https:\/\/blog.baamtu.com\/#\/schema\/person\/13e4a1b7b2e6f6435ae42c1469d97aa7"},"headline":"Techniques de pr\u00e9traitement en traitement du langage naturel","datePublished":"2025-03-06T06:50:51+00:00","dateModified":"2025-03-07T04:16:48+00:00","mainEntityOfPage":{"@id":"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/"},"wordCount":3120,"commentCount":1,"publisher":{"@id":"https:\/\/blog.baamtu.com\/#organization"},"image":{"@id":"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#primaryimage"},"thumbnailUrl":"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/computer_understand_011_language-1.jpg","articleSection":["Blog"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/","url":"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/","name":"Techniques de pr\u00e9traitement en traitement du langage naturel - Blog de Baamtu","isPartOf":{"@id":"https:\/\/blog.baamtu.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#primaryimage"},"image":{"@id":"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#primaryimage"},"thumbnailUrl":"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/computer_understand_011_language-1.jpg","datePublished":"2025-03-06T06:50:51+00:00","dateModified":"2025-03-07T04:16:48+00:00","breadcrumb":{"@id":"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#primaryimage","url":"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/computer_understand_011_language-1.jpg","contentUrl":"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/computer_understand_011_language-1.jpg"},{"@type":"BreadcrumbList","@id":"https:\/\/blog.baamtu.com\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/blog.baamtu.com\/"},{"@type":"ListItem","position":2,"name":"Techniques de pr\u00e9traitement en traitement du langage naturel"}]},{"@type":"WebSite","@id":"https:\/\/blog.baamtu.com\/#website","url":"https:\/\/blog.baamtu.com\/","name":"Blog de Baamtu","description":"","publisher":{"@id":"https:\/\/blog.baamtu.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.baamtu.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/blog.baamtu.com\/#organization","name":"Blog de Baamtu","url":"https:\/\/blog.baamtu.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.baamtu.com\/#\/schema\/logo\/image\/","url":"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2026\/05\/cropped-Mono-Baamtu-Profile-1-1.png","contentUrl":"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2026\/05\/cropped-Mono-Baamtu-Profile-1-1.png","width":460,"height":472,"caption":"Blog de Baamtu"},"image":{"@id":"https:\/\/blog.baamtu.com\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/blog.baamtu.com\/#\/schema\/person\/13e4a1b7b2e6f6435ae42c1469d97aa7","name":"Baamtu","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.baamtu.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/5732172cc11a69b9c95d9d577e1fad6f?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/5732172cc11a69b9c95d9d577e1fad6f?s=96&d=mm&r=g","caption":"Baamtu"},"url":"https:\/\/blog.baamtu.com\/en\/author\/baamtu\/"}]}},"uagb_featured_image_src":{"full":false,"thumbnail":false,"medium":false,"medium_large":false,"large":false,"1536x1536":false,"2048x2048":false,"trp-custom-language-flag":false},"uagb_author_info":{"display_name":"Baamtu","author_link":"https:\/\/blog.baamtu.com\/en\/author\/baamtu\/"},"uagb_comment_info":1,"uagb_excerpt":"Les techniques de pr\u00e9traitement en traitement du langage naturel permettent de pr\u00e9parer et d\u2019assainir les donn\u00e9es brutes pour les rendre [&hellip;]","_links":{"self":[{"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/posts\/3093"}],"collection":[{"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/comments?post=3093"}],"version-history":[{"count":179,"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/posts\/3093\/revisions"}],"predecessor-version":[{"id":3505,"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/posts\/3093\/revisions\/3505"}],"wp:attachment":[{"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/media?parent=3093"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/categories?post=3093"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/tags?post=3093"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}