{"id":3324,"date":"2025-03-07T16:24:31","date_gmt":"2025-03-07T15:24:31","guid":{"rendered":"https:\/\/blog.baamtu.com\/?p=3324"},"modified":"2025-04-29T08:24:44","modified_gmt":"2025-04-29T06:24:44","slug":"one-hot-encoding-et-bag-of-words-comprendre-les-differences","status":"publish","type":"post","link":"https:\/\/blog.baamtu.com\/en\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/","title":{"rendered":"One Hot Encoding and Bag of Words: Understanding the Differences."},"content":{"rendered":"<div data-elementor-type=\"wp-post\" data-elementor-id=\"3324\" class=\"elementor elementor-3324\">\n\t\t\t\t<div class=\"elementor-element elementor-element-cec4be3 e-flex e-con-boxed e-con e-parent\" data-id=\"cec4be3\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-684e4ec elementor-widget elementor-widget-text-editor\" data-id=\"684e4ec\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<style>\/*! elementor - v3.23.0 - 05-08-2024 *\/\n.elementor-widget-text-editor.elementor-drop-cap-view-stacked .elementor-drop-cap{background-color:#69727d;color:#fff}.elementor-widget-text-editor.elementor-drop-cap-view-framed .elementor-drop-cap{color:#69727d;border:3px solid;background-color:transparent}.elementor-widget-text-editor:not(.elementor-drop-cap-view-default) .elementor-drop-cap{margin-top:8px}.elementor-widget-text-editor:not(.elementor-drop-cap-view-default) .elementor-drop-cap-letter{width:1em;height:1em}.elementor-widget-text-editor .elementor-drop-cap{float:left;text-align:center;line-height:1;font-size:50px}.elementor-widget-text-editor .elementor-drop-cap-letter{display:inline-block}<\/style>\t\t\t\t<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_69_1 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of content<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewbox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewbox=\"0 0 24 24\" version=\"1.2\" baseprofile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/blog.baamtu.com\/en\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#One_Hot_Encoding_et_Bag_of_Words_Comprendre_les_differences\" title=\"One Hot Encoding and Bag of Words: Understanding the Differences.\">One Hot Encoding and Bag of Words: Understanding the Differences.<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/blog.baamtu.com\/en\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#IOne_Hot_Encoding_Comprendre_les_differences\" title=\"One Hot Encoding and Bag of Words: Understanding the differences.\">One Hot Encoding and Bag of Words: Understanding the differences.<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/blog.baamtu.com\/en\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#II_Bag_Of_Words_Comprendre_les_differences\" title=\"II. Bag Of Words : Comprendre les diff\u00e9rences\">II. Bag Of Words : Comprendre les diff\u00e9rences<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"One_Hot_Encoding_et_Bag_of_Words_Comprendre_les_differences\"><\/span><span style=\"color: #000000;\"><strong data-start=\"0\" data-end=\"65\" data-is-only-node=\"\">One Hot Encoding and Bag of Words: Understanding the Differences.<\/strong><\/span><span class=\"ez-toc-section-end\"><\/span><\/h2><p><br data-start=\"65\" data-end=\"68\" \/><span style=\"color: #000000;\">In this article, we analyze two popular methods of numerical text representation in NLP.<\/span><\/p><p><span style=\"color: #000000;\"> You will discover how one hot encoding and bag of words can transform text data to make it usable by machines. We will discuss their specificities, advantages and limitations, illustrating everything with concrete examples of implementation in Python to optimize your projects in natural language processing.<\/span><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-9f6a9df e-flex e-con-boxed e-con e-parent\" data-id=\"9f6a9df\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-122180c elementor-widget elementor-widget-image\" data-id=\"122180c\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<style>\/*! elementor - v3.23.0 - 05-08-2024 *\/\n.elementor-widget-image{text-align:center}.elementor-widget-image a{display:inline-block}.elementor-widget-image a img[src$=\".svg\"]{width:48px}.elementor-widget-image img{vertical-align:middle;display:inline-block}<\/style>\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"768\" height=\"432\" src=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding-768x432.png\" class=\"attachment-medium_large size-medium_large wp-image-3657\" alt=\"\" srcset=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding-768x432.png 768w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding-300x169.png 300w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding-1024x576.png 1024w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding-1536x864.png 1536w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding-2048x1152.png 2048w, https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding-18x10.png 18w\" sizes=\"(max-width: 768px) 100vw, 768px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-ea09236 e-flex e-con-boxed e-con e-parent\" data-id=\"ea09236\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-2ee1b07 elementor-widget elementor-widget-text-editor\" data-id=\"2ee1b07\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">In natural language processing ( <\/span><b>NLP <\/b><span style=\"font-weight: 400;\">), we are trying to give machines the ability to understand human language. That's interesting, isn't it! But a big problem arises: humans communicate with sentences, words... while machines only understand numbers. It then becomes important to be able to translate a text written in a <\/span><b>human language<\/b><span style=\"font-weight: 400;\"> towards a language <\/span><b>machine<\/b><span style=\"font-weight: 400;\">. <\/span><\/span><\/p><p><span style=\"color: #000000;\"><strong>During the episode <a style=\"color: #000000;\" href=\"https:\/\/blog.baamtu.com\/en\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/\" target=\"_blank\" rel=\"noopener\"><strong>previous<\/strong><\/a><\/strong><span style=\"font-weight: 400;\"> , we have covered the pre-processing step which is more than necessary to be able to attack the translation and then move on to natural language processing NLP.<\/span><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">In this episode, we set ourselves the goal of transforming the words in a text into numbers so that they can be interpreted by the computer.<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">To achieve this task, there are several approaches, we will focus on the simplest ones to achieve such as the <\/span><b>one hot encoding<\/b><span style=\"font-weight: 400;\"> and the <\/span><b>bag of words<\/b><span style=\"font-weight: 400;\">. In the next episodes, we will touch on the latest and most effective techniques.<\/span><\/span><\/p><h2><span class=\"ez-toc-section\" id=\"IOne_Hot_Encoding_Comprendre_les_differences\"><\/span><span style=\"color: #000000;\"><b>One Hot Encoding and Bag of Words: Understanding the differences.<\/b><\/span><span class=\"ez-toc-section-end\"><\/span><\/h2><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">The <\/span><b>One Hot Encoding<\/b><span style=\"font-weight: 400;\"> refers to the process by which categorical variables are converted into the form of <\/span><b>vectors<\/b><span style=\"font-weight: 400;\"> binary (0,1) in which each vector contains a single <\/span><b>1<\/b><span style=\"font-weight: 400;\"> .<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">In Natural Language Processing <\/span><b>NLP<\/b><span style=\"font-weight: 400;\">, why would we need the <\/span><b>one hot encoding<\/b><span style=\"font-weight: 400;\"> ? <\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">Indeed, we have previously raised the subject of translating a text written in human language into machine language. Since the machine only knows binary (0 and 1), it would then be wise to use one hot encoding to represent our words.<\/span><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">How exactly does it work? Let\u2019s first define clearly and concisely some terms that will be useful to us later.<\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">Let's say we have a book written in French.\u00a0<\/span><\/p><p><span style=\"color: #000000;\"><b>Vocabulary<\/b><span style=\"font-weight: 400;\">: We define as <\/span><b>vocabulary V<\/b><span style=\"font-weight: 400;\">, the number of distinct words contained in the book.\u00a0<\/span><\/span><\/p><p><span style=\"color: #000000;\"><b>Corpus<\/b><span style=\"font-weight: 400;\">: We call <\/span><b>corpus, <\/b><span style=\"font-weight: 400;\">the text contained in the book. A <\/span><b>corpus <\/b><span style=\"font-weight: 400;\">is therefore a set of words.\u00a0<\/span><\/span><\/p><p><span style=\"color: #000000;\"><b>Vector<\/b><span style=\"font-weight: 400;\">: A <\/span><b>vector (one hot encoding) <\/b><span style=\"font-weight: 400;\">in our context is a set of 0 and 1 which allows us to represent a word of our <\/span><b>corpus<\/b><span style=\"font-weight: 400;\">. For each word <\/span><b>wi <\/b><span style=\"font-weight: 400;\">from our vocabulary, we can represent it with a vector of size <\/span><b>N,N<\/b><span style=\"font-weight: 400;\"> being the number of words contained in our vocabulary <\/span><b>V,<\/b> <b>[0,0,0,..,1,\u2026,0,0]<\/b><span style=\"font-weight: 400;\"> such that the i th element takes 1 and the others 0. Each sentence in our corpus will then be represented by the set of vectors <\/span><b>\u00a0one hot encoding<\/b><span style=\"font-weight: 400;\"> of each of his words.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">To explain in a simple way the <\/span><b>one hot encoding<\/b><span style=\"font-weight: 400;\">, let's take a trivial example:<\/span><\/span><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-0df2fa7 e-flex e-con-boxed e-con e-parent\" data-id=\"0df2fa7\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-33b9870 elementor-widget elementor-widget-text-editor\" data-id=\"33b9870\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"color: #000000;\"><b><i>Le NLP est une branche de l\u2019intelligence artificielle.<\/i><\/b><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">We have a sentence composed of 8 distinct words, so our vocabulary V includes 8 words: <\/span><b>{ \u00ab Le \u00bb, \u00ab NLP \u00bb, \u00ab est \u00bb, \u00ab une \u00bb, \u00ab branche \u00bb, \u00ab de \u00bb , \u00bbintelligence \u00bb, \u00bbartificielle \u00bb} <\/b><span style=\"font-weight: 400;\">(Here we consider that \"l'\" has been changed to \"le\")<\/span><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">So to represent each word in our vocabulary, we will have a vector composed of 0 except for the i th element which will be 1.<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">So <\/span><b>The <\/b><span style=\"font-weight: 400;\">will have as vector <\/span><b>one hot encoding<\/b><span style=\"font-weight: 400;\"> : [ 1 0 0 0 0 0 0 0 ]. NLP will have as vector  0 1 0 0 0 0 0 0 ]. And so on. Our sentence can then be represented by the matrix contained in the table below:<\/span><\/span><\/p><table><tbody><tr><td><p><span style=\"font-weight: 400; color: #000000;\">The<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">1<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><\/tr><tr><td><p><span style=\"font-weight: 400; color: #000000;\">NLP<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">1<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><\/tr><tr><td><p><span style=\"font-weight: 400; color: #000000;\">is<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">1<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><\/tr><tr><td><p><span style=\"font-weight: 400; color: #000000;\">a<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">1<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><\/tr><tr><td><p><span style=\"font-weight: 400; color: #000000;\">branche<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">1<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><\/tr><tr><td><p><span style=\"font-weight: 400; color: #000000;\">of<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">1<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><\/tr><tr><td><p><span style=\"font-weight: 400; color: #000000;\">le<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">1<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><\/tr><tr><td><p><span style=\"font-weight: 400; color: #000000;\">intelligence<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">1<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><\/tr><tr><td><p><span style=\"font-weight: 400; color: #000000;\">artificielle<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">0<\/span><\/p><\/td><td><p><span style=\"font-weight: 400; color: #000000;\">1<\/span><\/p><\/td><\/tr><\/tbody><\/table><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">As you can see, it's pretty trivial.<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">So we can represent the sentence as a 9*8 matrix where each line is a one hot encoding vector of a word. The size of a vector for a given word therefore depends on the size of our vocabulary. This is the main drawback of this technique, in fact the larger the corpus becomes, the more the size of the vocabulary is likely to increase. And a language generally has several thousand distinct words. We can then quickly end up with matrices having enormous sizes.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">The other disadvantage of the <\/span><b>one hot encoding<\/b><span style=\"font-weight: 400;\"> is that it does not really allow to have information on the semantics or even the context of a word, its only goal being to transform a categorical value into a numerical value. There are then other techniques much more up to date and more adapted to the field of natural language processing (NLP).<\/span><\/span><\/p><h2><span class=\"ez-toc-section\" id=\"II_Bag_Of_Words_Comprendre_les_differences\"><\/span><span style=\"color: #000000;\"><b>II. Bag Of Words : <\/b><b>: Understanding the differences.<\/b><\/span><span class=\"ez-toc-section-end\"><\/span><\/h2><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">The Bag of Words model <\/span><b>Bag Of Words<\/b><span style=\"font-weight: 400;\"> is a simplified representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, without taking into account the grammar and even the word order but only preserving the multiplicity.<\/span><a style=\"color: #000000;\" href=\"https:\/\/en.wikipedia.org\/wiki\/Bag-of-words_model\" target=\"_blank\" rel=\"noopener\"><b>Wikipedia<\/b><\/a><span style=\"font-weight: 400;\">).<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">As for the <\/span><b>One Hot Encoding<\/b><span style=\"font-weight: 400;\">,the <\/span><b>Bag of words <\/b><span style=\"font-weight: 400;\">therefore allows the digital representation of a text to be made in such a way that it can be understood by the machine.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">Here the vector representation of the text describes the occurrence of words present in an entry (a document, a sentence).<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">The idea behind this approach is very simple, you will see.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">With the terms defined in the previous section, let us assume that we have a corpus composed of <\/span><b>N <\/b><span style=\"font-weight: 400;\">distinct words. The size of our vocabulary <\/span><b>V <\/b><span style=\"font-weight: 400;\">is therefore <\/span><b>N.<\/b><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">To represent a sentence, we define a fixed length vector <\/span><b>N, <\/b><span style=\"font-weight: 400;\">each element <\/span><b>i<\/b><span style=\"font-weight: 400;\"> of the vector represents a word from our vocabulary.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">How to know the values \u200b\u200bof a vector representing a sentence? Each element <\/span><b>i <\/b><span style=\"font-weight: 400;\">has the value of the number of occurrences of the word in the sentence.<\/span><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">Let's take a simple example to understand the concept. We have the following corpus:<\/span><\/p><p><span style=\"color: #000000;\"><b><i>La vie est courte mais la vie peut para\u00eetre longue.\u00a0<\/i><\/b><\/span><\/p><p><span style=\"color: #000000;\"><b><i>La nuit est proche .<\/i><\/b><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">Our vocabulary <\/span><b>V <\/b><span style=\"font-weight: 400;\">is therefore composed of the following words: <\/span><b>{ \u00ab la \u00bb, \u00bbvie \u00bb, \u00bbest \u00bb, \u00ab courte \u00bb, \u00ab mais \u00bb, \u00ab peut \u00bb, \u00ab para\u00eetre \u00bb, \u00ab longue \u00bb, \u00ab nuit \u00bb, \u00ab proche \u00bb }. <\/b><span style=\"font-weight: 400;\">To represent a sentence, we will therefore need a vector of size 10 (the number of words in our vocabulary).<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">The first sentence will be represented as follows: <\/span><b>[ 2 2 1 1 1 1 1 1 0 0 ]<\/b><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">We see from this representation that <\/span><b>la <\/b><span style=\"font-weight: 400;\">and <\/span><b>vie <\/b><span style=\"font-weight: 400;\">are present twice in the sentence, the other words <\/span><b>est, courte, mais, peut, para\u00eetre<\/b><span style=\"font-weight: 400;\"> and<\/span><b> longue <\/b><span style=\"font-weight: 400;\">ne sont pr\u00e9sentes qu\u2019une seule fois tandis que les mots comme <\/span><b>nuit <\/b><span style=\"font-weight: 400;\">and <\/span><b>proche<\/b><span style=\"font-weight: 400;\"> do not appear in the sentence.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">The same process is applied at the level of the second sentence to obtain its vector <\/span><b>Bag of words<\/b><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">The second sentence:<\/span><b> [ 1 0 1 0 0 0 0 0 1 1 ]<\/b><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">The problem with this method is that it does not allow us to determine the meaning of the text or extract the context in which the words appear. It only gives us information about the occurrence of the words in a sentence.<\/span><span style=\"font-weight: 400;\"><br \/><\/span><span style=\"font-weight: 400;\">Despite everything, the <\/span><b>Bag of words<\/b><span style=\"font-weight: 400;\"> remains a way to extract entities from text to use as input into algorithms <\/span><b>Machine<\/b> <b>Learning <\/b><span style=\"font-weight: 400;\">\u00a0in the case of document classification.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">To get a better representation of the two techniques presented in this episode, we must not forget that it is necessary to first do the <\/span><b>preprocessing. <\/b><span style=\"font-weight: 400;\">data mentioned in the previous episode.<\/span><\/span><\/p><ul><li><span style=\"color: #000000;\"><b>Implement One Hot Encoding and Bag Of Words in NLP.<\/b><\/span><\/li><\/ul><p><span style=\"font-weight: 400; color: #000000;\">In this part we will make a small implementation in Python of the two techniques presented above. As always if you don't like codes, you can skip to the conclusion :).<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">Let's start with the <\/span><b>One Hot Encoding<\/b><span style=\"font-weight: 400;\">. We will use the same techniques as in the previous episode. Let's take a book, choose a part of it and do the<\/span><b>OHE<\/b><span style=\"font-weight: 400;\"> of his sentences.<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">In this implementation we will use the book <\/span><b>Homer<\/b><span style=\"font-weight: 400;\">, downloaded from <\/span><b>Projet Guntenberg<\/b><span style=\"font-weight: 400;\"> :<\/span><\/span><\/p><table><tbody><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">homer_response <\/span><span style=\"font-weight: 400;\">=<\/span><span style=\"font-weight: 400;\"> requests.<\/span><span style=\"font-weight: 400;\">get<\/span><span style=\"font-weight: 400;\">(<\/span><span style=\"font-weight: 400;\">\u00ab\u00a0https:\/\/web.archive.org\/web\/20211128034110\/https:\/\/www.gutenberg.org\/files\/52927\/52927-0.txt\u00a0\u00bb<\/span><span style=\"font-weight: 400;\">)<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">homer_data <\/span><span style=\"font-weight: 400;\">=<\/span><span style=\"font-weight: 400;\"> homer_response.text<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">homer_data <\/span><span style=\"font-weight: 400;\">=<\/span><span style=\"font-weight: 400;\"> homer_data.<\/span><span style=\"font-weight: 400;\">split<\/span><span style=\"font-weight: 400;\">(<\/span><span style=\"font-weight: 400;\">\u00ab\u00a0***\u00a0\u00bb<\/span><span style=\"font-weight: 400;\">)[<\/span><span style=\"font-weight: 400;\">2<\/span><span style=\"font-weight: 400;\">]<\/span><\/span><\/p><\/td><\/tr><\/tbody><\/table><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">view raw<\/span><\/span><\/p><p><span style=\"color: #000000;\"><a style=\"color: #000000;\" href=\"https:\/\/gist.github.com\/moubaba\/312ecf51b9dd5eb4f5db51da256c4ba7#file-get_book-py\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">get_book.py <\/span><\/a><span style=\"font-weight: 400;\">hosted with <\/span><span style=\"font-weight: 400;\"> by <\/span><a style=\"color: #000000;\" href=\"https:\/\/github.com\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">GitHub<\/span><\/a><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">The book is too long, so we'll just take 3 sentences and that will be our corpus. Then we do the data preprocessing and create our vocabulary.<\/span><\/p><table><tbody><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">text_stems_sid,text_lems_sid <\/span><span style=\"font-weight: 400;\">=<\/span> <span style=\"font-weight: 400;\">process_data<\/span><span style=\"font-weight: 400;\">(<\/span><span style=\"font-weight: 400;\">\u00a0\u00bb \u00ab\u00a0<\/span><span style=\"font-weight: 400;\">.<\/span><span style=\"font-weight: 400;\">join<\/span><span style=\"font-weight: 400;\">(homer_data.<\/span><span style=\"font-weight: 400;\">split<\/span><span style=\"font-weight: 400;\">(<\/span><span style=\"font-weight: 400;\">\u00ab\u00a0.\u00a0\u00bb<\/span><span style=\"font-weight: 400;\">)[<\/span><span style=\"font-weight: 400;\">10<\/span><span style=\"font-weight: 400;\">:<\/span><span style=\"font-weight: 400;\">13<\/span><span style=\"font-weight: 400;\">]))<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">vocab <\/span><span style=\"font-weight: 400;\">=<\/span> <span style=\"font-weight: 400;\">list<\/span><span style=\"font-weight: 400;\">(<\/span><span style=\"font-weight: 400;\">set<\/span><span style=\"font-weight: 400;\">(text_stems_sid))<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">print<\/span><span style=\"font-weight: 400;\">(<\/span><span style=\"font-weight: 400;\">\u00a0\u00bb \u00ab\u00a0<\/span><span style=\"font-weight: 400;\">.<\/span><span style=\"font-weight: 400;\">join<\/span><span style=\"font-weight: 400;\">(homer_data.<\/span><span style=\"font-weight: 400;\">split<\/span><span style=\"font-weight: 400;\">(<\/span><span style=\"font-weight: 400;\">\u00ab\u00a0.\u00a0\u00bb<\/span><span style=\"font-weight: 400;\">)[<\/span><span style=\"font-weight: 400;\">10<\/span><span style=\"font-weight: 400;\">:<\/span><span style=\"font-weight: 400;\">13<\/span><span style=\"font-weight: 400;\">]))<\/span><\/span><\/p><\/td><\/tr><\/tbody><\/table><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">view raw<\/span><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">create_vocab.py <\/span><span style=\"font-weight: 400;\">hosted with <\/span><span style=\"font-weight: 400;\"> by <\/span><a style=\"color: #000000;\" href=\"https:\/\/github.com\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">GitHub<\/span><\/a><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">Now let's do it <\/span><b>one hot encoding<\/b><span style=\"font-weight: 400;\"> from the last sentence of this text using our vocabulary:<\/span><\/span><\/p><table><tbody><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">stems,lems <\/span><span style=\"font-weight: 400;\">=<\/span> <span style=\"font-weight: 400;\">process_data<\/span><span style=\"font-weight: 400;\">(homer_data.<\/span><span style=\"font-weight: 400;\">split<\/span><span style=\"font-weight: 400;\">(<\/span><span style=\"font-weight: 400;\">\u00ab\u00a0.\u00a0\u00bb<\/span><span style=\"font-weight: 400;\">)[<\/span><span style=\"font-weight: 400;\">12<\/span><span style=\"font-weight: 400;\">])<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">print<\/span><span style=\"font-weight: 400;\">(homer_data.<\/span><span style=\"font-weight: 400;\">split<\/span><span style=\"font-weight: 400;\">(<\/span><span style=\"font-weight: 400;\">\u00ab\u00a0.\u00a0\u00bb<\/span><span style=\"font-weight: 400;\">)[<\/span><span style=\"font-weight: 400;\">12<\/span><span style=\"font-weight: 400;\">])<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">onehot_encoded <\/span><span style=\"font-weight: 400;\">=<\/span> <span style=\"font-weight: 400;\">list<\/span><span style=\"font-weight: 400;\">()<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">for<\/span><span style=\"font-weight: 400;\"> word <\/span><span style=\"font-weight: 400;\">in<\/span><span style=\"font-weight: 400;\"> stems:<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 letter <\/span><span style=\"font-weight: 400;\">=<\/span><span style=\"font-weight: 400;\"> [<\/span><span style=\"font-weight: 400;\">0<\/span> <span style=\"font-weight: 400;\">for<\/span><span style=\"font-weight: 400;\"> _ <\/span><span style=\"font-weight: 400;\">in<\/span> <span style=\"font-weight: 400;\">range<\/span><span style=\"font-weight: 400;\">(<\/span><span style=\"font-weight: 400;\">len<\/span><span style=\"font-weight: 400;\">(<\/span><span style=\"font-weight: 400;\">set<\/span><span style=\"font-weight: 400;\">(text_stems_sid)))]<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 <\/span><span style=\"font-weight: 400;\">print<\/span><span style=\"font-weight: 400;\">(word,vocab.<\/span><span style=\"font-weight: 400;\">index<\/span><span style=\"font-weight: 400;\">(word))<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 letter[vocab.<\/span><span style=\"font-weight: 400;\">index<\/span><span style=\"font-weight: 400;\">(word)] <\/span><span style=\"font-weight: 400;\">=<\/span> <span style=\"font-weight: 400;\">1<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 onehot_encoded.<\/span><span style=\"font-weight: 400;\">append<\/span><span style=\"font-weight: 400;\">(letter)<\/span><\/span><\/p><\/td><\/tr><\/tbody><\/table><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">view raw<\/span><\/span><\/p><p><span style=\"color: #000000;\"><a style=\"color: #000000;\" href=\"https:\/\/gist.github.com\/moubaba\/312ecf51b9dd5eb4f5db51da256c4ba7#file-onehot-py\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">onehot.py <\/span><\/a><span style=\"font-weight: 400;\">hosted with <\/span><span style=\"font-weight: 400;\"> by <\/span><a style=\"color: #000000;\" href=\"https:\/\/github.com\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">GitHub<\/span><\/a><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">One Hot Encoding<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">As you might guess, the <\/span><b>OHE <\/b><span style=\"font-weight: 400;\">of the first word of this sentence ( <\/span><b>minerv <\/b><span style=\"font-weight: 400;\">) will be composed of 0 except for the <\/span><b>29e <\/b><span style=\"font-weight: 400;\">value which will be <\/span><b>1 :<\/b><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]<\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">Now let's tackle the <\/span><b>Bag Of Words. <\/b><span style=\"font-weight: 400;\">For the implementation of the <\/span><b>bag of words<\/b><span style=\"font-weight: 400;\">, we first create a function that returns a vocabulary by taking a corpus as input. Then, we create another function that takes a sentence and a corpus as input and returns the bag of words of the sentence.<\/span><\/span><\/p><table><tbody><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">import<\/span><span style=\"font-weight: 400;\"> numpy <\/span><span style=\"font-weight: 400;\">as<\/span><span style=\"font-weight: 400;\"> np<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">from<\/span><span style=\"font-weight: 400;\"> nltk <\/span><span style=\"font-weight: 400;\">import<\/span><span style=\"font-weight: 400;\"> word_tokenize<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td>\u00a0<\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">corpus <\/span><span style=\"font-weight: 400;\">=<\/span><span style=\"font-weight: 400;\"> [<\/span><span style=\"font-weight: 400;\">\u00ab\u00a0La vie est courte mais la vie peut para\u00eetre longue\u00a0\u00bb<\/span><span style=\"font-weight: 400;\">,<\/span><span style=\"font-weight: 400;\">\u00ab\u00a0La nuit est proche\u00a0\u00bb<\/span><span style=\"font-weight: 400;\">]<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td>\u00a0<\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"font-weight: 400; color: #000000;\">#definir deux phrases du corpus<\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">phrase_1 <\/span><span style=\"font-weight: 400;\">=<\/span> <span style=\"font-weight: 400;\">\u00ab\u00a0La vie est courte mais la vie peut para\u00eetre longue\u00a0\u00bb<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">phrase_2 <\/span><span style=\"font-weight: 400;\">=<\/span> <span style=\"font-weight: 400;\">\u00ab\u00a0La nuit est proche\u00a0\u00bb<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td>\u00a0<\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"font-weight: 400; color: #000000;\"># fonction retournant un vocabulaire<\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">def<\/span> <span style=\"font-weight: 400;\">vocabulary<\/span><span style=\"font-weight: 400;\">(corpus):<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 voc <\/span><span style=\"font-weight: 400;\">=<\/span><span style=\"font-weight: 400;\"> []<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 <\/span><span style=\"font-weight: 400;\">for<\/span><span style=\"font-weight: 400;\"> sentence <\/span><span style=\"font-weight: 400;\">in<\/span><span style=\"font-weight: 400;\"> corpus:<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 \u00a0 \u00a0 words <\/span><span style=\"font-weight: 400;\">=<\/span> <span style=\"font-weight: 400;\">word_tokenize<\/span><span style=\"font-weight: 400;\">(sentence.<\/span><span style=\"font-weight: 400;\">lower<\/span><span style=\"font-weight: 400;\">())<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 \u00a0 \u00a0 voc.<\/span><span style=\"font-weight: 400;\">extend<\/span><span style=\"font-weight: 400;\">(words)<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"font-weight: 400; color: #000000;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 voc_clean<\/span><span style=\"font-weight: 400;\">=<\/span><span style=\"font-weight: 400;\"> []<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 <\/span><span style=\"font-weight: 400;\">for<\/span><span style=\"font-weight: 400;\"> w <\/span><span style=\"font-weight: 400;\">in<\/span><span style=\"font-weight: 400;\"> voc:<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 \u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\">if<\/span><span style=\"font-weight: 400;\"> w <\/span><span style=\"font-weight: 400;\">not<\/span> <span style=\"font-weight: 400;\">in<\/span><span style=\"font-weight: 400;\"> voc_clean:<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 voc_clean.<\/span><span style=\"font-weight: 400;\">append<\/span><span style=\"font-weight: 400;\">(w)<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 <\/span><span style=\"font-weight: 400;\">return<\/span><span style=\"font-weight: 400;\"> voc_clean<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td>\u00a0<\/td><\/tr><tr><td>\u00a0<\/td><td>\u00a0<\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"font-weight: 400; color: #000000;\"># fonction retournant un sac de mots<\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">def<\/span> <span style=\"font-weight: 400;\">bagofwords<\/span><span style=\"font-weight: 400;\">(sentence,corpus):<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 vocab <\/span><span style=\"font-weight: 400;\">=<\/span> <span style=\"font-weight: 400;\">vocabulary<\/span><span style=\"font-weight: 400;\">(corpus)<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 sentence_words\u00a0 <\/span><span style=\"font-weight: 400;\">=<\/span><span style=\"font-weight: 400;\"> words <\/span><span style=\"font-weight: 400;\">=<\/span> <span style=\"font-weight: 400;\">word_tokenize<\/span><span style=\"font-weight: 400;\">(sentence.<\/span><span style=\"font-weight: 400;\">lower<\/span><span style=\"font-weight: 400;\">())<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 bag_of_words <\/span><span style=\"font-weight: 400;\">=<\/span><span style=\"font-weight: 400;\"> np.<\/span><span style=\"font-weight: 400;\">zeros<\/span><span style=\"font-weight: 400;\">(<\/span><span style=\"font-weight: 400;\">len<\/span><span style=\"font-weight: 400;\">(vocab))<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 <\/span><span style=\"font-weight: 400;\">for<\/span><span style=\"font-weight: 400;\"> w_in_sentence <\/span><span style=\"font-weight: 400;\">in<\/span><span style=\"font-weight: 400;\"> sentence_words :<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 \u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\">for<\/span><span style=\"font-weight: 400;\"> i,w <\/span><span style=\"font-weight: 400;\">in<\/span> <span style=\"font-weight: 400;\">enumerate<\/span><span style=\"font-weight: 400;\">(vocab) :<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\">if<\/span><span style=\"font-weight: 400;\"> w <\/span><span style=\"font-weight: 400;\">==<\/span><span style=\"font-weight: 400;\"> w_in_sentence :<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 bag_of_words[i] <\/span><span style=\"font-weight: 400;\">+=<\/span> <span style=\"font-weight: 400;\">1<\/span><\/span><\/p><\/td><\/tr><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\"> \u00a0 <\/span><span style=\"font-weight: 400;\">return<\/span><span style=\"font-weight: 400;\"> bag_of_words<\/span><\/span><\/p><\/td><\/tr><\/tbody><\/table><p><span style=\"color: #000000;\"><a style=\"color: #000000;\" href=\"https:\/\/gist.github.com\/Papaass\/25cfd5c3bbfca0c5dd68137e60c1dfb2#file-bag_of_words_fr-py\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">bag_of_words_fr.py <\/span><\/a><span style=\"font-weight: 400;\">hosted with <\/span><span style=\"font-weight: 400;\"> by <\/span><a style=\"color: #000000;\" href=\"https:\/\/github.com\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">GitHub<\/span><\/a><\/span><\/p><p><span style=\"font-weight: 400; color: #000000;\">After testing the bagofwords function with both sentences, we got the following results:<\/span><\/p><table><tbody><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">print<\/span><span style=\"font-weight: 400;\">(<\/span><span style=\"font-weight: 400;\">bagofwords<\/span><span style=\"font-weight: 400;\">(phrase_1,corpus))<\/span><\/span><\/p><\/td><\/tr><\/tbody><\/table><p><span style=\"color: #000000;\"><a style=\"color: #000000;\" href=\"https:\/\/gist.github.com\/Papaass\/25cfd5c3bbfca0c5dd68137e60c1dfb2#file-phrase1-py\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">phrase1.py <\/span><\/a><span style=\"font-weight: 400;\">hosted with <\/span><span style=\"font-weight: 400;\"> by <\/span><a style=\"color: #000000;\" href=\"https:\/\/github.com\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">GitHub<\/span><\/a><\/span><\/p><table><tbody><tr><td>\u00a0<\/td><td><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">print<\/span><span style=\"font-weight: 400;\">(<\/span><span style=\"font-weight: 400;\">bagofwords<\/span><span style=\"font-weight: 400;\">(phrase_2,corpus))<\/span><\/span><\/p><\/td><\/tr><\/tbody><\/table><p><span style=\"color: #000000;\"><span style=\"font-weight: 400;\">view raw<\/span><\/span><\/p><p><span style=\"color: #000000;\"><a style=\"color: #000000;\" href=\"https:\/\/gist.github.com\/Papaass\/25cfd5c3bbfca0c5dd68137e60c1dfb2#file-phrase_2-py\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">phrase_2.py <\/span><\/a><span style=\"font-weight: 400;\">hosted with <\/span><span style=\"font-weight: 400;\"> by <\/span><a style=\"color: #000000;\" href=\"https:\/\/github.com\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">GitHub<\/span><\/a><\/span><\/p><p><span style=\"color: #000000;\"><span style=\"font-size: 16px; font-style: inherit; font-weight: 400; font-family: var( --e-global-typography-text-font-family ), Sans-serif; text-align: var(--text-align); background-color: var(--ast-global-color-5);\">In this episode, we presented two approaches to translating text data into a form that is understandable to a computer.<\/span><b style=\"font-size: 16px; font-style: inherit; font-family: var( --e-global-typography-text-font-family ), Sans-serif; text-align: var(--text-align); background-color: var(--ast-global-color-5);\"> One Hot Encoding<\/b><span style=\"font-size: 16px; font-style: inherit; font-weight: 400; font-family: var( --e-global-typography-text-font-family ), Sans-serif; text-align: var(--text-align); background-color: var(--ast-global-color-5);\"> and <\/span><b style=\"font-size: 16px; font-style: inherit; font-family: var( --e-global-typography-text-font-family ), Sans-serif; text-align: var(--text-align); background-color: var(--ast-global-color-5);\">Bag Of Words<\/b><span style=\"font-size: 16px; font-style: inherit; font-weight: 400; font-family: var( --e-global-typography-text-font-family ), Sans-serif; text-align: var(--text-align); background-color: var(--ast-global-color-5);\"> are two trivial techniques, but they can be useful in the <\/span><b style=\"font-size: 16px; font-style: inherit; font-family: var( --e-global-typography-text-font-family ), Sans-serif; text-align: var(--text-align); background-color: var(--ast-global-color-5);\">Kingdom of Natural Language Processing <\/b><a style=\"font-size: 16px; font-style: inherit; font-weight: inherit; font-family: var( --e-global-typography-text-font-family ), Sans-serif; text-align: var(--text-align); color: #000000;\" href=\"https:\/\/www.ibm.com\/fr-fr\/think\/topics\/natural-language-processing\" target=\"_blank\" rel=\"noopener\"><b>NLP<\/b><\/a><b style=\"font-size: 16px; font-style: inherit; font-family: var( --e-global-typography-text-font-family ), Sans-serif; text-align: var(--text-align); background-color: var(--ast-global-color-5);\">.<\/b><\/span><\/p><p data-start=\"1538\" data-end=\"1889\"><span style=\"color: #000000;\">The analysis of <strong data-start=\"1566\" data-end=\"1631\">One Hot Encoding and Bag of Words: Understanding the Differences.<\/strong> allows you to choose the technique best suited to your NLP needs.\u00a0<\/span><\/p><p data-start=\"1538\" data-end=\"1889\"><span style=\"color: #000000;\">Do not hesitate to read our article on<strong><a style=\"color: #000000;\" href=\"https:\/\/blog.baamtu.com\/en\/techniques-de-pretraitement-en-traitement-du-langage-naturel\/\" target=\"_blank\" rel=\"noopener\">Preprocessing techniques in Natural Language Processing<\/a>.<\/strong><\/span><\/p><p data-start=\"1538\" data-end=\"1889\"><span style=\"color: #000000;\">Pour plus d&rsquo;informations sur l&rsquo;IA, consultez notre <a style=\"color: #000000;\" href=\"https:\/\/blog.baamtu.com\/en\/foire-aux-questions-faq-ia\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-start=\"51\" data-end=\"91\">FAQ IA<\/a><\/span><\/p><p data-start=\"1538\" data-end=\"1889\"><span style=\"color: #000000;\"><strong><a style=\"color: #000000;\" href=\"https:\/\/tally.so\/r\/3j7r1Q?AT=Onehotencoding\/bagwords\" target=\"_blank\" rel=\"noopener\">Pour ne pas ratez nos prochaines publications sur le sujet, cliquez ici !<\/a><\/strong><\/span><\/p><p data-start=\"1538\" data-end=\"1889\">\u00a0<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>","protected":false},"excerpt":{"rendered":"<p>One Hot Encoding et Bag of Words : Comprendre les diff\u00e9rences. Dans cet article, nous analysons deux m\u00e9thodes populaires de [&hellip;]<\/p>","protected":false},"author":4,"featured_media":3657,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_uag_custom_page_level_css":"","_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"disabled","ast-breadcrumbs-content":"","ast-featured-img":"disabled","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[1],"tags":[],"class_list":["post-3324","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v23.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>One Hot Encoding et Bag of Words : Comprendre les diff\u00e9rences - Blog de Baamtu<\/title>\n<meta name=\"description\" content=\"D\u00e9couvrez les diff\u00e9rences entre One Hot Encoding et Bag of Words en NLP, avec une impl\u00e9mentation en Python.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.baamtu.com\/en\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"One Hot Encoding et Bag of Words : Comprendre les diff\u00e9rences - Blog de Baamtu\" \/>\n<meta property=\"og:description\" content=\"D\u00e9couvrez les diff\u00e9rences entre One Hot Encoding et Bag of Words en NLP, avec une impl\u00e9mentation en Python.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/blog.baamtu.com\/en\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/\" \/>\n<meta property=\"og:site_name\" content=\"Blog de Baamtu\" \/>\n<meta property=\"article:published_time\" content=\"2025-03-07T15:24:31+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-29T06:24:44+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding.png\" \/>\n\t<meta property=\"og:image:width\" content=\"2240\" \/>\n\t<meta property=\"og:image:height\" content=\"1260\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Baamtu\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Baamtu\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/\"},\"author\":{\"name\":\"Baamtu\",\"@id\":\"https:\/\/blog.baamtu.com\/#\/schema\/person\/13e4a1b7b2e6f6435ae42c1469d97aa7\"},\"headline\":\"One Hot Encoding et Bag of Words : Comprendre les diff\u00e9rences\",\"datePublished\":\"2025-03-07T15:24:31+00:00\",\"dateModified\":\"2025-04-29T06:24:44+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/\"},\"wordCount\":1996,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/blog.baamtu.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding.png\",\"articleSection\":[\"Blog\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/\",\"url\":\"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/\",\"name\":\"One Hot Encoding et Bag of Words : Comprendre les diff\u00e9rences - Blog de Baamtu\",\"isPartOf\":{\"@id\":\"https:\/\/blog.baamtu.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding.png\",\"datePublished\":\"2025-03-07T15:24:31+00:00\",\"dateModified\":\"2025-04-29T06:24:44+00:00\",\"description\":\"D\u00e9couvrez les diff\u00e9rences entre One Hot Encoding et Bag of Words en NLP, avec une impl\u00e9mentation en Python.\",\"breadcrumb\":{\"@id\":\"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#primaryimage\",\"url\":\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding.png\",\"contentUrl\":\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding.png\",\"width\":2240,\"height\":1260},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/blog.baamtu.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"One Hot Encoding et Bag of Words : Comprendre les diff\u00e9rences\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/blog.baamtu.com\/#website\",\"url\":\"https:\/\/blog.baamtu.com\/\",\"name\":\"Blog de Baamtu\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/blog.baamtu.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/blog.baamtu.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/blog.baamtu.com\/#organization\",\"name\":\"Blog de Baamtu\",\"url\":\"https:\/\/blog.baamtu.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.baamtu.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2024\/04\/cropped-logo_baamtu.png\",\"contentUrl\":\"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2024\/04\/cropped-logo_baamtu.png\",\"width\":674,\"height\":158,\"caption\":\"Blog de Baamtu\"},\"image\":{\"@id\":\"https:\/\/blog.baamtu.com\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/blog.baamtu.com\/#\/schema\/person\/13e4a1b7b2e6f6435ae42c1469d97aa7\",\"name\":\"Baamtu\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.baamtu.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/5732172cc11a69b9c95d9d577e1fad6f?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/5732172cc11a69b9c95d9d577e1fad6f?s=96&d=mm&r=g\",\"caption\":\"Baamtu\"},\"url\":\"https:\/\/blog.baamtu.com\/en\/author\/baamtu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"One Hot Encoding et Bag of Words : Comprendre les diff\u00e9rences - Blog de Baamtu","description":"D\u00e9couvrez les diff\u00e9rences entre One Hot Encoding et Bag of Words en NLP, avec une impl\u00e9mentation en Python.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.baamtu.com\/en\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/","og_locale":"en_US","og_type":"article","og_title":"One Hot Encoding et Bag of Words : Comprendre les diff\u00e9rences - Blog de Baamtu","og_description":"D\u00e9couvrez les diff\u00e9rences entre One Hot Encoding et Bag of Words en NLP, avec une impl\u00e9mentation en Python.","og_url":"https:\/\/blog.baamtu.com\/en\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/","og_site_name":"Blog de Baamtu","article_published_time":"2025-03-07T15:24:31+00:00","article_modified_time":"2025-04-29T06:24:44+00:00","og_image":[{"width":2240,"height":1260,"url":"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding.png","type":"image\/png"}],"author":"Baamtu","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Baamtu","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#article","isPartOf":{"@id":"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/"},"author":{"name":"Baamtu","@id":"https:\/\/blog.baamtu.com\/#\/schema\/person\/13e4a1b7b2e6f6435ae42c1469d97aa7"},"headline":"One Hot Encoding et Bag of Words : Comprendre les diff\u00e9rences","datePublished":"2025-03-07T15:24:31+00:00","dateModified":"2025-04-29T06:24:44+00:00","mainEntityOfPage":{"@id":"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/"},"wordCount":1996,"commentCount":0,"publisher":{"@id":"https:\/\/blog.baamtu.com\/#organization"},"image":{"@id":"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#primaryimage"},"thumbnailUrl":"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding.png","articleSection":["Blog"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/","url":"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/","name":"One Hot Encoding et Bag of Words : Comprendre les diff\u00e9rences - Blog de Baamtu","isPartOf":{"@id":"https:\/\/blog.baamtu.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#primaryimage"},"image":{"@id":"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#primaryimage"},"thumbnailUrl":"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding.png","datePublished":"2025-03-07T15:24:31+00:00","dateModified":"2025-04-29T06:24:44+00:00","description":"D\u00e9couvrez les diff\u00e9rences entre One Hot Encoding et Bag of Words en NLP, avec une impl\u00e9mentation en Python.","breadcrumb":{"@id":"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#primaryimage","url":"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding.png","contentUrl":"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding.png","width":2240,"height":1260},{"@type":"BreadcrumbList","@id":"https:\/\/blog.baamtu.com\/one-hot-encoding-et-bag-of-words-comprendre-les-differences\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/blog.baamtu.com\/"},{"@type":"ListItem","position":2,"name":"One Hot Encoding et Bag of Words : Comprendre les diff\u00e9rences"}]},{"@type":"WebSite","@id":"https:\/\/blog.baamtu.com\/#website","url":"https:\/\/blog.baamtu.com\/","name":"Blog de Baamtu","description":"","publisher":{"@id":"https:\/\/blog.baamtu.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.baamtu.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/blog.baamtu.com\/#organization","name":"Blog de Baamtu","url":"https:\/\/blog.baamtu.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.baamtu.com\/#\/schema\/logo\/image\/","url":"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2024\/04\/cropped-logo_baamtu.png","contentUrl":"https:\/\/blog.baamtu.com\/wp-content\/uploads\/2024\/04\/cropped-logo_baamtu.png","width":674,"height":158,"caption":"Blog de Baamtu"},"image":{"@id":"https:\/\/blog.baamtu.com\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/blog.baamtu.com\/#\/schema\/person\/13e4a1b7b2e6f6435ae42c1469d97aa7","name":"Baamtu","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.baamtu.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/5732172cc11a69b9c95d9d577e1fad6f?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/5732172cc11a69b9c95d9d577e1fad6f?s=96&d=mm&r=g","caption":"Baamtu"},"url":"https:\/\/blog.baamtu.com\/en\/author\/baamtu\/"}]}},"uagb_featured_image_src":{"full":["https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding.png",2240,1260,false],"thumbnail":["https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding-150x150.png",150,150,true],"medium":["https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding-300x169.png",300,169,true],"medium_large":["https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding-768x432.png",768,432,true],"large":["https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding-1024x576.png",1024,576,true],"1536x1536":["https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding-1536x864.png",1536,864,true],"2048x2048":["https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding-2048x1152.png",2048,1152,true],"trp-custom-language-flag":["https:\/\/blog.baamtu.com\/wp-content\/uploads\/2025\/03\/One-hot-encoding-18x10.png",18,10,true]},"uagb_author_info":{"display_name":"Baamtu","author_link":"https:\/\/blog.baamtu.com\/en\/author\/baamtu\/"},"uagb_comment_info":0,"uagb_excerpt":"One Hot Encoding et Bag of Words : Comprendre les diff\u00e9rences. Dans cet article, nous analysons deux m\u00e9thodes populaires de [&hellip;]","_links":{"self":[{"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/posts\/3324"}],"collection":[{"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/comments?post=3324"}],"version-history":[{"count":174,"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/posts\/3324\/revisions"}],"predecessor-version":[{"id":5521,"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/posts\/3324\/revisions\/5521"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/media\/3657"}],"wp:attachment":[{"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/media?parent=3324"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/categories?post=3324"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.baamtu.com\/en\/wp-json\/wp\/v2\/tags?post=3324"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}