A Simple Explanation of the Bag-of-Words Model

A quick, easy introduction to the Bag-of-Words model and how to implement it in Python.

The bag-of-words (BOW) model is a representation that turns arbitrary text into fixed-length vectors by counting how many times each word appears. This process is often referred to as vectorization.

Let’s understand this with an example. Suppose we wanted to vectorize the following:

  • the cat sat
  • the cat sat in the hat
  • the cat with the hat

We’ll refer to each of these as a text document.

Step 1: Determine the Vocabulary

We first define our vocabulary, which is the set of all words found in our document set. The only words that are found in the 3 documents above are: the, cat, sat, in, the, hat, and with.

Step 2: Count

To vectorize our documents, all we have to do is count how many times each word appears:

Documentthecatsatinhatwith
the cat sat111000
the cat sat in the hat211110
the cat with the hat210011

Now we have length-6 vectors for each document!

  • the cat sat: [1, 1, 1, 0, 0, 0]
  • the cat sat in the hat: [2, 1, 1, 1, 1, 0]
  • the cat with the hat: [2, 1, 0, 0, 1, 1]

Notice that we lose contextual information, e.g. where in the document the word appeared, when we use BOW. It’s like a literal bag-of-words: it only tells you what words occur in the document, not where they occurred.

Implementing BOW in Python

Now that you know what BOW is, I’m guessing you’ll probably need to implement it. Here’s my preferred way of doing it, which uses Keras’s Tokenizer class:

from keras.preprocessing.text import Tokenizer

docs = [
  'the cat sat',
  'the cat sat in the hat',
  'the cat with the hat',
]

## Step 1: Determine the Vocabulary
tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)
print(f'Vocabulary: {list(tokenizer.word_index.keys())}')

## Step 2: Count
vectors = tokenizer.texts_to_matrix(docs, mode='count')
print(vectors)

Running that code gives us:

Vocabulary: ['the', 'cat', 'sat', 'hat', 'in', 'with']
[[0. 1. 1. 1. 0. 0. 0.]
 [0. 2. 1. 1. 1. 1. 0.]
 [0. 2. 1. 0. 1. 0. 1.]]

Notice that the vectors here have length 7 instead of 6 because of the extra 0 element at the beginning. This is an inconsequential detail - Keras reserves index 0 and never assigns it to any word.

How is BOW useful?

Despite being a relatively basic model, BOW is often used for Natural Language Processing (NLP) tasks like Text Classification. Its strengths lie in its simplicity: it’s inexpensive to compute, and sometimes simpler is better when positioning or contextual info aren’t relevant.

I’ve written a blog post that uses BOW for profanity detection - check it out if you’re curious to see BOW in action!

I write about ML, Web Dev, and more topics. Subscribe to get new posts by email!



This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

This blog is open-source on Github.

At least this isn't a full screen popup

That'd be more annoying. Anyways, subscribe to my newsletter to get new posts by email! I write about ML, Web Dev, and more topics.



This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.