Generating and Applying a Bag of Visual Words Model for Image Classification

Generating and Applying a Bag of Visual Words Model for Image Classification

Last year in June of 2016, I was hired at DigitalGlobe to provide data science support to a team located in Herndon, Virginia. The majority of my previous academic and work experience has been in the fields of image processing, computer vision, and remote sensing. I was excited and eager to make a career-transition as a full-time data scientist, particularly in the area of machine learning applied to Satellite imagery. So to ease my transition, I wanted to spend my first few months implementing a machine learning classifier that would classify DigitalGlobe Satellite imagery of Airplanes, and also make use of our company’s latest machine library software product called DeepCore. DeepCore is written in modern C++ (C++ 11), and it allows the user to perform either image classification or object detection, manipulate geospatial vector files, and download DigitalGlobe’s vast amounts of Satellite imagery, or even import and process your own imagery.

I had previously overheard a colleague talk about a computer vision algorithm called bag of words that could be used to classify images in conjunction with using a machine learning classifier, and so I decided that implementing the bag of words algorithm would be the first project that I would work on at DigitalGlobe. So in part 1 of a two-part series of blog posts, I wanted to first lay out the fundamental concepts and theory behind one of the most important computer vision classification algorithms, and in my second blog post (i.e., Part 2), I will cover how to implement the algorithm in OpenCV and C++. Furthermore, I will also highlight how to use the sliding window functionality and non-maximum suppression algorithm available in the DeepCore library to classify an input image with great success.

The notion of “bag of words” for document classification relates directly to the notion that documents can be searched for in a database by modeling the frequency of occurrence versus word “ordering”. Therefore, if all of the words in a document could somehow be poured into a bag, the bag of words model would be able to classify a document by its corresponding histogram of word counts, with no concern about word ordering (see Figure 1 below).

Figure 1. Bag of words is from the field of text analysis and text-based search engines.

Similarly, in computer vision, the goal is to represent each image with a bag of visual words, whereby instead of working with keywords, our “words” are key image patches and associated feature vectors as shown in Figure 2 below.

Figure 2. In Computer Vision, the bag of “words” model is analogous to a bag of “visual” words or key image patches.

Then, the next step in the bag of visual words modeling process is to create a histogram of the frequency of key image patches that represent the image (see Figure 3 below).

A bag of “visual” words for each of the images shown above of a woman, bicyle, and a violin, can be represented by a histogram of the “visual words” or key image patch occurrences using a given dictionary.

The histogram of visual words per image category shown above in Figure 3 relies on a given dictionary that was created from many training samples. So for a given image category (violin), we first apply various feature detectors (e.g., keypoints such as corner and edges) and descriptors (e.g., unique numeric encoding around each keypoint) such as SURF and SIFT to obtain the most pertinent image features. We then cluster the entire set of feature extractors using the clustering algorithm K-means. The center or mean of each cluster represents a visual word, the “K” in K-means represents the number of words in your dictionary, and the collection of visual words in turn represents our dictionary. Now that we have created our dictionary, also commonly referred to as your vocabulary or codebook, we can then classify an image. So given say an input image of a violin, we use our dictionary of visual words to encode all of the features extracted from the input image into a histogram of visual word occurrences (see Figure 4 below).

Figure 4. The input image of a violin has a bag of visual words histogram that closely resembles the category of class “Violin”, and is assigned the label “Violin” with a probability of 0.92.

In Figure 4, the bag of visual words (BoVW) modeling process consists of first extracting keypoint descriptors from the input image on a grid or around detected keypoints. Next, for each descriptor extracted compute its nearest neighbor in the dictionary. Then, we can build a histogram of length K, where the nth-bin represents the frequency of occurrence of the nth-dictionary word.

In the next few sections, I will delve more into the nuts and bolts on how to construct a bag of visual words model for image classification.

Creating and then classifying imagery using a BoVW classifier consists of the following three steps:

  • Create a BoVW Dictionary from training imagery
  • Train a Support Vector Machine using both positive and negative object instances
  • Classify test imagery

Creating a Bag of Visual Words Dictionary from Training Imagery

Since we are classifying images versus documents, our bag of (visual) words model needs to extract the most salient or key image patches within an images. These typically would include corner points, edges, and flat regions. Fortunately, OpenCV, an open-source computer vision library, contains several types of keypoint detectors and feature descriptors that we can use in either our C++ or python application. After some preliminary experiments, I settled on using the combination of the Fast Hessian (i.e., SURF) keypoint detector and the feature descriptor SIFT. Both the detector and the descriptor selected are robust to scale, rotation, and illuminations changes, which is why I chose to use them for detecting and describing key shape features in an image (see Figure 5 below).

Figure 5. In constructing a bag of “visual” words model, the first step is to extract many feature vectors for each input image.

Next, we construct our “visual word” dictionary by applying the K-means clustering algorithm from the entire collection of feature vectors that were extracted from the entire training set of images. Then, the cluster centers along with user-defined dictionary size “K” represent our generated dictionary of visual words.

Train a Support Vector Machine using Positive and Negative Object Instances

All of the training and testing objects that I used in constructing my BoVW model consisted of only two categories: 1) Fighter objects (see Figure 6) and 2) Clutter objects of both natural and man-made objects (see Figure 7). I trained on an augmented dataset consisting of 1,877 fighter objects and 1,877 clutter objects, and then tested the performance of the BoVW classifier on 892 fighter and 892 clutter objects that were not used in the training phase. The training chips varied in pixel dimensions, and I did not perform any resizing, normalizing, or demeaning of the original training images. I found the best success from my preliminary experiments using the keypoint detector/descriptor combinations of SURF/SIFT and BRISK/SIFT.

Figure 6. Some Fighter Aircraft examples that we want to detect represents the positive training image samples.

Below in Figure 7 are the negative training image samples consisting of both natural and man-made objects (e.g., cars and buildings).

Figure 7. The background clutter consisting of both natural and man-made objects represent the negative training image samples.

Finally, I used OpenCV’s support vector machine functionality to determine the best hyperplane that separated my two classes of fighter aircraft and clutter objects.

Classify the Test Imagery

Below are the Precision and Recall results for my binary classifier consisting of the only the two classes of fighter – clutter objects shown below in Table 1.

Table 1. The Precision-Recall results for the keypoint detector/descriptor combinations of SURF/SIFT and BRISK/SIFT.

The winning keypoint detection/descriptor was using the BRISK keypoint detector, and the SIFT keypoint descriptor, since they had the highest precision-recall values. Below in Figure 8 is a visualization of three words in our “Fighter” dictionary.

Figure 8. This is a visualization of 16 image patches of 3 different visual words in our “Fighter” dictionary that contains over 300 visual words. Note that each of these selected words represent diverse fighters in various orientations.


The Precision – Recall results shown above in Table 1 shows that the BoVW model is useful in classifying aircraft satellite imagery for detecting fighter aircraft. All of the training samples were extracted from DigitalGlobe imagery, and both the training and test sets contained objects of varying resolution and images size. I used OpenCV’s support vector machine’s hyperparameter tuning functionality to maximize the performance of the BoVW classifier. Future work will focus on using all of the eight bands available in World View-3 Imagery to see if using all of the multi-spectral bands will further improve the performance of the classifier.

Comments: 1

  1. Pretty cool article, thanks for sharing your experience. I’m also working on an algorithm based on the the bag of visual words paradigm. Was curious if you ever compared it with neural networks performance wise?

Add your comment