In this article I will give an introduction to the Bag of Words implementation of OpenCV 2.2.

The Bag of Words model for classification

The Bag of Words model (BoW) originated in natural language processing. It makes the simplifying assumption, that the order of the words in a sentence or text document is of negligible importance for classifying it. To describe a document with the BoW model, as a first step a dictionary containing a large number of words relevant to the application domain has to be created. If the goal is to classify conference papers then this could be done by analyzing hundreds or thousands of conference papers (the training set), from the different classes which the classifier should be able to distinguish, for relevant words. Relevant words are of course words which have a high probability of being contained in one class of texts and a low probability of being contained in others. Once the dictionary has been built it is possible to describe a document in terms of word counts. The vector describing the document has a length equal to the number of words in the dictionary and each dimension represents the number of occurrences of a certain word in the document.

A similar simplifying assumption can be made for images. The assumption is that the spatial relationship of the “visual-words” in the image is of negligible importance. The “visual words” can be numeric descriptions of certain regions of the image. Such regions could be blobs, corners or T-junctions, basically everything that stands out and is likely not to change to much under certain image transformations like scaling and rotation. We will call these highly structured regions “key points” and their numerical descriptions “features” or “visual-words” in this article. A popular image feature detector and descriptor is SURF, which is also implemented in OpenCV. Since the features are vectors of real numbers it does not make sense to construct a dictionary from all the features obtained from the training set directly (as it would be done in NLP), because it would most likely be of overwhelming size. An intermediary step has to be to find a limited number of feature vectors which represent the feature space well for constructing a dictionary. This is often done by k-means clustering, an iterative algorithm for finding clusters in data. After the dictionary has been constructed new images can be described by extracting features from them and matching them with the features in the dictionary which are closest. The method has been successfully applied to object categorization. For a more scientific introduction please see ‘’Visual Categorization with Bags of Keypoints’’ of Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, Cedric Bray, 2004.

The BoW model in OpenCV

The first thing to do when using a BoW model to represent images is to construct a dictionary. The OpenCV class which takes care of this is the BOWKMeansTrainer. To use it you first have to extract features from a set of images. Please see the next section on key point identification and feature extraction for details on how to use the feature extractors implemented in OpenCV. Here is example code for using the class in which we assume, that the features have already been extracted.

Mat featuresUnclustered;
extract feature row vectors from a set of images from your problem domain

//Construct BOWKMeansTrainer

int dictionarySize=1000;
TermCriteria tc(CV_TERMCRIT_ITER,100,0.001);
int retries=1;

BOWKMeansTrainer bowTrainer(dictionarySize,tc,retries,flags);

/*cluster the feature vectors */
Mat dictionary=bowTrainer.cluster(featuresUnclustered);

The BOWKMeansTrainer class is basically just a wrapper around the k-means algorithm and so the parameters determine the number of cluster centers (the dictionary size), the termination criterion for the algorithm and the k-means version to use for which the default value is KMEANS_PP_CENTERS which refers to the kmeans++ algorithm as well as how often the the k-means algorithm is to be tried with different initial cluster centers.
The k-means clustering problem is to choose k cluster centers such that a cost function based on the distances of data points in a cluster to the respective cluster center is minimized. Finding a global minimum is a NP-hard problem and therefore infeasible for even small amounts of data points. Practical k-means algorithms try to find a local minimum by an iterative optimization procedure. The quality of the solution found by these algorithms is dependent on the initial cluster centers and can be arbitrarily bad for bad initial guesses. The kmeans++ algorithm addresses this problem by using a heuristic for choosing good initial cluster centers which makes it a better choice than the standard k-means algorithm.

After creating the BOWKMeansTrainer object we call the cluster(const Mat& descriptors) method to run k-means on the descriptors and obtain the cluster centers.
This completes the first step. Now the dictionary is ready to be used to encode images. The OpenCV implementation of BoW separates this task into three subtasks which are represented by three interfaces, the “FeatureDetector” identifies key points, the “FeatureExtractor” extracts features from the key-points found by the “FeatureDetector” and the “DescriptorMatcher” matches those features to the features in the dictionary to construct the BoW representation of the image. OpenCV implements a couple of popular feature detectors and descriptors. For the detector implementations of SIFT, SURF, FAST, GFTT (good features to track), MSER and HARRIS are available and there are also a couple of adapters one can use to change the behavior of the key point detectors. For example the “Dynamic” adapter which adjusts a detector type specific detection threshold until enough key-points are found in an image or the “Pyramid” adapter which constructs a Gaussian pyramid to detect points on multiple scales. The “Pyramid” adapter is useful for feature descriptors which are not scale invariant. For the feature extractor interface implementations of SIFT, SURF and BRIEF are available as well as an adapter to run the descriptors on all the color channels of an image in opponent color space. The DescriptorMatcher comes in the varieties “FlannBased”, “BruteForceMatcher”, “BruteForce-L1” and “BruteForce-HammingLUT”. The “FlannBased” matcher uses the flann (fast library for approximate nearest neighbors) library under the hood to perform faster but approximate matching. The “BruteForce-*” versions exhaustively searche the dictionary to find the closest match for an image feature to a word in the dictionary. All the interfaces have static factory methods named “create” which take a string as an argument and return a smart pointer to one of the their implementations. The string has to be of the format “[Adaptor]FeatureName”, where “Adaptor” stands for “Grid”, “Pyramid” or “Dynamic” for the FeatureDetector and “Opponent” for the FeatureExtractor. The “FeatureName” can be any of the methods listed above. The BOWImgDescriptorExtractor class ties it all together:

Ptr<DescriptorMatcher> matcher = DescriptorMatcher::create(“FlannBased”);
Ptr<DescriptorExtractor> extractor = DescriptorExtractor::create(“OpponentSURF”);
Ptr<FeatureDetector> detector = FeatureDetector::create(“DynamicSURF”);

BOWImgDescriptorExtractor bowDE(extractor,matcher);
//Set the dictionary we created in the first step

Now the BOWImgDescriptorExtractor is ready to encode images. To do so we use the FeatureDetector object to find key points in an image.

Mat img=imread(“path/to/img/img.jpg”);
vector<KeyPoint> keypoints;
//Create the BoW representation of the image
Mat bowDescriptor;

The code is pretty much self-explanatory. The bowDescriptor Matrix contains the BoW representation of the image which can now be used with a classification algorithm like the Normal Bayes classifier to classify images which I will cover in the next article.