6 minute read

Before diving into feature extractor, we need to understand how do we represent an image without losing informations inherent in image. One possible option is vectorizing the image pixels.

towels

But, the main fallacies of this raw-pixel approach is, it is too sensitive to all kinds of variations including shifting. And, as the volume of data space grows (with dimension of the data), the amount of statistically significant amount of data to cover this space increases exponentially, which is known as curse of dimensionality.

Another approach is build a color histogram based on color distribution. It divides the color into disjoint bins, which is defined over the intensity value of the pixel, and count the number of pixels falling into each bin.

Then, it may better in terms of dimensionality; although the image size increases, counts of some bins will increase and intensity is in fixed range of the value. Another benefit is even if we shift, rotate the image little bit, histogram representation remains same.

But, it still sensitive to color changes, obviously. And, it doesn’t care the local structure of the image because it ruins the formation of the image to count the number of colors around the pixels. Hence, it may not be appropriate to handle partially observed data. It also cannot capture the shape and textures which are important to characterize object appearance.

Instead of these two approach, let’s consider little bit local image feature.

We represent the image with local parts of image which are worth to describe, and sufficient to represent the whole image, etc.

image -> BoW

Because, the local image features allows to robust to pose variation and partial occlusion. One of the popular application of local image features is panorama image

So, to represent image as local feat, we need to answer: what is good local feature.

  • ~~
  • ~~
  • saliency => be unambiguous for matching

Saliency is important. => Interesting part of the image worth to describe is called interest point.

So, we will discuss

How to detect good local features ● How to describe the local features ● How to represent an image using the local features ● How to build the classification system using the features

In this post, we are going to discuss about the image representation using local feature.

Image representation

One thing we have to consider is that, we may discover a different number of blobs for different images. Also, in the single image, we may have very few number of blops based on the different view.

So, we want to eventually represent the image into the same size vector. And at the end, we want to build a model that classifies different images with that fixed dimensional vectors.

Bag-of-visual-Words (BoW)

The idea of Bag-of-visual-Words is given the input image, we want to represent this image as a sort of histogram of local features. In case of the face images, it can be some facial features like eyes, and then we represent the image as a histogram of how many of these features are discovered from the image.

Origin: texture recognition

This idea has been already explored much before in the texture classification domain. Texture is characterized by the repetition of basic elements of textons. Here are some examples of the textures.

image

If you break down the texture, you may see some components that builds this texture and by exploiting this property, you may be able to find a representation of the texture that can be distinguishable between different image.

image

The benefit of such approach is that it is more robust against different variations compared to the pixel value representation.

BoW: general image classification

BoW has the same idea with this. We first find and collect some components that are composable components of the entire training data, which iss called the visual words, or the **code words **, and then from these, we represent each image in terms of histograms. This representation may be served to the downstream tasks like classification.

image

Quantization (codeword construction)

Hence, one of the important factor of BoW is building appropriate codewords and how we represent the image and the codewords.

In the last post, we learn about blobs and SIFT as a local feature for image. So, first, we can extract blobs and SIFT from the given image.

So, from these features, we basically can represent the image as a collection of the features and build the codewords for entire images.

Note that there are some properties we expect to the code words:

  1. The code words should be representative
  2. The code words should be discovered across many different images
  3. Each of the code words capture different aspects of the images

The way to construct code words is that we first collect all the SIFT features from the entire training image. Then, considering them as data points, we apply the clustering features to discover some representative, repeative and dominant patterns from the entire training images. Simple choice for clustering algorithm can be K-means clustering.

image

Here is some visualization of the learned codewords.

image

Image representation base on BoW

Once we get these code words, we can represent the image as a BoW histogram. For each of the local feature blocks that is detected in the image, we find the nearlest code words cluster in the dictionary which is constructed from training dataset. And then, assign that cluster to the local feature. After assigning all local features, we finally build a histogram.

image

Extending BoW: dense descriptors

However, there are some issues in them. Since blobs and SIFT features are robust against the scale variation, rotation illumination change, etc., it would be also robust against the deformation of the objects.

For instance, when we represent the pose of human body, there are different poses according to the location of arms and legs but as a collection of the SFT features they are considered to the same basically.

So, robustness is usually nice properties for representation, but if it is too insensitive, it may degrades the performance algorithm. Also, representing the image globally as the histogram removes all the geometric information originally hidden in the image.

To retain spatial structure information when we convert it into the histogram, we extract the sipt features at every grid of the image, without blob detection, and build a histogram. When we build a histogram, we calculate the frequency at multiple granurities.

image

At level 0, we basically calculate the histogram at the global level. But in the next level, we divide the image into some fine-grained grids like 2 by 2. For each of these grids, we collect histogram and finally there are total four histograms in this level. By collecting all these histograms as a one large vector and that can be the representation of images.

Since they are from different spatial postions and resolutions, if the object deforms, in other words, if it has different spatial configuration, the representation will change it a little bit different.

One may either extract features with the sparse representation and dense representation and build a spatial histogram, but the problem is that if you use the sparse features, then some of the tile may miss any interest points. So we usually extract dense keypoints for spatial histogram. The following figure summarizes building process of spatial histogram:

image

Leave a comment