Weakly Supervised Learning of Visual Semantic Attributes

David Hanwell

Many of the adjectives we use in everyday language denote visual attributes of objects. For example colours (`red’, `ruby’, `vermilion’), patterns (`stripy’, `chequered’), and others (`leopard skin’, `dark’). We refer to these as “visual semantic attributes” – they define associations between concepts of thought and language (semantic), and the visual appearance of objects in our physical world (visual).

Automating the recognition of such attributes in photos and video would allow rich textual descriptions to be generated, especially when combined with object detection. This would be useful for various applications; image archiving, generating descriptions for products from their photos (for example on ebay), and perhaps even for surveillance (for example “find the man in the blue stripy shirt in these CCTV feeds”). Typically, models of such attributes are trained using image data which has been specifically chosen or acquired for the task, and has been manually labelled, annotated and segmented.

There are already billions of images available on the internet, with many more being added every day. Although these images have not been manually annotated or segmented, many of them have text associated with them, for example items for sale on ebay have both an image and a text description. We call such images ‘weakly labelled’ – each has a corresponding set of words, some of which may describe some, all or none of the image content.

The question we are trying to answer is: Can we use weakly labelled images to train computer models to recognise visual semantic attributes?

There would be several potential benefits to this approach:

  • Without the need to acquire bespoke images, or to manually label, annotate or segment them, the process of learning an attribute would be faster, easier and cheaper.
  • Since the training data would be acquired from many different sources, and the corresponding text written by many different people, the resulting models would be unbiased by any one person’s notion of an attribute.
  • Due the huge range of images available, with different objects, lighting, and variations on each attribute, each model may better capture possible variation of an attribute, and thus be better able to recognise it in novel images.

(Back to top)

Weakly Supervised Learning of Semantic Colour Terms

In order to take advantage of such easily obtainable data, there are a number of challenges which must be overcome. If we acquire a set of images from a web search, using an attribute term, not all of the returned images will contain the attribute, and even in those which do, there will be regions which do not. Here are some examples of images returned from a search for ‘Blue’:

Training data for the word 'blue'

We have proposed a method of learning colour terms from weakly labelled data such as this. Below are some examples of images from the test-set of [1], which have been segmented into regions semantic colour terms using the proposed method. This work has been accepted for publication in IET Computer Vision [2].

Images from dataset segmented into regions of colour by the system

(Back to top)

QUAC: Quick Unsupervised Anisotropic Clustering

We would like to extend the above work on the weakly supervised learning of colour terms, to more complex attributes using other features. To do this, we propose to first cluster the features extracted from each image, before performing weakly supervised learning using paramaterisations of the clusters. This way the volume of data per image will be greatly reduced, allowing many more images to be used for learning.

To this end, we propose QUAC – Quick Unsupervised Anisotropic Clustering. The algorithm works by finding elliptical (or hyper-elliptical) clusters one at a time, removing the data corresponding to each cluster after it is found. It has several advantages over other clustering algorithms:

  • It does not assume isotropy, and is capable of finding highly elongated clusters.
  • It is unsupervised, having only a single parameter, and not requiring the number of clusters to be set.
  • Its complexity is linear in the number of data, resulting in good performance on larger data sets.

Below is an image of five pairs of differently coloured trousers against a white background. We have taken the Green and Blue channels of this image to create a 2D data set, also shown below. The five elongated clusters corresponding to the five pairs of trousers are clearly visible, as is the less elongated cluster in the bottom right corner of the feature space, corresponding to the white background. An animation shows how QUAC finds the clusters. This work has been accepted for publication in Pattern Recognition [3]. Source code is available below.

Trouser test data Trouser clustering result

(Back to top)

Weakly Supervised Learning of Semantic Pattern Terms:

To extend the colour learning work above to more complex attributes, there are several more challenges we must overcome. As well as not knowing which parts of the training images (if any) contain the attribute to be learned, the scale and orientation of the attribute are also unknown, and may vary throughout an image. This is the focus of our current research.

(Back to top)


  1. J. van de Weijer, C. Schmid, and J. Verbeek. Learning Color Names from Real-World Images. In CVPR, Minneapolis, MN, 2007.
  2. D. Hanwell and M. Mirmehdi. Weakly Supervised Learning of Semantic Colour Terms. Accepted by IET Computer Vision, 22nd February 2013.
  3. D. Hanwell and M. Mirmehdi. QUAC: Quick Unsupervised Anisotropic Clustering. Accepted by Pattern Recognition, 7th May 2013.

Leave a Reply

Your email address will not be published. Required fields are marked *