Automatic quality improvement of archived film and video

Xiaosong Wang, Majid Mirmehdi

This project is proposed by ITV to develop accurate restoration algorithms for archived film and video. The final aim is to produce an automated quality control arcfilmand improvement system targeting on dirt and blotches in digitized archive films. The system is composed of mainly two modules: defect detection and defect removal. In defect detection, we locate the defects by combing temporal and spatial information across a number of frames.

An HMM is trained for normal observation sequences and then applied within a framework to detect defective pixels. The resulting defect maps are refined in a two-stage false alarm elimination process and then passed over to the defect removal procedure. A labelled (degraded) pixels is restored in a multiscale framework by first searching the optimal replacement in its dynamically generated, random walk based region of candidate pixel-exemplars and then updating all its features (intensity, motion and texture).

Finally, the proposed system is compared against state-of-the-art methods to demonstrate improved accuracy in both detection and restoration using synthetic and real degraded image sequences.

Visual saliency

By being able to predict multicue gaze for open signed video content, there can be coding gains without loss of perceived quality. vissWe have developed a face orientation tracker based upon grid-based likelihood ratio trackers, using profile and frontal face detections. These cues are combined using a grid-based Bayesian state estimation algorithm to form a probability surface for each frame. This gaze predictor outperforms a static gaze prediction and one based on face locations within the frame.

Weakly Supervised Learning of Visual Semantic Attributes

David Hanwell

Many of the adjectives we use in everyday language denote visual attributes of objects. For example colours (`red’, `ruby’, `vermilion’), patterns (`stripy’, `chequered’), and others (`leopard skin’, `dark’). We refer to these as “visual semantic attributes” – they define associations between concepts of thought and language (semantic), and the visual appearance of objects in our physical world (visual).

Automating the recognition of such attributes in photos and video would allow rich textual descriptions to be generated, especially when combined with object detection. This would be useful for various applications; image archiving, generating descriptions for products from their photos (for example on ebay), and perhaps even for surveillance (for example “find the man in the blue stripy shirt in these CCTV feeds”). Typically, models of such attributes are trained using image data which has been specifically chosen or acquired for the task, and has been manually labelled, annotated and segmented.

There are already billions of images available on the internet, with many more being added every day. Although these images have not been manually annotated or segmented, many of them have text associated with them, for example items for sale on ebay have both an image and a text description. We call such images ‘weakly labelled’ – each has a corresponding set of words, some of which may describe some, all or none of the image content.

The question we are trying to answer is: Can we use weakly labelled images to train computer models to recognise visual semantic attributes?

There would be several potential benefits to this approach:

  • Without the need to acquire bespoke images, or to manually label, annotate or segment them, the process of learning an attribute would be faster, easier and cheaper.
  • Since the training data would be acquired from many different sources, and the corresponding text written by many different people, the resulting models would be unbiased by any one person’s notion of an attribute.
  • Due the huge range of images available, with different objects, lighting, and variations on each attribute, each model may better capture possible variation of an attribute, and thus be better able to recognise it in novel images.

(Back to top)

Weakly Supervised Learning of Semantic Colour Terms

In order to take advantage of such easily obtainable data, there are a number of challenges which must be overcome. If we acquire a set of images from a web search, using an attribute term, not all of the returned images will contain the attribute, and even in those which do, there will be regions which do not. Here are some examples of images returned from a search for ‘Blue’:

Training data for the word 'blue'

We have proposed a method of learning colour terms from weakly labelled data such as this. Below are some examples of images from the test-set of [1], which have been segmented into regions semantic colour terms using the proposed method. This work has been accepted for publication in IET Computer Vision [2].

Images from dataset segmented into regions of colour by the system

(Back to top)

QUAC: Quick Unsupervised Anisotropic Clustering

We would like to extend the above work on the weakly supervised learning of colour terms, to more complex attributes using other features. To do this, we propose to first cluster the features extracted from each image, before performing weakly supervised learning using paramaterisations of the clusters. This way the volume of data per image will be greatly reduced, allowing many more images to be used for learning.

To this end, we propose QUAC – Quick Unsupervised Anisotropic Clustering. The algorithm works by finding elliptical (or hyper-elliptical) clusters one at a time, removing the data corresponding to each cluster after it is found. It has several advantages over other clustering algorithms:

  • It does not assume isotropy, and is capable of finding highly elongated clusters.
  • It is unsupervised, having only a single parameter, and not requiring the number of clusters to be set.
  • Its complexity is linear in the number of data, resulting in good performance on larger data sets.

Below is an image of five pairs of differently coloured trousers against a white background. We have taken the Green and Blue channels of this image to create a 2D data set, also shown below. The five elongated clusters corresponding to the five pairs of trousers are clearly visible, as is the less elongated cluster in the bottom right corner of the feature space, corresponding to the white background. An animation shows how QUAC finds the clusters. This work has been accepted for publication in Pattern Recognition [3]. Source code is available below.

Trouser test data Trouser clustering result

(Back to top)

Weakly Supervised Learning of Semantic Pattern Terms:

To extend the colour learning work above to more complex attributes, there are several more challenges we must overcome. As well as not knowing which parts of the training images (if any) contain the attribute to be learned, the scale and orientation of the attribute are also unknown, and may vary throughout an image. This is the focus of our current research.

(Back to top)


  1. J. van de Weijer, C. Schmid, and J. Verbeek. Learning Color Names from Real-World Images. In CVPR, Minneapolis, MN, 2007.
  2. D. Hanwell and M. Mirmehdi. Weakly Supervised Learning of Semantic Colour Terms. Accepted by IET Computer Vision, 22nd February 2013.
  3. D. Hanwell and M. Mirmehdi. QUAC: Quick Unsupervised Anisotropic Clustering. Accepted by Pattern Recognition, 7th May 2013.

Robust visual SLAM using higher-order structure

Jose Martinez Carranza, Andrew Calway

We are working with the problem of simultaneous localization and mapping using a single hand-held camera as unique sensor.  Specifically, looking at how to camviewjextract a richer representation of the environment using the map generated by the system and the visual information obtained with the camera.

We have developed a method to identify planar structures from the environment using as basis the cloud of points generated by the slam system and the appearance information contained in the images captured with the camera. Our method evaluates the validity of points in the map as part of a physical plane in the world using a statistical frame work that incorporates the uncertainty of estimations for both camera and the map obtained with the typical EKF based visual slam framework. Besides of the benefits of a better visual representation fo the map, we are investigating how to exploid this structures to improve the estimation.