Automatic quality improvement of archived film and video

Xiaosong Wang, Majid Mirmehdi

This project is proposed by ITV to develop accurate restoration algorithms for archived film and video. The final aim is to produce an automated quality control arcfilmand improvement system targeting on dirt and blotches in digitized archive films. The system is composed of mainly two modules: defect detection and defect removal. In defect detection, we locate the defects by combing temporal and spatial information across a number of frames.

An HMM is trained for normal observation sequences and then applied within a framework to detect defective pixels. The resulting defect maps are refined in a two-stage false alarm elimination process and then passed over to the defect removal procedure. A labelled (degraded) pixels is restored in a multiscale framework by first searching the optimal replacement in its dynamically generated, random walk based region of candidate pixel-exemplars and then updating all its features (intensity, motion and texture).

Finally, the proposed system is compared against state-of-the-art methods to demonstrate improved accuracy in both detection and restoration using synthetic and real degraded image sequences.

Visual saliency

By being able to predict multicue gaze for open signed video content, there can be coding gains without loss of perceived quality. vissWe have developed a face orientation tracker based upon grid-based likelihood ratio trackers, using profile and frontal face detections. These cues are combined using a grid-based Bayesian state estimation algorithm to form a probability surface for each frame. This gaze predictor outperforms a static gaze prediction and one based on face locations within the frame.

Parametric Video Compression

This project presents a novel means of video compression based on texture warping and synthesis. Instead of encoding whole images or prediction residuals after translational motion estimation, our algorithm employs a perspective motion model to warp static textures and utilises texture synthesis to create dynamic textures. Texture regions are segmented using features derived from the complex wavelet transform and further classified according to their spatial and temporal characteristics. Moreover, a compatible artefact-based video metric (AVM) is proposed with which to evaluate the quality of the reconstructed video. This is also employed in-loop to prevent warping and synthesis artefacts. The proposed algorithm has been integrated into an H.264 video coding framework. The results show significant bitrate savings, of up to 60% compared with H.264 at the same objective quality (based on AVM) and subjective scores.

It is currently a very exciting and challenging time for video compression. The predicted growth in demand for bandwidth, especially for mobile services will be driven by video applications and is probably greater now than it has ever been. David Bull (VI-Lab), Dimitris Agrafiotis (VI-Lab) and Roland Baddeley (Experimental Psychology) have won a new £600k EPSRC research grant to investigate perceptual redundancy in, and new representations for digital video content. With EPSRC funding and collaboration with BBC and HHIFraunhofer Berlin, the team will investigate video compression schemes where an analysis/synthesis framework replaces the conventional energy minimisation approach. A preliminary coding framework of this type has been created by Zhang and Bull where scene content is modelled, using computer graphic techniques to replace target textures at the decoder. This approach is already producing world-leading results and has the potential to create a new content-driven framework for video compression, where region-based parameters are combined with perceptual quality metrics to inform and drive the coding processes.

Published Work

  1. P. Ndjiki-Nya, D. Doshkova, H. Kaprykowsky, F. Zhang, D. Bull, T. Wiegand, Perception-oriented video coding based on image analysis and completion: A review, Signal Processing: Image Communication, Volume 27, Issue 6, July 2012, Pages 579–594Link
  2. Zhang, F and Bull D.R., A Parametric Framework For Video Compression Using Region-based Texture Models’, IEEE Journal on Selected Areas in Signal processing (Special Issue), Vol. 5, No. 7, November 2011, pp1378-92. Link
  3. Ierodiaconou, S.; Byrne, J.; Bull, D.R.; Redmill, D.; Hill, P.; Unsupervised image compression using graphcut texture synthesis, Image Processing (ICIP), 2009 16th IEEE International Conference on, 2009, Page(s): 2289 – 2292
  4. Byrne, J.; Ierodiaconou, S.; Bull, D.; Redmill, D.; Hill, P.; Unsupervised image compression-by-synthesis within a JPEG framework, Image Processing, 2008. ICIP 2008. 15th IEEE International Conference on, 2008 , Page(s): 2892 – 2895

Bio-Inspired 3D Mapping

Geoffrey Daniels

Supervised by: David Bull, Walterio Mayol-Cuevas, J Burn

Using state of the art computer vision techniques and insights into the biological process used by animals to traverse any terrain a system has been created to enable a robotic platform to gather the information required to move safely throughout an unknown environment.  A goal of this project is to produce a system that can run in real-time upon an arbitrary locomotion platform and provide local route planning and hazard detection.  With the real-time aim in mind the core parts of the current algorithm have been developed using NVidia’s CUDA language for general purpose computing on GPUs as the code is embarrassingly parallel and GPUs can provide a huge speed increase for parallel processes. Currently without significant optimisation the system is able to compute the 3D surface ahead of the camera in approximately 100ms.

This system will be a module of a larger grant to develop a bio-inspired concept system for overall terrestrial perception and safe locomotion.

Interesting Results

Some example footage of the system generating a virtual 3D world from a single camera in real time:
https://www.youtube.com/watch?v=h36hVOerMFU&list=PLJmQZRbc9yWWrg6A0R_NFYHl6WP4FoNe9#t=15

Weakly Supervised Learning of Visual Semantic Attributes

David Hanwell

Many of the adjectives we use in everyday language denote visual attributes of objects. For example colours (`red’, `ruby’, `vermilion’), patterns (`stripy’, `chequered’), and others (`leopard skin’, `dark’). We refer to these as “visual semantic attributes” – they define associations between concepts of thought and language (semantic), and the visual appearance of objects in our physical world (visual).

Automating the recognition of such attributes in photos and video would allow rich textual descriptions to be generated, especially when combined with object detection. This would be useful for various applications; image archiving, generating descriptions for products from their photos (for example on ebay), and perhaps even for surveillance (for example “find the man in the blue stripy shirt in these CCTV feeds”). Typically, models of such attributes are trained using image data which has been specifically chosen or acquired for the task, and has been manually labelled, annotated and segmented.

There are already billions of images available on the internet, with many more being added every day. Although these images have not been manually annotated or segmented, many of them have text associated with them, for example items for sale on ebay have both an image and a text description. We call such images ‘weakly labelled’ – each has a corresponding set of words, some of which may describe some, all or none of the image content.

The question we are trying to answer is: Can we use weakly labelled images to train computer models to recognise visual semantic attributes?

There would be several potential benefits to this approach:

  • Without the need to acquire bespoke images, or to manually label, annotate or segment them, the process of learning an attribute would be faster, easier and cheaper.
  • Since the training data would be acquired from many different sources, and the corresponding text written by many different people, the resulting models would be unbiased by any one person’s notion of an attribute.
  • Due the huge range of images available, with different objects, lighting, and variations on each attribute, each model may better capture possible variation of an attribute, and thus be better able to recognise it in novel images.

(Back to top)

Weakly Supervised Learning of Semantic Colour Terms

In order to take advantage of such easily obtainable data, there are a number of challenges which must be overcome. If we acquire a set of images from a web search, using an attribute term, not all of the returned images will contain the attribute, and even in those which do, there will be regions which do not. Here are some examples of images returned from a search for ‘Blue’:

Training data for the word 'blue'

We have proposed a method of learning colour terms from weakly labelled data such as this. Below are some examples of images from the test-set of [1], which have been segmented into regions semantic colour terms using the proposed method. This work has been accepted for publication in IET Computer Vision [2].

Images from dataset segmented into regions of colour by the system

(Back to top)

QUAC: Quick Unsupervised Anisotropic Clustering

We would like to extend the above work on the weakly supervised learning of colour terms, to more complex attributes using other features. To do this, we propose to first cluster the features extracted from each image, before performing weakly supervised learning using paramaterisations of the clusters. This way the volume of data per image will be greatly reduced, allowing many more images to be used for learning.

To this end, we propose QUAC – Quick Unsupervised Anisotropic Clustering. The algorithm works by finding elliptical (or hyper-elliptical) clusters one at a time, removing the data corresponding to each cluster after it is found. It has several advantages over other clustering algorithms:

  • It does not assume isotropy, and is capable of finding highly elongated clusters.
  • It is unsupervised, having only a single parameter, and not requiring the number of clusters to be set.
  • Its complexity is linear in the number of data, resulting in good performance on larger data sets.

Below is an image of five pairs of differently coloured trousers against a white background. We have taken the Green and Blue channels of this image to create a 2D data set, also shown below. The five elongated clusters corresponding to the five pairs of trousers are clearly visible, as is the less elongated cluster in the bottom right corner of the feature space, corresponding to the white background. An animation shows how QUAC finds the clusters. This work has been accepted for publication in Pattern Recognition [3]. Source code is available below.

Trouser test data Trouser clustering result

(Back to top)

Weakly Supervised Learning of Semantic Pattern Terms:

To extend the colour learning work above to more complex attributes, there are several more challenges we must overcome. As well as not knowing which parts of the training images (if any) contain the attribute to be learned, the scale and orientation of the attribute are also unknown, and may vary throughout an image. This is the focus of our current research.

(Back to top)

References

  1. J. van de Weijer, C. Schmid, and J. Verbeek. Learning Color Names from Real-World Images. In CVPR, Minneapolis, MN, 2007.
  2. D. Hanwell and M. Mirmehdi. Weakly Supervised Learning of Semantic Colour Terms. Accepted by IET Computer Vision, 22nd February 2013.
  3. D. Hanwell and M. Mirmehdi. QUAC: Quick Unsupervised Anisotropic Clustering. Accepted by Pattern Recognition, 7th May 2013.

Robust visual SLAM using higher-order structure

Jose Martinez Carranza, Andrew Calway

We are working with the problem of simultaneous localization and mapping using a single hand-held camera as unique sensor.  Specifically, looking at how to camviewjextract a richer representation of the environment using the map generated by the system and the visual information obtained with the camera.

We have developed a method to identify planar structures from the environment using as basis the cloud of points generated by the slam system and the appearance information contained in the images captured with the camera. Our method evaluates the validity of points in the map as part of a physical plane in the world using a statistical frame work that incorporates the uncertainty of estimations for both camera and the map obtained with the typical EKF based visual slam framework. Besides of the benefits of a better visual representation fo the map, we are investigating how to exploid this structures to improve the estimation.