We are developing methods for the estimation of both temporal and spatial visual attention using a head-worn inertial measurement unit (IMU). Aimed at tasks where there is a wearer-object interaction, we estimate the when and the where the wearer is interested in. We evaluate various methods on a new egocentric dataset from 8 volunteers and compare our results with those achievable with a commercial gaze tracker used as ground-truth. Our approach is primarily geared for sensor-minimal EyeWear computing.
From the paper:
Teesid Leelasawassuk, Dima Damen, Walterio W Mayol-Cuevas, Estimating Visual Attention from a Head Mounted IMU. ISWC ’15 Proceedings of the 2015 ACM International Symposium on Wearable Computers. ISBN 978-1-4503-3578-2, pp. 147–150. September 2015.
The current system allows online building of 3D wireframe models through a combination of user interaction and automated methods from a handheld camera-mouse. Crucially, the model being built is used to concurrently compute camera pose, permitting extendable tracking while enabling the user to edit the model interactively. In contrast to other model building methods that are either off-line and/or automated but computationally intensive, the aim here is to have a system that has low computational requirements and that enables the user to define what is relevant (and what is not) at the time the model is being built. OutlinAR hardware is also developed which simply consists of the combination of a camera with a wide field of view lens and a wheeled computer mouse.
Activity recognition and event classification are of prime relevance to any intelligent system designed to assist on the move. There have been several systems aimed at the capturing of signals from a wearable computer with the aim of establishing a relationship between what is being perceived now and what should be happening. Assisting people is indeed one of the main championed potentials of wearable sensing and therefore of significant research interest.
Our work currently focuses on higher-level activity recognition that processes very low resolution motion images (160×120 pixels) to classify user manipulation activity. For this work, we base our test environment on supervised learning of the user’s behaviour from video sequences. The system observes interaction between the user’s hands and various objects, in various locations of the environment, from a wide angle shoulder-worn camera. The location and object being interacted with are indirectly deduced on the fly from the manipulation motions. Using this low-level visual information user activity is classified as one from a set of previously learned classes.
At the moment we are working on fast and robust local feature detectors and descriptors, supported by a HP Labs Innovation Research Award for Dr Walterio Mayol-Cuevas’ proposal titled ‘Anywhere Authoring using Interactive Computer Vision and Mobile Sensors’.
Our project investigates how to improve feature matching within a single camera Real-time Visual SLAM system. SLAM stands for Simultaneous Localisation and Mapping, when a camera position is estimated simultaneously with sparse point-wise representation of a surrounding environment. The camera is hand-held in our case, hence it is important to maintain camera track during or quickly recover after unpredicted and erratic motions. The range of scenarios we would like to deal with includes severe shake, partial or total occlusion and camera kidnapping.
One of the directions of our research is an adaptation of distinctive but in the same time robust image feature descriptors. These descriptors are the final stage of the Scale Invariant Feature Transform (SIFT). This descriptor forms a vector which describes a distribution of local image gradients through specially positioned orientation histograms. Such representation was inspired by advances in understanding the human vision system. In our implementation a scale selection is stochastically guided by the estimates from the SLAM filter. This allows to omit a relatively expensive scale invariant detector of the SIFT scheme.
When the camera is kidnapped or unable to perform any reliable measurement a special relocalisation mode kicks in. It attempts to find a new correct camera position by performing many-to-many feature search and use robust geometry verification procedure to ensure that a pose and found set of matches are in consensus. We investigate a way of speeding up the feature search by splitting the search space based on feature appearances.
The software based on our findings is incorporated into the Real-time Visual SLAM system which is used extensively within the Visual Information Laboratory.
Context enhanced networked services by fusing mobile vision and location
ViewNet is a 1.5M GBP project jointly funded by the UK Technology Strategy Board, the EPSRC and industrial partners. The aim is to develop the next generation of distributed localisation and user-assisted mapping systems, based on the fusion of multiple sensing technologies, including visual SLAM, inertial devices, UWB and GPS.