Dr Cees Snoek – Faculty of Science, University of Amsterdam
Abstract

Understanding what activity is happening where and when in video content is crucial for video computing, communication and intelligence. In the literature, the common tactic for action localization is to learn a deep classifier on hard to obtain spatiotemporal annotations and to apply it at test time on an exhaustive set of spatiotemporal candidate locations. Annotating the spatiotemporal extent of an action in training video is not only cumbersome, tedious, and error prone, it also does not scale beyond a hand full of action categories. In this presentation, I will highlight recent work from my team at the University of Amsterdam in addressing the challenging problem of action localization in video without the need for spatiotemporal supervision. We consider three possible solution paths: 1) the first relies on intuitive user-interaction with points, 2) the second infers the relevant spatiotemporal location from an action class label, and finally, 3) the third derives a spatiotemporal action location from off-the-shelf object detectors and text corpora only. I will discuss the benefit and drawbacks of these three solutions on common action localization datasets, compare with alternatives depending on spatiotemporal supervision, and highlight the potential for future work.
Biography
Cees Snoek received the M.Sc. degree in business information systems in 2000 and the Ph.D. degree in computer science in 2005, both from the University of Amsterdam, The Netherlands. He is currently a director of the QUVA Lab, the joint research lab of Qualcomm and the University of Amsterdam, on deep learning and computer vision. He is also a principal engineer/manager at Qualcomm Research Netherlands and an associate professor at the University of Amsterdam. His research interests focus on video and image recognition. He is recipient of a Veni Talent Award, a Fulbright Junior Scholarship, a Vidi Talent Award, and The Netherlands Prize for Computer Science Research.
http://www.uva.nl/en/profile/s/n/c.g.m.snoek/c.g.m.snoek.html


What is the nature of representations that sustain attention to colour? In other words, is attention to colour predominantly determined by the low-level, cone-opponent chromatic mechanisms established at subcortical processing stages, or by the multiple narrowly-tuned higher-level chromatic mechanisms established in the cortex? These questions remain unresolved in spite of decades of research. In an attempt to address this problem, we conducted a series of electroencephalographic (EEG) studies that examined cone-opponent and hue-based contributions to colour selection. We used a feature-based attention paradigm, in which spatially overlapping, flickering random dot kinematograms (RDKs) of different colours are presented and participants are asked to selectively attend to a colour in order to detect brief, coherent motion intervals, ignoring any such events in the unattended colours. Each flickering colour drives a separate steady-state visual evoked potential (SSVEP), a response whose amplitude increases when that colour is attended. In our studies, behavioural performance and SSVEPs are thus taken as indicators of selective attention. The first study demonstrated that at least some of the observed cone-opponent attentional effects can be explained by asymmetric summation of signals from different cone-opponent channels with luminance at early cortical sites (V1-V3). This indicates that there might be cone-mechanism-specific optimal contrast ranges for combining colour and luminance signals. The second study demonstrated that hue-based contributions can also be observed concurrently with the aforementioned low-level, cone-opponent effects. Proximity and linear separability of targets and distractors in a hue-based colour space was shown to be another determinant of effective colour selection. In conclusion, attention to colour should be examined across the full range of chromoluminance space, task space and dependent measure space. Current evidence indicates that multiple representations contribute to selection of colour, and that depending on the stimulus attributes, task demands, and the attributes of the applied measures, it is possible to observe a spectrum of effects ranging from purely cone-opponent to largely hue-based.
I will present two studies examining human eye movements and discuss my role at the University of Bristol. The first study concerns patients with central vision loss who often adopt a preferred retinal locus (PRL), a region in peripheral vision used for fixation as alternative to the damaged fovea. A common clinical approach to assess the PRL is to record monocular fixation behavior of a small stimulus using a scanning laser ophthalmoscope. Using a combination of visual field tests and eye tracking, we tested how well the ‘fixational PRL’ generalizes to PRL use during a pointing task. Our results suggest that measures of the fixational PRL do not sufficiently capture the PRL variation exhibited while pointing, and can inform patient therapy and future research. In the second study, eye movements from eight participants were recorded with a mobile eye tracker. Participants performed five everyday tasks: Making a sandwich, transcribing a document, walking in an office and a city street, and playing catch with a flying disc. Using only saccadic direction and amplitude time series data, we trained a hidden Markov model for each task and were then able to classify unlabeled data. Lastly, I will briefly describe my role in the GLANCE Project at the University of Bristol. We are an interdisciplinary group in the departments of Experimental Psychology and Computer Science, sponsored by the EPSRC to make a wearable assistive device that would monitor behavior in tasks and present short video clips to guide behavior in real-time