Timothy Hospedales PhD

Timothy Hospedales


Current Research Interests

Computer Vision: Recognition, retrieval, matching. Person re-identification. Zero-shot learning. Vision and language.

Robotics: Contextual Policies, Paramaterised Skills, Reinforcement Learning.

Machine Learning: Lifelong learning, transfer learning, multi-task learing, active learning, deep learning, probabilistic modeling.

Publications:
2017
  Learning to Generalize: Meta-Learning for Domain Generalization
Li, D, Yang, Y, Song, Y-Z & Hospedales, T 2017, Learning to Generalize: Meta-Learning for Domain Generalization. in AAAI Conference on Artificial Intelligence (AAAI 2018).
Domain shift refers to the well known problem that a model trained in one source domain performs poorly when applied to a target domain with different statistics. Domain Generalization (DG) techniques attempt to alleviate this issue by producing models which by design generalize well to novel testing domains. We propose a novel meta-learning method for domain generalization. Rather than designing a specific model that is robust to domain shift as in most previous DG work, we propose a model agnostic training procedure for DG. Our algorithm simulates train/test domain shift during training by synthesizing virtual testing domains within each mini-batch. The meta-optimization objective requires that steps to improve training domain performance should also improve testing domain performance. This meta-learning procedure trains models with good generalization ability to novel domains. We evaluate our method and achieve state of the art results on a recent cross-domain image classification
benchmark, as well demonstrating its potential on two classic reinforcement learning tasks.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Li, Da , Yang, Yongxin , Song, Yi-Zhe & Hospedales, Timothy.
Number of pages: 8
Publication Date: 7 Nov 2017
Publication Information
Category: Conference contribution
Original Language: English
  Frankenstein: Learning Deep Face Representations using Small Data
Hu, G, Peng, X, Yang, Y, Hospedales, T & Verbeek, J 2017, 'Frankenstein: Learning Deep Face Representations using Small Data' IEEE Transactions on Image Processing. DOI: 10.1109/TIP.2017.2756450
Deep convolutional neural networks have recentlyproven extremely effective for difficult face recognition problems in uncontrolled settings. To train such networks, very large training sets are needed with millions of labeled images. For
some applications, such as near-infrared (NIR) face recognition,
such large training datasets are, however, not publicly available and very difficult to collect. In this work, we propose a method to generate very large training datasets of synthetic images by compositing real face images in a given dataset. We show that this method enables to learn models from as few as 10,000 training images, which perform on par with models trained from 500,000
images. Using our approach we also improve the state-of-the-art results on the CASIA NIR-VIS2.0 heterogeneous face recognition dataset.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Hu, Guosheng , Peng, Xiaojiang , Yang, Yongxin , Hospedales, Timothy & Verbeek, Jakob .
Number of pages: 11
Publication Date: 26 Sep 2017
Publication Information
Category: Article
Journal: IEEE Transactions on Image Processing
ISSN: 1057-7149
Original Language: English
DOIs: 10.1109/TIP.2017.2756450
  Synergistic Instance-Level Subspace Alignment for Fine-Grained Sketch-Based Image Retrieval
Li, K, Pang, K, Song, YZ, Hospedales, TM, Xiang, T & Zhang, H 2017, 'Synergistic Instance-Level Subspace Alignment for Fine-Grained Sketch-Based Image Retrieval' IEEE Transactions on Image Processing, vol 26, no. 12, pp. 5908 - 5921. DOI: 10.1109/TIP.2017.2745106
We study the problem of fine-grained sketch-based image retrieval. By performing instance-level (rather than category-level) retrieval, it embodies a timely and practical application, particularly with the ubiquitous availability of touchscreens. Three factors contribute to the challenging nature of the problem: (i) free-hand sketches are inherently abstract and iconic, making visual comparisons with photos difficult, (ii) sketches and photos are in two different visual domains, i.e. black and white lines vs. color pixels, and (iii) fine-grained distinctions are especially challenging when executed across domain and abstraction-level. To address these challenges, we propose to bridge the image-sketch gap both at the high-level via parts and attributes, as well as at the low-level, via introducing a new domain alignment method. More specifically, (i) we contribute a dataset with 304 photos and 912 sketches, where each sketch and image is annotated with its semantic parts and associated part-level attributes. With the help of this dataset, we investigate (ii) how strongly-supervised deformable part-based models can be learned that subsequently enable automatic detection of part-level attributes, and provide pose-aligned sketch-image comparisons. To reduce the sketch-image gap when comparing low-level features, we also (iii) propose a novel method for instance-level domain-alignment, that exploits both subspace and instance-level cues to better align the domains. Finally (iv) these are combined in a matching framework integrating aligned low-level features, mid-level geometric structure and high-level semantic attributes. Extensive experiments conducted on our new dataset demonstrate effectiveness of the proposed method.
General Information
Organisations: Edinburgh College of Art.
Authors: Li, K., Pang, K., Song, Y. Z., Hospedales, T. M., Xiang, T. & Zhang, H..
Keywords: (Bridges, Deformable models, Feature extraction, Footwear, Image retrieval, Semantics, Visualization, Cross-modal, Dataset, Fine-grained, Instance-level, Sketch-based Image Retrieval, Subspace alignment. )
Number of pages: 14
Pages: 5908 - 5921
Publication Date: 1 Dec 2017
Publication Information
Category: Article
Journal: IEEE Transactions on Image Processing
Volume: 26
Issue number: 12
ISSN: 1057-7149
Original Language: English
DOIs: 10.1109/TIP.2017.2745106
  Attribute-Enhanced Face Recognition with Neural Tensor Fusion Networks
Hu, G, Yang, H, Yuan, Y, Zhang, Z, Lu, Z, Mukherjee, SS, Hospedales, T, Robertson, NM & Yang, Y 2017, Attribute-Enhanced Face Recognition with Neural Tensor Fusion Networks. in The International Conference on Computer Vision (ICCV 2017).
Deep learning has achieved great success in face recognition, however deep-learned features still have limited invariance to strong intra-personal variations such as large pose changes. It is observed that some facial attributes (e.g. eyebrow thickness, gender) are robust to such variations. We present the first work to systematically explore how the fusion of face recognition features (FRF) and facial attribute features (FAF) can enhance face recognition performance in various challenging scenarios. Despite the promise of FAF, we find that in practice existing fusion methods fail to leverage FAF to boost face recognition performance in some challenging scenarios. Thus, we develop a powerful tensor-based framework which formulates feature fusion as a tensor optimisation problem. It is nontrivial to directly optimise this tensor due to the large number of parameters to optimise. To solve this problem, we establish a theoretical equivalence between low-rank tensor optimisation and a two-stream gated neural network. This equivalence allows tractable learning using standard
neural network optimisation tools, leading to accurate and stable optimisation. Experimental results show the fused feature works better than individual features, thus proving for the first time that facial attributes aid face recognition.
We achieve state-of-the-art performance on three popular databases: MultiPIE (cross pose, lighting and expression), CASIA NIR-VIS2.0 (cross-modality environment) and LFW (uncontrolled environment).
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Hu, Guosheng , Yang, Hua, Yuan, Yang , Zhang, Zhihong , Lu, Zheng , Mukherjee, Sankha S. , Hospedales, Timothy, Robertson, Neil M & Yang, Yongxin .
Number of pages: 10
Publication Date: 17 Jul 2017
Publication Information
Category: Conference contribution
Original Language: English
  Deeper, Broader and Artier Domain Generalization
Li, D, Yang, Y, Song, Y-Z & Hospedales, T 2017, Deeper, Broader and Artier Domain Generalization. in The International Conference on Computer Vision (ICCV 2017).
The problem of domain generalization is to learn from multiple training domains, and extract a domain-agnostic model that can then be applied to an unseen domain. Domain generalization (DG) has a clear motivation in contexts
where there are target domains with distinct characteristics, yet sparse data for training. For example recognition in sketch images, which are distinctly more abstract and rarer than photos. Nevertheless, DG methods have primarily been evaluated on photo-only benchmarks focusing on alleviating the dataset bias where both problems of domain distinctiveness and data sparsity can be minimal. We argue that these benchmarks are overly straightforward, and show that simple deep learning baselines perform surprisingly well on them.
In this paper, we make two main contributions: Firstly, we build upon the favorable domain shift-robust properties of deep learning methods, and develop a low-rank parameterized CNN model for end-to-end DG learning. Secondly, we develop a DG benchmark dataset covering photo, sketch, cartoon and painting domains. This is both more practically relevant, and harder (bigger domain shift) than existing benchmarks. The results show that our method outperforms existing DG alternatives, and our dataset provides a more significant DG challenge to drive future research.

General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Li, Da , Yang, Yongxin , Song, Yi-Zhe & Hospedales, Timothy.
Number of pages: 9
Publication Date: 17 Jul 2017
Publication Information
Category: Conference contribution
Original Language: English
  Deep Spatial-Semantic Attention for Fine-Grained Sketch-Based Image Retrieval
Song, J, Yu, Q, Song, Y-Z, Xiang, T & Hospedales, T 2017, Deep Spatial-Semantic Attention for Fine-Grained Sketch-Based Image Retrieval. in The International Conference on Computer Vision (ICCV 2017).
Human sketches are unique in being able to capture both the spatial topology of a visual object, as well as its subtle appearance details. Fine-grained sketch-based image retrieval (FG-SBIR) importantly leverages on such fine-grained characteristics of sketches to conduct instance level retrieval of photos. Nevertheless, human sketches are often highly abstract and iconic, resulting in severe misalignments with candidate photos which in turn make subtle visual detail matching difficult. Existing FG-SBIR approaches focus only on coarse holistic matching via deep cross-domain representation learning, yet ignore explicitly accounting for fine-grained details and their spatial context. In this paper, a novel deep FG-SBIR model is proposed which differs significantly from the existing models in that: (1) It is spatially aware, achieved by introducing an
attention module that is sensitive to the spatial position of visual details; (2) It combines coarse and fine semantic information via a shortcut connection fusion block; and (3) It models feature correlation and is robust to misalignments between the extracted features across the two domains by introducing a novel higher-order learnable energy function (HOLEF) based loss. Extensive experiments show that the proposed deep spatial-semantic attention model significantly outperforms the state-of-the-art.

General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Song, Jifei, Yu, Qian, Song, Yi-Zhe, Xiang, Tao & Hospedales, Timothy.
Number of pages: 10
Publication Date: 17 Jul 2017
Publication Information
Category: Conference contribution
Original Language: English
  Cross-domain Generative Learning for Fine-Grained Sketch-Based Image Retrieval
Pang, K, Song, Y-Z, Xiang, T & Hospedales, T 2017, Cross-domain Generative Learning for Fine-Grained Sketch-Based Image Retrieval. in The British Machine Vision Conference (BMVC 2017).
The key challenge for learning a fine-grained sketch-based image retrieval (FG-SBIR) model is to bridge the domain gap between photo and sketch. Existing models learn a deep joint embedding space with discriminative losses where a photo and a sketch can be compared. In this paper, we propose a novel discriminative-generative hybrid model by introducing a generative task of cross-domain image synthesis. This task enforces the learned embedding space to preserve all the domain invariant information that is useful for cross-domain reconstruction, thus explicitly reducing the domain gap as opposed to existing models. Extensive experiments on the largest FG-SBIR dataset Sketchy [19] show that the proposed model significantly outperforms state-of-the-art discriminative FG-SBIR models.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Pang, Kaiyue , Song, Yi-Zhe , Xiang, Tao & Hospedales, Timothy.
Number of pages: 12
Publication Date: 4 Jul 2017
Publication Information
Category: Conference contribution
Original Language: English
  Fine-Grained Image Retrieval: the Text/Sketch Input Dilemma
Song, J, Song, Y-Z, Xiang, T & Hospedales, T 2017, Fine-Grained Image Retrieval: the Text/Sketch Input Dilemma. in The British Machine Vision Conference (BMVC 2017).
Fine-grained image retrieval (FGIR) enables a user to search for a photo of an object instance based on a mental picture. Depending on how the object is described by the user, two general approaches exist: sketch-based FGIR or text-based FGIR, each of which has its own pros and cons. However, no attempt has been made to systematically investigate how informative each of these two input modalities is, and more importantly whether they are complementary to each thus should be modelled jointly. In this work, for the first time we introduce a multi-modal FGIR dataset with both sketches and sentences description provided as query modalities. A multi-modal quadruplet deep network is formulated to jointly model the sketch and text input modalities as well as the photo output modality. We show that on its own the sketch modality is much more informative than text and each modality can benefit the other when they are modelled jointly.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Song, Jifei, Song, Yi-Zhe , Xiang, Tao & Hospedales, Timothy.
Number of pages: 12
Publication Date: 17 Jul 2017
Publication Information
Category: Conference contribution
Original Language: English
  Now You See Me: Deep Face Hallucination for Unviewed Sketches
Hu, C, Li, D, Song, Y-Z & Hospedales, TM 2017, Now You See Me: Deep Face Hallucination for Unviewed Sketches. in The British Machine Vision Conference (BMVC 2017).
Face hallucination has been well studied in the last decade because of its useful applications in law enforcement and entertainment. Promising results on the problem of sketch-photo face hallucination have been achieved with classic, and increasingly deep learning-based methods. However, synthesized photos still lack the crisp fidelity of real photos. More importantly, good results have primarily been demonstrated on very constrained datasets where the style variability is very low, and crucially the sketches are perfectly align-able traces of the ground-truth photos. However, realistic applications in entertainment or law enforcement require working with more unconstrained sketches drawn from memory or description, which are not rigidly align-able. In this paper, we
develop a new deep learning approach to address these settings. Our image-image regression network is trained with a combination of content and adversarial losses to generate crisp photorealistic images, and it contains an integrated spatial transformer network to deal with non-rigid alignment between the domains. We evaluate face synthesis on classic constrained, as well as unviewed, benchmarks namely CUHK, MGDB, and FSMD. The results qualitatively and quantitatively outperform existing approaches.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Hu, Conghui , Li, Da , Song, Yi-Zhe & Hospedales, Timothy M..
Number of pages: 11
Publication Date: 4 Jul 2017
Publication Information
Category: Conference contribution
Original Language: English
  Transferring CNNs to Multi-instance Multi-label Classification on Small Datasets
Dong, M, Pang, K, Wu, Y, Xue, J-H, Hospedales, T & Ogasawara, T 2017, Transferring CNNs to Multi-instance Multi-label Classification on Small Datasets. in International Conference on Image Processing (ICIP 17).
Image tagging is a well known challenge in image processing. It is typically addressed through multi-instance multi-label (MIML) classification methodologies. Convolutional Neural Networks (CNNs) possess great potential to
perform well on MIML tasks, since multi-level convolution and max pooling coincide with the multi-instance setting and the sharing of hidden representation may benefit multi-label modeling. However, CNNs usually require a large amount of carefully labeled data for training, which is hard to obtain in many real applications. In this paper, we propose a new approach for transferring pre-trained deep networks such as VGG16 on Imagenet to small MIML tasks. We extract features from each group of the network layers and apply
multiple binary classifiers to them for multi-label prediction. Moreover, we adopt an L1-norm regularized Logistic Regression (L1LR) to find the most effective features for learning the multi-label classifiers. The experiment results on two
most-widely used and relatively small benchmark MIML image datasets demonstrate that the proposed approach can substantially outperform the state-of-the-art algorithms, in terms of all popular performance metrics.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Dong, Mingzhi , Pang, Kunkun, Wu, Yang, Xue, Jing-Hao, Hospedales, Timothy & Ogasawara, Tsukasa.
Number of pages: 5
Publication Date: 9 May 2017
Publication Information
Category: Conference contribution
Original Language: English
  A Dataset for Persistent Multi-Target Multi-Camera Tracking in RGB-D
Layne, R, Hannuna, S, Camplani, M, Hall, J, Hospedales, T, Xiang, T, Mirmehdi, M & Damen, D 2017, A Dataset for Persistent Multi-Target Multi-Camera Tracking in RGB-D. in CVPR workshop on Target Re-Identification and Multi-Target Multi-Camera Tracking 2017.
Video surveillance systems are now widely deployed to improve our lives by enhancing safety, security, health monitoring and business intelligence. This has motivated extensive research into automated video analysis. Nevertheless,
there is a gap between the focus of contemporary research, and the needs of end users of video surveillance systems. Many existing benchmarks and methodologies focus on narrowly defined problems in detection, tracking, re-identification or recognition. In contrast, end users face higher-level problems such as long-term monitoring of identities in order to build a picture of a person’s activity across the course of a day, producing usage statistics
of a particular area of space, and that these capabilities should be robust to challenges such as change of clothing. To achieve this effectively requires less widely studied capabilities such as spatio-temporal reasoning about people identities and locations within a space partially observed by multiple cameras over an extended time period. To bridge this gap between research and required capabilities, we propose a new dataset LIMA that encompasses
the challenges of monitoring a typical home / office environment. LIMA contains 4.5 hours of RGB-D video from three cameras monitoring a four room house. To reflect the challenges of a realistic practical application, the dataset includes clothes changes and visitors to ensure the global reasoning is a realistic open-set problem. In addition to raw data, we provide identity annotation for benchmarking, and tracking results from a contemporary RGB-D tracker – thus
allowing focus on the higher level monitoring problems.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Layne, Ryan, Hannuna, Sion , Camplani, Massimo , Hall, Jake , Hospedales, Timothy, Xiang, Tao , Mirmehdi, Majid & Damen, Dima .
Number of pages: 9
Publication Date: 8 May 2017
Publication Information
Category: Conference contribution
Original Language: English
  Tensor Based Knowledge Transfer Across Skill Categories for Robot Control
Zhao, C, Hospedales, T, Stulp, F & Sigaud, O 2017, Tensor Based Knowledge Transfer Across Skill Categories for Robot Control. in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17). IJCAI Inc, pp. 3462-3468. DOI: 10.24963/ijcai.2017/484
Advances in hardware and learning for control are enabling robots to perform increasingly dextrous and dynamic control tasks. These skills typically require a prohibitive amount of exploration for reinforcement learning, and so are commonly achieved by imitation learning from manual demonstration. The costly non-scalable nature of manual demonstration has motivated work into skill generalisation, e.g., through contextual policies and options. Despite good results, existing work along these lines is limited to generalising across variants of one skill such as throwing an object to different locations. In this paper we go significantly further and investigate generalisation across qualitatively different classes of control skills. In particular, we introduce a class of neural network controllers that can realise four distinct skill classes: reaching, object throwing, casting, and ball-in-cup. By factorising the weights of the neural network, we are able to extract transferrable latent skills that enable dramatic acceleration of learning in cross-task transfer. With a suitable curriculum, this allows us to learn
challenging dextrous control tasks like ball-in-cup from scratch with pure reinforcement learning.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Zhao, Chenyang, Hospedales, Timothy, Stulp, Freek & Sigaud, Olivier.
Number of pages: 7
Pages: 3462-3468
Publication Date: Aug 2017
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.24963/ijcai.2017/484
  Semantic Regularisation for Recurrent Image Annotation
Liu, F, Xiang, T, Hospedales, T, Yang, W & Sun, C 2017, Semantic Regularisation for Recurrent Image Annotation. in Computer Vision and Pattern Recognition (CVPR 2017).
The “CNN-RNN” design pattern is increasingly widely applied in a variety of image annotation tasks including multi-label classification and captioning. Existing models use the weakly semantic CNN hidden layer or its transform
as the image embedding that provides the interface between the CNN and RNN. This leaves the RNN overstretched with two jobs: predicting the visual concepts and modelling their correlations for generating structured annotation output.
Importantly this makes the end-to-end training of the CNN and RNN slow and ineffective due to the difficulty of back propagating gradients through the RNN to train the CNN. We propose a simple modification to the design pattern that
makes learning more effective and efficient. Specifically, we propose to use a semantically regularised embedding layer as the interface between the CNN and RNN. Regularising the interface can partially or completely decouple the learning
problems, allowing each to be more effectively trained and jointly training much more efficient. Extensive experiments show that state-of-the art performance is achieved on multi-label classification as well as image captioning.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Liu, F., Xiang, T. , Hospedales, Timothy, Yang, W. & Sun, C..
Number of pages: 12
Publication Date: 27 Feb 2017
Publication Information
Category: Conference contribution
Original Language: English
  Deep Multi-task Representation Learning: A Tensor Factorisation Approach
Yang, Y & Hospedales, T 2017, Deep Multi-task Representation Learning: A Tensor Factorisation Approach. in International Conference on Learning Representations (ICLR 2017).
Most contemporary multi-task learning methods assume linear models. This setting is considered shallow in the era of deep learning. In this paper, we present a new deep multi-task representation learning framework that learns cross-task sharing structure at every layer in a deep network. Our approach is based on generalising the matrix factorisation techniques explicitly or implicitly used by many conventional MTL algorithms to tensor factorisation, to realise automatic learning of end-to-end knowledge sharing in deep networks. This is in contrast to existing deep learning approaches that need a user-defined multi-task sharing strategy. Our approach applies to both homogeneous and heterogeneous MTL. Experiments demonstrate the efficacy of our deep multi-task representation learning in terms of both higher accuracy and fewer design choices.

General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Yang, Yongxin & Hospedales, Timothy.
Number of pages: 12
Publication Date: 6 Feb 2017
Publication Information
Category: Conference contribution
Original Language: English
  Gated Neural Networks for Option Pricing: Rationality by Design
Yang, Y, Zheng, Y & Hospedales, T 2017, Gated Neural Networks for Option Pricing: Rationality by Design. in The Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17). pp. 52-58.
We propose a neural network approach to price EU call options that significantly outperforms some existing pricing models and comes with guarantees that its predictions are economically reasonable. To achieve this, we introduce a class of gated neural networks that automatically learn to divide-and-conquer the problem space for robust and accurate pricing. We then derive instantiations of these networks that are ‘rational by design’ in terms of naturally encoding a valid call option surface that enforces no arbitrage principles. This integration of human insight within data-driven learning provides significantly better generalisation in pricing performance due to the encoded inductive bias in the learning, guarantees sanity in the model’s predictions, and provides econometrically useful byproduct such as risk neutral density.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Yang, Yongxin , Zheng, Yu & Hospedales, Timothy.
Number of pages: 7
Pages: 52-58
Publication Date: 10 Feb 2017
Publication Information
Category: Conference contribution
Original Language: English
  Discovery of Shared Semantic Spaces for Multi-Scene Video Query and Summarization
Xu, X, Hospedales, TM & Gong, S 2017, 'Discovery of Shared Semantic Spaces for Multi-Scene Video Query and Summarization' IEEE Transactions on Circuits and Systems for Video Technology, vol 27, no. 6, 7422088, pp. 1353-1367. DOI: 10.1109/TCSVT.2016.2532719, 10.1109/TCSVT.2016.2532719
The growing rate of public space CCTV installations has generated a need for automated methods for exploiting video surveillance data including scene understanding, query, behaviour annotation and summarization. For this reason, extensive research has been performed on surveillance scene understanding and analysis. However, most studies have considered single scenes, or groups of adjacent scenes. The semantic similarity between different but related scenes (e.g., many different traffic scenes of similar layout) is not generally exploited to improve any automated surveillance tasks and reduce manual effort. Exploiting commonality, and sharing any supervised annotations, between different scenes is however challenging due to: Some scenes are totally un-related – and thus any information sharing between them would be detrimental; while others may only share a subset of common activities – and thus information sharing is only useful if it is selective. Moreover, semantically similar activities which should be modelled together and shared across scenes may have quite different pixel-level appearance in each scene. To address these issues we develop a new framework for distributed multiple-scene global understanding that clusters surveillance scenes by their ability to explain each other’s behaviours; and further discovers which subset of activities are shared versus scene-specific within each cluster. We show how to use this structured representation of multiple scenes to improve common surveillance tasks including scene activity understanding, crossscene query-by-example, behaviour classification with reduced supervised labelling requirements, and video summarization. In each case we demonstrate how our multi-scene model improves on a collection of standard single scene models and a flat model of all scenes.
General Information
Organisations: School of Informatics.
Authors: Xu, Xun, Hospedales, Timothy M. & Gong, Shaogang.
Keywords: (Scene understanding, transfer learning, video summarization, visual surveillance, , , . )
Number of pages: 15
Pages: 1353-1367
Publication Date: 1 Jun 2017
Publication Information
Category: Article
Journal: IEEE Transactions on Circuits and Systems for Video Technology
Volume: 27
Issue number: 6
ISSN: 1051-8215
Original Language: English
DOIs: 10.1109/TCSVT.2016.2532719
  Transductive Zero-Shot Action Recognition by Word-Vector Embedding
Xu, X, Hospedales, T & Gong, S 2017, 'Transductive Zero-Shot Action Recognition by Word-Vector Embedding' International Journal of Computer Vision, vol 123, no. 3, pp. 309-333. DOI: 10.1007/s11263-016-0983-5
The number of categories for action recognition is growing rapidly and it has become increasingly hard to label sufficient training data for learning conventional models for all categories. Instead of collecting ever more data and labelling them exhaustively for all categories, an attractive alternative approach is “zero-shot learning” (ZSL). To that end, in this study we construct a mapping between visual features and a semantic descriptor of each action category, allowing new categories to be recognised in the absence of any visual training data. Existing ZSL studies focus primarily on still images, and attribute-based semantic representations. In this work, we explore word-vectors as the shared semantic space to embed videos and category labelsfor ZSL action recognition. This is a more challenging problem than existing ZSL of still images and/or attributes, because the mapping between video spacetime features of actions and the semantic space is more complex and harder to learn for the purpose of generalising over any cross-category domain shift. To solve this generalisation problem in ZSL action recognition, we investigate a series of synergistic strategies to improve upon the standard ZSL pipeline. Most of these strategies are transductive in nature which means access to testing data in the training phase. First, we enhance significantly the semantic space mapping by proposing manifold-regularized regression and data augmentation strategies. Second, we evaluate two existing post processing strategies transductive self-training and hubness correction), and show that they are complementary. We evaluate extensively our model on a wide range of human action datasets including HMDB51, UCF101, OlympicSports and event datasets including CCV and TRECVID MED 13. The results demonstrate that our approach achieves the state-of-the-art zero-shot action recognition performance with a simple and efficient pipeline, and without supervised annotation of attributes. Finally, we present in-depth analysis into why and when zero-shot works, including demonstrating the ability to predict cross-category transferability in advance.

General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Xu, Xun , Hospedales, Timothy & Gong, Shaogang .
Number of pages: 25
Pages: 309-333
Publication Date: Jul 2017
Publication Information
Category: Article
Journal: International Journal of Computer Vision
Volume: 123
Issue number: 3
ISSN: 0920-5691
Original Language: English
DOIs: 10.1007/s11263-016-0983-5
  Zero-Shot Crowd Behavior Recognition
Xu, X, Gong, S & Hospedales, TM 2017, Zero-Shot Crowd Behavior Recognition. in Group and Crowd Behavior for Computer Vision. Academic Press, pp. 341-369. DOI: 10.1016/B978-0-12-809276-7.00018-7
Understanding crowd behaviour in video is challenging for computer vision.
There have been increasing attempts on modelling crowded scenes by introducing ever larger property ontologies (attributes) and annotating ever larger training datasets. However, in contrast to still images, manually annotating video attributes needs to consider spatio-temporal evolution which is inherently much harder and more costly. Critically, the most interesting crowd behaviours captured in surveillance videos (e.g. street fighting, flash mobs) are either rare, thus have few examples for model training, or unseen previously. Existing crowd analysis techniques are not readily scalable to recognise novel (unseen) crowd behaviours. To address this problem, we investigate and develop methods for recognising visual crowd behavioural butes without any training samples, i.e. zero-shot learning crowd behaviour recognition. To that end, we relax the common assumption that each individual crowd video instance is only associated with a single crowd attribute. Instead, our model learns to jointly recognise multiple crowd behavioural attributes in each video instance by exploring multi-attribute co-occurrence as contextual knowledge for optimising individual crowd attribute recognition. Joint multi-label attribute prediction in zero-shot learning is inherently non-trivial because co-occurrence statistics does not exist for unseen attributes. To solve this problem, we learn to predict cross-attribute co-occurrence from both online text corpus and multi-label annotation of videos with known attributes. Our experiments show that this approach to modelling multi-attribute context not only improves zero-shot crowd behaviour recognition on the WWW crowd video dataset, but also generalises to novel behaviour (violence) detection cross-domain in the Violence Flow video dataset.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Xu, X., Gong, S. & Hospedales, T. M..
Number of pages: 29
Pages: 341-369
Publication Date: 2017
Publication Information
Category: Chapter
Original Language: English
DOIs: 10.1016/B978-0-12-809276-7.00018-7
  Sketch-a-Net: A Deep Neural Network that Beats Humans
Yu, Q, Yang, Y, Liu, F, Song, Y-Z, Xiang, T & Hospedales, T 2017, 'Sketch-a-Net: A Deep Neural Network that Beats Humans' International Journal of Computer Vision, vol 122, no. 3, pp. 411–425. DOI: 10.1007/s11263-016-0932-3
We propose a deep learning approach to free-hand sketch recognition that achieves state-of-the-art performance, significantly surpassing that of humans. Our superior performance is a result of modelling and exploiting the unique characteristics of free-hand sketches, i.e., consisting of an ordered set of strokes but lacking visual cues such as colour and texture, being highly iconic and abstract, and exhibiting extremely large appearance variations due to different levels of abstraction and deformation. Specifically, our deep neural network, termed Sketch-a-Net has the following novel components: (i) we propose a network architecture designed for sketch rather than natural photo statistics. (ii) Two novel data augmentation strategies are developed which exploit the unique sketch-domain properties to modify and synthesise sketch training data at multiple abstraction levels. Based on this idea we are able to both significantly increase the volume and diversity of sketches for training, and address the challenge of varying levels of sketching detail commonplace in free-hand sketches. (iii) We explore different network ensemble fusion strategies, including a re-purposed joint Bayesian scheme, to further improve recognition performance. We show that state-of-the-art deep networks specifically engineered for photos of natural objects fail to perform well on sketch recognition, regardless whether they are trained using photos or sketches. Furthermore, through visualising the learned filters, we offer useful insights in to where the superior performance of our network comes from.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Yu, Qian , Yang, Yongxin , Liu, Feng,, Song, Yi-Zhe , Xiang, Tao & Hospedales, Timothy.
Number of pages: 15
Pages: 411–425
Publication Date: May 2017
Publication Information
Category: Article
Journal: International Journal of Computer Vision
Volume: 122
Issue number: 3
ISSN: 0920-5691
Original Language: English
DOIs: 10.1007/s11263-016-0932-3
  Free-hand Sketch Synthesis with Deformable Stroke Models
Li, Y, Song, Y-Z, Hospedales, T & Gong, S 2017, 'Free-hand Sketch Synthesis with Deformable Stroke Models' International Journal of Computer Vision, pp. 1-22. DOI: 10.1007/s11263-016-0963-9
We present a generative model which can automatically summarize the stroke composition of freehand sketches of a given category. When our model is fit
to a collection of sketches with similar poses, it discovers and learns the structure and appearance of a set of coherent parts, with each part represented by a group of strokes. It represents both consistent (topology) as well as diverse aspects (structure and appearance variations) of each sketch category. Key to the success of our model are important insights learned from a comprehensive
study performed on human stroke data. By fitting this model to images, we are able to synthesize visually similar and pleasant free-hand sketches.

General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Li, Yi, Song, Yi-Zhe , Hospedales, Timothy & Gong, Shaogang .
Number of pages: 23
Pages: 1-22
Publication Date: Mar 2017
Publication Information
Category: Article
Journal: International Journal of Computer Vision
ISSN: 0920-5691
Original Language: English
DOIs: 10.1007/s11263-016-0963-9
2016
  Emerging Topics in Learning from Noisy and Missing Data
Alameda-Pineda, X, Hospedales, TM, Ricci, E, Sebe, N & Wang, X 2016, Emerging Topics in Learning from Noisy and Missing Data. in Proceedings of the 2016 ACM on Multimedia Conference. MM '16, ACM, New York, NY, USA, pp. 1469-1470. DOI: 10.1145/2964284.2986910
While vital for handling most multimedia and computer vision problems, collecting large scale fully annotated datasets is a resource-consuming, often unaffordable task. Indeed, on the one hand datasets need to be large and variate enough so that learning strategies can successfully exploit the variability inherently present in real data, but on the other hand they should be small enough so that they can be fully annotated at a reasonable cost. With the overwhelming success of (deep) learning methods, the traditional problem of balancing between dataset dimensions and resources needed for annotations became a full-fledged dilemma. In this context, methodological approaches able to deal with partially described data sets represent a one-of-a-kind opportunity to find the right balance between data variability and resource-consumption in annotation. These include methods able to deal with noisy, weak or partial annotations. In this tutorial we will present several recent methodologies addressing different visual tasks under the assumption of noisy, weakly annotated data sets.
General Information
Organisations: School of Informatics.
Authors: Alameda-Pineda, Xavier, Hospedales, Timothy M., Ricci, Elisa, Sebe, Nicu & Wang, Xiaogang.
Number of pages: 2
Pages: 1469-1470
Publication Date: Oct 2016
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1145/2964284.2986910
  A survey on heterogeneous face recognition: Sketch, infra-red, 3D and low-resolution
Ouyang, S, Hospedales, T, Song, Y-Z, Li, X, Loy, CC & Wang, X 2016, 'A survey on heterogeneous face recognition: Sketch, infra-red, 3D and low-resolution' Image and vision computing, vol 56, pp. 28-48. DOI: 10.1016/j.imavis.2016.09.001
Heterogeneous face recognition (HFR) refers to matching face imagery across different domains. It has received much interest from the research community as a result of its profound implications in law enforcement. A wide variety of new invariant features, cross-modality matching models and heterogeneous datasets are being established in recent years. This survey provides a comprehensive review of established techniques and recent developments in HFR. Moreover, we offer a detailed account of datasets and benchmarks commonly used for evaluation. We finish by assessing the state of the field and discussing promising directions for future research.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Ouyang, Shuxin , Hospedales, Timothy, Song, Yi-Zhe , Li, Xueming , Loy, Chen Change & Wang, Xiaogang.
Number of pages: 21
Pages: 28-48
Publication Date: Dec 2016
Publication Information
Category: Review article
Journal: Image and vision computing
Volume: 56
ISSN: 0262-8856
Original Language: English
DOIs: 10.1016/j.imavis.2016.09.001
  Towards Bottom-Up Analysis of Social Food
Rich, J, Haddad, H & Hospedales, T 2016, Towards Bottom-Up Analysis of Social Food. in DH '16 Proceedings of the 6th International Conference on Digital Health Conference. ACM, pp. 10. DOI: 10.1145/2896338.2897734
Social media provide a wealth of information for research into public health by providing a rich mix of personal data, location, hashtags, and social network information. Among these, Instagram has been recently the subject of many computational social science studies. However despite Instagram’s focus on image sharing, most studies have exclusively focused on the hashtag and social network structure. In this paper we perform the first large scale content analysis of Instagram posts, addressing both the image and the associated hashtags, aiming to understand the content of partially labelled images taken in-the-wild and the relationship with hashtags that individuals use as noisy labels. In particular, we explore the possibility of learning to recognise food image content in a data driven way, discovering both the categories of food, and how to recognise them, purely from social network data. Notably, we demonstrate that our approach to food recognition can often achieve accuracies greater than 70% in recognising popular food-related image categories, despite using no manual annotation. We highlight the current capabilities and future challenges and opportunities for such data-driven analysis of image content and the relation to hashtags.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Rich, Jaclyn , Haddad, Hamed & Hospedales, Timothy.
Number of pages: 111
Pages: 10
Publication Date: 13 Apr 2016
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1145/2896338.2897734
  L1 Graph Based Sparse Model for Label De-noising
Chang, X, Xiang, T & Hospedales, T 2016, L1 Graph Based Sparse Model for Label De-noising. in British Machine Vision Conference (BMVC 2016, Oral). pp. 1-12.
The abundant images and user-provided tags available on social media websites provide an intriguing opportunity to scale vision problems beyond the limits imposed by manual dataset collection and annotation. However, exploiting user-tagged data in practice is challenging since it contains many noisy (incorrect and missing) labels. In this work, we propose a novel robust graph-based approach for label de-noising. Specifically, the proposed model is built upon (i) label smoothing via a visual similarity graph in a form of L1 graph regulariser, which is more robust against visual outliers than the conventional L2 regulariser, and (ii) explicitly modelling the label noise pattern, which helps to further improve de-noising performance. An efficient algorithm is formulated to optimise the proposed model, which contains multiple robust L1 terms in its objective function and is thus non-trivial to optimise. We demonstrate our model’s superior denoising performance across the spectrum of problems from multi-class with label noise to real social media data with more complex multi-label structured label noise patterns.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Chang, Xiaobin , Xiang, Tao & Hospedales, Timothy.
Number of pages: 12
Pages: 1-12
Publication Date: 2016
Publication Information
Category: Conference contribution
Original Language: English
  Weakly-Supervised Image Annotation and Segmentation with Objects and Attributes
Shi, Z, Yang, Y, Hospedales, T & Xiang, T 2016, 'Weakly-Supervised Image Annotation and Segmentation with Objects and Attributes' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol PP, no. 99. DOI: 10.1109/TPAMI.2016.2645157
We propose to model complex visual scenes using a non-parametric Bayesian model learned from weakly labelled images abundant on media sharing sites such as Flickr. Given weak image-level annotations of objects and attributes without locations or associations between them, our model aims to learn the appearance of object and attribute classes as well as their association on each object instance. Once learned, given an image, our model can be deployed to tackle a number of vision problems in a joint and coherent manner, including recognising objects in the scene (automatic object annotation), describing objects using their attributes (attribute prediction and association), and localising and delineating the objects (object detection and semantic segmentation). This is achieved by developing a novel Weakly Supervised Markov Random Field Stacked Indian Buffet Process (WS-MRF-SIBP) that models objects and attributes as latent factors and explicitly captures their correlations within and across superpixels. Extensive experiments on benchmark datasets demonstrate that our weakly supervised model significantly outperforms weakly supervised alternatives and is often comparable with existing strongly supervised models on a variety of tasks including semantic segmentation, automatic image annotation and retrieval based on object-attribute associations.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Shi, Z., Yang, Y., Hospedales, T. & Xiang, T..
Number of pages: 14
Publication Date: 26 Dec 2016
Publication Information
Category: Article
Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence
Volume: PP
Issue number: 99
ISSN: 0162-8828
Original Language: English
DOIs: 10.1109/TPAMI.2016.2645157
  Deep Multi-task Attribute-driven Ranking for Fine-grained Sketch-based Image Retrieval
Song, J, Song, Y-Z, Xiang, T, Hospedales, T & Ruan, X 2016, Deep Multi-task Attribute-driven Ranking for Fine-grained Sketch-based Image Retrieval. in British Machine Vision Conference (BMVC 2016, Oral).
Fine-grained sketch-based image retrieval (SBIR) aims to go beyond conventional SBIR to perform instance- level cross-domain retrieval: finding the specific photo that matches an input sketch. Existing methods focus on designing/learning good features for cross-domain matching and/or learning cross-domain matching functions. However, they neglect the semantic aspect of retrieval, i.e., what meaningful object properties does a user try encode in her/his sketch? We propose a fine-grained SBIR model that exploits semantic attributes and deep feature learning in a complementary way. Specifically, we perform multi-task deep learning with three objectives, including: retrieval by fine-grained ranking on a learned representation, attribute prediction, and attribute-level ranking. Simultaneously predicting semantic attributes and using such predictions in the ranking procedure help retrieval results to be more semantically relevant. Importantly, the introduction of semantic attribute learning in the model allows for the elimination of the otherwise prohibitive cost of human annotations required for training a fine-grained deep ranking model. Experimental results demonstrate that our method outperforms the state-of-the-art on challenging fine-grained SBIR benchmarks while requiring less annotation.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Song, Jifei, Song, Yi-Zhe , Xiang, Tao , Hospedales, Timothy & Ruan, Xiang .
Number of pages: 11
Publication Date: 2016
Publication Information
Category: Conference contribution
Original Language: English
  Robust Subjective Visual Property Prediction from Crowdsourced Pairwise Labels
Fu, Y, Hospedales, TM, Xiang, T, Xiong, J, Gong, S, Wang, Y & Yao, Y 2016, 'Robust Subjective Visual Property Prediction from Crowdsourced Pairwise Labels' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 38, no. 3, pp. 563-577. DOI: 10.1109/TPAMI.2015.2456887
The problem of estimating subjective visual properties from image and video has attracted increasing interest. A subjective visual property is useful either on its own (e.g. image and video interestingness) or as an intermediate representation for visual recognition (e.g. a relative attribute). Due to its ambiguous nature, annotating the value of a subjective visual property for learning a prediction model is challenging. To make the annotation more reliable, recent studies employ crowdsourcing tools to collect pairwise comparison labels. However, using crowdsourced data also introduces outliers. Existing methods rely on majority voting to prune the annotation outliers/errors. They thus require a large amount of pairwise labels to be collected. More importantly as a local outlier detection method, majority voting is ineffective in identifying outliers that can cause global ranking inconsistencies. In this paper, we propose a more principled way to identify annotation outliers by formulating the subjective visual property prediction task as a unified robust learning to rank problem, tackling both the outlier detection and learning to rank jointly. This differs from existing methods in that (1) the proposed method integrates local pairwise comparison labels together to minimise a cost that corresponds to global inconsistency of ranking order, and (2) the outlier detection and learning to rank problems are solved jointly. This not only leads to better detection of annotation outliers but also enables learning with extremely sparse annotations.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Fu, Y., Hospedales, T. M., Xiang, T., Xiong, J., Gong, S., Wang, Y. & Yao, Y..
Number of pages: 15
Pages: 563-577
Publication Date: 1 Mar 2016
Publication Information
Category: Article
Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence
Volume: 38
Issue number: 3
ISSN: 0162-8828
Original Language: English
DOIs: 10.1109/TPAMI.2015.2456887
  When and Where to Transfer for Bayes Net Parameter Learning
Zhou, Y, Hospedales, T & Fenton, N 2016, 'When and Where to Transfer for Bayes Net Parameter Learning' Expert Systems with Applications, vol 55, pp. 361-373. DOI: 10.1016/j.eswa.2016.02.011
Learning Bayesian networks from scarce data is a major challenge in real-world applications where data are hard to acquire. Transfer learning techniques attempt to address this by leveraging data from different but related problems. For example, it may be possible to exploit medical diagnosis data from a different country. A challenge with this approach is heterogeneous relatedness to the target, both within and across source networks. In this paper we introduce the Bayesian network parameter transfer learning (BNPTL) algorithm to reason about both network and fragment (sub-graph) relatedness. BNPTL addresses (i) how to find the most relevant source network and network fragments to transfer, and (ii) how to fuse source and target parameters in a robust way. In addition to improving target task performance, explicit reasoning allows us to diagnose network and fragment relatedness across Bayesian networks, even if latent variables are present, or if their state space is heterogeneous. This is important in some applications where relatedness itself is an output of interest. Experimental results demonstrate the superiority of BNPTL at various scarcities and source relevance levels compared to single task learning and other state-of-the-art parameter transfer methods. Moreover, we demonstrate successful application to real-world medical case studies.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Zhou, Yun , Hospedales, Timothy & Fenton, Norman .
Number of pages: 13
Pages: 361-373
Publication Date: 15 Aug 2016
Publication Information
Category: Article
Journal: Expert Systems with Applications
Volume: 55
ISSN: 0957-4174
Original Language: English
DOIs: 10.1016/j.eswa.2016.02.011
  Fine-grained sketch-based image retrieval: The role of part-aware attributes
Li, K, Pang, K, Song, Y-Z, Hospedales, T, Zhang, H & Hu, Y 2016, Fine-grained sketch-based image retrieval: The role of part-aware attributes. in Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on. IEEE. DOI: 10.1109/WACV.2016.7477615
We study the problem of fine-grained sketch-based image retrieval. By performing instance-level (rather than category-level) retrieval, it embodies a timely and practical application, particularly with the ubiquitous availability of touchscreens. Three factors contribute to the challenging nature of the problem: (i) free-hand sketches are inherently abstract and iconic, making visual comparisons with photos more difficult, (ii) sketches and photos are in two different visual domains, i.e. black and white lines vs. color pixels, and (iii) fine-grained distinctions are especially challenging when executed across domain and abstraction-level. To address this, we propose to detect visual attributes at part-level, in order to build a new representation that not only captures fine-grained characteristics but also traverses across visual domains. More specifically, (i) we propose a dataset with 304 photos and 912 sketches, where each sketch and photo is annotated with its semantic parts and associated part-level attributes, and with the help of this dataset, we investigate (ii) how strongly-supervised deformable part-based models can be learned that subsequently enable automatic detection of part-level attributes, and (iii) a novel matching framework that synergistically integrates low-level features, mid-level geometric structure and high-level semantic attributes to boost retrieval performance. Extensive experiments conducted on our new dataset demonstrate value of the proposed method.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Li, Ki, Pang, Kaiyue , Song, Yi-Zhe , Hospedales, Timothy, Zhang, Honggang & Hu, Yichuan .
Number of pages: 9
Publication Date: 26 May 2016
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1109/WACV.2016.7477615
  ForgetMeNot: Memory-Aware Forensic Facial Sketch Matching
Ouyang, S, Hospedales, T, Song, Y-Z & Li, X 2016, ForgetMeNot: Memory-Aware Forensic Facial Sketch Matching. in Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on. IEEE, pp. 5571-5579. DOI: 10.1109/CVPR.2016.601
We investigate whether it is possible to improve the performance of automated facial forensic sketch matching by learning from examples of facial forgetting over time. Forensic facial sketch recognition is a key capability for law enforcement, but remains an unsolved problem. It is extremely challenging because there are three distinct contributors to the domain gap between forensic sketches and photos: The well-studied sketch-photo modality gap, and the less studied gaps due to (i) the forgetting process of the eye-witness and (ii) their inability to elucidate their memory. In this paper, we address the memory problem head on y introducing a database of 400 forensic sketches created at different time-delays. Based on this database we build a model to reverse the forgetting process. Surprisingly, we show that it is possible to systematically “un-forget” facial details. Moreover, it is possible to apply this model to dramatically improve forensic sketch recognition in practice: we achieve the state of the art results when matching 195 benchmark forensic sketches against corresponding photos and a 10,030 mugshot database.

General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Ouyang, Shuxin , Hospedales, Timothy, Song, Yi-Zhe & Li, Xueming .
Number of pages: 9
Pages: 5571-5579
Publication Date: 12 Dec 2016
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1109/CVPR.2016.601
  Sketch Me That Shoe
Yu, Q, Liu, F, Song, Y-Z, Xiang, T, Hospedales, T & Loy, CC 2016, Sketch Me That Shoe. in Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on. IEEE, pp. 799-807. DOI: 10.1109/CVPR.2016.93
We investigate the problem of fine-grained sketch-based image retrieval (SBIR), where free-hand human sketches are used as queries to perform instance-level retrieval of images. This is an extremely challenging task because (i) visual comparisons not only need to be fine-grained but also executed cross-domain, (ii) free-hand (finger) sketches are highly abstract, making fine-grained matching harder, and most importantly (iii) annotated cross-domain sketch-photo datasets required for training are scarce, challenging many state-of-the-art machine learning techniques.
In this paper, for the first time, we address all these challenges, providing a step towards the capabilities that would underpin a commercial sketch-based image retrieval application. We introduce a new database of 1,432 sketch photo pairs from two categories with 32,000 fine-grained triplet ranking annotations. We then develop a deep triple tranking model for instance-level SBIR with a novel data augmentation and staged pre-training strategy to alleviate the issue of insufficient fine-grained training data. Extensive experiments are carried out to contribute a variety of insights into the challenges of data sufficiency and over-fitting avoidance when training deep networks for fine grained cross-domain ranking tasks.

General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Yu, Qian , Liu, Feng,, Song, Yi-Zhe , Xiang, Tao , Hospedales, Timothy & Loy, Chen Change .
Number of pages: 9
Pages: 799-807
Publication Date: 12 Dec 2016
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1109/CVPR.2016.93
  Multivariate Regression on the Grassmannian for Predicting Novel Domains
Yang, Y & Hospedales, T 2016, Multivariate Regression on the Grassmannian for Predicting Novel Domains. in Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on. IEEE, pp. 5071-5080. DOI: 10.1109/CVPR.2016.548
We study the problem of predicting how to recognise visual objects in novel domains with neither labelled nor unlabelled training data. Domain adaptation is now an established research area due to its value in ameliorating the issue of domain shift between train and test data. However, it is conventionally assumed that domains are discrete entities, and that at least unlabelled data is provided in testing domains. In this paper, we consider the case where domains are parametrised by a vector of continuous values (e.g., time, lighting or view angle). We aim to use such domain metadata to predict novel domains for recognition. This allows a recognition model to be pre-calibrated for a new domain in advance (e.g., future time or view angle) without waiting for data collection and re-training. We achieve this by posing the problem as one of multivariate regression on the Grassmannian, where we regress a domain’s subspace (point on the Grassmannian) against an independent vector of domain parameters. We derive two novel methodologies to achieve this challenging task: a direct kernel regression from RM ? G, and an indirect method with better extrapolation properties. We evaluate our methods on two crossdomain visual recognition benchmarks, where they perform close to the upper bound of full data domain adaptation. This demonstrates that data is not necessary for domain adaptation if a domain can be parametrically described.

General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Yang, Yongxin & Hospedales, Timothy.
Number of pages: 10
Pages: 5071-5080
Publication Date: 12 Dec 2016
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1109/CVPR.2016.548
  Gaussian Visual-Linguistic Embedding for Zero-Shot Recognition
Mukherjee, T & Hospedales, T 2016, Gaussian Visual-Linguistic Embedding for Zero-Shot Recognition. in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (ACL), pp. 912-918.
An exciting outcome of research at the intersection of language and vision is that of zeroshot learning (ZSL). ZSL promises to scale visual recognition by borrowing distributed semantic models learned from linguistic corpora and turning them into visual recognition models. However the popular word-vector DSM embeddings are relatively impoverished in their expressivity as they model each word as a single vector point. In this paper we explore word-distribution embeddings for ZSL. We present a visual-linguistic mapping for ZSL in the case where words and visual categories are both represented by distributions. Experiments show improved results on ZSL benchmarks due to this better exploiting of intra-concept variability in each modality
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Mukherjee, Tanmoy & Hospedales, Timothy.
Number of pages: 7
Pages: 912-918
Publication Date: Nov 2016
Publication Information
Category: Conference contribution
Original Language: English
  Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation
Xun, X, Hospedales, T & Gong, S 2016, Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation. in Computer Vision – ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II. Lecture Notes in Computer Science, vol. 9906, Springer International Publishing, pp. 343-359. DOI: 10.1007/978-3-319-46475-6_22
Zero-Shot Learning (ZSL) promises to scale visual recognition by bypassing the conventional model training requirement of annotated examples for every category. This is achieved by establishing a mapping connecting low-level features and a semantic description of the label space, referred as visual-semantic mapping, on auxiliary data. Re-using the learned mapping to project target videos into an embedding space thus allows novel-classes to be recognised by nearest neighbour inference. However, existing ZSL methods suffer from auxiliary-target domain shift intrinsically induced by assuming the same mapping for the disjoint auxiliary and target classes. This compromises the generalisation accuracy of ZSL recognition on the target data. In this work, we improve the ability of ZSL to generalise across this domain shift in both model- and data-centric ways by formulating a visual-semantic mapping with better generalisation properties and a dynamic data re-weighting method to prioritise auxiliary data that are relevant to the target classes. Specifically: (1) We introduce a multi-task visual-semantic mapping to improve generalisation by constraining the semantic mapping parameters to lie on a low-dimensional manifold, (2) We explore prioritised data augmentation by expanding the pool of auxiliary data with additional instances weighted by relevance to the target domain. The proposed new model is applied to the challenging zero-shot action recognition problem to demonstrate its advantages over existing ZSL models.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Xun, Xu, Hospedales, Timothy & Gong, Shaogang .
Number of pages: 17
Pages: 343-359
Publication Date: Oct 2016
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1007/978-3-319-46475-6_22
2015
  Semantic embedding space for zero-shot action recognition
Xu, X, Hospedales, T & Gong, S 2015, Semantic embedding space for zero-shot action recognition. in 2015 IEEE International Conference on Image Processing (ICIP). IEEE, pp. 63-67. DOI: 10.1109/ICIP.2015.7350760
The number of categories for action recognition is growing rapidly. It is thus becoming increasingly hard to collect sufficient training data to learn conventional models for each category. This issue may be ameliorated by the increasingly popular “zero-shot learning” (ZSL) paradigm. In this framework a mapping is constructed between visual features and a human interpretable semantic description of each category, allowing categories to be recognised in the absence of any training data. Existing ZSL studies focus primarily on image data, and attribute-based semantic representations. In this paper, we address zero-shot recognition in contemporary video action recognition tasks, using semantic word vector space as the common space to embed videos and category labels. This is more challenging because the mapping between the semantic space and space-time features of videos containing complex actions is more complex and harder to learn. We demonstrate that a simple self-training and data augmentation strategy can significantly improve the efficacy of this mapping. Experiments on human action datasets including HMDB51 and UCF101 demonstrate that our approach achieves the state-of-the-art zero-shot action recognition performance.
General Information
Organisations: School of Informatics.
Authors: Xu, X., Hospedales, T. & Gong, S..
Number of pages: 5
Pages: 63-67
Publication Date: Sep 2015
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1109/ICIP.2015.7350760
  Transferring a semantic representation for person re-identification and search
Shi, Z, Hospedales, TM & Xiang, T 2015, Transferring a semantic representation for person re-identification and search. in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 4184-4193. DOI: 10.1109/CVPR.2015.7299046
Learning semantic attributes for person re-identification and description-based person search has gained increasing interest due to attributes' great potential as a pose and view-invariant representation. However, existing attribute-centric approaches have thus far underperformed state-of-the-art conventional approaches. This is due to their nonscalable need for extensive domain (camera) specific annotation. In this paper we present a new semantic attribute learning approach for person re-identification and search. Our model is trained on existing fashion photography datasets - either weakly or strongly labelled. It can then be transferred and adapted to provide a powerful semantic description of surveillance person detections, without requiring any surveillance domain supervision. The resulting representation is useful for both unsupervised and supervised person re-identification, achieving state-of-the-art and near state-of-the-art performance respectively. Furthermore, as a semantic representation it allows description-based person search to be integrated within the same framework.
General Information
Organisations: School of Informatics.
Authors: Shi, Z., Hospedales, T. M. & Xiang, T..
Number of pages: 10
Pages: 4184-4193
Publication Date: Jun 2015
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1109/CVPR.2015.7299046
  When Face Recognition Meets with Deep Learning: An Evaluation of Convolutional Neural Networks for Face Recognition
Hu, G, Yang, Y, Yi, D, Kittler, J, Christmas, W, Li, SZ & Hospedales, T 2015, When Face Recognition Meets with Deep Learning: An Evaluation of Convolutional Neural Networks for Face Recognition. in 2015 IEEE International Conference on Computer Vision Workshop (ICCVW). IEEE, pp. 384-392. DOI: 10.1109/ICCVW.2015.58
Deep learning, in particular Convolutional Neural Network (CNN), has achieved promising results in face recognition recently. However, it remains an open question: why CNNs work well and how to design a 'good' architecture. The existing works tend to focus on reporting CNN architectures that work well for face recognition rather than investigate the reason. In this work, we conduct an extensive evaluation of CNN-based face recognition systems (CNN-FRS) on a common ground to make our work easily reproducible. Specifically, we use public database LFW (Labeled Faces in the Wild) to train CNNs, unlike most existing CNNs trained on private databases. We propose three CNN architectures which are the first reported architectures trained using LFW data. This paper quantitatively compares the architectures of CNNs and evaluates the effect of different implementation choices. We identify several useful properties of CNN-FRS. For instance, the dimensionality of the learned features can be significantly reduced without adverse effect on face recognition accuracy. In addition, a traditional metric learning method exploiting CNN-learned features is evaluated. Experiments show two crucial factors to good CNN-FRS performance are the fusion of multiple CNNs and metric learning. To make our work reproducible, source code and models will be made publicly available.
General Information
Organisations: School of Informatics.
Authors: Hu, G., Yang, Y., Yi, D., Kittler, J., Christmas, W., Li, S. Z. & Hospedales, T..
Number of pages: 9
Pages: 384-392
Publication Date: Dec 2015
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1109/ICCVW.2015.58
  Making better use of edges via perceptual grouping
Qi, Y, Song, YZ, Xiang, T, Zhang, H, Hospedales, T, Li, Y & Guo, J 2015, Making better use of edges via perceptual grouping. in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 1856-1865. DOI: 10.1109/CVPR.2015.7298795
We propose a perceptual grouping framework that organizes image edges into meaningful structures and demonstrate its usefulness on various computer vision tasks. Our grouper formulates edge grouping as a graph partition problem, where a learning to rank method is developed to encode probabilities of candidate edge pairs. In particular, RankSVM is employed for the first time to combine multiple Gestalt principles as cue for edge grouping. Afterwards, an edge grouping based object proposal measure is introduced that yields proposals comparable to state-of-the-art alternatives. We further show how human-like sketches can be generated from edge groupings and consequently used to deliver state-of-the-art sketch-based image retrieval performance. Last but not least, we tackle the problem of freehand human sketch segmentation by utilizing the proposed grouper to cluster strokes into semantic object parts.
General Information
Organisations: School of Biological Sciences.
Authors: Qi, Y., Song, Y. Z., Xiang, T., Zhang, H., Hospedales, T., Li, Y. & Guo, J..
Number of pages: 10
Pages: 1856-1865
Publication Date: Jun 2015
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1109/CVPR.2015.7298795
  Sketch-a-Net that Beats Humans
Yu, Q, Yang, Y, Song, Y-Z, Xiang, T & Hospedales, T 2015, Sketch-a-Net that Beats Humans. in MWJ Xianghua Xie & GKL Tam (eds), Proceedings of the British Machine Vision Conference (BMVC)., 7, BMVA Press, pp. 1-12. DOI: 10.5244/C.29.7
We propose a multi-scale multi-channel deep neural network framework that, for the ?rst time, yields sketch recognition performance surpassing that of humans. Our superior performance is a result of explicitly embedding the unique characteristics of sketches in our model: (i) a network architecture designed for sketch rather than natural photo statistics, (ii) a multi-channel generalisation that encodes sequential ordering in the sketching process, and (iii) a multi-scale network ensemble with joint Bayesian fusion that accounts for the different levels of abstraction exhibited in free-hand sketches. We show that state-of-the art deep networks speci?cally engineered for photos of natural objects fail to perform well on sketch recognition, regardless whether they are trained using photo or sketch. Our network on the other hand not only delivers the best performance on the largest human sketch dataset to date, but also is small in size making ef?cient training possible using just CPUs.
General Information
Organisations: School of Informatics.
Authors: Yu, Qian, Yang, Yongxin, Song, Yi-Zhe, Xiang, Tao & Hospedales, Timothy.
Number of pages: 12
Pages: 1-12
Publication Date: 1 Sep 2015
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.5244/C.29.7
  Free-hand sketch recognition by multi-kernel feature learning
Li, Y, Hospedales, TM, Song, Y-Z & Gong, S 2015, 'Free-hand sketch recognition by multi-kernel feature learning' Computer Vision and Image Understanding, vol 137, pp. 1-11. DOI: 10.1016/j.cviu.2015.02.003
Free-hand sketch recognition has become increasingly popular due to the recent expansion of portable touchscreen devices. However, the problem is non-trivial due to the complexity of internal structures that leads to intra-class variations, coupled with the sparsity in visual cues that results in inter-class ambiguities. In order to address the structural complexity, a novel structured representation for sketches is proposed to capture the holistic structure of a sketch. Moreover, to overcome the visual cue sparsity problem and therefore achieve state-of-the-art recognition performance, we propose a Multiple Kernel Learning (MKL) framework for sketch recognition, fusing several features common to sketches. We evaluate the performance of all the proposed techniques on the most diverse sketch dataset to date (Mathias et al., 2012), and offer detailed and systematic analyses of the performance of different features and representations, including a breakdown by sketch-super-category. Finally, we investigate the use of attributes as a high-level feature for sketches and show how this complements low-level features for improving recognition performance under the MKL framework, and consequently explore novel applications such as attribute-based retrieval.
General Information
Organisations: School of Informatics.
Authors: Li, Yi, Hospedales, Timothy M., Song, Yi-Zhe & Gong, Shaogang.
Number of pages: 11
Pages: 1-11
Publication Date: 2015
Publication Information
Category: Article
Journal: Computer Vision and Image Understanding
Volume: 137
ISSN: 1077-3142
Original Language: English
DOIs: 10.1016/j.cviu.2015.02.003
  Transductive Multi-View Zero-Shot Learning
Fu, Y, Hospedales, TM, Xiang, T & Gong, S 2015, 'Transductive Multi-View Zero-Shot Learning' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 37, no. 11, pp. 2332-2345. DOI: 10.1109/TPAMI.2015.2408354
Most existing zero-shot learning approaches exploit transfer learning via an intermediate semantic representation shared between an annotated auxiliary dataset and a target dataset with different classes and no annotation. A projection from a low-level feature space to the semantic representation space is learned from the auxiliary dataset and applied without adaptation to the target dataset. In this paper we identify two inherent limitations with these approaches. First, due to having disjoint and potentially unrelated classes, the projection functions learned from the auxiliary dataset/domain are biased when applied directly to the target dataset/ domain. We call this problem the projection domain shift problem and propose a novel framework, transductive multi-view embedding, to solve it. The second limitation is the prototype sparsity problem which refers to the fact that for each target class, only a single prototype is available for zero-shot learning given a semantic representation. To overcome this problem, a novel heterogeneous multi-view hypergraph label propagation method is formulated for zero-shot learning in the transductive embedding space. It effectively exploits the complementary information offered by different semantic representations and takes advantage of the manifold structures of multiple representation spaces in a coherent manner. We demonstrate through extensive experiments that the proposed approach (1) rectifies the projection shift between the auxiliary and target domains, (2) exploits the complementarity of multiple semantic representations, (3) significantly outperforms existing methods for both zero-shot and N-shot recognition on three image and video benchmark datasets, and (4) enables novel cross-view annotation tasks.
General Information
Organisations: School of Informatics.
Authors: Fu, Y., Hospedales, T. M., Xiang, T. & Gong, S..
Number of pages: 14
Pages: 2332-2345
Publication Date: 1 Nov 2015
Publication Information
Category: Article
Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence
Volume: 37
Issue number: 11
ISSN: 0162-8828
Original Language: English
DOIs: 10.1109/TPAMI.2015.2408354
  Bayesian Joint Modelling for Object Localisation in Weakly Labelled Images
Shi, Z, Hospedales, TM & Xiang, T 2015, 'Bayesian Joint Modelling for Object Localisation in Weakly Labelled Images' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 37, no. 10, pp. 1959-1972. DOI: 10.1109/TPAMI.2015.2392769
We address the problem of localisation of objects as bounding boxes in images and videos with weak labels. This weakly supervised object localisation problem has been tackled in the past using discriminative models where each object class is localised independently from other classes. In this paper, a novel framework based on Bayesian joint topic modelling is proposed, which differs significantly from the existing ones in that: (1) All foreground object classes are modelled jointly in a single generative model that encodes multiple object co-existence so that “explaining away” inference can resolve ambiguity and lead to better learning and localisation. (2) Image backgrounds are shared across classes to better learn varying surroundings and “push out” objects of interest. (3) Our model can be learned with a mixture of weakly labelled and unlabelled data, allowing the large volume of unlabelled images on the Internet to be exploited for learning. Moreover, the Bayesian formulation enables the exploitation of various types of prior knowledge to compensate for the limited supervision offered by weakly labelled data, as well as Bayesian domain adaptation for transfer learning. Extensive experiments on the PASCAL VOC, ImageNet and YouTube-Object videos datasets demonstrate the effectiveness of our Bayesian joint model for weakly supervised object localisation.
General Information
Organisations: School of Informatics.
Authors: Shi, Z., Hospedales, T. M. & Xiang, T..
Number of pages: 14
Pages: 1959-1972
Publication Date: 1 Oct 2015
Publication Information
Category: Article
Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence
Volume: 37
Issue number: 10
ISSN: 0162-8828
Original Language: English
DOIs: 10.1109/TPAMI.2015.2392769
2014
  Attributes-Based Re-identification
Layne, R, Hospedales, TM & Gong, S 2014, Attributes-Based Re-identification. in Person Re-Identification. Advances in Computer Vision and Pattern Recognition, Springer London, pp. 93-117. DOI: 10.1007/978-1-4471-6296-4_5
Automated person re-identification using only visual information from public-space CCTV video is challenging for many reasons, such as poor resolution or challenges involved in dealing with camera calibration. More critically still, the majority of clothing worn in public spaces tends to be non-discriminative and therefore of limited disambiguation value. Most re-identification techniques developed so far have relied on low-level visual-feature matching approaches that aim to return matching gallery detections earlier in the ranked list of results. However, for many applications an initial probe image may not be available, or a low-level feature representation may not be sufficiently invariant to viewing condition changes as well as being discriminative for re-identification. In this chapter, we show how mid-level “semantic attributes” can be computed for person description. We further show how this attribute-based description can be used in synergy with low-level feature descriptions to improve re-identification accuracy when an attribute-centric distance measure is employed. Moreover, we discuss a “zero-shot” scenario in which a visual probe is unavailable but re-identification can still be performed with user-provided semantic attribute description.
General Information
Organisations: School of Informatics.
Authors: Layne, Ryan, Hospedales, Timothy M. & Gong, Shaogang.
Number of pages: 25
Pages: 93-117
Publication Date: 2014
Publication Information
Category: Chapter
Original Language: English
DOIs: 10.1007/978-1-4471-6296-4_5
  Interestingness Prediction by Robust Learning to Rank
Fu, Y, Hospedales, TM, Xiang, T, Gong, S & Yao, Y 2014, Interestingness Prediction by Robust Learning to Rank. in Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II. Lecture Notes in Computer Science (LNCS), vol. 8690, Springer, Cham, pp. 488-503. DOI: 10.1007/978-3-319-10605-2_32
The problem of predicting image or video interestingness from their low-level feature representations has received increasing interest. As a highly subjective visual attribute, annotating the interestingness value of training data for learning a prediction model is challenging. To make the annotation less subjective and more reliable, recent studies employ crowdsourcing tools to collect pairwise comparisons – relying on majority voting to prune the annotation outliers/errors. In this paper, we propose a more principled way to identify annotation outliers by formulating the interestingness prediction task as a unified robust learning to rank problem, tackling both the outlier detection and interestingness prediction tasks jointly. Extensive experiments on both image and video interestingness benchmark datasets demonstrate that our new approach significantly outperforms state-of-the-art alternatives.

The research of Yuan Yao was supported in part by National Basic Research Program of China (973 Program 2012CB825501), NSFC grant 61071157, and a joint NSFC-Royal Society grant 61211130360, IE110976 with Tao Xiang.
General Information
Organisations: School of Informatics.
Authors: Fu, Yanwei, Hospedales, Timothy M., Xiang, Tao, Gong, Shaogang & Yao, Yuan.
Number of pages: 16
Pages: 488-503
Publication Date: 2014
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1007/978-3-319-10605-2_32
  The Re-identification Challenge
Gong, S, Cristani, M, Loy, CC & Hospedales, TM 2014, The Re-identification Challenge. in Person Re-Identification. Advances in Computer Vision and Pattern Recognition, Springer London, pp. 1-20. DOI: 10.1007/978-1-4471-6296-4_1
For making sense of the vast quantity of visual data generated by the rapid expansion of large-scale distributed multi-camera systems, automated person re-identification is essential. However, it poses a significant challenge to computer vision systems. Fundamentally, person re-identification requires to solve two difficult problems of ‘finding needles in haystacks’ and ‘connecting the dots’ by identifying instances and associating the whereabouts of targeted people travelling across large distributed space–time locations in often crowded environments. This capability would enable the discovery of, and reasoning about, individual-specific long-term structured activities and behaviours. Whilst solving the person re-identification problem is inherently challenging, it also promises enormous potential for a wide range of practical applications, ranging from security and surveillance to retail and health care. As a result, the field has drawn growing and wide interest from academic researchers and industrial developers. This chapter introduces the re-identification problem, highlights the difficulties in building person re-identification systems, and presents an overview of recent progress and the state-of-the-art approaches to solving some of the fundamental challenges in person re-identification, benefiting from research in computer vision, pattern recognition and machine learning, and drawing insights from video analytics system design considerations for engineering practical solutions. It also provides an introduction of the contributing chapters of this book. The chapter ends by posing some open questions for the re-identification challenge arising from emerging and future applications.
General Information
Organisations: School of Informatics.
Authors: Gong, Shaogang, Cristani, Marco, Loy, Chen Change & Hospedales, Timothy M..
Number of pages: 20
Pages: 1-20
Publication Date: 2014
Publication Information
Category: Chapter
Original Language: English
DOIs: 10.1007/978-1-4471-6296-4_1
  Weakly Supervised Learning of Objects, Attributes and Their Associations
Shi, Z, Yang, Y, Hospedales, TM & Xiang, T 2014, Weakly Supervised Learning of Objects, Attributes and Their Associations. in Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II. Lecture Notes in Computer Science (LNCS), vol. 8690, Springer, Cham, pp. 472-487. DOI: 10.1007/978-3-319-10605-2_31
When humans describe images they tend to use combinations of nouns and adjectives, corresponding to objects and their associated attributes respectively. To generate such a description automatically, one needs to model objects, attributes and their associations. Conventional methods require strong annotation of object and attribute locations, making them less scalable. In this paper, we model object-attribute associations from weakly labelled images, such as those widely available on media sharing sites (e.g. Flickr), where only image-level labels (either object or attributes) are given, without their locations and associations. This is achieved by introducing a novel weakly supervised non-parametric Bayesian model. Once learned, given a new image, our model can describe the image, including objects, attributes and their associations, as well as their locations and segmentation. Extensive experiments on benchmark datasets demonstrate that our weakly supervised model performs at par with strongly supervised models on tasks such as image description and retrieval based on object-attribute associations.
General Information
Organisations: School of Informatics.
Authors: Shi, Zhiyuan, Yang, Yongxin, Hospedales, Timothy M. & Xiang, Tao.
Number of pages: 16
Pages: 472-487
Publication Date: 2014
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1007/978-3-319-10605-2_31
  Transductive Multi-view Embedding for Zero-Shot Recognition and Annotation
Fu, Y, Hospedales, TM, Xiang, T, Fu, Z-Y & Gong, S 2014, Transductive Multi-view Embedding for Zero-Shot Recognition and Annotation. in Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II. Lecture Notes in Computer Science (LNCS), vol. 8690, Springer, Cham, pp. 584-599. DOI: 10.1007/978-3-319-10605-2_38
Most existing zero-shot learning approaches exploit transfer learning via an intermediate-level semantic representation such as visual attributes or semantic word vectors. Such a semantic representation is shared between an annotated auxiliary dataset and a target dataset with no annotation. A projection from a low-level feature space to the semantic space is learned from the auxiliary dataset and is applied without adaptation to the target dataset. In this paper we identify an inherent limitation with this approach. That is, due to having disjoint and potentially unrelated classes, the projection functions learned from the auxiliary dataset/domain are biased when applied directly to the target dataset/domain. We call this problem the projection domain shift problem and propose a novel framework, transductive multi-view embedding, to solve it. It is ‘transductive’ in that unlabelled target data points are explored for projection adaptation, and ‘multi-view’ in that both low-level feature (view) and multiple semantic representations (views) are embedded to rectify the projection shift. We demonstrate through extensive experiments that our framework (1) rectifies the projection shift between the auxiliary and target domains, (2) exploits the complementarity of multiple semantic representations, (3) achieves state-of-the-art recognition results on image and video benchmark datasets, and (4) enables novel cross-view annotation tasks.
General Information
Organisations: School of Informatics.
Authors: Fu, Yanwei, Hospedales, Timothy M., Xiang, Tao, Fu, Zhen-Yong & Gong, Shaogang.
Number of pages: 16
Pages: 584-599
Publication Date: 2014
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1007/978-3-319-10605-2_38
  Investigating Open-World Person Re-identification Using a Drone
Layne, R, Hospedales, TM & Gong, S 2014, Investigating Open-World Person Re-identification Using a Drone. in Computer Vision - ECCV 2014 Workshops - Zurich, Switzerland, September 6-7 and 12, 2014, Proceedings, Part III. Lecture Notes in Computer Science (LNCS), vol. 8927, Springer International Publishing, pp. 225-240. DOI: 10.1007/978-3-319-16199-0_16
Person re-identification is now one of the most topical and intensively studied problems in computer vision due to its challenging nature and its critical role in underpinning many multi-camera surveillance tasks. A fundamental assumption in almost all existing re-identification research is that cameras are in fixed emplacements, allowing the explicit modelling of camera and inter-camera properties in order to improve re-identification. In this paper, we present an introductory study pushing re-identification in a different direction: re-identification on a mobile platform, such as a drone. We formalise some variants of the standard formulation for re-identification that are more relevant for mobile re-identification. We introduce the first dataset for mobile re-identification, and we use this to elucidate the unique challenges of mobile re-identification. Finally, we re-evaluate some conventional wisdom about re-id models in the light of these challenges and suggest future avenues for research in this area.
General Information
Organisations: School of Informatics.
Authors: Layne, Ryan, Hospedales, Timothy M. & Gong, Shaogang.
Number of pages: 16
Pages: 225-240
Publication Date: 2014
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1007/978-3-319-16199-0_16
  Intra-category sketch-based image retrieval by matching deformable part models
Li, Y, Hospedales, TM, Song, Y-Z & Gong, S 2014, Intra-category sketch-based image retrieval by matching deformable part models. in British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1-5, 2014. BMVA Press. DOI: 10.5244/C.28.115
An important characteristic of sketches, compared with text, rests with their ability of intrinsically capturing structure and appearance detail of objects. Nonetheless, akin to traditional text-based image retrieval, conventional sketch-based image retrieval (SBIR) principally focuses on retrieving photos of the same category, neglecting the fine-grained characteristics of sketches. In this paper, we further advocate the expressiveness of sketches and examine their efficacy under a novel intra-category SBIR framework. In particular, we study how sketches can be adopted to permit pose-specific retrieval within object categories. Key challenge to this problem is introducing a mid-level sketch representation that not only captures object pose, but also possess the ability to traverse sketch and photo domains. More specifically, we learn deformable part-based model (DPM) as a mid-level representation to discover and encode the various poses and parts in sketch and image domains independently, after which graph matching is utilized to establish component and part-level correspondences across the two domains. We further propose an SBIR dataset that covers the unique aspects of fine-grained SBIR. Through in-depth experiments, we demonstrate the superior performance of our proposed SBIR framework, and showcase its unique ability in pose-specific retrieval.
General Information
Organisations: School of Informatics.
Authors: Li, Yi, Hospedales, Timothy M., Song, Yi-Zhe & Gong, Shaogang.
Number of pages: 12
Publication Date: 2014
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.5244/C.28.115
  Re-id: Hunting Attributes in the Wild
Layne, R, Hospedales, TM & Gong, S 2014, Re-id: Hunting Attributes in the Wild. in British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1-5, 2014. BMVA Press. DOI: 10.5244/C.28.1
Person re-identification is a crucial capability underpinning many applications of public-space video surveillance. Recent studies have shown the value of learning semantic attributes as a discriminative representation for re-identification. However, existing attribute representations do not generalise across camera deployments. Thus, this strategy currently requires the prohibitive effort of annotating a vector of person attributes for each individual in a large training set -- for each given deployment/dataset. In this paper we take a different approach and automatically discover a semantic attribute ontology, and learn an effective associated representation by crawling large volumes of internet data. In addition to eliminating the necessity for per-dataset annotation, by training on a much larger and more diverse array of examples this representation is more view-invariant and generalisable than attributes trained at conventional small scales. We show that these automatically discovered attributes provide a valuable representation that significantly improves re-identification performance on a variety of challenging datasets.
General Information
Organisations: School of Informatics.
Authors: Layne, Ryan, Hospedales, Timothy M. & Gong, Shaogang.
Number of pages: 12
Publication Date: 2014
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.5244/C.28.1
  Transductive Multi-label Zero-shot Learning
Fu, Y, Yang, Y, Hospedales, TM, Xiang, T & Gong, S 2014, Transductive Multi-label Zero-shot Learning. in British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1-5, 2014. BMVA Press. DOI: 10.5244/C.28.7
Zero-shot learning has received increasing interest as a means to alleviate the often prohibitive expense of annotating training data for large scale recognition problems. These methods have achieved great success via learning intermediate semantic representations in the form of attributes and more recently, semantic word vectors. However, they have thus far been constrained to the single-label case, in contrast to the growing popularity and importance of more realistic multi-label data. In this paper, for the first time, we investigate and formalise a general framework for multi-label zero-shot learning, addressing the unique challenge therein: how to exploit multi-label correlation at test time with no training data for those classes? In particular, we propose (1) a multi-output deep regression model to project an image into a semantic word space, which explicitly exploits the correlations in the intermediate semantic layer of word vectors; (2) a novel zero-shot learning algorithm for multi-label data that exploits the unique compositionality property of semantic word vector representations; and (3) a transductive learning strategy to enable the regression model learned from seen classes to generalise well to unseen classes. Our zero-shot learning experiments on a number of standard multi-label datasets demonstrate that our method outperforms a variety of baselines.
General Information
Organisations: School of Informatics.
Authors: Fu, Yanwei, Yang, Yongxin, Hospedales, Timothy M., Xiang, Tao & Gong, Shaogang.
Number of pages: 11
Publication Date: 2014
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.5244/C.28.7
  Open-world Person Re-Identification by Multi-Label Assignment Inference
Cancela, B, Hospedales, TM & Gong, S 2014, Open-world Person Re-Identification by Multi-Label Assignment Inference. in British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1-5, 2014. BMVA Press. DOI: 10.5244/C.28.98
Person re-identification methods have recently made tremendous progress on maximizing re-identification accuracy between camera pairs. However, this line of work mostly shares an critical limitation - it assumes re-identification in a `closed world'. That is, between a known set of people who all appear in both views of a single pair of cameras. This is clearly far from a realistic application scenario. In this study, we take a significant step toward a more realistic `open world' scenario. We consider associating persons observed in more than two cameras where: multiple within-camera detections are possible; different people can transit between different cameras -- so that there is only partial and unknown overlap of identity between people observed by each camera; and the total number of unique people among all cameras is itself unknown. To address this significantly more challenging open world scenario, we propose a novel framework based on online Conditional Random Field (CRF) inference. Experiments demonstrate the robustness of our approach in contrast to the limitations of conventional approaches in the open world context.
General Information
Organisations: School of Informatics.
Authors: Cancela, Brais, Hospedales, Timothy M. & Gong, Shaogang.
Number of pages: 11
Publication Date: 2014
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.5244/C.28.98
  Cross-Modal Face Matching: Beyond Viewed Sketches
Ouyang, S, Hospedales, TM, Song, Y-Z & Li, X 2014, Cross-Modal Face Matching: Beyond Viewed Sketches. in Computer Vision - ACCV 2014 - 12th Asian Conference on Computer Vision, Singapore, Singapore, November 1-5, 2014, Revised Selected Papers, Part II. Lecture Notes in Computer Science (LNCS), vol. 9004, Springer International Publishing, pp. 210-225. DOI: 10.1007/978-3-319-16808-1_15
General Information
Organisations: School of Informatics.
Authors: Ouyang, Shuxin, Hospedales, Timothy M., Song, Yi-Zhe & Li, Xueming.
Number of pages: 16
Pages: 210-225
Publication Date: 2014
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1007/978-3-319-16808-1_15
  Learning Multimodal Latent Attributes
Fu, Y, Hospedales, TM, Xiang, T & Gong, S 2014, 'Learning Multimodal Latent Attributes' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 36, no. 2, pp. 303-316. DOI: 10.1109/TPAMI.2013.128
The rapid development of social media sharing has created a huge demand for automatic media classification and annotation techniques. Attribute learning has emerged as a promising paradigm for bridging the semantic gap and addressing data sparsity via transferring attribute knowledge in object recognition and relatively simple action classification. In this paper, we address the task of attribute learning for understanding multimedia data with sparse and incomplete labels. In particular, we focus on videos of social group activities, which are particularly challenging and topical examples of this task because of their multimodal content and complex and unstructured nature relative to the density of annotations. To solve this problem, we 1) introduce a concept of semilatent attribute space, expressing user-defined and latent attributes in a unified framework, and 2) propose a novel scalable probabilistic topic model for learning multimodal semilatent attributes, which dramatically reduces requirements for an exhaustive accurate attribute ontology and expensive annotation effort. We show that our framework is able to exploit latent attributes to outperform contemporary approaches for addressing a variety of realistic multimedia sparse data learning tasks including: multitask learning, learning with label noise, N-shot transfer learning, and importantly zero-shot learning.
General Information
Organisations: School of Informatics.
Authors: Fu, Y., Hospedales, T. M., Xiang, T. & Gong, S..
Number of pages: 14
Pages: 303-316
Publication Date: Feb 2014
Publication Information
Category: Article
Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence
Volume: 36
Issue number: 2
ISSN: 0162-8828
Original Language: English
DOIs: 10.1109/TPAMI.2013.128
2013
  Cross-domain Traffic Scene Understanding by Motion Model Transfer
Xu, X, Gong, S & Hospedales, T 2013, Cross-domain Traffic Scene Understanding by Motion Model Transfer. in Proceedings of the 4th ACM/IEEE International Workshop on Analysis and Retrieval of Tracked Events and Motion in Imagery Stream. ARTEMIS '13, ACM, New York, NY, USA, pp. 77-86. DOI: 10.1145/2510650.2510657
This paper proposes a novel framework for cross-domain traffic scene understanding. Existing learning-based outdoor wide-area scene interpretation models suffer from requiring long term data collection in order to acquire statistically sufficient model training samples for every new scene. This makes installation costly, prevents models from being easily relocated, and from being used in UAVs with continuously changing scenes. In contrast, our method adopts a geometrical matching approach to relate motion models learned from a database of source scenes (source domains) with a handful sparsely observed data in a new target scene (target domain). This framework is capable of online ''sparse-shot'' anomaly detection and motion event classification in the unseen target domain, without the need for extensive data collection, labelling and offline model training for each new target domain. That is, trained models in different source domains can be deployed to a new target domain with only a few unlabelled observations and without any training in the new target domain. Crucially, to provide cross-domain interpretation without risk of dramatic negative transfer, we introduce and formulate a scene association criterion to quantify transferability of motion models from one scene to another. Extensive experiments show the effectiveness of the proposed framework for cross-domain motion event classification, anomaly detection and scene association.
General Information
Organisations: School of Informatics.
Authors: Xu, Xun, Gong, Shaogang & Hospedales, Timothy.
Number of pages: 10
Pages: 77-86
Publication Date: 2013
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1145/2510650.2510657
  Domain Transfer for Person Re-identification
Layne, R, Hospedales, TM & Gong, S 2013, Domain Transfer for Person Re-identification. in Proceedings of the 4th ACM/IEEE International Workshop on Analysis and Retrieval of Tracked Events and Motion in Imagery Stream. ARTEMIS '13, ACM, New York, NY, USA, pp. 25-32. DOI: 10.1145/2510650.2510658
Automatic person re-identification in is a crucial capability underpinning many applications in public space video surveillance. It is challenging due to intra-class variation in person appearance when observed in different views, together with limited inter-class variability. Various recent approaches have made great progress in re-identification performance using discriminative learning techniques. However, these approaches are fundamentally limited by the requirement of extensive annotated training data for every pair of views. For practical re-identification, this is an unreasonable assumption, as annotating extensive volumes of data for every pair of cameras to be re-identified may be impossible or prohibitively expensive.

In this paper we move toward relaxing this strong assumption by investigating flexible multi-source transfer of re-identification models across camera pairs. Specifically, we show how to leverage prior re-identification models learned for a set of source view pairs (domains), and flexibly combine these to obtain good re-identification performance in a target view pair (domain) with greatly reduced training data requirements in the target domain.
General Information
Organisations: School of Informatics.
Authors: Layne, Ryan, Hospedales, Timothy M. & Gong, Shaogang.
Number of pages: 8
Pages: 25-32
Publication Date: 2013
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1145/2510650.2510658
  Bayesian Joint Topic Modelling for Weakly Supervised Object Localisation
Shi, Z, Hospedales, TM & Xiang, T 2013, Bayesian Joint Topic Modelling for Weakly Supervised Object Localisation. in 2013 IEEE International Conference on Computer Vision. IEEE, pp. 2984-2991. DOI: 10.1109/ICCV.2013.371
We address the problem of localisation of objects as bounding boxes in images with weak labels. This weakly supervised object localisation problem has been tackled in the past using discriminative models where each object class is localised independently from other classes. We propose a novel framework based on Bayesian joint topic modelling. Our framework has three distinctive advantages over previous works: (1) All object classes and image backgrounds are modelled jointly together in a single generative model so that "explaining away" inference can resolve ambiguity and lead to better learning and localisation. (2) The Bayesian formulation of the model enables easy integration of prior knowledge about object appearance to compensate for limited supervision. (3) Our model can be learned with a mixture of weakly labelled and unlabelled data, allowing the large volume of unlabelled images on the Internet to be exploited for learning. Extensive experiments on the challenging VOC dataset demonstrate that our approach outperforms the state-of-the-art competitors.
General Information
Organisations: School of Informatics.
Authors: Shi, Z., Hospedales, T. M. & Xiang, T..
Number of pages: 8
Pages: 2984-2991
Publication Date: 1 Dec 2013
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1109/ICCV.2013.371
2012
  Attribute Learning for Understanding Unstructured Social Activity
Fu, Y, Hospedales, TM, Xiang, T & Gong, S 2012, Attribute Learning for Understanding Unstructured Social Activity. in Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part IV. Lecture Notes in Computer Science (LNCS), vol. 7575, Springer Berlin Heidelberg, pp. 530-543. DOI: 10.1007/978-3-642-33765-9_38
The rapid development of social video sharing platforms has created a huge demand for automatic video classification and annotation techniques, in particular for videos containing social activities of a group of people (e.g. YouTube video of a wedding reception). Recently, attribute learning has emerged as a promising paradigm for transferring learning to sparsely labelled classes in object or single-object short action classification. In contrast to existing work, this paper for the first time, tackles the problem of attribute learning for understanding group social activities with sparse labels. This problem is more challenging because of the complex multi-object nature of social activities, and the unstructured nature of the activity context. To solve this problem, we (1) contribute an unstructured social activity attribute (USAA) dataset with both visual and audio attributes, (2) introduce the concept of semi-latent attribute space and (3) propose a novel model for learning the latent attributes which alleviate the dependence of existing models on exact and exhaustive manual specification of the attribute-space. We show that our framework is able to exploit latent attributes to outperform contemporary approaches for addressing a variety of realistic multi-media sparse data learning tasks including: multi-task learning, N-shot transfer learning, learning with label noise and importantly zero-shot learning.
General Information
Organisations: School of Informatics.
Authors: Fu, Yanwei, Hospedales, Timothy M., Xiang, Tao & Gong, Shaogang.
Number of pages: 14
Pages: 530-543
Publication Date: 2012
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1007/978-3-642-33765-9_38
  A Unifying Theory of Active Discovery and Learning
Hospedales, TM, Gong, S & Xiang, T 2012, A Unifying Theory of Active Discovery and Learning. in Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V. Lecture Notes in Computer Science (LNCS), vol. 7576, Springer Berlin Heidelberg, pp. 453-466. DOI: 10.1007/978-3-642-33715-4_33
For learning problems where human supervision is expensive, active query selection methods are often exploited to maximise the return of each supervision. Two problems where this has been successfully applied are active discovery – where the aim is to discover at least one instance of each rare class with few supervisions; and active learning – where the aim is to maximise a classifier’s performance with least supervision. Recently, there has been interest in optimising these tasks jointly, i.e., active learning with undiscovered classes, to support efficient interactive modelling of new domains. Mixtures of active discovery and learning and other schemes have been exploited, but perform poorly due to heuristic objectives. In this study, we show with systematic theoretical analysis how the previously disparate tasks of active discovery and learning can be cleanly unified into a single problem, and hence are able for the first time to develop a unified query algorithm to directly optimise this problem. The result is a model which consistently outperforms previous attempts at active learning in the presence of undiscovered classes, with no need to tune parameters for different datasets.
General Information
Organisations: School of Informatics.
Authors: Hospedales, Timothy M., Gong, Shaogang & Xiang, Tao.
Number of pages: 14
Pages: 453-466
Publication Date: 2012
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1007/978-3-642-33715-4_33
  Towards Person Identification and Re-identification with Attributes
Layne, R, Hospedales, TM & Gong, S 2012, Towards Person Identification and Re-identification with Attributes. in Computer Vision - ECCV 2012. Workshops and Demonstrations - Florence, Italy, October 7-13, 2012, Proceedings, Part I. Lecture Notes in Computer Science (LNCS), vol. 7583, Springer Berlin Heidelberg, pp. 402-412. DOI: 10.1007/978-3-642-33863-2_40
Visual identification of an individual in a crowded environment observed by a distributed camera network is critical to a variety of tasks including commercial space management, border control, and crime prevention. Automatic re-identification of a human from public space CCTV video is challenging due to spatiotemporal visual feature variations and strong visual similarity in people’s appearance, compounded by low-resolution and poor quality video data. Relying on re-identification using a probe image is limiting, as a linguistic description of an individual’s profile may often be the only available cues. In this work, we show how mid-level semantic attributes can be used synergistically with low-level features for both identification and re-identification. Specifically, we learn an attribute-centric representation to describe people, and a metric for comparing attribute profiles to disambiguate individuals. This differs from existing approaches to re-identification which rely purely on bottom-up statistics of low-level features: it allows improved robustness to view and lighting; and can be used for identification as well as re-identification. Experiments demonstrate the flexibility and effectiveness of our approach compared to existing feature representations when applied to benchmark datasets.
General Information
Organisations: School of Informatics.
Authors: Layne, Ryan, Hospedales, Timothy M. & Gong, Shaogang.
Number of pages: 11
Pages: 402-412
Publication Date: 2012
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1007/978-3-642-33863-2_40
  Stream-based joint exploration-exploitation active learning
Loy, CC, Hospedales, TM, Xiang, T & Gong, S 2012, Stream-based joint exploration-exploitation active learning. in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 1560-1567. DOI: 10.1109/CVPR.2012.6247847
Learning from streams of evolving and unbounded data is an important problem, for example in visual surveillance or internet scale data. For such large and evolving real-world data, exhaustive supervision is impractical, particularly so when the full space of classes is not known in advance therefore joint class discovery (exploration) and boundary learning (exploitation) becomes critical. Active learning has shown promise in jointly optimising exploration-exploitation with minimal human supervision. However, existing active learning methods either rely on heuristic multi-criteria weighting or are limited to batch processing. In this paper, we present a new unified framework for joint exploration-exploitation active learning in streams without any heuristic weighting. Extensive evaluation on classification of various image and surveillance video datasets demonstrates the superiority of our framework over existing methods.
General Information
Organisations: School of Informatics.
Authors: Loy, C. C., Hospedales, T. M., Xiang, T. & Gong, S..
Number of pages: 8
Pages: 1560-1567
Publication Date: Jun 2012
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1109/CVPR.2012.6247847
  Person Re-identification by Attributes
Layne, R, Hospedales, TM & Gong, S 2012, Person Re-identification by Attributes. in British Machine Vision Conference, BMVC 2012, Surrey, UK, September 3-7, 2012. BMVA Press, pp. 1-11. DOI: 10.5244/C.26.24
Visually identifying a target individual reliably in a crowded environment observed by a distributed camera network is critical to a variety of tasks in managing business information, border control, and crime prevention. Automatic re-identification of a human candidate from public space CCTV video is challenging due to spatiotemporal visual feature variations and strong visual similarity between different people, compounded by low-resolution and poor quality video data. In this work, we propose a novel method for re-identification that learns a selection and weighting of mid-level semantic attributes to describe people. Specifically, the model learns an attribute-centric, parts-based feature representation. This differs from and complements existing low-level features for re-identification that rely purely on bottom-up statistics for feature selection, which are limited in discriminating and identifying reliably visual appearances of target people appearing in different camera views under certain degrees of occlusion due to crowdedness. Our experiments demonstrate the effectiveness of our approach compared to existing feature representations when applied to benchmarking datasets.
General Information
Organisations: School of Informatics.
Authors: Layne, Ryan, Hospedales, Timothy M. & Gong, Shaogang.
Number of pages: 11
Pages: 1-11
Publication Date: 2012
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.5244/C.26.24
  Video Behaviour Mining Using a Dynamic Topic Model
Hospedales, TM, Gong, S & Xiang, T 2012, 'Video Behaviour Mining Using a Dynamic Topic Model' International Journal of Computer Vision, vol 98, no. 3, pp. 303-323. DOI: 10.1007/s11263-011-0510-7
This paper addresses the problem of fully automated mining of public space video data, a highly desirable capability under contemporary commercial and security considerations. This task is especially challenging due to the complexity of the object behaviors to be profiled, the difficulty of analysis under the visual occlusions and ambiguities common in public space video, and the computational challenge of doing so in real-time. We address these issues by introducing a new dynamic topic model, termed a Markov Clustering Topic Model (MCTM). The MCTM builds on existing dynamic Bayesian network models and Bayesian topic models, and overcomes their drawbacks on sensitivity, robustness and efficiency. Specifically, our model profiles complex dynamic scenes by robustly clustering visual events into activities and these activities into global behaviours with temporal dynamics. A Gibbs sampler is derived for offline learning with unlabeled training data and a new approximation to online Bayesian inference is formulated to enable dynamic scene understanding and behaviour mining in new video data online in real-time. The strength of this model is demonstrated by unsupervised learning of dynamic scene models for four complex and crowded public scenes, and successful mining of behaviors and detection of salient events in each.
General Information
Organisations: School of Informatics.
Authors: Hospedales, Timothy M., Gong, Shaogang & Xiang, Tao.
Number of pages: 21
Pages: 303-323
Publication Date: Jul 2012
Publication Information
Category: Article
Journal: International Journal of Computer Vision
Volume: 98
Issue number: 3
ISSN: 0920-5691
Original Language: English
DOIs: 10.1007/s11263-011-0510-7
2011
  Finding Rare Classes: Active Learning with Generative and Discriminative Models
Hospedales, TM, Gong, S & Xiang, T 2011, 'Finding Rare Classes: Active Learning with Generative and Discriminative Models' IEEE Transactions on Knowledge and Data Engineering, vol 25, no. 2, pp. 374-386. DOI: 10.1109/TKDE.2011.231
Discovering rare categories and classifying new instances of them are important data mining issues in many fields, but fully supervised learning of a rare class classifier is prohibitively costly in labeling effort. There has therefore been increasing interest both in active discovery: to identify new classes quickly, and active learning: to train classifiers with minimal supervision. These goals occur together in practice and are intrinsically related because examples of each class are required to train a classifier. Nevertheless, very few studies have tried to optimise them together, meaning that data mining for rare classes in new domains makes inefficient use of human supervision. Developing active learning algorithms to optimise both rare class discovery and classification simultaneously is challenging because discovery and classification have conflicting requirements in query criteria. In this paper, we address these issues with two contributions: a unified active learning model to jointly discover new categories and learn to classify them by adapting query criteria online; and a classifier combination algorithm that switches generative and discriminative classifiers as learning progresses. Extensive evaluation on a batch of standard UCI and vision data sets demonstrates the superiority of this approach over existing methods.
General Information
Organisations: School of Informatics.
Authors: Hospedales, T. M., Gong, S. & Xiang, T..
Number of pages: 13
Pages: 374-386
Publication Date: 15 Nov 2011
Publication Information
Category: Article
Journal: IEEE Transactions on Knowledge and Data Engineering
Volume: 25
Issue number: 2
ISSN: 1041-4347
Original Language: English
DOIs: 10.1109/TKDE.2011.231
  Learning Tags from Unsegmented Videos of Multiple Human Actions
Hospedales, TM, Gong, S & Xiang, T 2011, Learning Tags from Unsegmented Videos of Multiple Human Actions. in 2011 IEEE 11th International Conference on Data Mining. IEEE, 978-1-4577-2075-8, pp. 251-259. DOI: 10.1109/ICDM.2011.90
Providing methods to support semantic interaction with growing volumes of video data is an increasingly important challenge for data mining. To this end, there has been some success in recognition of simple objects and actions in video, however most of this work requires strongly supervised training data. The supervision cost of these approaches therefore renders them economically non-scalable for real world applications. In this paper we address the problem of learning to annotate and retrieve semantic tags of human actions in realistic video data with sparsely provided tags of semantically salient activities. This is challenging because of (1) the multi-label nature of the learning problem and (2) realistic videos are often dominated by (semantically uninteresting) background activity un-supported by any tags of interest, leading to a strong irrelevant data problem. To address these challenges, we introduce a new topic model based approach to video tag annotation. Our model simultaneously learns a low dimensional representation of the video data, which dimensions are semantically relevant (supported by tags), and how to annotate videos with tags. Experimental evaluation on three different video action/activity datasets demonstrate the challenge of this problem, and value of our contribution.
General Information
Organisations: School of Informatics.
Authors: Hospedales, T. M., Gong, S. & Xiang, T..
Number of pages: 9
Pages: 251-259
Publication Date: 1 Dec 2011
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1109/ICDM.2011.90
  Identifying Rare and Subtle Behaviors: A Weakly Supervised Joint Topic Model
Hospedales, TM, Li, J, Gong, S & Xiang, T 2011, 'Identifying Rare and Subtle Behaviors: A Weakly Supervised Joint Topic Model' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 33, no. 12, pp. 2451-2464. DOI: 10.1109/TPAMI.2011.81
One of the most interesting and desired capabilities for automated video behavior analysis is the identification of rarely occurring and subtle behaviors. This is of practical value because dangerous or illegal activities often have few or possibly only one prior example to learn from and are often subtle. Rare and subtle behavior learning is challenging for two reasons: (1) Contemporary modeling approaches require more data and supervision than may be available and (2) the most interesting and potentially critical rare behaviors are often visually subtle-occurring among more obvious typical behaviors or being defined by only small spatio-temporal deviations from typical behaviors. In this paper, we introduce a novel weakly supervised joint topic model which addresses these issues. Specifically, we introduce a multiclass topic model with partially shared latent structure and associated learning and inference algorithms. These contributions will permit modeling of behaviors from as few as one example, even without localization by the user and when occurring in clutter, and subsequent classification and localization of such behaviors online and in real time. We extensively validate our approach on two standard public-space data sets, where it clearly outperforms a batch of contemporary alternatives.
General Information
Organisations: Moray House School of Education.
Authors: Hospedales, T. M., Li, J., Gong, S. & Xiang, T..
Number of pages: 14
Pages: 2451-2464
Publication Date: 1 Dec 2011
Publication Information
Category: Article
Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence
Volume: 33
Issue number: 12
ISSN: 0162-8828
Original Language: English
DOIs: 10.1109/TPAMI.2011.81
  Finding Rare Classes: Adapting Generative and Discriminative Models in Active Learning
Hospedales, TM, Gong, S & Xiang, T 2011, Finding Rare Classes: Adapting Generative and Discriminative Models in Active Learning. in Advances in Knowledge Discovery and Data Mining: 15th Pacific-Asia Conference, PAKDD 2011, Shenzhen, China, May 24-27, 2011, Proceedings, Part II. Lecture Notes in Computer Science (LNCS), vol. 6635, Springer Berlin Heidelberg, pp. 296-308. DOI: 10.1007/978-3-642-20847-8_25
Discovering rare categories and classifying new instances of them is an important data mining issue in many fields, but fully supervised learning of a rare class classifier is prohibitively costly. There has therefore been increasing interest both in active discovery: to identify new classes quickly, and active learning: to train classifiers with minimal supervision. Very few studies have attempted to jointly solve these two inter-related tasks which occur together in practice. Optimizing both rare class discovery and classification simultaneously with active learning is challenging because discovery and classification have conflicting requirements in query criteria. In this paper we address these issues with two contributions: a unified active learning model to jointly discover new categories and learn to classify them; and a classifier combination algorithm that switches generative and discriminative classifiers as learning progresses. Extensive evaluation on several standard datasets demonstrates the superiority of our approach over existing methods.
General Information
Organisations: School of Informatics.
Authors: Hospedales, Timothy M., Gong, Shaogang & Xiang, Tao.
Number of pages: 13
Pages: 296-308
Publication Date: 2011
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1007/978-3-642-20847-8_25
  Generative Probabilistic Modeling: Understanding Causal Sensorimotor Integration
Vijayakumar, S, Hospedales, T & Haith, A 2011, Generative Probabilistic Modeling: Understanding Causal Sensorimotor Integration. in J Trommershauser, K Kording & MS Landy (eds), Sensory Cue Integration. Oxford University Press, pp. 63-81.
This chapter argues that many aspects of human perception are best explained by adopting a modeling approach in which experimental subjects are assumed to possess a full generative probabilistic model of the task they are faced with, and that they use this model to make inferences about their environment and act optimally given the information available to them. It applies this generative modeling framework in two diverse settings—concurrent sensory and motor adaptation, and multisensory oddity detection—and shows, in both cases, that the data are best described by a full generative modeling approach.
General Information
Organisations: Neuroinformatics DTC.
Authors: Vijayakumar, S., Hospedales, Timothy & Haith, Adrian.
Number of pages: 19
Pages: 63-81
Publication Date: 2011
Publication Information
Category: Chapter
Original Language: English
2010
  Learning Rare Behaviours
Li, J, Hospedales, TM, Gong, S & Xiang, T 2010, Learning Rare Behaviours. in Computer Vision - ACCV 2010 - 10th Asian Conference on Computer Vision, Queenstown, New Zealand, November 8-12, 2010, Revised Selected Papers, Part II. Lecture Notes in Computer Science (LNCS), vol. 6493 , Springer Berlin Heidelberg, pp. 293-307. DOI: 10.1007/978-3-642-19309-5_23
We present a novel approach to detect and classify rare behaviours which are visually subtle and occur sparsely in the presence of overwhelming typical behaviours. We treat this as a weakly supervised classification problem and propose a novel topic model: Multi-Class Delta Latent Dirichlet Allocation which learns to model rare behaviours from a few weakly labelled videos as well as typical behaviours from uninteresting videos by collaboratively sharing features among all classes of footage. The learned model is able to accurately classify unseen data. We further explore a novel method for detecting unknown rare behaviours in unseen data by synthesising new plausible topics to hypothesise any potential behavioural conflicts. Extensive validation using both simulated and real-world CCTV video data demonstrates the superior performance of the proposed framework compared to conventional unsupervised detection and supervised classification approaches.
General Information
Organisations: School of Informatics.
Authors: Li, Jian, Hospedales, Timothy M., Gong, Shaogang & Xiang, Tao.
Number of pages: 15
Pages: 293-307
Publication Date: 2010
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1007/978-3-642-19309-5_23
2009
  A Markov Clustering Topic Model for mining behaviour in video
Hospedales, T, Gong, S & Xiang, T 2009, A Markov Clustering Topic Model for mining behaviour in video. in 2009 IEEE 12th International Conference on Computer Vision. IEEE, pp. 1165-1172. DOI: 10.1109/ICCV.2009.5459342
This paper addresses the problem of fully automated mining of public space video data. A novel Markov Clustering Topic Model (MCTM) is introduced which builds on existing Dynamic Bayesian Network models (e.g. HMMs) and Bayesian topic models (e.g. Latent Dirichlet Allocation), and overcomes their drawbacks on accuracy, robustness and computational efficiency. Specifically, our model profiles complex dynamic scenes by robustly clustering visual events into activities and these activities into global behaviours, and correlates behaviours over time. A collapsed Gibbs sampler is derived for offline learning with unlabeled training data, and significantly, a new approximation to online Bayesian inference is formulated to enable dynamic scene understanding and behaviour mining in new video data online in real-time. The strength of this model is demonstrated by unsupervised learning of dynamic scene models, mining behaviours and detecting salient events in three complex and crowded public scenes.
General Information
Organisations: School of Informatics.
Authors: Hospedales, T., Gong, S. & Xiang, T..
Number of pages: 8
Pages: 1165-1172
Publication Date: Sep 2009
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.1109/ICCV.2009.5459342
  A Unified Bayesian Framework for Adaptive Visual Tracking
Zelniker, EE, Hospedales, TM, Gong, S & Xiang, T 2009, A Unified Bayesian Framework for Adaptive Visual Tracking. in Proceedings of the British Machine Vision Conference., 18, BMVA Press, pp. 1-11. DOI: 10.5244/C.23.18
We propose a novel method for tracking objects in a video scene that undergo drastic changes in their appearance. These changes may arise due to out-of-plane rotation, abrupt or gradual changes in illumination in outdoor scenarios, or changing position with respect to near light-sources indoors. The key problem with most existing models is that they are either non-adaptive (rendering them non-robust to object appearance change) or use a single tracker output to heuristically update the appearance model at each iteration
(rendering them vulnerable to drift). In this paper, we take a step toward general
real-world tracking, in a principled manner, proposing a unified generative model for Bayesian multi-feature, adaptive target tracking. We show the performance of our method on a wide variety of video data, with a focus on surveillance scenarios.
General Information
Organisations: School of Informatics.
Authors: Zelniker, Emanuel E., Hospedales, Timothy M., Gong, Shaogang & Xiang, Tao.
Number of pages: 11
Pages: 1-11
Publication Date: 2009
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.5244/C.23.18
  Multisensory Oddity Detection as Bayesian Inference
Hospedales, T & Vijayakumar, S 2009, 'Multisensory Oddity Detection as Bayesian Inference' PLoS One, vol 4, no. 1, e4205. DOI: 10.1371/journal.pone.0004205
A key goal for the perceptual system is to optimally combine information from all the senses that may be available in order to develop the most accurate and unified picture possible of the outside world. The contemporary theoretical framework of ideal observer maximum likelihood integration (MLI) has been highly successful in modelling how the human brain combines information from a variety of different sensory modalities. However, in various recent experiments involving multisensory stimuli of uncertain correspondence, MLI breaks down as a successful model of sensory combination. Within the paradigm of direct stimulus estimation, perceptual models which use Bayesian inference to resolve correspondence have recently been shown to generalize successfully to these cases where MLI fails. This approach has been known variously as model inference, causal inference or structure inference. In this paper, we examine causal uncertainty in another important class of multi-sensory perception paradigm - that of oddity detection and demonstrate how a Bayesian ideal observer also treats oddity detection as a structure inference problem. We validate this approach by showing that it provides an intuitive and quantitative explanation of an important pair of multi-sensory oddity detection experiments - involving cues across and within modalities - for which MLI previously failed dramatically, allowing a novel unifying treatment of within and cross modal multisensory perception. Our successful application of structure inference models to the new 'oddity detection' paradigm, and the resultant unified explanation of across and within modality cases provide further evidence to suggest that structure inference may be a commonly evolved principle for combining perceptual information in the brain.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Hospedales, Timothy & Vijayakumar, Sethu.
Keywords: (Informatics, Computer Science, , , . )
Number of pages: 13
Publication Date: Jan 2009
Publication Information
Category: Article
Journal: PLoS One
Volume: 4
Issue number: 1
ISSN: 1932-6203
Original Language: English
DOIs: 10.1371/journal.pone.0004205
2008
  An Adaptive Machine Director
Hospedales, TM & Williams, O 2008, An Adaptive Machine Director. in Proceedings of the British Machine Vision Conference 2008, Leeds, September 2008. BMVA Press, pp. 1-10. DOI: 10.5244/C.22.118
We model the class of problem faced by a video broadcast director, who must act as an active perception agent to select a view of interest to a human from a range of possibilities. Real-time learning of a broadcast direction policy is achieved by efficient online Bayesian learning of the model’s parameters based on intermittent user feedback. In contrast to existing machine direction systems, which are dedicated to a particular scenario, our novel approach allows flexible learning of direction policies for novel domains or for viewer specific preferences. We illustrate the flexibility of our approach by applying our model to a selection of scenarios with audio-visual input including teleconferencing, meetings and dance entertainment.
General Information
Organisations: Neuroinformatics DTC.
Authors: Hospedales, Timothy M. & Williams, Oliver.
Number of pages: 10
Pages: 1-10
Publication Date: 2008
Publication Information
Category: Conference contribution
Original Language: English
DOIs: 10.5244/C.22.118
  Implications of noise and neural heterogeneity for vestibulo-ocular reflex fidelity
Hospedales, T, van Rossum, MCW, Graham, BP & Dutia, MB 2008, 'Implications of noise and neural heterogeneity for vestibulo-ocular reflex fidelity' Neural Computation, vol 20, no. 3, pp. 756-778. DOI: 10.1162/neco.2007.09-06-339
The vestibulo-ocular reflex (VOR) is characterized by a short-latency, high-fidelity eye movement response to head rotations at frequencies up to 20 Hz. Electrophysiological studies of medial vestibular nucleus (MVN) neurons, however, show that their response to sinusoidal currents above 10 to 12 Hz is highly nonlinear and distorted by aliasing for all but very small current amplitudes. How can this system function in vivo when single cell response cannot explain its operation? Here we show that the necessary wide VOR frequency response may be achieved not by firing rate encoding of head velocity in single neurons, but in the integrated population response of asynchronously firing, intrinsically active neurons. Diffusive synaptic noise and the pacemaker-driven, intrinsic firing of MVN cells synergistically maintain asynchronous, spontaneous spiking in a population of model MVN neurons over a wide range of input signal amplitudes and frequencies. Response fidelity is further improved by a reciprocal inhibitory link between two MVN populations, mimicking the vestibular commissural system in vivo, but only if asynchrony is maintained by noise and pacemaker inputs. These results provide a previously missing explanation for the full range of VOR function and a novel account of the role of the intrinsic pacemaker conductances in MVN cells. The values of diffusive noise and pacemaker currents that give optimal response fidelity yield firing statistics similar to those in vivo, suggesting that the in vivo network is tuned to optimal performance. While theoretical studies have argued that noise and population heterogeneity can improve coding, to our knowledge this is the first evidence indicating that these parameters are indeed tuned to optimize coding fidelity in a neural control system in vivo.
General Information
Organisations: Centre for Integrative Physiology.
Authors: Hospedales, Timothy, van Rossum, M. C. W., Graham, Bruce P. & Dutia, M. B..
Keywords: (, , . )
Number of pages: 23
Pages: 756-778
Publication Date: Mar 2008
Publication Information
Category: Article
Journal: Neural Computation
Volume: 20
Issue number: 3
ISSN: 0899-7667
Original Language: English
DOIs: 10.1162/neco.2007.09-06-339
  Structure Inference for Bayesian Multisensory Scene Understanding
Hospedales, T & Vijayakumar, S 2008, 'Structure Inference for Bayesian Multisensory Scene Understanding' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 30, no. 12, pp. 2140-2157. DOI: 10.1109/TPAMI.2008.25
We investigate a solution to the problem of multi-sensor scene understanding by formulating it in the framework of Bayesian model selection and structure inference. Humans robustly associate multimodal data as appropriate, but previous modelling work has focused largely on optimal fusion, leaving segregation unaccounted for and unexploited by machine perception systems. We illustrate a unifying, Bayesian solution to multi-sensor perception and tracking which accounts for both integration and segregation by explicit probabilistic reasoning about data association in a temporal context. Such explicit inference of multimodal data association is also of intrinsic interest for higher level understanding of multisensory data. We illustrate this using a probabilistic implementation of data association in a multi-party audio-visual scenario, where unsupervised learning and structure inference is used to automatically segment, associate and track individual subjects in audiovisual sequences. Indeed, the structure inference based framework introduced in this work provides the theoretical foundation needed to satisfactorily explain many confounding results in human psychophysics experiments involving multimodal cue integration and association.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Hospedales, Timothy & Vijayakumar, Sethu.
Keywords: (Informatics, Pattern Recognition, Scene Analysis, Sensor fusion, , , . )
Number of pages: 18
Pages: 2140-2157
Publication Date: Dec 2008
Publication Information
Category: Article
Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence
Volume: 30
Issue number: 12
Original Language: English
DOIs: 10.1109/TPAMI.2008.25
2007
  Structure Inference for Bayesian Multisensory Perception and Tracking
Hospedales, T, Cartwright, JJ & Vijayakumar, S 2007, Structure Inference for Bayesian Multisensory Perception and Tracking. in IJCAI 2007, Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, January 6-12, 2007. pp. 2122-2128.
We investigate a solution to the problem of multisensor perception and tracking by formulating it in the framework of Bayesian model selection. Humans robustly associate multi-sensory data as appropriate, but previous theoretical work has focused largely on purely integrative cases, leaving segregation unaccounted for and unexploited by machine perception systems. We illustrate a unifying, Bayesian solution to multi-sensor perception and tracking which accounts for both integration and segregation by explicit probabilistic reasoning about data association in a temporal context. Unsupervised learning of such a model with EM is illustrated for a real world audio-visual application.
General Information
Organisations: Institute of Perception, Action and Behaviour .
Authors: Hospedales, Timothy, Cartwright, Joel J. & Vijayakumar, Sethu.
Number of pages: 7
Pages: 2122-2128
Publication Date: 2007
Publication Information
Category: Conference contribution
Original Language: English

Projects:
DREAM - Deferred Restructuring of Experience in Autonomous Machines
Timothy, Hospedales (Principal investigator)
Period: 31/08/201631/12/2018
Funding Organisation: EU government bodies

Projects:
Principled multimodal cue integration for perceptual interference (PhD)

Personal Website