Possible PhD Projects
A list of example PhD Project suggestions (including directly funded projects).
Computational epigenetics
Supervisor: Guido Sanguinetti
The overwhelming majority of quantitative biology has focused on
studying molecules like mRNA, which decay within hours at most. How can
this help us explain phenomena that take years to establish, e.g.
ageing, cancer, neurodegenerative diseases? People increasingly think
that a determining factor is so called "epigenetics", i.e. changes in
the spatial organisation/ chemical state of DNA (e.g. how it is wrapped
around histones, its methylation state; for a very accessible review see
here ). Data about these epigenetics factors is becoming increasingly
available thanks to next generation sequencing. Can we use computational
methods to discover whether there are networks connecting these various
epigenetic factors, and connecting epigenetics with genetics? Can we use
computational methods to discover whether there are networks connecting
these various epigenetic factors, and connecting epigenetics with genetics?
Machine learning for spatio-temporal systems
Supervisor: Guido Sanguinetti
Advances in remote sensing technologies mean that there is an increasing
number of data sets detailing physical processes at a spatial and
temporal resolution. As an example, our collaborator Dr John Quinn,
Makerere University Kampala, is gathering a very large data set in the
following way: farmers in Uganda often own GPS phones, and they are
asked to send photographs of suspect Cassava plants (main staple in East
Africa) to a server in Kampala where a computer vision algorithm
classifies the pics in a certain number of disease classes. We therefore
get a nation-scale data set of occurrence of diseased plants as events
in space and time. How do we analyse such types of data and extract
information e.g. about the dynamics of the spread? Can we make online
predictions which can be useful to decision makers? I would be very
interested in working on these questions, perhaps building on this
online general estimation tool for a class of spatio-temporal models I
recently worked on with collaborators in systems engineering.
Machine Learning Markets
Supervisor: Amos Storkey
Develop methods for large scale development of structured machine learning.
There are a number of stumbling blocks to progress in machine learning
and statistical modelling. These include the existence of a plethora of
algorithms combined with poor knowledge of the performance of most of
them, and the fact that machine learning is typically done from scratch
on each new problem. Machine Learning Markets help overcome these
issues. Machine Learning Markets involve extending prediction market
mechanisms for doing machine learning. They are a meta-modelling
approach, and provide principled methods for combining models, building
hierarchical models, and deriving new features. Because the markets are
robust to new agents joining or leaving, they can continuously improve
as modelling capability improves. This PhD will involve integrating
ideas from machine learning, economics, game theory, statistical
physics, information theory and numerical analysis to establish both the
theoretical basis for machine learning markets and the practical
development of them.
Direct funding is available for this project from Microsoft Research, Cambridge.
Deep Learning for Sequences and Diffusions.
Supervisor: Amos Storkey
Feature production methods for sequences.
Recently, the development of hierarchical models for unsupervised
learning has been improved through the use of a variety of deep learning
processes, that build the unsupervised model in a layer-wise manner. We
will investigate the development of deep hierarchical models for
sequences, including image sequences and music sequences. We will look
at different forms of hierarchical models, develop novel models, and
compare relative performances of these model forms. A qualified student
may also be interested in examining the implications of these models
from a computational neuroscience perspective.
Statistical NLP for Programming Languages
Supervisor: Charles Sutton
Find syntactic patterns in corpora of programming language text.
The goal of this project is to apply the advanced statistical techniques
from natural language processing to a completely different and new
textual domain: programming language text. Think about how you program
when you are using a new library or new environment for the first time.
You "program by search engine", i.e., you search for examples of people
who have used the same library, and you copy chunks of code from them. I
want to systemize this process, and apply it at a large scale. We have
collected a corpus of 1.5 billion lines of source code from 8000
software projects, and we want to find syntactic patterns that recur
across projects. These can then be presented to a programmer as she is
writing code, providing an autocomplete functionality that can suggest
entire function bodies. Statistical techniques involved include language
modelling, data mining, and Bayesian nonparametrics. This also raises
some deep and interesting questions in software engineering: i.e., Why
do syntactic patterns occur in professionally written software when they
could be refactored away?
Structure Learning for Computer Systems
Supervisor: Charles Sutton
Automatically determine the structure of models to describe the
performance of warehouse-scale and cloud applications.
Modern computer systems have become more complex than ever before, with
distributed systems becoming a mainstream computing tool. Low latency is
a crucial design goal for these systems, because users will not adopt an
interactive Web service that is slow. Understanding the performance of a
distributed system is extremely difficult because of the many
interactions between components. In this project, we will address this
problem by attempting to learn the structure of models to describe the
performance of these systems. Possible structure may include networks of
nonparametric regression models, networks of queues, or more complex
performance models such as stochastic process algebras. The idea is that
the learning structure will be useful for visualisation, i.e., that it
will provide a compact, interpretable description of the system's
performance, so that performance bugs in the system will be visually
apparent as bottlenecks in the learned queueing network. Essentially,
the learned model will serve as a summary of the large amount of
performance data used to generate it. Structure learning is a
notoriously complex problem in machine learning, so this new application
may serve as a challenge problem for this area.
Unsupervised learning for hierarchical image modelling
Supervisor: Chris Williams
Develop models for shapes and appearances of image regions and objects
It is highly desirable to frame image understanding in terms of
hierarchical generative probabilistic models. These allow top-down and
bottom-up flows of information to take place, in order to provide a
scene interpretation. Encoded within such a model would be knowledge at
various levels, e.g. lower-level models of regions and boundaries, and
at a higher level the shape and appearance of object classes, and their
contextual relationships. Due to the difficulties in obtaining
appropriate annotated data, such models should be learned in a largely
unsupervised fashion from image data. Hinton's "deep learning" agenda is
attractive here in that it provides an upgrade path from lower-level to
higher-level regularities.
The specific PhD project would develop components that would fit into
this framework; for example one might decompose an image into regions
based on visual texture, and at a higher level model the typical shapes
and appearances of co-occurring regions that arise from object classes.
Models for Understanding Time Series from Intensive Care Units
Supervisor: Chris Williams
Identifying physiological and artifactual events in patient monitoring
data so as to make "smart alarms" for medical staff possible.
Patients in intensive care are monitored by many sensors (heart rate,
blood pressure, temperature etc) giving rise to time-series data that
has rich structure. The goal of this project is to identify various
events in the data streams, both physiological and artifactual. If this
can be achieved reliably then identified or predicted physiological
events could be flagged to medical staff, as a "smart alarm".
Artifactual events (such as a probe recalibration) need to be identified
and then discounted. The methods for this work will be based on the
Factorial Switching Linear Dynamical System (FSLDS; Quinn, Williams and
McIntosh, 2009), but there are many new directions to explore. The work
will be carried out in collaboration with Intensive Care Units in Scotland.


