Personal tools
You are here: Home Prospective PhD Students Possible PhD Projects

Possible PhD Projects

A list of example PhD Project suggestions (including directly funded projects).

Below are a list of suggested PhD topics in machine learning and bioinformatics. This list is not exhaustive. Staff will have other ideas for projects beyond those listed here, and students are welcome to propose a project of their own instance. In any case, we advise that you contact a potential supervisor before you apply. 

If you would like to do a PhD in a Data Science related area, consider the 4-year Data Science Ph.D. programme.

Computational epigenetics

Supervisor: Guido Sanguinetti

The overwhelming majority of quantitative biology has focused on studying molecules like mRNA, which decay within hours at most. How can this help us explain phenomena that take years to establish, e.g. ageing, cancer, neurodegenerative diseases? People increasingly think that a determining factor is so called "epigenetics", i.e. changes in the spatial organisation/ chemical state of DNA (e.g. how it is wrapped around histones, its methylation state; for a very accessible review see here ). Data about these epigenetics factors is becoming increasingly available thanks to next generation sequencing. Can we use computational methods to discover whether there are networks connecting these various epigenetic factors, and connecting epigenetics with genetics? Can we use computational methods to discover whether there are networks connecting these various epigenetic factors, and connecting epigenetics with genetics?

Machine learning for spatio-temporal systems

Supervisor: Guido Sanguinetti

Advances in remote sensing technologies mean that there is an increasing number of data sets detailing physical processes at a spatial and temporal resolution. As an example, our collaborator Dr John Quinn, Makerere University Kampala, is gathering a very large data set in the following way: farmers in Uganda often own GPS phones, and they are asked to send photographs of suspect Cassava plants (main staple in East Africa) to a server in Kampala where a computer vision algorithm classifies the pics in a certain number of disease classes. We therefore get a nation-scale data set of occurrence of diseased plants as events in space and time. How do we analyse such types of data and extract information e.g. about the dynamics of the spread? Can we make online predictions which can be useful to decision makers? I would be very interested in working on these questions, perhaps building on this online general estimation tool for a class of spatio-temporal models I recently worked on with collaborators in systems engineering.  

Integrating Genetic, Phenotypic and Clinical Data to Improve our Understanding of Autism

Supervisor: Ian Simpson

The growing number of large cohort genetic studies for neurological diseases that include detailed phenotypic and clinical information represents a tremendous opportunity to bring to bear computer science and machine learning methods for joint analysis to derive new insights into disease aetiology and to inform future patient care. Several recent studies have illustrated the potential of such methods with particular interest in the use of text-mining with literature and electronic health records(1-3). There are currently no significant studies that apply these approaches directly to autism though at least one proof of concept approach gives a tantalising insight into the potential utility for both basic scientists and clinicians(4). The Developmental Disorders Genotype to Phenotype (DDG2P) and SFARI-base databases contain genetic and phenotypic data for patients with intellectual disability(5) (13,500 pro-bands) and autism(6) (c.50,000 patients) respectively. These initiatives have also collected behavioural, medical history, treatment, developmental milestones, growth, age and gender information for patients which, if linked to existing genotypic and phenotypic data, will allow us to quantitatively evaluate the relationships between these features at the level of patients. In order to achieve this these new data need to be efficiently and reliably extracted using text-mining approaches involving domain expertise from expert clinical curators and computer scientists. The benefit of creating these large linked-data for disorders of ASD is that they can be used in a wide range of downstream application areas; gene prioritisation, pathway analysis, mechanistic modelling, clinical profiling and diagnosis, treatment options and efficacy and to identify new outcome measures for clinical trials.

Using high-performance computers to analyse next-generation sequencing data

Supervisor: Ian Simpson

The emergence of rapid high-throughput assay technologies for biomolecules has facilitated a revolution in the kinds of questions that can be asked of biological systems. The pre-dominant types of data include short nucleic acid sequence fragments derived from genomic DNA or RNA molecules, metabolite or peptide data. All of these are typically large in terms of the number of elements, but more importantly they harbour complex information about the samples from which they are derived. The challenge is both to modify existing analytical methods and to create novel ones that can be efficiently applied to these kinds of data in a way that informs us about the underlying Biology.  In this project we want to study how we can harness the power of high performance computers to improve genome and transcriptome assembly, comparative analysis, classification and regression, validation and cross-comparison across multiple next-generation data sets. This will involve the development of novel algorithms, statistics and parallel programming methods with a view to delivering software packages and applications that can be applied to next-generation data on HPC architectures by expert and non-expert users.

Machine Learning Markets

Supervisor: Amos Storkey

Develop methods for large scale development of structured machine learning.

There are a number of stumbling blocks to progress in machine learning and statistical modelling. These include the existence of a plethora of algorithms combined with poor knowledge of the performance of most of them, and the fact that machine learning is typically done from scratch on each new problem. Machine Learning Markets help overcome these issues. Machine Learning Markets involve extending prediction market mechanisms for doing machine learning. They are a meta-modelling approach, and provide principled methods for combining models, building hierarchical models, and deriving new features. Because the markets are robust to new agents joining or leaving, they can continuously improve as modelling capability improves. This PhD will involve integrating
ideas from machine learning, economics, game theory, statistical physics, information theory and numerical analysis to establish both the theoretical basis for machine learning markets and the practical development of them. 

Direct funding is available for this project from Microsoft Research, Cambridge.

Deep Learning for Sequences and Diffusions.

Supervisor: Amos Storkey

Feature production methods for sequences.

Recently, the development of hierarchical models for unsupervised learning has been improved through the use of a variety of deep learning processes, that build the unsupervised model in a layer-wise manner. We
will investigate the development of deep hierarchical models for sequences, including image sequences and music sequences. We will look at different forms of hierarchical models, develop novel models, and compare relative performances of these model forms. A qualified student may also be interested in examining the implications of these models from a computational neuroscience perspective.

Statistical NLP for Programming Languages

Supervisor: Charles Sutton

Find syntactic patterns in corpora of programming language text.

The goal of this project is to apply the advanced statistical techniques from natural language processing to a completely different and new textual domain: programming language text. Think about how you program when you are using a new library or new environment for the first time. You "program by search engine", i.e., you search for examples of people who have used the same library, and you copy chunks of code from them. I want to systemize this process, and apply it at a large scale. We have collected a corpus of 1.5 billion lines of source code from 8000 software projects, and we want to find syntactic patterns that recur across projects. These can then be presented to a programmer as she is writing code, providing an autocomplete functionality that can suggest entire function bodies. Statistical techniques involved include language modelling, data mining, and Bayesian nonparametrics. This also raises some deep and interesting questions in software engineering: i.e., Why do syntactic patterns occur in professionally written software when they could be refactored away?

Structure Learning for Computer Systems

Supervisor: Charles Sutton

Automatically determine the structure of models to describe the performance of warehouse-scale and cloud applications.

Modern computer systems have become more complex than ever before, with distributed systems becoming a mainstream computing tool. Low latency is a crucial design goal for these systems, because users will not adopt an interactive Web service that is slow. Understanding the performance of a distributed system is extremely difficult because of the many interactions between components. In this project, we will address this problem by attempting to learn the structure of models to describe the performance of these systems. Possible structure may include networks of nonparametric regression models, networks of queues, or more complex performance models such as stochastic process algebras. The idea is that the learning structure will be useful for visualisation, i.e., that it will provide a compact, interpretable description of the system's performance, so that performance bugs in the system will be visually apparent as bottlenecks in the learned queueing network. Essentially, the learned model will serve as a summary of the large amount of performance data used to generate it. Structure learning is a notoriously complex problem in machine learning, so this new application may serve as a challenge problem for this area.

Models for Understanding Time Series from Intensive Care Units

Supervisor: Chris Williams

Identifying physiological and artifactual events in patient monitoring data so as to make "smart alarms" for medical staff possible.

Patients in intensive care are monitored by many sensors (heart rate, blood pressure, temperature etc) giving rise to time-series data that has rich structure. The goal of this project is to identify various events in the data streams, both physiological and artifactual. If this can be achieved reliably then identified or predicted physiological events could be flagged to medical staff, as a "smart alarm". Artifactual events (such as a probe recalibration) need to be identified and then discounted. The methods for this work will be based on the Factorial Switching Linear Dynamical System (FSLDS; Quinn, Williams and McIntosh, 2009), but there are many new directions to explore. The work will be carried out in collaboration with Intensive Care Units in Scotland.