MSc Projects

MSc project list for MSc Data Science and MSc Human and Social Data Science

Please find the current list of suggested MSc projects below:

For further details, please speak to the project supervisor using the contact details on their profile page.

You should email your project choice by Monday 31st March 2022 to your Course Convenor (Prof Enrico Scalas) and the PGT Office.

Projects

Mr Colin Ashby

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Flexible OCR From Historic Scientific American

The overall aim of the project is to analyse the use of and attitudes to items of technology throughout the years. We have access to the scanned pages of Scientific American from 1845 to the present day. OCR has been performed with a commercial product achieving around 80% accuracy. The format and vocabulary has varied over time somewhat confounding these efforts to extract and analyse the text. Techniques from Computer Vision may be applied to segment different articles, especially in the older pages which are more densely packed. Techniques from NLP may be able to enhance the OCR result by dealing better with unknown words that are either mis-recognised or whose spelling has changed over time.

Analysing The Emergence Of The AI Industry

Build a picture of the emerging AI sector by analysing mergers, acquisitions and company data.

Use crunchbase.com and/or extracted entity lists, linked to company website scrapes (or perhaps golden.com) to extract mergers and acquisition information for companies in the AI sector. Information of interest includes company size and area of expertise, in order to analyse whether takeovers are hostile or complementary.

Examination Of Psychology Methods

This project looks into:

• How the lack of robustness of research models may have contributed to the current issues that psychology and social sciences are experiencing with replicability and credibility

• We’ve already conducted a couple of studies to see whether researchers attend to the statistical assumptions of their models and related issues with robustness, and the general conclusion is: they don’t and when they do, they only rarely detect any real problems - not because the problems aren’t there but because they are using inappropriate methods. Aims of the study:

• We want to get a fairly comprehensive overview of the reported statistical practice around the models used by pulling information out of papers published in the past 10 years.

• The kinds of things we want to know, include: - What kind of models are typically applied and how has this changed over the years? - How often do researchers use robust models instead of OLS (Ordinary Least Squares) models? - For OLS models, how often are assumption checks and outliers checks reported? What kind of methods are being used to deal with potential problems?

Bring Your Own Project

I'm interested in discussing any project ideas around Information Extraction from the Internet, Named Entity Recognition, Speech Recognition or Natural Language Processing in general. Also applications developed around these technologies.

Dr Adam Barrett

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Predictive modelling of depression and anxiety across the life course

"What are the factors in childhood and adolescence that can predict the onset and persistence of depression and/or anxiety across the life span? This project will tackle this question by analysing the National Child Development Study (British 1958 birth cohort) dataset [1], which has a large amount of longitudinal data on a cohort of over 17,000 people born in the UK in 1958, e.g. on physical and educational development, economic circumstances, health behaviour and family life.

[1] https://cls.ucl.ac.uk/cls-studies/1958-national-child-development-study/ Data: British 1958 birth cohort

Data analytics on inflation and national debt

The pandemic has triggered an enormous shock to the global economy, and it is challenging to understand how governments’ economic policy should respond. Governments have provided a lot of stimulus to protect livelihoods, yet there is a danger now that rising inflation could lead to a choking off of credit, both public and private. This project will explore two new comprehensive databases produced by the World Bank, on inflation [1] and fiscal space [2], covering over 190 countries. Data science / machine learning will be applied to understand the dynamics of different measures of inflation at the national and international level, and the relation to public and private debt dynamics. See [3] for a previous historical case-study on Canada, and [4] for a paper on applications of machine learning at the Bank of England.

[1] https://www.worldbank.org/en/research/brief/inflation-database

[2] https://www.worldbank.org/en/research/brief/fiscal-space

[3] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2679090 [4] https://www.bankofengland.co.uk/working-paper/2017/machine-learning-at-central-banks

Emergence in machine/deep learning

TWhat machine/deep learning techniques can we use to generate “emergent” properties?

In this project, the core task is to do a thorough research on the landscape of machine/deep learning techniques that help us generating properties associated with emergence in either simulated or empirical data. An emergent property, broadly speaking, is one that applies only to an entire set of elements forming a system, but not to the elements themselves. Imagine a set of dots whose order yields a circle – the property of “being a circle” can only be attributed to the whole set of dots, not to single dots. While this is a trivial example of emergence (used only to set intuitions), “truly” emergent phenomena most likely involve interesting interactions of the elements at the lower-level, thereby giving rise to new unexpected higher-level properties. Think, for instance, of a flock of starling birds which seems to have a “life of its own”.

Thus, macro properties may or may not be emergent, according to some (non-trivial) definition of emergence. The degree to which a candidate emergent macro property is emergent can be quantified – see, e. g., causal emergence (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008289)or dynamical independence (https://arxiv.org/abs/2106.06511)which are two different “flavours” of emergence involving different aspects to characterize the emergent phenomenon in question. The diversity of aspects of emergence, including possible attempts to quantify those aspects, is currently under active investigation.

One challenge to study current proposals of measures of emergence is the difficulty to find suitable macro variables in either simulated or empirical data to probe those variables for their degree of emergence. A robust measure should be able to yield zero where we deem the macro property in question to be trivial and thus not emergent. Conversely, it should yield results above zero where we can fairly assume that the macro property in question is an emergent one. Whereas we may distinguish emergent phenomena from non-emergent ones in the natural world (think of the starling flock which we intuitively deem emergent), we’re less able to do so in the realm of computer-generated systems or empirical noisy data. What methods from machine/deep learning can we use to generate macro variables we potentially deem interesting in the context of emergence? The unsupervised learning procedure proposed in the same paper introducing dynamical independence might be a starting point – what other methods can we use to help the discovery of emergent macro variables?

To navigate this search, both Matlab and Python code is available to apply causal emergence to macro variables of interest and guide judgements about suitability. Anyone interested in this project would need to be able to implement either simulation models and/or use empirical data, as well as machine/deep learning methods of interest (a few simulation models to start with as well as experimental neuronal datasets are available). Prerequisites are intermediate coding skills, intermediate to advanced knowledge and experience in machine/deep learning, as well as an interest in the concept of emergence and its quantifications.

Prof Luc Berthouze

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Exploring The Science Behind Cancer Diagnostics Through Network Modelling

This project aims at exploring the scientific field behind three types of cancers (lung, prostate and colorectal) through the use of publication data from 1990 to 2015. It builds on a publication dataset of ~370 000 publications. A large number of variables are available, such as co-authorship (between individuals and organisations), publication year, cited references, discipline of the paper/health classification...

The two main areas of interest are to (i) explore how the specific field of diagnostics has evolved within these three cancers and (ii) whether there is any specific area in cancer which is more attractive to companies. Analysis of the data will involve the deployment of text mining/processing techniques, citation and co-citation analysis as well as state of the art methods from network science such as centrality measures and modularity-based community structure detection.

Estimation Of Dynamical Model Of Synchronisation

It is quite commonly accepted that synchronisation of neural oscillations underlies many brain functions (with excessive synchronisation being characteristics of a number of neurological disorders, e.g., epilepsy, Parkinson's Disease - PD). A lot of the work in this area really on models involving phase oscillator models (e.g., Kuramoto oscillators) seeking to represent the activity of weakly coupled oscillators. A number of techniques have been proposed to try to estimate the parameters of such phase oscillator model from real neural data. One such method is [1] (and see also [2] for an earlier, simpler, version). The aim of this project will be to replicate this method and then to investigate its behaviour (and robustness) when presented with data of a system at or near a critical transition. There will also be possibility to apply it to data from healthy and PD brain.

References: Onojima T, Goto T, Mizuhara H, Aoyagi T (2018). A dynamical systems approach for estimating phase interactions between rhythms of different frequencies from experimental data. PLoS computational biology, 14(1), e1005928. Ota K, Aoyagi T (2014). Direct extraction of phase dynamics from fluctuating rhythmic data based on a Bayesian approach. arXiv preprint arXiv:1405.4126. Helpful supporting material: http://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1005928.s003&type=supplementary

Skills: Interest in complex systems, Bayesian statistics, synchronisation

Assessing Synchronisation In Spike Trains

The metric of coherence [1] is commonly used in neurophysiology to quantify synchronisation through characterising the presence of a linear interaction between point processes, time series, or hybrids of the two. It has been used (among other things) to describe the presence of synchronisation between motor cortex and spinal motor neurones as well as between motor units; and how such synchronisation changes with age and with task. This is important because dynamic synchronisation is hypothesised to underlie communication in brain — the so-called communication through coherence hypothesis [2]. The metric of coherence comes with confidence limits, however, the derivation of those limits involves a log transform of the spectra, the reliability of which depends on the availability of a sufficiently long record [3]. Recently, we have introduced a new time-domain measure, which is insensitive to the density of events and/or the length of the record [4]. The goal of this project is to empirically compare both approaches. This will be done using two sets of data: (a) synthetic data from the so-called common shock model [5], a mathematical equivalent of the physiological concept of common drive [6]. (b) using real data from the seminal paper of Rosenberg et al. [1] This project will involve a collaboration with Dr Simon Farmer @ UCL.

References

[1] Rosenberg et al. Progress in Biophysics and Molecular Biology. 1989 Jan 1;53(1):1–31.

[2] Fries. Neuron. 2015 Oct 7;88(1):220–35.

[3] Halliday et al. Prog Biophys Mol Biol. 1995;64(2–3):237–78.

[4] Messager et al. 2019. https://arxiv.org/abs/1904.04813v2

[5] Genest et al. Statistics & Probability Letters. 2018 Sep 1;140:202–9.

[6] Sears and Stagg. The Journal of Physiology. 1976 Dec 1;263(3):357–81.

Skills: Interest in complex systems, synchronisation, computational neuroscience; Reasonable Python skills

Dr Masoumeh Dashti

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Statistical inference of the permeability of the rocks in a subsurface flow problem from measurements of the pressure of the flow

Consider the problem of describing the motion of a fluid (water or oil for instance) through subsurface rocks. To solve this problem, one needs to know the permeability properties of the rocks where the fluid moves through. It is however very difficult to measure this directly. What one can measure instead is the pressure of the flow through the rocks (by drilling holes in the ground and reaching the flow). The pressure and the permeability functions are related by Darcy's law and the problem of our interest in this project is recovering the permeability from the measurements of the pressure. We will use a Bayesian approach to recover the permeability function. This includes: 1) learning about the fundamentals of Bayesian approach to inverse problems for functions and, the appropriate MCMC methods for such problems; and 2) writing a Python code to estimate the permeability function in a simplified version of subsurface flow problem described above.

References:

[1] A. M. Stuart. Inverse problems: A Bayesian perspective. Acta Numerica, 19(May 2010):451–459, 2010.

[2] http://smai.emath.fr/cemracs/cemracs21/data/presentation-speakers/dashti.pdf

Dr Darya Gaysina

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Predictive modelling of depression and anxiety across the life course

What are the factors in childhood and adolescence that can predict the onset and persistence of depression and/or anxiety across the life span?

This project will tackle this question by analysing the National Child Development Study (British 1958 birth cohort) dataset [1], which has a large amount of longitudinal data on a cohort of over 17,000 people born in the UK in 1958, e.g. on physical and educational development, economic circumstances, health behaviour and family life.

[1] https://cls.ucl.ac.uk/cls-studies/1958-national-child-development-study/

Dr Nicos Georgiou

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Traffic flow models via totally asymmetric simple exclusion processes

The totally asymmetric simple exclusion process (TASEP) is a stochastic particle system in which particles move only in one direction, without being able to overtake each other. The model has been used a few times to model traffic flow in narrow highways, for which there are rigorous mathematical results, or even implemented to make predictions about traffic in city grids but in this case without the mathematical rigour.

The goal of this project is three-fold. First there is the theoretical component of understanding the mathematics behind the hydrodynamic limits of the particle system and find the limiting PDE. Second, we will use free traffic data and develop statistical tests to identify and estimate relevant parameters that appear in the hydrodynamic limit above. The third is to develop Monte Carlo algorithms that take the estimated parameters, build the stochastic model, and show us the traffic progress in a given road network.

Helpful mathematical background: Random processes, Monte Carlo Simulations, Statistical Inference. Some

Bibliography:

[1] N. Georgiou, R. Kumar and T. Seppäläinen TASEP with discontinuous jump rates https://arxiv.org/pdf/1003.3218.pdf

[2] H.J. Hilhorst and C. Appert-Rolland, A multi-lane TASEP model for crossing pedestrian traffic flows https://arxiv.org/pdf/1205.1653.pdf

[3] J.G. Brankov, N.C. Pesheva and N. Zh. Bunzarova, One-dimensional traffic flow models: Theory and computer simulations. Proceedings of the X Jubilee National Congress on Theoretical and Applied Mechanics, Varna, 13-16 September, 2005(1), 442–456.

Prof Peter Giesl

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Deep neural networks, reinforcement learning and Lyapunov functions

Recently, the areas of neural networks and reinforcement learning on the one hand and Lyapunov functions on the other hand have been used to support each other. Lyapunov functions are a tool in dynamical system to certify stability: they are scalar-valued functions that decrease along solutions of a dynamical system.

Lyapunov functions have been used to certify the safety of the results obtained by reinforcement learning [1,2]. Conversely, deep neural networks have been used to compute Lyapunov functions for dynamical systems [3].

The aim of the dissertation is to explore these relations both theoretically and in examples. For this project previous knowledge on dynamical systems is very useful.

Bibliography:

[1] Theodore Perkins and Andrew Barto. Lyapunov design for safe Reinforcement Learning. J. Machine Learning Research 3 (2002), 803-832.

[2] Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman and Mohammad Ghavamzadeh. A Lyapunov-based approach to safe reinforcement learning. 32nd conference on neural information Processing Systems, Montreal, Canada, 2018.

[3] Lars Grüne. Computing Lyapunov functions using deep neural networks. Journal of Computational Dynamics 8 (2021), 131-152.

Mr Anthony Groves and Prof Enrico Scalas

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Library Search: what's the use?

In 2015 the University of Sussex Library began using the Ex Libris Primo tool to make our collections discoverable to users, in the shape of Library Search. Primo is widely used by academic libraries across the UK and there is a need to better understand how it is being used in order to provide effective support and instruction: what keywords and search strategies are people using; are Boolean operators and wildcards being added? How are searches being modified when too many results are returned? We have access to data about millions of searches that have been performed and would like to work with you to find out what this can tell us about our community’s search behaviour. If you would be interested in working with Library data for your dissertation and are interested in the search behaviour of your fellow students, please get in touch!

Data: Millions of searches by our users

Exploring usage of Library e-resources

More than half of the Library’s knowledge base budget is used on subscriptions to electronic journals and database packages. These packages can be extremely expensive, so it is important for the Library to be able to understand how often, and by which schools, the resources are used. For its physical book stock, the Library has always had detailed statistics on how often books are borrowed. However, this is not the situation with its electronic holdings. Some of the suppliers of electronic materials provide overview statistics of how often they have been used but this lacks any information about when resources have been accessed, and by which category of Library user (Undergraduate/Postgraduate, School of study). Since April, the Library has been populating a MySQL database with this information and we would be very interested in seeing whether the data contains any patterns that might inform the Library on the relevancy and cost effectiveness of its electronic holdings. The sort of questions that might be explored are: Which resources provide the best and worst value for money? Whether any departments are particularly high users? Do any School/Department has a low usage level? This could inform our liaison and promotion with schools.

Data: e-resources access

Prof Tim Hitchcock and Prof Enrico Scalas

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Digital History: The Bills of Mortality, from 1657 to 1758

This project concerns digital history [1] and focuses on the Bills of Mortality [2]. The 'Bills' contain a consistent record of the causes of death for Londoners between 1657 and 1758 - including the period of the plague in 1665 - and collectively build a unique and historically significant picture of health and mortality in the past. The purpose of the project is to digitalise, plot and analyse part of these data to gain insights into health problems in the early modern era and to identify hitherto unrecognised patterns in the data. Other databases are available to study if several students are interested in working on a project in digital history. Please feel free to contact us.

[1] C. Annemieke Romein et al., State of the Field: Digital History, History, 2020. https://onlinelibrary.wiley.com/doi/full/10.1111/1468-229X.12969

[2] A Collection of the Yearly Bills of Mortality, from 1657 to 1758 inclusive. Together with several other Bills of an earlier Date, A. Millar, London, 1759. https://www.dropbox.com/s/i2wyj07eu6gpkyw/Bills%20of%20Mortality.pdf?dl=0

Public dataset: https://reshare.ukdataservice.ac.uk/854104/

Dr Hsi-Ming Ho

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

NLP meets temporal logics.

Recently there have been some attempts to apply modern NLP / machine learning techniques to traditional logical/mathematical reasoning tasks, e.g., simple arithmetic or Boolean formula satisfiabilty. A recent work, which applies the Transformer architecture to the problem of finding solutions to temporal logic formulas, can be found at the link below:

https://openreview.net/forum?id=dOcQK-f4byz

The goal of the project is to advance the state of the art by applying similar techniques to more expressive temporal logics (e.g., with timing constraints). It is likely that the result will degrade as the formulas will be more complicated, thus it is also expected that one needs some novel ideas to improve the performance.

Document layout analysis for exam scripts

Document image analysis (DIA) plays a key role in modern social sciences and humanities research. While there have been many success stories in various individual projects based on recent advances in using deep learning for DIA, many of these required specialised training or post-processing steps to achieve satisfactory performances. The goal is to make use of a recently proposed uniform framework for DIA [1] to develop a semi-automated workflow for exam marking - based on the assumption that scanned PDF exam scripts have rather restricted (and known) possible layouts.

[1] Shen et al. LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. https://arxiv.org/pdf/2103.15348.pdf

Dr Helfrid Hochegger

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Mining cancer gene dependencies to identify and validate novel cell cycle regulators.

Large scale gene dependency screens have delivered a dataset for over 900 cell lines and 14,000 genes. This is a rich resource for data mining and machine learning to identify new functional associations and dependencies. Dr. Hochegger and Dr. Istvan Kiss have been collaborating over the past year to identify new cell cycle regulators using correlation- and cluster-analysis of gene dependency data in this data set. This project aims to further develop this approach. In parallel we aim to functionally validate already predicted cell cycle regulators in various cancer cell lines. The project will have two components, complex network analysis of gene dependent correlations, and analysis of high throughput microscopy data using Python image processing and quantification work flows.

The student will be exposed to both data science and cancer cell biology in an interdisciplinary environment. They will become fluent in applying state of the art programming languages (Python, Matlab), set up machine learning algorithms and develop new code to query cancer cell line and tumour genomics data bases. In parallel, the student will learn to analyse high throughput gene depletion screens and will set up automated image segmentation and analysis work flows to detect cell cycle phenotypes. This will be done on an already identified set of approximately 30 novel cell cycle regulators, while novel hits will be generated using more refined ML algorithms. The interdisciplinary outlook of this project provides a good opportunity for the student to learn a broad scope of data science skills for a successful career as a cancer biologist.

Prof Istvan Kiss and Prof Luc Berthouze

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Epidemic and contact network parameter inference using mean-field models of epidemics on networks

Epidemic propagation on networks is a well-studied problem [1]. In particular, mean-field models have proved to be useful in dealing with the complexity caused by the high-dimensionality of exact models while revealing how properties of the network (e.g. average degree, degree distribution, clustering) influence how a disease invades, spreads and how to design optimal control measures.

In this study however, we set out to investigate to what extent such mean-field models can be used for inference purposes. We investigate the question of whether and to what extent can mean-field models be used to make useful inferences about the parameters of disease dynamics and that of the network [2]. We focus on SIR-like epidemics and will use mean-field models such as the Pairwise and Edge-based compartmental model. The parameter inference will be tested using synthetic data based on Gillespie simulations on different networks as well as realistic data, such as from the ongoing COVID19 pandemic. The inference will be done by using maximum-likelihood and Bayesian methods. The main of the study will be to systematically map out the usefulness of mean-field-model-based inference and describe how the quality of the inference may depend on available data, the complexity of the mean-field model and parameters of the epidemic.

References:

[1] Kiss, I. Z., Miller, J. C., & Simon, P. L. (2017). Mathematics of epidemics on networks. Cham: Springer, 598.

[2] Di Lauro, F., Croix, J. C., Dashti, M., Berthouze, L., & Kiss, I. Z. (2020). Network inference from population-level observation of epidemics. Scientific Reports, 10(1), 1-14.

Dr Naercio Magaia

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

An AI-based incentive framework for the next-generation vehicular networks

With the rapid development of 5G communication systems, the advancements of vehicular With the rapid development of 5G communication systems, the advancements of vehicular communication technology in assisting Intelligent Transportation Systems (ITS) can effectively improve passengers’ safety and provide infotainment for enjoyable journeys [1]. There have been significant advancements in electronics aiding car drivers, along with communications outside the vehicle, e.g., vehicle-to- everything (V2X), to enable mobile Internet and entertainment. Together with V2X connectivity, data collected from infrastructure Road Side Units and various other sensors placed on roads, buildings and pedestrians, which constitute the next-generation vehicular networks, hereafter Internet of Vehicles (IoVs), could be disseminated to the vehicle cloud hence ensuring efficient emergency warning notifications, the management of the city traffic levels, and provide drivers with congestion control information.

However, security is a key challenge in ITS for IoVs since, nowadays, there exist a plethora of security attacks that can negatively impact its reliability [2]. Some examples of common threats to vehicles in ITS include denial of service, jamming, man in the middle, Sybil, eavesdropping, and broadcast and message tampering.

When further investigating individual behaviors and by considering that each vehicle or group of vehicles is controlled by a rational entity, vehicles might misbehave (maliciously or not) by reporting false safety messages while using valid security credentials, what is critical for the most ITS-based applications [3]. This problem is further exacerbated by the absence or impracticality of centralized control in vehicular environments. Conventional incentive schemes allow nodes to learn from their surroundings, i.e., from neighboring vehicles, and then assess the veracity of reported events and the honesty of vehicles. However, these schemes have their own vulnerabilities, such as false accusations and praise, which can be exploited to manipulate the entire network for the benefit of malicious vehicles/entities.

The aim of this project is to design a novel AI-based incentive framework by considering social norm complexity and past reputations [4] to improve cooperation among nodes.

References

[1] N. Magaia, Z. Sheng, P. R. Pereira, and M. Correia, “REPSYS: A robust and distributed incentive scheme for in-network caching and dissemination in Vehicular Delay-Tolerant Networks,” IEEE Wirel. Commun. Mag., pp. 1–16, 2018.

[2] C. Bernardini, M. R. Asghar, and B. Crispo, “Security and privacy in vehicular communications: Challenges and opportunities,” Vehicular Communications, vol. 10. pp. 13–28, 2017.

[3] N. Magaia and Z. Sheng, “ReFIoV: A Novel Reputation Framework for Information-Centric Vehicular Applications,” IEEE Transactions on Vehicular Technology, pp. 1–1, 2018.

[4] F. P. Santos, F. C. Santos, and J. M. Pacheco, “Social norm complexity and past reputations in the evolution of cooperation,” Nature, vol. 555, no. 7695, pp. 242–245, Mar. 2018.

A bio-inspired content dissemination mechanism for information-centric connected vehicle networks

With the rapid development of 5G communication systems, the advancements of vehicular communication technology in assisting Intelligent Transportation Systems (ITS) can effectively improve passengers’ safety and provide infotainment for enjoyable journeys. There have been significant advancements in electronics aiding car drivers, along with communications outside the vehicle, e.g., vehicle-to- everything (V2X), to enable mobile Internet and entertainment. Together with V2X connectivity, data collected from infrastructure Road Side Units and various other sensors placed on roads, buildings and pedestrians, which constitute the next-generation connected vehicle networks, could be disseminated to the vehicle cloud hence ensuring efficient emergency warning notifications, the management of the city traffic levels, and provide drivers with congestion control information.

However, the amount of data required for novel vehicular applications such as multimedia content sharing will continue to increase along with the need to minimize latency as a result of an increasing number of connected vehicles as well as more evolved use cases [1]. Although, the use of 5G mobile technology enables broad coverage and high bandwidth to provide multimedia content downloading services for the moving vehicles, even though most probably being overloaded and congested especially during peak times and in urban central areas with the increase of the services and user demands. Consequently, such networks will face extreme performance hits in terms of low network bandwidth, missed calls, and unreliable coverage. However, the opportunistic contacts enabled by V2X communications can provide high bandwidth for the transmission of data as well as enable vehicles to build relationships with other objects they might come into contact.

The aim of this project is to devise a novel bio-inspired content dissemination scheme leveraging on efficient caching mechanisms in-network (i.e., through information-centric networking [2], [3]) or at the edge of 5G networks [4].

References

[2] A. Ioannou and S. Weber, “A Survey of Caching Policies and Forwarding Mechanisms in Information-Centric Networking,” IEEE Communications Surveys and Tutorials, vol. 18, no. 4. pp. 2847–2886, 2016.

[3] H. Khelifi et al., “Named Data Networking in Vehicular Ad hoc Networks: State-of-the-Art and Challenges,” IEEE Commun. Surv. Tutorials, pp. 1–1, 2019.

[4] K. Zhang, Y. Mao, S. Leng, Y. He, and Y. ZHANG, “Mobile-Edge Computing for Vehicular Networks: A Promising Network Paradigm with Predictive Off-Loading,” IEEE Veh. Technol. Mag., vol. 12, no. 2, pp. 36–44, 2017.

Prof Michael Melgaard

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Higher-Order Singular-Value Decomposition and its Use in Big Data

The singular-value decomposition (SVD) is a factorization of a real or complex matrix. It has numerous useful applications in real life, e.g., imagine compression, handwritten digits classification and recommendation system. More powerful low-rank tensor decompositions has been developed, in particular the Tucker decomposition can be seen as a higher-order SVD -- a simple and robust algorithm to obtain quasi-optimal low-rank approximation. The aim is to study the Tucker decomposition, the higher-order SVD and, if possible, go beyond. Possible projects include applications in computer graphics, machine learning, scientific computing, and signal processing, e.g., image reconstruction and noise filtering, sensor measurements, genomic signal processing, low memory optimization etc.

Bibliography:

[1] H. Fanaee, J. Gama, EigenEvent: An algorithm for event detection from complex data streams in Syndromic surveillance. Intelligent Data Analysis. 19 (2015), no 3, 597-616.

[2] B. Huber, R. Schneider, S. Wolf, A randomized tensor-train singular-value decomposition. Springer International Publishing, Cham, 2017.

[3] Y. Li, H.L. Nguyen, D.P. Woodruff, D. P. Turnstile streaming algorithms might as well be linear sketches. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, (pp. 174-183). ACM, 2014.

[4] O.A. Malik, S. Becker, Low-rank tucker decomposition of large tensors using tensorsketch. In Advances in Neural Information Processing Systems, (pp. 10116--10126), 2018.

[5] G. Zhou, A. Cichocki, S. Xie, Decomposition of big tensors with low multilinear rank (2014).

Tensor Methods in Deep Learning

Tensor methods are increasingly finding significant applications in deep learning, including the design of memory and compute efficient network architectures, improving robustness to random noise and adversarial attacks, and aiding the theoretical understanding of deep networks. The aim is to study how tensor methods can be used in deep learning and/or in probabilistic modeling. Possible projects include: supervised/unsupervised learning, grid-search and/or DMRG-type algorithms, hidden Markov models, convolutional rectifier networks etc.

Bibliography:

[1] Bahadori, M. T., Yu, Q. R., and Liu, Y. (2014). Fast multivariate spatio-temporal analysis via low rank tensor learning. In NIPS.

[2] N. Cohen, A. Shashua. Convolutional rectifier networks as generalized tensor decompositions. arXiv preprint arXiv:1603.00162, 2016.

[3] Kuznetsov, Maxim A.; Oseledets, Ivan V. Tensor train spectral method for learning of hidden Markov models (HMM). Comput. Methods Appl. Math. 19 (2019), no. 1, 93–99.

[4] A. Novikov, M. Trofimov, and I. Oseledets. Exponential Machines. arXiv e-prints, art. arXiv:1605.03795, May 2016.

[5] B. Romera-Paredes, M. H. Aung, N. Bianchi-Berthouze, M. Pontil, M. (2013). Multilinear multitask learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 1444-1452.

[6] E. M. Stoudenmire and D. J. Schwab. Supervised learning with tensor networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4799–4807. Curran Associates, Inc., 2016.

Dimensionality Reduction using Higher-Order Tensors

Dimensionality reduction is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension. This project reviews dimensionality reduction, e.g., Principal Component Analysis (PCA), Canonical Correlation Analysis (CCA) and/or Locally Linear Embedding (LLE), before it focusses on recent advances using tensor formats (or decompositions), e.g, the canonical format, the tensor train format, the Tucker representation and the hierarchical Tucker format, etc. The project could be directed towards different applications depending on the student's interest, e.g., neuroscience (neural data, medical images etc), biology (low-rank tensor model for gene–gene interactions) or multichannel EEG signals.

Bibliography:

[1] Y-H. Cheng, T-M. Huang, S-Y. Huang, Tensor decomposition for dimension reduction, WIREs Comput Stat. 2020; 12:e1482.

[2] M. Udell, C. Horn, R. Zadeh, and S. Boyd, Generalized Low Rank Models, Foundations and Trends in Machine Learning, 2016.

[3] E. L. Mackevicius et al, Unsupervised discovery of temporal sequences in high-dimensional datasets, with applications to neuroscience, BioRxiv 273128; doi: https://doi.org/10.1101/273128

Latent Variable Models and Method of Moments

With unlabeled data, how do you discover topics in documents, clusters of points, hidden communities in social networks, or dynamics of a system? Learning is easy with cluster labels but what about learning without cluster labels? There is a growing body of work that shows this is possible, both statistically and computationally. The aim is to study new algorithms for learning latent variable models, techniques for developing new learning algorithms based on spectral decompositions, and analytical techniques for understanding the aforementioned models and algorithms. Projects include: community detection through tensor methods, topic models (say, co-occurrence of words in a document), latent trees etc.

Bibliography:

[1] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, M. Telgarsky, Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1):2773-2832, 2014.

[2] F. Huang, U. N. Niranjan, M. Umar Hakeem, A. Anandkumar, Online tensor methods for learning latent variable models. J. Mach. Learn. Res. 16 (2015), 2797–2835

[3] V. Kuleshov, A. Chaganty, and P. Liang. Tensor factorization via matrix factorization. In Artificial Intelligence and Statistics, pages 507–516, 2015.

[4] A. Lewbel. A local generalized method of moments estimator. Economics Letters, 94(1):124–128, 2007.

[5] A.-H. Phan, A. Cichocki, A. Uschmajew, P. Tichavský, G. Luta, D. P. Mandic, Tensor networks for latent variable analysis: novel algorithms for tensor train approximation. IEEE Trans. Neural Netw. Learn. Syst. 31 (2020), no. 11, 4622–4636.

Deep Learning applied in Finance

Networks with a large number of layers are called deep neural networks, and they have a wide variety of applications such as speech recognition, image classification and game intelligence. In recent years, applications in the financial sector has emerged. Banks typically use risk sensitivities known as Greeks derived from classical models to hedge their options books, but these methods are limited in their ability to factor in transaction costs and additional market information. The purpose is to study how machines can learn from large amounts of historical data to make more precise decisions. Projects include: deep hedging, calibration, option pricing, etc.

Bibliography:

[1] J. Bergstra, Y. Bengio. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 305 (2012), 13:28

[2] J. Berner, P. Grohs, G. Kutyniok, P. Petersen, The Modern Mathematics of Deep Learning, in "Theory of Deep Learning", Cambridge University Press, 2021, to appear.

[3] J.M. Hutchinson, A. Lo, T. Poggio, T. A nonparametric approach to pricing and hedging derivative securities via learning networks. Journal of Finance 49 (1994), no. 3, 851--89.

[4] H. Bolcskei, P. Grohs, G. Kutyniok, P. Petersen, Optimal approximation with sparsely connected deep neural networks. SIAM J. Math. Data Sci. 1 (2019), no. 1, 8–45.

[5] A. Hernandez. Model calibration: Global optimizer vs. neural network. SSRN, July 2017.

[6] D. T. Tran, M. Magris, J. Kanniainen, M. Gabbouj, A. Iosidis, Tensor Representation in High-Frequency Financial Data for Price Change Prediction, IEEE Symposium Series on Computational Intelligence (SSCI), 2017.

Deep Learning applied to Quantum Chemistry/Physics

The goal of quantum chemistry is to predict chemical and physical properties of molecules based solely on the arrangement of their atoms in space, avoiding the need for resource-intensive and time-consuming laboratory experiments. In principle, this can be achieved by solving the Schrödinger equation, but in practice this is extremely difficult. During the past 10 years real progress has been made by low-rank tensor-train approach known as Density Matrix Renormalization Group (DMRG) which is based on an iterative optimization algorithm for wave functions; another possible project idea [2]. In addition, very recently, the deep learning approach to the problem has resulting in exciting new progress on this challenging many-body problem with immense potential impact in, e.g., material science, drug design etc. Possible projects: deep variational Monte Carlo simulations, solving (generalized) eigenvalue problem for the Schrödinger equation for few-body problems using deep learning, grid-based electronic structure calculations using tensor decomposition approach etc.

Bibliography:

[1] J. Hermann, Z. Schätzle, F. Noé, Deep-neural-network solution of the electronic Schrödinger equation. Nat. Chem. 12, 891–897 (2020). https://doi.org/10.1038/s41557-020-0544-y

[2] Rakhuba, M. V.; Oseledets, I. V. Grid-based electronic structure calculations: the tensor decomposition approach. J. Comput. Phys. 312 (2016), 19–30.

[3] E. Stoudenmire and D. J. Schwab. Supervised learning with tensor networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4799–4807. Curran Associates, Inc., 2016.

[4] E. M. Stoudenmire and D. J. Schwab. Supervised Learning with Quantum-Inspired Tensor Networks. arXiv e-prints, art. arXiv:1605.05775, May 2016.

Dr Peter Overbury

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Automatic identification of bats from video data.

The problem of automating observations of animals is of great importance to almost all fields of ecology (see https://www.microsoft.com/en-us/ai/ai-lab-snow-leopard). Here, there is a need for the processing of 700 hours of footage of bats crossing country roads in the UK in order to determine their flight height. This will allow us to better understand the collision risk posed to bats from vehicles, particularly for the rare British woodland bat, the barbastelle. Further, the automation of this task could allow for wider citizen science projects into this bats life cycle and behaviors, which are still not fully understood. As such, this project has good scope for publication and could even be expanded into identification of other flying object problems such as drones and or publication of this as a data set for other ML/computer vision problems.

Biometric application of variational diffusion models

Problem:

• Investigate the impact of diffusion-based models on biometrics - Dr Peter Overbury and Dr Julie Weeds

Objectives:

• Use variational diffusion models to aid biometric attack detection

• Explore the implementation of novel diffusion model techniques to generate • synthetic imagery

• Improve model inference times by optimising ML operations

Reading

• https://arxiv.org/pdf/2107.00630.pdf

• https://arxiv.org/pdf/2105.05233.pdf

Supervisors: Dr Peter Overbury & Dr Julie Weeds

About Iproov

Iproov is a World leader in face authentication technology for online identity verification. We are based in London and are happy to support students in person or remote. We are looking for a few students to take up projects along these lines (project above is one example). If you are interested in doing a project with us please send a copy of your CV to edward.crookenden@iproov.com or peter.overbury@iproov.com , along with the projects that you are most interested in. If there’s something else you’re interested in that you think would fit with the work we do at Iproov, please contact us and let us know.

Unsupervised video feature disentanglement - Dr Peter Overbury and Dr Julie Weeds

Problem:

• How to efficiently reconstruct a 3D scene from a series of sparse-view images

Objectives:

• Explore a number of new deep learning methods including Neural reflectance surfaces, neural radiance fields and video autoencoders to learn 3D structure and camera trajectory

• Current 3D reconstruction methods can be extremely computationally expensive.

• Improve model training and inference times without loss of precision

Reading

• https://zlai0.github.io/VideoAutoencoder/resources/video_autoencoder.pdf

• https://arxiv.org/pdf/2110.07604.pdf • https://arxiv.org/pdf/2003.08934.pdf

Supervisors: Dr Peter Overbury & Dr Julie Weeds

About Iproov

Unsupervised representation learning of structured data - Dr Peter Overbury and Dr Julie Weeds

Problem:

• Detecting anomalous samples of structured data

Objectives:

• Research effectiveness of unsupervised representation learning with our large structured dataset

• Explore learnt embedding space for useful features

• Anomaly detection strategies using the generated feature embeddings

Reading

● https://assets.amazon.science/60/53/7b0e54fb4ee0bbcba20dc0c5348a/record2vec-unsupervised-representation-learning-for-structured-records.pdf

Supervisors: Dr Peter Overbury & Dr Julie Weeds

About Iproov

Generative facial modelling - Dr Peter Overbury and Dr Julie Weeds

Problem:

• Investigating how machine learning could be used to commit identity fraud

Objectives:

• Explore state-of-the-art methods in deep learning for reconstructing high-fidelity facial imagery

• Investigate computer vision techniques and develop statistical algorithms for creating and mimicking short video captures of real humans

Reading

● https://arxiv.org/abs/2106.12423

● https://arxiv.org/pdf/2112.00532.pdf

● https://arxiv.org/pdf/2007.03898.pdf

Supervisors: Dr Peter Overbury & Dr Julie Weeds

About Iproov

Time series forecasting of video data - Dr Peter Overbury and Dr Julie Weeds

Problem:

• Explore using spatio-temporal machine learning methods to generate unseen video frame(s) based on seen video frames. This could be:

o given a sequence of previous frames predict the next frame

o given a single frame generate a realistic video of many frames

o given a small subset of frames from an original video, predict the missing frames

Objectives:

• Explore the field and get a good understanding of the current solutions to the above problem. Understand the challenges of video prediction with our dataset vs those used in papers

• Implement one of these solutions yourself and improve upon it or come up with your own solution altogether

• Train on our dataset with the aim of achieving accurate frame prediction / convincing video generation

Reading

• https://arxiv.org/pdf/2004.05214.pdf

Supervisors: Dr Peter Overbury & Dr Julie Weeds

About Iproov

Gabor filters for improving DNN robustness to adversarial attacks - Dr Peter Overbury and Dr Ben Evans

Deep Convolutional Neural Networks (DNN) are often vulnerable to adversarial attacks and this is a growing threat in the real world of face authentication. Research has shown that bio-inspired Gabor filters can be used to improve DNN’s robustness to these kinds of attacks by making them less reliant on overly specific details in the data.

Problem:

● Investigate the impact of adding Gabor filters to the training of facial matching DNN both on the overall performance of the DNN and also the effect on their robustness to adversarial attacks

Reading

https://www.biorxiv.org/content/10.1101/2021.02.18.431827v2

Supervisors: Dr Peter Overbury & Dr Ben Evans

About Iproov

Dr Andrew Penn

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Open science data standards and electrophysiology analysis

We recently developed and released open source software, Eventer [1,2], as a machine learning solution to the challenge of detecting spontaneous synaptic events measured by electrophysiology or imaging. Eventer supports a wide variety of traditional proprietary data formats from commercial data acquisition packages. However, recent open science initiatives have led to much investment into the development of open source data acquisition software and standardized electrophysiology data formats. The purpose of this dissertation is to incorporate wider support for open science data standards in Eventer, in particular Neurodata without borders [3].

Bibliography:

[1] Winchester, G., Liu, S., Steele, O.G., Aziz, W. and Penn, A.C. (2020) Eventer. Software for the detection of spontaneous synaptic events measured by electrophysiology or imaging. http://doi.org/10.5281/zenodo.3991677

[2] Steele, O.G., Liu, S., Winchester, G., Aziz, W., Chagas, A. and Penn, A. Eventer: Software you can train to detect spontaneous synaptic responses for you (TP001324) in BNA 2021 Festival of Neuroscience Poster abstracts. (2021). Brain and Neuroscience Advances. https://doi.org/10.1177/23982128211035062

[3] Rübel, O., Tritt, A., Dichter, B., Braun, T., Cain, N., Clack, N., Davidson, T. J., Dougherty, M., Fillion-Robin, J.-C., Graddis, N., Grauer, M., Kiggins, J. T., Niu, L., Ozturk, D., Schroeder, W., Soltesz, I., Sommer, F. T., Svoboda, K., Ng, L., Frank, L. M., Bouchard, K. (2019) NWB:N 2.0: An Accessible Data Standard for Neurophysiology. bioRxiv. https://doi.org/10.1101/523035 Website: https://www.nwb.org/

Learning descriptive feature sets for accurate classification of spontaneous synaptic events

Neurons receive inputs from thousands of other neurons at connections called chemical synapses. Measuring synaptic activity is an instrumental step to understanding how neurons encode and process information. Dr Penn’s lab recently developed and released the first version of some open source software, Eventer [1,2], as a machine learning solution for the challenge of classifying candidate spontaneous synaptic events measured by electrophysiology or imaging and detected by deconvolution [3]. The accurate detection and classification of spontaneous synaptic activity is particularly difficult when events occur frequently and their waveforms are overlapping; a problem often associated with in vivo recordings. Currently, training the software requires manual classification and uses a set of hand-crafted features extracted from a training data set as input to a Random Forests algorithm. The software serves as a framework to use machine learning as solution for improving consistency and automating the correct identification of synaptic events.

The purpose of this MSc project is to develop a data-driven framework for creating sets of general-purpose descriptive features that enable accurate classification of candidate synaptic events. These features will be used in downstream machine learning models and could be used to make an improved version of the Eventer tool. The types of features to consider could include convolutional filters of the raw signal or spectrogram. These features will be learned from example datasets using either a fully-supervised or semi-supervised learning paradigm. The resulting feature sets will be used to train and evaluate classifiers on a range of challenging data sets from our lab and from online data archives (e.g. DANDI). The existing toolkit is in MATLAB, and developments would need to be compatible with this. However, Python frameworks (e.g. PyTorch and TensorFlow) could still be used for learning the features, which could then be translated into MATLAB.

Bibliography:

[3] Pernía-Andrade, A.J., Goswami, S.P., Stickler Y., Fröbe, U., Schlögl, Alois, and Jonas, Peter (2012) A Deconvolution-Based Method with High Sensitivity and Temporal Resolution for Detection of Spontaneous Synaptic Currents In Vitro and In Vivo. Biophys J 103, 1429–1439.

Prof Simon Peeters and Prof Fiona Mathews

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Why are bats being battered by wind farms?

Wind farms negatively affect over 30 bat species [1] and have potential consequences for population viability of at least one species [2]. This problem will only get worse due to the explosion of new wind farms in the world. Somehow, bats seem to be attracted to wind farms [3]. If we better understand the reasons why bats want to spend time around wind turbines, we will be able to develop mitigation strategies.

One hypothesis is that the sound generated by the wind turbines could be attracting them. You will be using the available data sets, including a large set of sound recordings from wind farms in the UK, to try and understand why bats seem to have an interest to explore the areas around wind turbines.

This project is suitable for students with an interest in statistical inference, data visualisation, machine learning, and big data. Some knowledge of machine learning and python will be needed. Knowledge of C++ would be an advantage.

[1] Thaxter, C. B. et al. IProc. R. Soc. B 284, 20170829 (2017).

[2] Frick, W. F. et al Biol. Conserv. 209, 172–177 (2017).

[3] Mathews, F. et al Nature Scientifc Reports 11 :3636 (2021)

https://www.nature.com/articles/s41598-021-82014-9.pdf

Dr Warrick Roseboom

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Project 1

In recent decades, academic positions/promotions and research funding depend largely on researchers producing large numbers of “papers” as an indicator of work produced. The quality of such papers is often irrelevant to the outcome. Because of this incentive structure, as text generation methods improve there is substantial risk that scientific literatures will be overrun with entirely synthetic papers produced by paper mills and published in for-profit predatory journals. How can we detect such generated, “fake” content? Especially in the “softer” sciences? What should happen to papers classified as “fake”? This project would involve building or augmenting scientific paper generators and training classifiers to distinguish between selected “real” and “fake” papers, then ideally applying the tool in the “wild” to some degree.

See: https://www.science.org/content/article/hoax-detecting-software-spots-fake-papers, https://en.wikipedia.org/wiki/SCIgen, doi:10.1007/978-3-319-24403-7_8

Project 2

Sports analytics has become big business. Professional teams spend huge money on analytics staff to maximise team performance. Simultaneously, “fantasy leagues” have also become very common. At the core, these reflect two sides of the same underlying problem –how to predict performance of a team based on simple features, such as easily tracked performance aspects (metres run, passes made, points scored – whatever, depending on the sport). There are two aspects I’m interested in: one is the act of predicting team success based on these simple measures in individual players. In different sports (particularly US sports) there have been many models produced and discussed in different spheres (e.g. https://projects.fivethirtyeight.com/carmelo/).

One idea would be to work on understanding which of these approaches is generally better, and augment as necessary with what is learnt in order to have a better performing model. This could be from an informed analytics view (as is often the case in existing models, working from domain knowledge) wherein different parts of the model are developed and informed based on knowledge. Alternatively, if sufficient data can be obtained for the relevant case (basketball, football, whatever you are interested in), could be approached with a ~pure machine learning approach, and then try and break the model down to figure out what aspects make it work. A similar, though distinct idea would be to make a model that would win a fantasy league. The complexity of application and outcome would depend on your background skills and interests.

Project 3

The neural foundations of conscious experience remain unknown. Recent efforts to characterise some aspects have received a great deal of attention (for example: DOI: 10.1126/sciadv.aat7603). We have access to a large data set of intracranial neural recordings taken from patients under controlled visual stimulation or during free-viewing conditions, as well as during sleep (see DOI: 10.1016/j.cub.2016.11.024). In this project we will look for patterns of neural responses that can be associated with specific stimulus content or conscious level. This might be approached in a manner biased more towards mining or in a more structured way (e.g. DOI: 10.1523/JNEUROSCI.4399-14.2015) depending on skills, background, and interests.

Prof Enrico Scalas

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Data Science and Quantum Programming

The relatively new field of quantum programming [1] builds on recent developments of quantum computing and quantum technology: The quantum computer, using qubits instead of ordinary bits, is in need of new programming skills. The purpose of this dissertation is to analyse the interface between quantum computing and data science [2].

Bibliography:

[1] W. Zeng et al., First quantum computers need smart software, Nature 549, 149–151 (14 September 2017) doi:10.1038/549149a https://www.nature.com/news/first-quantum-computers-need-smart-software-1.22590

[2] https://math.nist.gov/quantum/zoo/ (see in particular: Machine Learning/SimulatedAnnealing)

Data-driven models for wealth inequality

The recent book by T. Piketty (Capital in the Twenty-First Century) [1] promoted the important issue of wealth inequality. In the last twenty years, physicists and mathematicians developed models to derive the wealth distribution using discrete and continuous stochastic processes(random exchange models) as well as related Boltzmann-type kinetic equations [2]. In this literature, the usual concept of equilibrium in economics is either replaced or completed by statistical equilibrium. The purpose of this dissertation is to collect data on the distribution of wealth and to corroborate/falsify such models using data science tools.

Bibliography:

[1] T. Picketty, Capital in the 21st Century, Harvard University Press, 2014

[2] B. Duering, N. Georgiou, E. Scalas, A stylised model for wealth distribution. In: Aruka, Yuji and Kirman, Alan (eds.) Economic Foundations of Social Complexity Science. Springer Singapore, 2017 pp. 95-117. ISBN 9789811057045

Probabilistic causation

According to the Stanford Encyclopedia of Philosophy [1]:“Probabilistic Causation” designates a group of theories that aim to characterize the relationship between cause and effect using the tools of probability theory. The central idea behind these theories is that causes change the probabilities of their effects. In recent years, computer scientists have developed new theories and tools [2, 3] to deal with probabilistic causation. The aim of this project is to explore these ideas and to apply specific tools [3] to one or more problems of cause identification.

References

[1] https://plato.stanford.edu/entries/causation-probabilistic/

[2] J. Pearl, M. Glymour, N.P. Jewell, Causal Inference in Statistics, Wiley, New York, 2016.

[3] https://cran.r-project.org/web/packages/causaleffect/vignettes/causaleffect.pdf

Data Science and Glottochronology: The Classification of Germanic Languages

Glottochronology is the study of chronological relationships between languages. Sometimes, these are known in some detail. For example, French, Italian, Spanish, Catalan, Portuguese and Romanian, etc. are related languages and they derive from a known language: Latin. They are called Romance languages and we know that they started evolving from Latin approximately 2000 years ago at the time of the early Roman empire in the case of Italian, French, Spanish, Portuguese and Catalan and a bit later in the case of Romanian. In other cases, the proto language is not known. It is the case of Germanic languages, the group of Indo-European languages to which English belongs. Even if Proto-Germanic is not attested and is a reconstructed language, the timeline of the differentiation of Germanic languages overlaps with the one of Romance languages. Early information on Germanic tribes is given by Tacitus in his book Germania [1].

Typically, glottochronological studies use phonetics as a tool for studying the family tree of languages and a list of words known as the Swadesh list. In this project we shall study the family tree of Germanic languages using data science classification methods on subsets of the Swadesh list. A starting point are the results obtained in this online project: http://www.elinguistics.net/

[1] Publius Cornelius Tacitus, Germania. An English translation is available (more than one, indeed): https://www.perseus.tufts.edu/hopper/text?doc=Perseus:text:1999.02.0083

Dr Sylvia Schroeder

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

The neural encoding of visual stimuli during different behavioural states

We have recorded the activity of hundreds of neurons in the early visual system of the mouse (retina and superior colliculus) [1]. The data are continuous traces (collected using two-photon imaging), which reflect the activity of each neuron at a temporal resolution of tens to hundreds of milliseconds. During the experiments, the mouse saw gratings moving in different directions, while we recorded the mouse’s behaviour (running speed, eye movements, and pupil size). We want to use adevelop a machine learning model that can predicts the visual stimulus from the activity of the neural population and then ask the following questions:.

(a) Is the prediction better when the mouse was running?

(b) If we train the model only with data when the mouse was not running, how well accurate is the prediction during periods when the mouse was running?

[1] Schröder S, Steinmetz NA, Krumin M, et al. Arousal Modulates Retinal Output. Neuron. 2020;107(3):487-495.e9. doi:10.1016/j.neuron.2020.04.026

Dr Kate Shaw

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Machine Learning Visualisation with ATLAS data from the Large Hadron Collider

The Large Hadron Collider (LHC) at CERN [1] accelerates particles near to the speed of light before colliding them millions of times a second inside massive particle detectors such as the ATLAS detector, to study fundamental particles of the Universe. Cutting-edge machine learning techniques are used to analyse the huge amount of collision data collected by ATLAS to study rare fundamental particles such as the Higgs boson and search for new physics such as dark matter [2]. This project, working with ATLAS physicists and data science experts, will develop interactive data visualisation tools for non-specialist users to understand and analyse data using machine learning techniques to search for and discover new rare fundamental particles. This is part of the ATLAS Open Data project [3] led by Sussex researchers [4], which provides ATLAS data, tools and software frameworks to the public [5]. This project will produce the first ever tool allowing the non-specialist to visualise machine learning on LHC data. This project is suitable for students with an interest in data visualisation, machine learning, and big data. Some knowledge of Java, machine learning and Python or C++ will be needed.

[1] https://home.cern/

[2] https://iml.web.cern.ch/

[3] https://atlas.cern/resources/opendata

[4] https://home.cern/news/news/knowledge-sharing/atlas-releases-13-tev-open-data-science-education

[5] http://opendata.atlas.cern/release/2020/documentation/visualization/histogram-analyser-2_13TeV.html

Dr Ivor Simpson

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Analysing cell microscopy images with machine learning - Dr Helfrid Hochegger

Quantifying cell properties from microscopy images is an integral part of experimental cell biology. Recent work has demonstrated the potential of multi-channel imaging for measuring progression through the cell-cycle [1]. Such data can be used to quantify differences between different cell-lines, which could be used to measure the effects of genetic changes.

This project has two directions that could be investigated: Improvements to cell segmentation models using semi-supervised learning: Recent works have created general purpose machine learning approaches for labelled the pixels of cells [2,3]. However, due to the diversity of appearance and cells overlapping these may fail in some situations. One route this project could take is leveraging a small quantity of labelled images of the target cells and many unlabelled images (or videos) to improve the segmentation performance [4, 5]. Learning the statistical relationships between the different image channel information through a generative model, such as a variational autoencoder [6]. This may lead to a representation that enables more reliable classification of cell-cycle stages. It also enables analysis of how observing the cell in one image channel can predict the other channels [7], which may enable subsequent optimisation of the acquisition process.

Skills: Python, Machine Learning, Interests in computer vision and cell biology helpful

References:

[1] Zerjatke, Thomas, et al. "Quantitative cell cycle analysis based on an endogenous all-in-one reporter for cell tracking and classification." Cell reports 19.9 (2017): 1953-1966.

[2] Stringer, Carsen, et al. "Cellpose: a generalist algorithm for cellular segmentation." Nature Methods 18.1 (2021): 100-106.

[3] Schmidt, Uwe, et al. "Cell detection with star-convex polygons." MICCAI 2018 [2] Bortsova, Gerda, et al. "Semi-supervised medical image segmentation via learning consistency under transformations." International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2019.

[4] https://github.com/xiaomengyc/Few-Shot-Semantic-Segmentation-Papers

[5] Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." ICLR 2013

[6] Ivanov, Oleg, Michael Figurnov, and Dmitry Vetrov. "Variational autoencoder with arbitrary conditioning." ICLR 2019

Detecting biomarkers from Neuroimaging data - Prof Jamie Ward

Magnetic resonance imaging (MRI) has been demonstrated as a powerful and flexible technique for understanding the human brain. In recent years large scale studies, such as the Human Connectome Project (HCP)[1], have identified acquisition protocols to allow consistent high-quality data collection and preprocessing [2][3]. This preprocessing creates a set of measurements, which describe individual anatomical and functional properties.

This project seeks to develop a framework for identifying biomarkers [4], patterns of measurements that are indicative of a particular condition, from pre-processed HCP data. Specifically, this project will explore how different machine learning analysis strategies can be used to extract biomarkers, with the added goal of identifying approaches that robustly detect interpretable differences between populations. A student on this project will have access to HCP data from people with synaesthesia (unusual experiences such as music triggering vision) which could be compared to normative samples or clinical data from other sources. The student could either run a pre-planned analysis or search the literature to develop an alternative biomarker.

References:

[1] https://www.humanconnectome.org/

[2] Glasser, Matthew F., et al. "The minimal preprocessing pipelines for the Human Connectome Project." Neuroimage 80 (2013): 105-124.

[3] Glasser, Matthew F., et al. "A multi-modal parcellation of human cerebral cortex." Nature 536.7615 (2016): 171-178

[4] Woo, Choong-Wan, et al. "Building better biomarkers: brain models in translational neuroimaging." Nature neuroscience 20.3 (2017): 365.

Machine Learning For Controlling Ultrafast Lasers - Dr Juan Totero Gongora

Optical frequency combs are an advanced class of lasers emitting ultra-precise pulses of light [1]. They are ideal candidates to provide the fast-beating “optical heart” required by transformative technologies such as portable atomic clocks, highly-sensitive hazardous chemical detectors, wearable devices for high-precision medical diagnostics, and computer chips operating at photonics speeds.

Despite recent technological breakthroughs, frequency combs remain surprisingly hard to control and stabilise at the high emission powers required by real-life applications. This is because existing linear analysis techniques are poorly suited to modeling highly nonlinear states, which limits our access to the extensive range of potential emission regimes. Recent results have demonstrated the vast potential of Machine Learning (ML) in stabilising lasers, delivering improved emission performance (e.g. higher pulse intensity) in a fraction of the time required by standard techniques [2]. In the case of a frequency comb laser, for example, a deep-learning model could iteratively learn to “predict” which combination of input parameters can improve the laser emission. Driving and maintaining a micro-comb laser into an arbitrary high-energy state, however, is elusive and requires a more advanced approach merging AI predictions with a precise understanding of the system’s internal nonlinear dynamics, not necessarily known a priori. This task eludes standard techniques and requires developing an entirely new conceptual approach to tackle real-life laser dynamics.

This interdisciplinary project aims to overcome this conceptual gap. You will investigate extending an existing approach [3] for characterising and controlling a real-life ultrafast laser using variational auto encoders and LSTMs trained on simulated laser outputs. This project follows on from a successful MSc dissertation in this area, with several areas of interest to explore depending on the students interests.

References:

[1] H. Bao et al., ‘Laser cavity-soliton microcombs’, Nature Photonics, p. 1, Mar. 2019, doi: 10.1038/s41566-019-0379-5.

[2] G. Genty et al., ‘Machine learning and applications in ultrafast photonics’, Nature Photonics, pp. 1–11, Nov. 2020, doi: 10.1038/s41566-020-00716-4.

[3] Baumeister, Thomas, Steven L. Brunton, and J. Nathan Kutz. "Deep learning and model predictive control for self-tuning mode-locked lasers." JOSA B 35.3 (2018): 617-626. Video related to this paper: https://youtu.be/b4wZyAh99wM

Machine Learning acceleration of hyperspectral imaging at terahertz frequencies - Dr Juan Totero Gongora

Terahertz is a novel form of electromagnetic radiation with a frequency range lying between microwave and radio waves. In recent years, terahertz science has attracted sizeable research efforts, due to a large number of potential applications across biology, material characterisation, security, and industrial diagnostics . On one side, many common materials (plastic, paper, fabrics) are transparent to Terahertz waves. At the same time, terahertz waves carry only an infinitesimal amount of energy and are therefore much safer than highly-ionizing X-rays when inspecting the interior of a sample. Besides, a wide range of complex materials and compounds (for example, chemical and pharmaceutical substances or explosives) exhibit a unique and very distinguishable response when illuminated with short pulses of terahertz light. The ability to see inside objects while discriminating their material composition is at the heart of current research on the development of advanced imaging devices based on ultrashort terahertz pulses.

Despite recent technological breakthroughs, further developments in this research area are limited by the availability of terahertz cameras and imaging sensors. To tackle this limitation, the Emergent Photonics (EPic) Lab at Sussex is focusing on developing single-pixel imaging approaches, also known as computational imaging or ghost-imaging techniques [1]–[3]. In these approaches, rather than employing a sensor composed of a large number of pixels, the sample is illuminated with a series of known spatial patterns, and the transmitted terahertz waves corresponding to each pattern are sequentially acquired by a single-pixel detector. By combining the spatial information of the incident patterns and their corresponding time-dependent outputs, one can reconstruct the properties of the sample through a numerical inversion process. For more information on our research see [4]

However, understanding how to extract the sample information from measurements becomes increasingly challenging when dealing with extremely small objects (i.e., much smaller than the incident wavelength) or samples with a complex three-dimensional structure (e.g., a biological cell sample). On the one hand, the experimental measurements for each incident pattern can become extremely long, leading to unpractical overall imaging times. At the same time, when considering objects much smaller than the incident wavelength, the spatial and temporal properties of the sample become entangled.

This multidisciplinary MSc research project aims to bring the power of Machine Learning (ML) to overcome these conceptual and technological gaps. ML has emerged as an ideal tool to “disentangle” complex information in standard imaging, allowing, for example, to significantly reduce the number of patterns required in computational imaging techniques, or to retrieve the image of samples concealed by scattering materials (e.g., fog) [5]–[7]. You will investigate and evaluate the most suitable strategies to characterise a real-life sample in the typical experimental conditions of terahertz single-pixel imaging. The initial part of the project will focus on analysing numerical simulations data, while in the final part of the project, your methodology will be assessed and validated against real-life experimental datasets.

Further details and a detailed research plan will be discussed at the initial meeting of the project.

[1] L. Olivieri, J. S. Totero Gongora, A. Pasquazi, and M. Peccianti, ‘Time-Resolved Nonlinear Ghost Imaging’, ACS Photonics, vol. 5, no. 8, pp. 3379–3388, Aug. 2018, doi: 10.1021/acsphotonics.8b00653.

[2] L. Olivieri et al., ‘Hyperspectral terahertz microscopy via nonlinear ghost imaging’, Optica, OPTICA, vol. 7, no. 2, pp. 186–191, Feb. 2020, doi: 10.1364/OPTICA.381035.

[3] J. S. T. Gongora et al., ‘Route to Intelligent Imaging Reconstruction via Terahertz Nonlinear Ghost Imaging’, Micromachines, vol. 11, no. 5, Art. no. 5, May 2020, doi: 10.3390/mi11050521.

[4] https://www.optica-opn.org/home/newsroom/2020/february/toward_hyperspectral_terahertz_microscopy/

[5] S. Resisi, S. M. Popoff, and Y. Bromberg, ‘Image Transmission Through a Dynamically Perturbed Multimode Fiber by Deep Learning’, Laser & Photonics Reviews, vol. 15, no. 10, p. 2000553, 2021, doi: 10.1002/lpor.202000553.

[6] P. Caramazza, O. Moran, R. Murray-Smith, and D. Faccio, ‘Transmission of natural scene images through a multimode fibre’, Nat Commun, vol. 10, no. 1, p. 2029, May 2019, doi: 10.1038/s41467-019-10057-8.

[7] F. Tonolini, J. Radford, A. Turpin, D. Faccio, and R. Murray-Smith, ‘Variational Inference for Computational Imaging Inverse Problems’, Journal of Machine Learning Research 21, 46 (2020).

Analysing neuronal activity in two-photon calcium imaging - Prof Miguel Maravall

This project will examine imaging data acquired from two-photon calcium microscopy in mice brains; the experimental setup is described in [1]. The data reflect activity of individual neurons recorded while mice perform a trained sensory-guided behaviour. The ultimate aim is to understand how neurons respond to sensory and behavioural variables as the mouse interacts with its environment. This data has two analysis challenges: firstly the observed image data has a low temporal sampling frequency (~10Hz) and its relationship with neuronal activations is not a straightforward linear function; secondly, neuronal activations are likely to correlate both with sensory stimuli under the control of the experimenter, and potentially with actions of the mouse and other factors. This makes disentangling neuronal activations related to the stimulus challenging without incorporating observations of confounding factors. The project could lead to publishable insights into how groups of neurons collectively predict stimuli and other variables.

Depending on the students’ interest, this project could investigate two directions of study: Develop a multivariate model for predicting stimuli given the simultaneously imaged signal from sets of neurons. This approach would investigate a combination of temporal feature engineering and/or learning, to independently interpret the signal at each neuron, in combination with a regularised linear classifier to provide interpretability in terms of neuronal contribution. Build an implementation of a variational autoencoder for probabilistic spike inference spikes from calcium imaging data following a similar approach to [2]. Subsequently, the dataset can be re-analysed using the predicted spikes.

Skills: Programming (preferably Python), knowledge of statistics and/or machine learning

1. Michael R. Bale, Malamati Bitzidou, Elena Giusto, Paul Kinghorn, Miguel Maravall, Sequence Learning Induces Selectivity to Multiple Task Parameters in Mouse Somatosensory Cortex, Current Biology, Volume 31, Issue 3, 2021.

2. Speiser, Artur, et al. "Fast amortized inference of neural activity from calcium imaging data with variational autoencoders." NeurIPS 2017.

General reading on analysing calcium imaging data can be found in: Giovannucci, Andrea, et al. "CaImAn an open source tool for scalable calcium imaging data analysis." Elife 8 (2019): e38173.

Learning descriptive feature sets for accurate classification of spontaneous synaptic events - Dr Andrew Penn

Aims.

1. Develop a learning framework to create descriptive features of spontaneous synaptic activity.

2. Investigate machine learning model choices for classifying candidate events

3. Compare the performance (Receiver Operating Characteristics, ROC) with earlier software versions over a range of problem data sets

Bibliography:

Machine learning for Ecoacoustic monitoring - Dr Alice Eldridge

Monitoring, understanding, and predicting the integrity of our planetary biosphere is the most critical sustainability issue of our time. The emerging science of Ecoacoustics enables the exciting possibility to eavesdrop on ecosystems to assess their health (Sueur & Farina 2015). This project will investigate how cutting-edge machine learning systems can learn compact and informative representations of such data. This will include exploration of: learning strategies, architectures and inductive biases, and capabilities for ecological monitoring and prediction. Whereas bioacoustics focuses on the investigation of signals between individuals (inferring, behavioural information from intra- and interspecific signals), Ecoacoustics investigates the role of sound at higher ecological and evolutionary organisational units - from population and community up to landscape scales

This project is in collaboration with Alice Eldridge (Music), and builds upon an internship in summer 2021 that established an initial codebase for these analysis and reimplementing and extending the work of (Sethi et al. 2020) with application to a Sussex collected dataset (Eldridge 2018).

The objective of this project will be to investigate the application of approaches to representation learning using variational auto-encoders, contrastive learning or other semi-supervised approaches to create a rich representation of environmental audio data that provides meaningful ecological indicators and is predictive of animal diversity.

This project would suit a student with interest in advanced machine learning methods and good programming skills.

References: Sueur, J. and Farina, A. (2015). Ecoacoustics: the ecological investigation and interpretation of environmental sound. Biosemiotics, 8(3):493–502. (introduction to the field) Sethi, S.S., Jones, N.S., Fulcher, B.D., Picinali, L., Clink, D.J., Klinck, H., Orme, C.D.L., Wrege, P.H. and Ewers, R.M., 2020. Characterizing soundscapes across diverse ecosystems using a universal acoustic feature set. Proceedings of the National Academy of Sciences, 117(29), pp.17049-17055.

Eldridge, A., Guyot, P., Moscoso, P., Johnston, A., Eyre-Walker, Y. and Peck, M., 2018. Sounding out ecoacoustic metrics: Avian species richness is predicted by acoustic indices in temperate but not tropical habitats. Ecological Indicators, 95, pp.939-952.

Skills Machine learning, PyTorch, Knowledge of/interest in audio data helpful

Analysing speech motion from MRI data - Prof Mara Cercignani and Dr Leandro Beltrachini (University of Cardiff)

Teaching people to speak a new language can be complicated by the need to produce unfamiliar sounds with your mouth. This teaching process could be greatly improved by providing illustrations of muscle movements when speaking. Collaborators (Professor Mara Cercignani and Dr Leandro Beltrachini, University of Cardiff) are collecting a dataset of MRI videos of subjects speaking in Welsh to create such a teaching aid.

This project will investigate computer vision analysis methods, such as optical flow [1] and super-resolution [2] to create high-quality videos illustrating how people move when they are pronouncing different sounds. This project has a similar inspiration to [3], but will use different analysis tools and datasets.

Skills: Python, interests in computer vision and/or machine learning helpful.

References:

[1] Teed, Zachary, and Jia Deng., Raft: Recurrent all-pairs field transforms for optical flow., ECVV 2020.

[2] Ledig, Christian, et al., Photo-realistic single image super-resolution using a generative adversarial network., CVPR 2017

[3] https://www.seeingspeech.ac.uk/

Generative modelling of neurodegenerative disease progression

Neurodegenerative disease such as Alzhiemer's disease are known to have heterogeneous presentation across populations. This can complicate differential diagnosis between various forms of dementia, and predicting how a patient's state may degrade over time.

This project will use a public dataset[1] containing neurological biomarkers to build statistical models of disease progression. A good starting point may be modifying the PySuStain project [2], with an investigation of robust likelihood functions that are less affected by outlier measurements. Alternative approaches could involve using generative models, such as variational autoencoder [3] or transformer-based sequence models [4]. [

1] https://tadpole.grand-challenge.org/

[2] https://github.com/ucl-pond/pySuStaIn

[3] Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." ICLR 2013

[4] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.

Efficient GAN training with pre-trained autoencoders - Dr Justin Pinkney (Lambda Lab, https://twitter.com/Buntworthy)

Training a state of the art GAN at high-resolution requires thousands of GPU-hours and is unachievable without substantial high end GPU resources. This project aims to leverage the recent advances in pretrained autoencoders, which have made applying transformers and diffusion models for high-resolution image generation tractable, and utilise them in speeding up existing state of the art GAN training.

Skills Python, PyTorch, interests in Computer Vision helpful

References

Karras, Tero, Samuli Laine, and Timo Aila. 2018. “A Style-Based Generator Architecture for Generative Adversarial Networks.” arXiv [cs.NE]. arXiv. http://arxiv.org/abs/1812.04948.

Rombach, Robin, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. “High-Resolution Image Synthesis with Latent Diffusion Models.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/2112.10752.

Warping StyleGAN for caricature generation - Dr Justine Pinkney (Lambda Lab, https://twitter.com/Buntworthy)

A caricature of an individual compared to a photograph involves both large scale deformation as well as textural modifications whilst also maintaining the perceived identity. This is a challenging problem in image-to-image translation and closely related to various face modification methods such as photo-to-anime or “toonification”.

Existing caricature generation methods have often strived to break down the process into two steps: an explicit structural deformation and a stylisation step. But these methods have in general been limited in quality and resolution of the generated images. This project aims to directly integrate spatial deformation with high-quality pretrained generative models of faces in order to generate caricature images with higher resolution and fidelity.

Skills: Python, PyTorch, interests in Computer Vision helpful

References:

Karras, Tero, Samuli Laine, and Timo Aila. 2018. “A Style-Based Generator Architecture for Generative Adversarial Networks.” arXiv [cs.NE]. arXiv. http://arxiv.org/abs/1812.04948.

Shi, Yichun, Debayan Deb, and Anil K. Jain. 2018. “WarpGAN: Automatic Caricature Generation.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/1811.10100.

Jang, Wonjong, Gwangjin Ju, Yucheol Jung, Jiaolong Yang, Xin Tong, and Seungyong Lee. 2021. “StyleCariGAN: Caricature Generation via StyleGAN Feature Map Modulation.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/2107.04331.

Tracing the Visual Genealogies of C19th Scientific Instrument Illustrations - Dr Alex Butterworth

Background

‘Tools of Knowledge: modelling the communities of scientific instrument makers in Britain, 1550-1914’ is a AHRC-funded collaboration between the University of Sussex (Dr Alex Butterworth), which is leading on the data processing and digital interpretation, the Department of History and Philosophy of Science at the University of Cambridge and the National Museum of Scotland, with the National Maritime Museum, Greenwich as a partner organisation.

The project is creating a semantically modelled database of over ten thousand named instrument makers and businesses, their family and professional relationships with other individuals and institutions, together with the thousands of instruments - historical ‘tools of knowledge’ - made by them that still exist in museums. It is drawing on a wide range of written and visual sources and collections data, and applying many different computational methods - new ‘tools of knowledge’ - to extract, model, analyse and interpret. It aims to generate transformative insights into an area of activity crucial to the changing intellectual, industrial and commercial life of Britain over several centuries.

The project is intended to enable new research at all levels, and offers MSc students an early opportunity to explore some of the materials with which we are working, using data science/AI methods to analyse the wealth of collected data.

The Task

From the mid-eighteenth century until the early twentieth, prominent individuals and companies produced trade catalogues that included large numbers of engraved illustrations of scientific instruments. The ‘Tools of Knowledge’ project is collecting a significant corpus of these illustrations that will be available for exploration using Computer Vision methods.

The main objective of this MSc project is to identify visual similarities between the illustrated instruments that can be used to generate a provisional ‘family tree’ for the development of different instrument types (by cross-reference to dates of publication). Beyond this, the generation of new illustrations of GAN-‘imagined’ instruments - hybrids of types or speculative extensions of branches of the tree - might be a further creative output.

This project offers the opportunity to work with real world data and to contribute to a major digital humanities research project. Furthermore, if significant results are obtained there is potential for the dissertation outcomes to be publicised on the project web page, and/or co-authorship in a research publication.

Skills: Machine learning, Python, interest in computer vision helpful.

Prof Evi Soutoglou

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Analysis of 3D gene positioning during differentiation of Stem cells

Every day the DNA in our cells are subjected to thousands of breaks, which must be repaired to maintain the health the organism as unrepaired or incorrectly repaired DNA breaks can lead to disease, such as cancer. DNA double strand break (DSB) are amongst the most deleterious DNA lesions, since they can lead to chromosomal translocation, which are a leading cause of cancer. To preserve genomic stability, cells have evolved various DNA repair pathways, including homologous recombination (HR), non-homologous (NHEJ) or microhomology-mediated end joining (MMEJ) pathways. The NHEJ and MMEJ pathways have been implicated in the formation of chromosomal translocations. HR and NHEJ/MMEJ repair are compartmentalized into different parts of a mammalian nucleus, suggesting repair pathway choice regulation may in part be based on where in the nucleus the broken DNA is located (Lemaitre et al., 2014, Schep et al., 2021). It remains unknown how the nuclear position of a DNA break impacts the frequency and the nature of chromosomal translocations. To tackle this question, we induce DSBs in specific gene loci in two different cell types, and will identify chromosomal translocation partners (which genomic loci have been incorrectly joined together) using a dedicated sequencing assay (LAM-HTGTS Hu et al., 2016). Specifically, we induce DSBs at loci that are located at the nuclear periphery in mouse embryonic stem cells (ESC), but which are relocate in to the nuclear centre during differentiation into neural precursor cells (NPC). We will then compare the frequency of translocations and any changes in translocation partners between the two cell types to assess the effect nuclear positioning has on chromosomal translocations. We have selected candidate genes (n=20) that have previously been shown to have a differential location in ESC and NPC (Peric-Hupkes et al., 2010, Therizols et al., 2014).

As an important first step, the position of these candidate genes must be validated in both cell types, before and after break induction. To assess a genes position, we use 3D fluorescence in situ hybridization to fluorescently label specific individual genes of interest, and immunostain the lamin B protein to demark the nuclear edge. We currently perform manual imaging analysis, on approx. 200 nuclei per gene, per cell type, using ImageJ. We measure the position of the gene loci relative to the nuclear periphery by measuring the distance between the centre of the FISH signal and the lamin B staining. We would like to develop an automatic image analysis tool for this project, which is capable of: accurate nuclei segmentation, distance measurements between the gene locus and the nuclear periphery, and perform segmentation of nuclei into shells of equal areas (*), and position the gene loci relative to these shells. Overall, the question we aim to answer has significant implications for understanding the mechanisms controlling the formation of chromosomal translocations within the nuclear environment and will aid the understanding of why certain translocations are recurrent in cancer.

Useful references associated with FISH analysis:

Zaki et al., 2020 DOI: 10.1002/cyto.a.24257 Meaburn and Misteli 2008 The Journal of Cell Biology, Vol. 180, No. 1, January 14, 2008 39–50 http://www.jcb.org/cgi/doi/ JCB 39 10.1083/jcb.200708204 Shachar et al., 2015 doi: 10.1101/sqb.2015.80.027417 Nandy et al., 2009 doi:10.1109/IEMBS.2009.5332922. Nandy et al., 2011 doi:10.1109/IEMBS.2011.6091480. Gudla et al., 2008 DOI: 10.1002/cyto.a.20550 Therizols et al., 2014 DOI: 10.1126/science.1259587

Dr Ali Taheri

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Geometry of Information and Reproducing Kernel Hilbert Spaces

This project has its roots in Pure Mathematics (Functional and Complex Analysis) were RKHS appeared first in the theory of Bergman spaces (RKHS of Szego and Bergman types). The theory has since developed beyond the boundaries of pure mathematics and has found numerous applications in areas including Machine Learning and Data Analysis. In this project we focus on the general theory of RKHS and work on some of its concrete applications, primarily in Information Geometry, Data Analysis as well as certain applications in the theory of ordinary and partial differential equations.

References:

1 Theory of Reproducing Kernels and Applications by S. Saitoh and Y. Sawano (Springer)

2 Reproducing Kernel Hilbert Spaces in Probablity and Statistics by A. Berlinet and C. Thomas-Agnan (Springer)

3 An Introduction to the theory of Reproducing Kernel Hilbert Spaces by V.I. Paulsen and M. Raghupathi (CUP) 4 Information Geometry and its Application by S. Amari (Springer) 5 Mathematical Foundation of Data Analysis by J.M. Phillips (Springer)

Learning Riemannian Manifolds: Distance, Geodesics and Motion Skills

Learning a distance function or metric on a given data manifold is of great importance in machine learning and pattern recognition. The problem begins by first embedding the manifold into a suitable Euclidean space (using a variety of methods and techniques) and then learning the distance function and its corresponding geodesics. Here PDE techniques such as heat flow on vector fields, gradient flows, Wasserstein spaces and Lyapunov functions in addition to tools from Riemannian Geometry and Geometric Analysis play a key role. Applications of this project may include Optimal transport, Geometric Robotics and Image processing to name a few.

References:

1 Learning with Kernels by B. Scholkpoff and A.J. Smola (MIT Press)

2 Support Vector Machines by I. Steinwart and A. Christmann (Springer)

3 Information Geometry and its Application by S. Amari (Springer)

4 Mathematical Foundation of Data Analysis by J.M. Phillips (Springer)

5 Geometric Science of Information by F. Nielsen and F. Barbaresco (Eds) (Springer)

Dr Chandrasekhar Venkataraman

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Data-driven modelling of immune-tumour interactions

The ability of immune cells to successfully infiltrate solid tumours, a process known as trafficking, is crucial to the body’s anti-tumour response. Factors such as stress and drug therapies can reduce immune cell trafficking levels inhibiting the anti-tumour response. In a recent study [1], images of solid tumours and immune cells were obtained in different settings corresponding to varying stress and drug levels. Using novel image analysis algorithms trafficking levels were quantified for the different cases. The goal of this dissertation is to develop mathematical models for immune cell trafficking into solid tumours that are guided by this data and moreover to develop methods to validate/parameterise the models against this data.

[1] Al-Hity, G., Yang, F., Campillo-Funollet, E., Greenstein, A.E., Hunt, H., Mampay, M., Intabli, H., Falcinelli, M., Madzvamuse, A., Venkataraman, C. and Flint, M.S., 2021. An integrated framework for quantifying immune-tumour interactions in a 3D co-culture model. Communications Biology, 4(1), pp.1-12.

Prof Jamie Ward

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Developing a connectome biomarker for synaesthesia

References:

[1] https://www.humanconnectome.org/

[2] Glasser, Matthew F., et al. "The minimal preprocessing pipelines for the Human Connectome Project." Neuroimage 80 (2013): 105-124.

[3] Glasser, Matthew F., et al. "A multi-modal parcellation of human cerebral cortex." Nature 536.7615 (2016): 171-178

[4] Woo, Choong-Wan, et al. "Building better biomarkers: brain models in translational neuroimaging." Nature neuroscience 20.3 (2017): 365.

Dr Julie Weeds

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Mapping AI

This project aims to use AI to understand AI! Using keywords to collect a corpus of articles from different sub-fields of AI, this project will explore variation in terminology across time and sub-fields, using diachronic word embeddings (Kutozov et al. (2018)) and investigate diffusion of concepts from one area to another.

Kutozov et al. (2018). Diachronic Word Embeddings and semantic shifts: a survey. https://www.aclweb.org/ anthology/C18-1117/

Marking Assistant

The aim of this project is to use NLP and ML methods to develop a marking assistant for GCSE-level short-answer questions. A dataset with around 30 student answers per question (4 questions as of 2020) is being developed and will be provided by Eastbourne College.

Benomran and Ab Aziz (2013).Automatic essay grading for short answers in English Language. Journal of Computer Science https://www.researchgate.net/publication/269338716_Automatic_essay_grading_system_ for_short_answers_in_ English_language Taghipour and Ng (2016).

A neural approach to automated essay scoring. EMNLP. https://www.aclweb.org/ anthology/D16-1193/

Exploiting automatic machine translation to find wildlife exploitation studies in other languages

Automatic methods for searching for studies on academic databases and on the Internet more generally are being used to identify articles for biodiversity and macroecological datasets (Cornford et al. 2020). These datasets may be used to provide evidence for the sustainability (or not) of various forms of wildlife exploitation including hunting by indigenous populations. However, ignoring non-English-language studies may skew the results (Konno et al. 2020). The aim of this project is to exploit automatic translation methods in order to identify non-English-language studies and assess the effect on conclusions drawn. Cornford et al. 2020. Fast, scalable, and automated identification of articles for biodiversity and macroecological datasets. Global ecology and Biogeography https://onlinelibrary.wiley.com/ doi/full/10.1111/geb.13219 Konno et al. 2020. Ignoring non-English-language studies may bias ecological meta analyses https://onlinelibrary.wiley.com/ doi/10.1002/ece3.6368

MeeTwo: Mental health chatroom for young people

MeeTwo Education Ltd provides a safe (and fully-moderated) online space for young people to talk about issues affecting their mental health. Anonymised posts (with meta-data) and a collection of mental health resources are available which have been developed over the past 4 years and provides an opportunity for a student to develop a project in a number of ways including automated moderation of posts, topic tracking, relevance matching / search and recommendation. Wang et al. 2020. Improving mental health using machine learning to assist humans in the moderation of forum posts. Health Informatics. http://users.sussex.ac.uk/ ~juliewe/wang20.pdf

Fact Checking, fake news and Confirmation Bias

Being able to automatically check facts on the internet is a hot topic in NLP and machine learning. However, the way in which we check facts often leads to confirmation bias – we will find evidence supporting the fake fact that we are checking. If I search for “vaccines are bad for you”, the top results will be about side effects and adverse effects whereas if I search for “vaccines are good for you” – the top results are about the benefits of vaccines. This project will look at rewording potential facts as queries or neutral statements designed to return more balanced results.

• https://www.scientificamerican.com/article/the-psychology-of-fact-checking1/

• https://fever.ai/

• https://misinforeview.hks.harvard.edu/article/the-presence-of-unexpected-biases-in-online-fact-checking/

• https://libguides.reynolds.edu/fakenews/bias

Revision Assistant

The aim of this project would be to help students find and learn definitions of key terms and phrases for an area of study. For example, the input would likely be a set of lecture notes. NLE techniques would be applied to identify key terms / phrases and then link them to other occurrences of those key terms - potentially identifying occurrences which are definitional or most useful for a glossary. A game or quiz element could also be introduced to automatically generate questions and answers based on the linked content.

Automatic Transcription

Over the past year, we have all become familiar with the use of automatic transcription on zoom and other platforms and also its current inadequacy in appropriate supporting non-native speakers. The aim of this project will be to post-process and improve the output of an automatic transcription service using available information about the topic or discourse. For example, the automatic transcription of lectures could be improved by using a language model based on the notes for that module.

Plagiarism Detection

This project will look at plagiarism detection methods for application to jupyter notebooks. The expectation is that the project would focus on the text cells but an alternative project could be developed looking at code cells. A variety of methods could be looked at including most simply word or n-gram overlap. Extensions could include looking for style markers which suggest that the text has been copied from another source (even when that source is unknown).

Source Language Detection

When text is translated from one language to another, tell-tale markers of the source language remain which native speakers might detect as disfluencies. These markers vary according to the source language. This project would look at developing a dataset and classifying translated texts according to the source language.

Measuring Probabilistic and Semantic Cognitive-Biases in Language ModelsDescription

The project aims to explore the extent to which large pre-trained language models (LM) (e.g., BERT, T5, GPT) encode and reflect specific human biases. The investigation will focus on the conjunction-fallacy heuristic, a bias involving semantic knowledge and probabilistic reasoning. This bias is expressed when the conjunction of two events (or categories) is considered more likely than either event alone. In a typical scenario, humans are given a brief description of a person – such as “Charlie is smart, loving, outgoing”– and are later asked if such person falls under the category A or A&B (e.g., “a teacher” vs. “a teacher who likes reading”). The project lends itself to multiple ways of tackling the influence of semantic, social and gender-related biases in LMs. It will require a part of data collection (human annotation of existing data), designing strategies to query LMs for probability estimation, and comparing such predictions against human judgments. Possibly, an investigation of the decision process followed by the model.

https://towardsdatascience.com/dont-fall-for-the-conjunction-fallacy-d860ed89053e

Impact of Lockdown on Children’s Mental Health - Co-Supervisor Sam Cartwright-Hatton

This project would be looking at questions such as “Did lockdown harm children’s mental health?” and would be based on the analysis of surveys of parents carried out by Psychologists at Sussex during lockdown. The dataset contains responses from around 2500 parents over the period of a year with parents responding between 1 and 8 times each. The data is largely quantitative (rather than free text) and this project would likely suit a Human and Social Data Scientist.

Prof David Weir

For more information on the projects listed below, please use the details provided on the supervisor's profile page.

Active learning methods for low resource document classification

In a wide variety of NLP applications, the need arises to create one or more bespoke document classifiers. For example, suppose that you want analyse the conversation on Twitter concerned with attitudes towards COVID-19 vaccination. A first step would be to identify a set of search terms that could be used to collect a reasonable sample of this conversation. A typical second step would involve the creation of a (relevancy) document classifier that is intended to filter out tweets that contain one the search terms, but which are not relevant to the conversation being targeted (perhaps because the matching search term was ambiguous). Once a (mostly) relevant set of documents (tweets) has been assembled, it is common to create a further document classifier (or collection of classifiers) that attempt to divide up the dataset according to which of a number of identifiable sub-topics the documents (tweets) are concerned with. As this example illustrates, this kind of scenario is likely to require the creation of a number of bespoke document classifiers: i.e. classification problems for which no suitable labelled data exists.

State-of-the-art classifiers are currently built using very large pre-trained masked language models (e.g. BERT), and to achieve high performance this typically involves the use of several thousand labelled examples.

In this project you will explore active learning methods that can be used to produce classifiers with reasonable performance, with a 10's, or 100's of labelled examples. Active learning methods are algorithms that iteratively select the most useful documents to labelled at each iteration.

The methods under consideration would be evaluated on a wide variety of datasets involving different types of document classification, such as sentiment analysis, and topic-based classification.

Good python programming skill and a familiarity with NLP are essential for this project.

Supervisors: David Weir and Shaun Ring

Links

BERT: https://arxiv.org/abs/1810.04805 RoBERTa: https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/

Data augmentation methods for low resource document classification

In this project you will explore data augmentation methods that can be used to produce classifiers with reasonable performance, with a tens, or hundreds of labelled examples. Data augmentation methods are algorithms that expand (augment) a (potentially small) set of labelled instances by generating new labelled instances based on various heuristics. For example, rephrasing sentences in a document in ways that are presumed not to change the document's label. Data augmentation techniques that could be explored include simple approaches such as swapping words, to the use of generation model such as AUG-BERT.

The methods under consideration would be evaluated on a wide variety of datasets involving different types of document classification, such as sentiment analysis, and topic-based classification.

Good python programming skill and a familiarity with NLP are essential for this project.

Supervisors: David Weir and Shaun Ring

Links

BERT: https://arxiv.org/abs/1810.04805

RoBERTa: https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/

AUG-BERT: https://link.springer.com/chapter/10.1007/978-981-13-9409-6_266

Summarisation of Interview Responses in a Project Success Dataset

This project will involve the analysis of an existing 241,750 word dataset consisting of transcripts of interviews with project professionals on the role of different organisational factors for ensuring successful projects across a range of industrial sectors.

The transcribed interviews cover a number of topics including sustainability, technology and data, and interpersonal skills and appropriate keywords will be used to identify which topic each interviewee's response relates to.

Using these topic-specific keywords, it will be possible to collect the text of all of the responses relating each of the topics of interest, and the idea underlying this project is to explore Natural Language Processing (NLP) methods that are designed to provide a summary of these collections.

The project will involve investigating the effectiveness of state-of-the-art abstractive and extractive summarisation methods on this dataset. Abstractive summarisation involves the generation of a new piece of text that provides a summary of the input text, whereas extractive summarisation selects content from the input texts and uses it to form the summary.

There are many methods that could be explored, including those found here: https://paperswithcode.com/area/natural-language-processing/text-summarization

Supervisors

David Weir (Informatics)

David Eggleton (SPRU)

Classification of long documents with transformer models

Transformer models such as BERT offer state-of-the-art performance on document classification problems. One drawback of such models, however, is that they are limited to documents of up to 512 tokens. While this is adequate for many scenarios, e.g. the classification of tweets, there are situations where this presents a problem, e.g. the classification of news articles.

It is possible to break up a long document into suitably sized chunks and classify each one separately, before producing an overall classification of the whole document based on the class of each of the chunks. There are, however, problems with this approach, in particular: - how best to combine the decisions on each of the chunks to produce a decision for the whole document - how to handle the fact that the labelled data used to train the classifier will not provide chunk level labels, but document level labels.

In this project you will explore the effectiveness of approaches to this problem, considering a variety of different datasets and types of classification problems.

Supervisors: David Weir and Shaun Ring

Links

BERT: https://arxiv.org/abs/1810.04805

Measuring sentence similarity: tackling the cross-encoders asymmetry problem

This project considers methods for the problem of measuring the similarity of two sentences. There are two dominant approaches both of which involve the use of a BERT (or RoBERTa) masked language model: cross-encoders and bi-encoders.

Cross-encoders generally achieved higher performance than bi-encoders improvements on various sentence pair tasks. However, an asymmetry issue arises from the fact that cross-encoder involve sentence concatenation in its input scheme. The issue is that this has the potential to violate the reasonable assumption that sentence order should not have an impact on a measure of sentence similarity. This is an issue that is not discussed in recent work involving the use of cross-encoders

In this project, you will explore methods that relax the asymmetry constraint in BERT (RoBERTa) cross-encoders, in particular, attempt to decreases the percentage of inconsistent predictions while maintaining the overall performance. For example, you could try to make modifications to BERT(RoBERTa)’s structure such as changes its positional embedding.

Good python programming skill and a familiarity with NLP are essential for this project.

Supervisors: David Weir and Qiwei Peng

Links

BERT: https://arxiv.org/abs/1810.04805

RoBERTa: https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/

Machine Storytelling

Language models have advanced rapidly in recent years, with OpenAI’s GPT-3 language model capable of generating natural language text that can be difficult to distinguish from human-authored texts. The student(s) will examine the limitations and opportunities of current language models, and formulate a dissertation project around the development of a new application/tool within this space.

Areas that could be explored include: sustaining coherence over longer outputs; creating stories with beginnings, middles and ends; enriching generated stories with knowledge of worlds or settings; multimedia storytelling incorporating imagery and sound; training with smaller text corpora (e.g. languages other than English); location-specific storytelling using geolocation data; distinguishing generated texts from human-authored texts; applying text generation within games, immersive storytelling, or other interactive media contexts; addressing biases embedded in training data (e.g. racism); developing tools to evaluate specific social and ethic risks of new language models; enabling more human oversight and input during text generation; developing machine-human collaboration interfaces incorporating generated text; developing more sustainable and less energy-intensive approaches to text generation; developing countermeasures against possible abuses of language models (e.g. extremism, fraud); improving text generation for specific genres or kinds of texts. Other topics related to NLP can also be considered.

Supervisors: This interdisciplinary project will be jointly supervised by Ben Roberts, Jo Walton (MAH) and either David Weir or Julie Weeds.

References:

Alexander, Anne, Alan Blackwell, Caroline Bassett, Jo Walton. 2021.Ghosts, Robots,

Automatic Writing: an AI Level Study Guide. CDH, 2021.https://www.cdh.cam.ac.uk/ghostfictions

Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. ‘On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? &#x1f99c’; In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23. FAccT ’21. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922. Hua, Minh, and Rita Raley. "Playing With Unicorns: AI Dungeon and Citizen NLP."DHQ: Digital Humanities Quarterly 14.4 (2020).https://www.proquest.com/scholarly-journals/playing-with-unicorns-ai-dungeon-citizen-nlp/docview/2553526112/se-2 Weidinger, Laura, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, et al. 2021. ‘Ethical and Social Risks of Harm from Language Models’.ArXiv:2112.04359 [Cs], December. http://arxiv.org/abs/2112.04359.

Discussion-based Community detection in social networks

TMuch of the existing work on community detection in social networks is based on networks where the edges in the network are derived from the way in which users (nodes in the network) interact. For example, in the case of Twitter, this could be based on who a user follows, or on who a user mentions, retweets, or replies to.

An interesting alternative to this is to base the (strength of) edges in a network on the extent to which users are contributing similar content to an overall conversation taking place on the platform on one or more topics. Communities found on such networks correspond to groups of users who are contributing similar content.

To create a network of this sort, it is possible to convert the content that a user posts into a point in high dimensional space, and this can be done with the use of transformers such as BERT. Users that are mapped to points that are close to each other in the space are more likely to be clustered together into the same community.

In this project, a suitable dataset would be created and used to explore the potential of this approach.

Supervisors: David Weir and Chris Inskip

Links

BERT: https://arxiv.org/abs/1810.04805

A tool for visualizing and exploring network (graph) structured data in 2D or 3D.

There is a great deal of interest in the data science community in algorithms that construct and analyse social networks. This project will be concerned with building a tool that can be used to visualise and explore such networks in various ways.

The underlying data from which the network is produced may be derived from social-networks, but ideally the tool will be agnostic to any specific domain of data. The tool may be built as a interactive counterpart to existing network processing libraries (e.g. NetworkX in python).

Such tooling will aim to facilitate the exploration of “communities” or “clusters” present in network structure, with functionality to, for example: - Interact with nodes and edges (e.g. adding and removing annotations) - Show node and edge annotations (e.g. through the use of textual labels, colours, sizes) - Highlight differences between changing network structures (e.g. across time) - Adjust the layout/position of nodes and edges (e.g. through the use of existing layout algorithms such as ForceAtlas) - Submit edits/annotations made to the network for processing and (re)visualise the result.

This would be part of an existing project involving machine learning on network structured data.

Supervisors David Weir and Chris Inskip

Tracing the Idea of a Scientific Instrument across Centuries

Objective and Sources Scientific instruments played a key role in the intellectual and industrial endeavours and developments of British history, from the sixteenth century to the twentieth. Over this extended historical period, how these instruments were understood and referred to, in diverse kinds of published texts, changed significantly. This MSc project will set out to identify the appearance of references in historical corpora and to trace some aspect(s) of these changes.

Access will be available, for each student, to one of the following: a medium scale corpus of mixed texts from pre-1700 and eighteenth century (English language); a large corpus of nineteenth century medical texts (multiple European languages but primarily English); the Hansard record of parliamentary debate. These will enable such questions as: how the ideas of ‘compass’ or ‘microscope’ were used literally or figuratively; in what contexts were scientific measurement referenced in Parliament; how were names invented to refer to new kinds of instruments and what new processes were they associated with in nineteenth century medical texts?

Background

‘Tools of Knowledge: modelling the communities of scientific instrument makers in Britain, 1550-1914’ is a thirty-month-long AHRC-funded project that began in early 2020. It is a collaboration between the University of Sussex (Dr Alex Butterworth), which is leading on the data processing and digital interpretation, the Department of History and Philosophy of Science at the University of Cambridge and the National Museum of Scotland, with the National Maritime Museum, Greenwich as a partner organisation.

The project is intended to enable new research at all levels, and offers a number of MSc students of AI/Data Science at Sussex an early opportunity to explore some of the materials with which we are working, using data science methods to address real project needs.

Methods

• Named entity recognition and part of speech tagging

• Word/graph embedding

• Information extraction of event graphs (event type, actor, place, data, etc)

• Carry out analysis using off the shelf visualisation software OR

• For those with specialist and relevant coding interests: development of bespoke interactive visualisation, coded in React/D3.js as part of project’s design-led DH activities

Support

The MSc project will be supervised by Professor David Weir and Dr Alex Butterworth, with advisory support from researchers in the TAG Lab (and the AHRC-funded Tools of Knowledge project). The aim will be to explore and implement generalisable processes and pipelines. However, these will be prototyped and tested in relation to specific research questions and targeted objectives, in consultation with specialist academic historians, digital humanities practitioners, and data visualisation designers

Benefits

• Chance to work with real world data and corpora that are the subject of active research

• Access to datasets that are mostly already cleaned and prepared and with gazetteers of names and entity keywords

• Opportunity to make recognised contribution to a major DH research project, discussed in project blog posts

• Possibility of recognition of visible association with the project for students producing work of practical value

• Opportunity to be named on co-authored papers with project team in cases of exceptional work that contributes to a research publication.

Supervisors: Alex Butterworth and David Weir

Identifying Person and Object References in Online Sources

Objective and Sources The ‘Tools of Knowledge’ project seeks to expand the information about instrument makers, types of scientific instruments, and associated places and institutions recorded in its database by creating links to Wikidata and by identifying (and scraping) relevant online data. The task for this MSc project is to work with the directories and gazetteers created with reference to the project’s database and to discover, link and collect related information discovered on the web.

Work on this MSc project will be enabled by access to the Tools of Knowledge ‘SENSIM’ database of Makers, instruments, dates, etc, and the associated controlled vocabularies and taxonomies.

Background

The project is intended to enable new research at all levels, and a number of MSc students of AI/Data Science at Sussex an early opportunity to explore some of the materials with which we are working, using data science methods to address real project needs.

Methods

• Named entity recognition and part of speech tagging

• Word/graph embedding • Information extraction of event graphs (event type, actor, place, data, etc)

• Carry out analysis using off the shelf visualisation software OR

• For those with specialist and relevant coding interests: development of bespoke interactive visualisation, coded in React/D3.js as part of project’s design-led DH activities

Support

Benefits

• Chance to work with real world data and corpora that are the subject of active research

• Access to datasets that are mostly already cleaned and prepared and with gazetteers of names and entity keywords

• Opportunity to make recognised contribution to a major DH research project, discussed in project blog posts

• Possibility of recognition of visible association with the project for students producing work of practical value

• Opportunity to be named on co-authored papers with project team in cases of exceptional work that contributes to a research publication

Supervisors: Alex Butterworth and David Weir.

Extracting ‘Event’ data from textual biographies

Objective and Sources The ‘Tools of Knowledge’ project aims to discover events in the lives of the instrument makers and the stories of the instruments they made, and to model and analyse these as data. Examples of these events might be: when and where was a person educated or involved in a legal dispute; or when were their instruments used in particular experiments or bought be a particular collector? The task, therefore, is to Identify likely events (as entity-graph structures) in digitised corpora collected for the current research projects, indexed to controlled vocabularies/taxonomies (names of persons, instrument types, institutions, event types, etc) produced by the project.

A student working on this MSc project will be provided with a number of small to medium-sized corpora and may choose to work with any or all of them. These comprise (a) Six digitised texts of encyclopaedia-type works: short (1-3 page) biographies and object entries (English); (b) Automatically transcribed (HCR; handwritten character recognition) text from 10k index card records. Not ‘cleaned’ so fuzzy-matching challenge (English), (c) Free text database entries, short (1-5) sentence summary notes, object history or person biography related (English)

Background

‘Tools of Knowledge: modelling the communities of scientific instrument makers in Britain, 1550-1914’ is an thirty-month-long AHRC-funded project that began in early 2020. It is a collaboration between the University of Sussex (Dr Alex Butterworth), which is leading on the data processing and digital interpretation, the Department of History and Philosophy of Science at the University of Cambridge and the National Museum of Scotland, with the National Maritime Museum, Greenwich as a partner organisation.

The project is intended to enable new research at all levels, and offers ta number of MSc students of AI/Data Science at Sussex an early opportunity to explore some of the materials with which we are working, using data science methods to address real project needs.

Methods

• Named entity recognition and part of speech tagging

• Word/graph embedding • Information extraction of event graphs (event type, actor, place, data, etc)

• Carry out analysis using off the shelf visualisation software OR

• For those with specialist and relevant coding interests: development of bespoke interactive visualisation, coded in React/D3.js as part of project’s design-led DH activities

Support

The MSc project will be supervised by Professor David Weir (TAG Lab, NLP/Data Science) and Dr Alex Butterworth (SHL, DH/History and Literature), with advisory support from researchers in the TAG Lab (and the AHRC-funded Tools of Knowledge project). The aim will be to explore and implement generalisable processes and pipelines. However, these will be prototyped and tested in relation to specific research questions and targeted objectives, in consultation with specialist academic historians, digital humanities practitioners, and data visualisation designers

Benefits

• Chance to work with real world data and corpora that are the subject of active research

• Access to datasets that are mostly already cleaned and prepared and with gazetteers of names and entity keywords

• Opportunity to make recognised contribution to a major DH research project, discussed in project blog posts

• Possibility of recognition of visible association with the project for students producing work of practical value

• Opportunity to be named on co-authored papers with project team in cases of exceptional work that contributes to a research publication.

Supervisors: Alex Butterworth and David Weir

Automatically Find Key Phrases In A Corpus - Utilising Dependency Analysis

This project concerns the exploration of approaches to the problem of identifying a set of key phrases that provides a good indication of the content of a corpus.

The project would involve the following steps:

Select existing datasets from e.g. Kaggle (no ethical review required), or data using a platform’s public API such as Reddit or YouTube (ethical review required).

Then apply a keyword finding algorithm to find terms of interest.

Implement an algorithm to extract important uses of these terms in coherent phrases using dependency analysis.

Design an evaluation to compare strategies.

Consider single large documents vs large collections of small documents.

Consider how “important” might be defined.

Supervisor: Andrew Robertson

Department of Mathematics

MSc Projects

MSc project list for MSc Data Science and MSc Human and Social Data Science