Big data approach to understand disease transmission and emergence

Blue Globe viewing from space at night with connections between cities

The EID2 database compiles data on pathogens and their hosts, helping us to anticipate, understand and combat infectious diseases worldwide.

Disease is a complicated puzzle

Disease transmission and emergence are complex and can be difficult to track or predict. Our changing climate also means that infectious diseases are expanding their reach and emerging in new areas.

Pathogens (such as bacteria, viruses, fungi, and parasites) usually have a wide range of known host species that act as reservoirs.

Reservoirs are groups of organisms or locations, such as contaminated water, where a pathogen is naturally maintained, not always causing disease. For example, humans are the reservoir for measles, whilst aquatic environments are a well-known reservoir for cholera.

Other pathogens circulate in vector species, such as ticks, mosquitoes, or lice, which can readily spread infections.

Mapping relationships can offer huge benefits

Hosts can also transmit diseases to other hosts, within or outside their own species. This creates a highly complex network between pathogens, vectors and hosts.

Effectively mapping these relationships to aid our understanding of potential routes of disease transmission can offer huge benefits. For example, when it comes to knowing what the disease risks are in a population or geographical area and how best to manage and eliminate them.

60% of known infectious diseases are zoonotic, plus an estimated 75% of new and emerging diseases. Many high-profile endemic or epidemic diseases have originated from animals, such as HIV. Tracking cases where infections move into a new species or mutate into new strains plays an important role in disease preparedness.

The issue

There is an ongoing lack of comprehensive and centralised data on the interactions between vectors, hosts, and pathogens. A 2013 study reviewed 347 clinically important diseases and found only 4% were comprehensively mapped.

Better data would provide clues and information to help answer key questions concerning disease, including:

  • how diseases are being transmitted between hosts or between vectors and hosts
  • potential evolution of transmission routes
  • which species or environments are acting as wild reservoirs for diseases
  • where novel pathogens or emerging diseases might come from
  • where significant outbreaks might occur

Dr Maya Wardeh, a Tenure-Track Fellow at the University of Liverpool, says:

It’s important to not just focus on 1 species. With a centralised database, we can start seeing hidden links, like how reservoirs of certain pathogens interlink with the vectors or target populations.

The project

Researchers at the University of Liverpool developed the ENHanCEd Infectious Diseases Database (EID2), an open-access database that annotates and integrates data on pathogens, vectors, hosts, and locations. EID2 was developed with Biotechnology and Biological Sciences Research Council (BBSRC) funding.

EID2 collects data from 2 sources:

  • genetic sequences and associated metadata from GenBank (over 139 million sequences). Crucially, the metadata adds useful information about each genetic sample, including which organism it has been taken from and the location. EID2 saves at the level of administrative division (such as states in the US, counties in the UK, or departments in France)
  • publications data from PubMed (over 8 million publications). EID2 searches for co-occurrences between PubMed data, pathogens and location. It uses these cross-references as evidence for host-pathogen or pathogen-location associations. This method can produce false positives, so EID2 overcomes this by only accepting evidence for a relationship with four or more supporting publications. Regularly performed manual verification of this process ensures EID2 data remains accurate

The team have also incorporated climate data (between 1900 to 2000) and demographic data sourced from:

  • the Tyndall Centre for Climate Change Research
  • the Food and Agriculture Organisation
  • the Socioeconomic Data and Applications Center

They use a grid system for this data, which is a gridded map of the globe where each square has a reference.

EID2 links pathogen occurrences and their locations to the grid, which can be used to create occurrence maps and frequency histograms of climate profiles. This allows users to view whether a pathogen has a higher occurrence rate for different climate factors, such as temperature or rainfall.

The EID2 database can be easily viewed via its online portal.

A new opportunity for interdisciplinary learning

The EID2 database was established and developed with interdisciplinary funding from multiple UK Research and Innovation (UKRI) councils, including:

  • BBSRC
  • Natural Environment Research Council (NERC)
  • Medical Research Council (MRC)

This included:

  • a NERC ERA-NET Environmental Health award
  • BBSRC Tools and Resources Development Fund (TRDF) awards
  • an MRC National Productivity Investment Fund (NPIF) fellowship

The TRDF awards enabled EID2 development to be finalised and added more data, including crop plant pathogens and notifiable animal diseases.

Dr Maya Wardeh was funded through the NPIF fellowship to study disease networks and how transmission routes affect them. She then progressed to predicting host-pathogen interactions using EID2 data.

Looking at the big picture

Dr Wardeh says:

I came from a computer science background, but since I started working with EID2, I’m fascinated with the big picture. It’s really shifted my career.

Host-pathogen interactions and networks are cool because they allow us to present the big picture, all those very complex interactions, in ways computers can understand.

When you start looking at networks, you start seeing pathways that haven’t been found or studied.

Using computational methods to fill that picture was the next logical step.

EID2 makes you think about things, such as how things are transmitted, how viruses are different to bacteria, and how reservoirs differ.

Only when you have the big picture can you zoom in to try and find a solution.

Finding 40-fold more potential hosts of coronaviruses

The recent pandemic kickstarted a flurry of research into coronaviruses. SARS-CoV-2 is one of 6 coronavirus species that can infect humans but many more circulate in animal populations.

Dr Wardeh used the genetic sequences stored on EID2 to collect information on known and potential mammalian hosts of coronaviruses. Her work was also funded by BBSRC Impact Acceleration Account COVID funding, as well as being underpinned by earlier EID2 funding.

The more coronavirus species within a single host, the more likely it is that a species will mutate, potentially enabling it to infect other host species, including humans.

The study identified potential hosts of multiple coronaviruses, with species such as wild boars, domestic cats, and several bat species hosting the most.

Machine learning techniques predicted that there were 40-fold more host species acting as hosts for four or more subtypes of coronaviruses compared to those currently known.

The study highlights that this underappreciation of the scale of pathogen-host interactions suggests that potential novel coronavirus generation may also have been underestimated.

Estimating the impact of climate change on disease

The impact of climate change is causing the relationship between climate and disease to rapidly evolve. EID2 has been used to estimate the impact of climate change on pathogens and vectors.

A study (McIntyre et al., 2017) used EID2 to assess European human or domestic animal pathogens for climate sensitivity. 157 human or animal pathogens, or both, present in Europe were systematically reviewed for evidence of sensitivity to 190 climate driver terms. These terms were divided into 11 groups, such as altitude, temperature, and rainfall, among others.

63% of the pathogens reviewed were climate sensitive, with 82% of pathogens driven by rainfall, temperature, humidity, and wind speed. This highlights how climate changes make predicting disease emergence or re-emergence an increasingly complex task.

Finding optimal temperature of pathogens and vectors

EID2 data was easily applied to find optimal temperatures of vector-borne diseases, with researchers using the data on location for each recorded malaria sample and linking it to climate. This data allowed researchers to find the optimal temperature of both the pathogen and its vector.

Key knowledge gaps on the relationship between environmental factors and pathogens, especially concerning sensitivity to climate change, can be filled by utilising EID2 data. This could improve disease preparedness and help researchers and policymakers prioritise pathogens of concern.

For example, the study was also cited in 2 United Nations reports:

  • ‘food systems at risk: new trends and challenges’ in 2019
  • ‘climate change: unpacking the burden on food safety’ in 2020

Both reports used the study to evidence how many pathogens are climate sensitive, as well as how there are complex pathogen-environment relationships that will likely be affected by climate change.

Patterns within disease networks

Dr Wardeh also collaborated on a 2020 paper with Professor Matthew Baylis and Dr Konstans Wells that used the EID2 database to explore virus sharing between mammals.

The study found that domesticated animals were central to host-virus networks and played a dominant role in the spread of viruses. It also found that RNA and DNA viruses were spread differently through networks. Carnivorous and bat species played a major role in the spread of RNA viruses but only minor roles in DNA virus spread. Cattle, pigs, horses and sheep played a major role in DNA viral sharing.

Dr Wardeh also led a 2021 study using EID2 data to predict unknown hosts of known viruses. The study found 20,000 unknown associations between viruses and mammals. This means current knowledge could be underestimating links by a factor of 4.3 and that viruses could have a larger average host range by a factor of 3.2.

The former study was cited in a 2020 EU publication, ‘the link between biodiversity loss and the increasing spread of zoonotic diseases’. It was also cited in a 2022 International Union for Conservation of Nature report on ‘situation analysis on the roles and risks of wildlife in the emergence of human infectious diseases’.

Future applications and continuous improvement

These examples of EID2 data use are only a portion of potential applications for the data. Ongoing collaborations include linking mammal and bird data sets and linking animal and plant data sets to look at associations.

Dr Wardeh says:

For every project, we build on the previous projects to find new ways of connecting these data.

Improvements to the database have also been discussed, with the EID2 team wanting to expand their source database for publications beyond the current use of PubMed. This would allow for more plant-specific papers to be sourced. Dr Wardeh would also like to link to other data sources so that users could cross-reference publications, sequences, and published data all in one place.

In terms of an even bigger picture, Dr Wardeh envisions being able to integrate experimental data. She says:

We want to not only predict but also improve these predictions with links to experiments.

We could predict the vector competence for a pathogen, which is how likely a mosquito species is to be a vector for a pathogen.

That will remain a probability until we take it to the lab and generate useful data to confirm or deny that hypothesis.

That’s not only useful for our models but also useful for everyone.

For me, personally, the link between computer science and experimental biology is where the future lies in general, especially with the addition of disease surveillance in the field.

EID2 could be a useful tool for guiding lab experiments.

A broad range of applications

By using openly accessible information in a new integrated way, data from EID2 has been used in work to:

  • trace the history of human and animal diseases
  • predict the effects of climate change on pathogens
  • produce maps of which diseases are most likely in some areas
  • categorise the complex relationships between human, plant, and animal carriers and hosts of numerous pathogens

EID2 is a valuable tool in our arsenal to combat emerging infectious diseases.

BBSRC has 2 ongoing funding opportunities:

Find out more

For further information see:

Top image:  Credit: imaginima, iStock, Getty Images Plus via Getty Images

This is the integrated website of the seven research councils, Research England and Innovate UK.
Let us know if you have feedback or would like to help us test new developments.