Kaarin Goodburn, of the Chilled Food Association, and Edward Haynes, of Fera, look at the development of genome sequencing technologies and their potential uses in tracking food contamination micro-organisms. They consider technical, practical and policy issues to be resolved from a food perspective.
Introduction
Since the discoveries of the structure of DNA and the genetic code, the importance of understanding genetic variability and diversity in relation to the biological world has become clear.
The ease and speed of identification of microorganisms has, in the last three decades, undergone major development – from plating out on to selective media, Gram staining, biochemical reactions, sero- and phage typing, amplification-based techniques (e.g. PCR, Amplified Fragment Length Polymorphism, Multiple Locus Variable Number Tandem Repeat Analysis), restriction enzyme-based techniques (e.g. ribotying, Pulsed Field Gel Electrophoresis) and, most recently, whole genome sequencing (WGS). WGS is just one of the techniques covered by the umbrella term Next Generation Sequencing (NGS). In just over 20 years since the first sequencing of a bacterium’s genome, technology has advanced such that a microorganism can now be sequenced in a day or less using WGS.
Internationally, researchers, governments and industry are now using WGS to track pathogens' movements in greater detail than was previously possible. The result has been the creation of libraries of genome data, such as that produced by the GenomeTrakr Network, which receives more than 1000 genomes a month from laboratories in Argentina, Australia, Austria, Canada, Denmark, Germany, Italy, Ireland, the UK and the USA.
WGS can help to: • rapidly determine potential sources of single cases of illness, outbreaks and contamination events • determine microorganisms’ characteristics including resistance to antibiotics • reveal the epidemiology of previously unlinked cases of infection.
Technologies
A range of molecular tools has been developed to distinguish between different strains of pathogenic microorganisms. The purpose of this is to be able to differentiate between unrelated infection or contamination events and link together those that are epidemiologically connected. Many of these tools are applicable to bacterial pathogens and can distinguish between different lineages with varying degrees of discrimination. Some of these techniques are outlined below:
MALDI-TOF
Matrix Assisted Laser Desorption/Ionisation-Time of Flight is a rapid, proteomics-based approach to identify bacteria to species level cheaply. MALDI-TOF compares the protein profile obtained from a bacterial culture to a library of known patterns to accurately and rapidly identify the species. The approach currently struggles to identify bacteria to subspecies or below, but advances in that direction are being made [1].
Serotyping
The differentiation of bacterial subtypes based on the presence of different cell surface antigens. This approach is particularly well established for some bacteria, such as Escherichia coli or Salmonella enterica. Indeed, there are over 2,500 known serotypes of Salmonella. However, in the UK, approximately half of Salmonella infections are caused by either Salmonella Enteritidis or Salmonella Typhimurium[1]. Therefore, if a clinical isolate belongs to one of these serotypes, it gives limited additional information about its potential source.
MLST
Multi Locus Sequence Typing - A DNA-based approach, which uses DNA sequences of around seven conserved genes within a bacterial species to identify subtypes. This approach is discriminatory, portable (methodologies and data are easily transferred between laboratories) and gives information about bacterial population structure, which can be useful for understanding outbreaks. Importantly this approach gives a nomenclature which makes independent studies highly comparable.
PFGE
Pulsed Field Gel Electrophoresis (PFGE). Previously thought of as the gold standard for pathogen subtyping, PFGE involves enzymatic digestion of DNA followed by gel electrophoresis with a voltage which periodically changes direction. This allows very fine level discrimination between strains. However, the technique can be cumbersome and there may be an element of subjectivity about the interpretation of the banding patterns from imaging the gel
Image may be NSFW. Clik here to view.
Strain typing for epidemiology
Whole Genome Sequencing
Bacterial WGS is, by comparison to previous techniques, conceptually relatively straightforward. DNA is extracted from a pure bacterial culture, sequenced in a high throughput manner and then interpreted bioinformatically. Several different analytical approaches can be taken dependent upon the objective, for example: • Compare sequences to a reference database of known sequences using a rapid approach such as k-mer analysis (comparing the number of shared short patterns of DNA sequence between samples). • Map back to a reference genome to obtain an accurate number of nucleotide differences between samples. Samples should be sequenced at multiple times coverage (each nucleotide in the genome is covered by multiple sequence reads) to allow for error correction. Reasonable coverage would be 20 times. • Assemble the individual reads into longer, continuous sections of sequence (contigs). This gives information about gene content, which enables determination of type for backwards compatibility to some earlier schemes (e.g. MLST) or identification of genes that may be absent in a reference genome as a result of horizontal gene transfer (movement of genetic material between different organisms).
In an outbreak or contamination response situation, we are primarily interested in using these comparisons for intraspecies analysis - if two isolates are of different Listeria species, for example, we already know they are different and strain typing them will not give more information about whether they share a source. Genomic delineation of prokaryote species identity often involves measures, such as percentage identity at the 16S ribosomal RNA gene with cut-offs varying from 97% to 99% identity, being proposed, although some species have indistinguishable 16S sequences, especially at the shorter loci targeted for NGS. Alternatively, an average nucleotide identity (ANI) across homologous genes between strains within a species of 95-96% has also been suggested [3]. This could equate to tens of thousands of single nucleotide polymorphisms (SNP – genetic variants found in a significant proportion of the population that are not associated with disease) between the thousand or more conserved genes shared by different strains of the same species, without even considering larger scale differences due to the presence or absence of mobile genetic elements.
Analysis of such large datasets requires access to bioinformatics support and powerful computational infrastructure. Some examples of this, such as the CLIMB infrastructure [4], have now become cloud-based. Some commercial graphical user interface-based software is available for straightforward applications (such as MLST), while more complex or bespoke investigations often involve the use of text-based input and the creation of novel programmes.
The comparison of large numbers of genome sequences isolated from foodstuffs and clinical cases gives the possibility of linking smaller and more geographically dispersed outbreaks than was previously possible. This is the premise behind the FDA’s GenomeTrakr project, which isolates and sequences L. monocytogenes, Salmonella, Shigella/E. coli and Campylobacter spp. and shares the sequences freely on the National Center for Biotechnology Information (NCBI) SRA (Sequence Read Archive). This has implications for data sharing, especially of associated metadata. So-called ‘minimal metadata’ are supplied by FDA to protect personally identifiable information whilst being epidemiologically useful. These metadata consist of information, such as year of collection, geographical state of collection and foodstuff from which it was isolated. The need to standardise and maximise usefulness of metadata has been recognised. Although major sequence uploaders, such as FDA, CDC and PHE tend to focus on the priority pathogens mentioned, a bioproject for sharing sequence data from any pathogen can be set up in conjunction with NCBI, with some of the other species already being sequenced including Vibrio parahaemolyticus, Citrobacter freundii and Acinetobacter spp.
Image may be NSFW. Clik here to view.
Salmonella
Next Generation Sequencing
The ability to routinely perform WGS on bacteria has come about as a consequence of increased throughput of DNA sequencing provided by so-called NGS technologies. These can simplistically be divided into high throughput platforms, which produce many, short sequences (e.g. illumina or IonTorrent devices), or lower throughput platforms, which generate fewer, but much longer sequences (e.g. PacBio or Oxford Nanopore). The different platforms are best suited to different applications. For example, high throughput sequencers are a reliable and cost effective way to rapidly generate accurate, draft quality bacterial genomes. The long read sequencers are ideal for creating high quality complete genomes (against which drafts can be mapped) or other applications, such as investigating microbial populations. This can be achieved by using PCR to amplify a conserved gene from a total DNA extract from the sample of interest and then sequencing the resulting mixture of PCR products. Unlike genome sequencing, where reads from around the genome can be assembled into continuous sections of DNA, this PCR amplicon sequencing relies on a single PCR product being traversed by a single read. Long read sequencers allow longer PCR products to be sequenced, which are more likely to allow discrimination between closely related taxa.
Analysis of such large datasets requires access to bioinformatics support and powerful computational infrastructure.'
WGS applications and current issues
WGS can be applied in food safety contexts to identify pathogens isolated from food or environmental samples. These data can then be compared to clinical isolates from patients to determine if they are linked and to help track the source. This relies on having a well populated database against which to compare samples. Some countries have made greater strides towards this than others.
WGS data can be used to: • Help predict the geographic origins of pathogenic isolates, potentially aiding significantly in outbreak delineation. • Establish links to wider, apparently disconnected outbreaks and therefore provide a greater appreciation of events. • Investigate historic events and re-occurrences and potentially make linkages to ongoing food production operations. • Establish trends or shifts in successful variants, e.g. any increase in aggressive antibiotic resistant strains. • Identify shifts in and uptake of genes, e.g. stx1, stx 2 and eae in E. coli. • Potentially reveal new pathogens. • Reveal similarities/differences between e.g. Citrobacter and Salmonella.
However, no technology is without disadvantages and difficulties in application. In particular, it should be noted that the presence of a gene does not necessarily indicate presence of a live organism. An isolate should therefore be required in advance of legal and regulatory action.
The basis of using similarities in genome sequence between isolates of a foodborne pathogen to infer a connection is the assumption that the genotypes found in one region are likely to be different from the genotypes found in another. However, it can be difficult to identify definitively a causal link or directionality of transmission without further information and metadata. This may be critical to a food business that finds itself implicated in an incident on the basis of WGS data.
Current technical issues with WGS focus on the validation of data generated. Low quality sequence or badly curated accessions can lead to erroneous conclusions being drawn. The impacts of poor analysis and understanding of sequence data is evident in recent reports of the presence of unusual contaminants, such as human or rat DNA in processed foods.
It is therefore extremely important that genomics professionals should consult with food scientists and technologists to ensure that conclusions are viable, not only from genetic data, but also from the perspective of food production technologies and practices.
The benefits of this collaborative approach were seen in a pilot study between Fera Science and a food manufacturer interested in exploring the origin of Listeria spp. detected in its facility. Combining the microbiological sampling and expertise of the manufacturer’s contract lab with Fera’s WGS capabilities allowed the sequencing and interpretation of genomes from a number of different Listeria spp. This information enabled researchers to suggest potential sources of contamination. For example, the same type of L. ivanovii was detected in a drain and on finished products; this could indicate transmission in one or other direction or contamination from a common source. Greater temporal sampling would allow an investigation into whether the drain was persistently contaminated with the same type of L. ivanovii. In another example, a finished product contained the same type of L. innocua as a raw ingredient, which was not used to make the product. This indicates that there had been some transmission event from the raw ingredient area of the plant to the high care environment, which is useful information to aid further investigations.
Combining the microbiological sampling and expertise of the manufacturer’s contract lab with Fera’s WGS capabilities allowed the sequencing and interpretation of genomes from a number of different Listeria spp'
Genetics timeline
1859 Charles Darwin publishes 'On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life’
1866 Gregor Mendel discovers the basic principles of genetics. This is formally acknowledged in 1900
1869 Friedrich Miescher first identified what he called ‘nuclein’ in the nuclei of human white blood cells – deoxyribonucleic acid (DNA)
1882 Walther Flemming discovers mitosis and chromosomes
1905 William Bateson coins the term ‘genetics’
1909 Wilhem Johannsen coins the term ‘gene’
1950 Erwin Chargaff discovers that DNA composition is species specific
1953 Discovery of DNA double helix structure by James Watson, Francis Crick, Rosalind Franklin and Maurice Wilkins
1961 Marshall Nirenberg and others crack the genetic code, linking the DNA sequence to protein synthesis
1977 Fred Sanger and Walter Gilbert independently develop efficient sequencing – Sanger Sequencing
1977 First genome is sequenced using Sanger Sequencing - 5,386 bases - Escherichia coli bacteriophage Öw174
1983 Polymerase chain reaction (PCR) developed by Kary Mullis for amplifying DNA
1986 First automated DNA sequencer marketed
1990 The Human Genome Project begins, aiming to sequence the entire human genome
1995 First bacterial genome sequenced, Haemophilus influenzae, 1.8m bases
1996 Dolly the sheep is cloned
1998 First multi-cellular organism sequenced – nema-tode Caenorhabditis elegans – 100m bases
2003 Human Genome Project completed to 99.9% accuracy. Humans have between 20-25,000 genes and ~3x109 base pairs
2005 First next-generation sequencing platform launched, capable of sequencing 1m bases in a day
2014 The $1000 genome - sequencer launched capable of sequencing 45 genomes a day at a cost of $1000 per genome
2016 Sequencing carried out in space on the International Space Station, using MinION
Advantages of Whole Genome Sequencing (WGS)
· Precision and accuracy – high resolution and discriminatory power, reproducibility.
· Comparatively low cost – NGS (2nd and 3rd generation) can generate millions of reads (35 basepairs–100s of kilobasepairs in length) in single runs.
· Rapid - 3rd generation sequencing can produce long reads by sequencing single molecules in real time.
· Powerful and highly attractive tool to assist with epidemiological investigations.
In the near future, WGS technology for routine use will permit accurate identification and characterisation of bacterial isolates, but application of findings in commercial and legal scenarios presents issues to be resolved.
Current technical issues regarding WGS
Issues include: • Individual DNA sequences generated on NGS platforms are usually not 100% accurate. Improvements in accuracy could be achieved either by techniques, such as using high fidelity polymerases or PCR-free library preparation. Accurate WGS relies on the generation of a consensus sequence by sequencing the same region multiple times. • Currently RNA sequencing is laborious and relies on the generation of complementary DNA from RNA transcripts, which can introduce biases. In the near future however, some technologies (e.g. Nanopore) will be able to directly sequence RNA molecules. This will have a number of implications as the presence of RNA shows which genes are being transcribed, indicating what an organism is doing and implying that it is, or recently was, alive (due to RNA’s generally more rapid degradation than DNA). • Method and laboratory certification is not well developed. There is some accreditation for some small parts of the workflows and some proficiency testing occurs either in-house (e.g. FDA) or organised by consortia (e.g. Global Microbial Identifier (GMI)). Some internal guidelines have been developed by some organisations (e.g. CDC). GMI is also working on standards. • Risk of sample contamination. Some NGS techniques (e.g. metabarcoding, RNAseq) are at greater risk of contamination through their reliance on either PCR or their identification of pathogens from low numbers of sequences. In a laboratory environment, the risk of contamination can be reduced by physical separation of stages (e.g. PCR setup and PCR product clean-up), physical protection of samples from user contamination (with appropriate personal protective equipment), regular disinfection and use of appropriate controls. When genome sequencing from pure bacterial cultures, the target DNA will be in massive excess and when mapping back to a reference the coverage can be a guide to identity. Plasmids or conserved regions might have higher than average coverage, but accidental contaminants will likely have relatively low coverage in normal circumstances. • Risk from external contaminants when using portable sequencers. It would be difficult to sequence microbial genomes without the use of culture-based techniques, as background genetic material will be in excess, unless some form of enrichment is used or sampling matrices with low amounts of host DNA (e.g. urine). How robustly field use sequencers will perform when sampling e.g. irrigation water, is not currently clear. • Assembling a complete and fully accurate genome sequence from short read data is not always possible. Some areas of the genome are highly repetitive and isolates may have sequences that have been inserted or deleted, both of which can complicate assembly and subsequent alignment (use of a reference genome to piece a genome back together following sequencing of DNA fragments). Long read technologies and a range of assembly software can assist resolving these problems, as can complementary techniques, such as mate-pair sequencing. It is currently challenging to rapidly compute and interpret the relevant information from large data sets.
Taken together it is apparent that making direct linkage between clinical, food or environmental samples is not straightforward. Dr Peter Gerner-Smidt of CDC stated at the September 2016 IFSH Whole-Genome Sequencing for Food Safety Symposium that: ‘A WGS match between a food isolate and a clinical isolate does NOT mean the food caused the patient’s illness. They likely share an ancestor somewhere in the food production chain’… ‘Epidemiological and traceback information remains critical.’
Image may be NSFW. Clik here to view.
Genome sequencing
Practical application issues
The current technical issues to be resolved in WGS are compounded by additional practical issues when it is used in food safety applications: • How many SNPs comprise retention of the identity of the microorganism? • What is a valid case definition and how will agencies interpret and apply it? • What is an outbreak? • What are the implications of WGS for identifying antibiotic resistance genes? • How will any emerging food safety concerns be communicated and by whom to whom? • How will partners and stakeholders agree in advance on appropriate use of data? Will this need to be done at international level? What would interpretation criteria be? • IP and isolate ownership require national agreements to enable metadata sharing and a common approach to metadata. • What data should be publicly available? • Is funding appropriate to develop and use the technology with sufficient confidence to bring legal action? • What level of certainty is there when sample numbers are low? • How should methods, metadata and attribution be standardised/harmonised internationally? • What are the requirements for IT and bioinformatics infrastructure? • How should data integration, sharing and knowledge transfer be managed?
Given these issues, validated interpretation criteria must be established for the use of WGS and the weight of all evidence should be assessed, i.e. WGS in parallel to the primary tests and epidemiological evidence, and carefully interpreted on a case-by-case basis. One problem with epidemiological investigation of foodborne outbreaks and incidents is that people may be asked to complete historic food consumption questionnaires covering periods of several weeks, where recall may be incomplete or inaccurate, potentially highlighting incorrect food vehicles.
Conclusions
WGS is a major advance in gene sequencing technology, presenting many opportunities for gaining greater understanding of the identity, genetic nature and presence of microorganisms.
However, the technology is still being developed. Issues remain regarding the classification and handling of data and what conclusions are viable regarding the discovery of an organism in a sample, be it a food or environmental sample.
The quantity of data that is becoming available is unprecedented. How will it be applied? What conclusions can be drawn? The uses of the data require careful consideration by all using WGS, particularly governments and their agencies, laboratories and food businesses.
Kaarin Goodburn MBE FIFST RSPH Chilled Food Association and Dr Edward Haynes Fera
1. Lartigue, M.-F. 2013. Matrix-assisted laser desorption ionization time-of-flight mass spectrometry for bacterial strain characterization. Infection, Genetics and Evolution, 13, 230-235.
2. Public Health England 2014. PHE Gastrointestinal Infections Data, Summary of Salmonella Surveillance, 2013.
3. Kim, M., OH, H. S., Park, S. C. & Chun, J. 2014. Towards a taxonomic coherence between average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of prokaryotes. International Journal of Systematic and Evolutionary Microbiology, 64, 346-351.
4. Connor, T. R., Guest, M., Southgate, J., Ismail, M., Bakke, M., Poplawski, R., Loman, N. J., Thompson, S. E., Thompson, S., Bull, M. J., Kitchen, C., Smith, A., Richardson, E., Sheppard, S. K. & Pallen, M. J. 2016. CLIMB (the Cloud Infrastructure for Microbial Bioinformatics): an online resource for the medical microbiology community. Microbial Genomics, 2.