USA - Toll Free: +1-866-221-0634
USA - From abroad: +1-208-327-6494
USA - Subscription Renewals: +1-866-830-4410
Latin America: +1 512 535 7751
UK: +44 845 399 1124
Ireland: +353 1 6919191
Germany: +49 89 420 95 98 95
France: +33 1 70 61 48 95
Sweden: +46 730 207 871
Benelux: +358 50 5710 528
Italy: +39 06-99268193
Israel: +358 50 5710 528
Spain & Portugal: + 34 933905461
Other EMEA countries: +353 1 6919191
Asia Pacific: +81 3 5843 1140
Learn about new MySQL releases, technical articles, events and more.
The Wellcome Trust Sanger Institute focuses on large-scale sequencing and analysis of human, animal and bacterial genomes. As a major player in the Human Genome Project (HGP), the Sanger Institute sequenced approximately one third of the 3200 million “letters” of the human genome.
The work of the Sanger Institute relies upon computers as much as it does on traditional laboratory methods. MySQL, the world’s leading open-source database, is used in many of the Institute’s mission-critical programs to manage massive amounts of data for human gene and disease identification.
For example, Sanger’s Pathogen Group sequences and analyzes a wide range of disease-causing bacteria and protozoa, such as malaria, leprosy, tuberculosis, salmonella and bubonic plague. Arcturus, a data management system for the genome assembly process, is a classic LAMP application comprising Apache, MySQL, and Perl.
“The Wellcome Trust has an open release policy on both the genome data and the software which is produced at the Sanger Institute – it is available to anyone,” says Dr. David Harper, senior computer programmer in the Pathogen Group. “MySQL is easy to install and maintain, and it doesn’t require a DBA to manage. Anyone who can run their own Linux box can run MySQL.”
To read the genome requires reading chromosomes, each which contain many millions of letters of DNA. However, current sequencing technology can only read short snippets of DNA, up to about a thousand letters long. To solve this dilemma, Sanger researchers shatter multiple copies of the chromosome into many thousands of fragments, then reassemble the pieces by looking for overlaps between the fragments. The overlaps are referred to as “reads”, which merge into contiguous sequences, or “contigs”. Then, the DNA itself is reconstructed from the contigs.
Assembling genomes from these fragments requires significant computing power. Initially, all data was stored and manipulated via flat files, which worked well for sequencing the early, small genomes. However, the challenge of larger genomes and more projects necessitated the move to the MySQL database.
The Pathogen Group supports two clusters of HP AlpaServer ES40 and ES45 machines. Multiple MySQL servers run in each cluster. Each server supports several dozen databases, each containing tens of gigabytes of data.
According to Harper, the Pathogen Group uses a cluster alias that allows clients to address the entire cluster via a single name and IP address. This helps achieve load balancing and, if one host fails, a CAA daemon automatically restarts its MySQL servers on another host in the cluster.
“All our servers must be very robust,” says Harper. “In our experience of running 4 servers per cluster over the last 18 months, the only reason a server dies is that the host machine has gone down or the cluster has been rebooted.”
Backup is achieved through replication. Each MySQL server in one cluster mirrors a replication slave server in the other cluster, and vice versa. The clusters are in different locations to protect against catastrophes such as fire or an irreparable file system crash. Each week, a full backup is run on each server and automatically copied to the slave server. Binary logs are flushed and backed up nightly.
Within the Arcturus program, multiple MySQL servers run on different host/port combinations. Each server holds data on several organisms, and each organism has its own database comprising 30 tables. These tables are accessed via a Perl module that acts as a proxy for all queries, insertions and updates.
Primary data – reads and contigs – are put into static tables. A typical project has half a million reads. Fluid data, such as mapping information from reads to contigs, are typically stored in “long and lean” tables, which have a small number of fixed-length columns but many millions of rows.
In addition, all servers have a common database of shared information, which contains a list of organisms. Clients can find the correct instance for a particular organism by querying any server. It also includes an organism database model, delineating information on the relationships between tables in an organism database, and organism-independent shared biochemistry and biological information. Finally, the common database holds a user administration database.
New organism databases and tables are created easily through a script process. Meta-data on the organism is automatically propagated to all other MySQL databases.
Ensembl is another Sanger Institute project relying on MySQL. This software system provides public access to assembled DNA sequences as well as information on the genes that have been identified. MySQL is used both as a backend database server for the web site www.ensembl.org and as a part of the pipeline that generates the gene-identification data. The database is several hundreds of GB in size, and is spread across a federated database of 18 MySQL servers.
“The mission of the Wellcome Trust is to foster and promote research with the aim of improving human and animal health,” says Harper. “The Pathogen Sequencing Unit at the Sanger Institute is dedicated to furthering the Wellcome Trust's mission through genome sequencing and analysis. Without robust and reliable databases, this would not be possible.”

