Secure Access for Genomics in the Cloud
Fast, Secure Access via ITMI’s Hybrid Cloud with AWS + Avere
Genomics is a life-and-death game of big data—the DNA of just one person comprises some six billion bits of information, and a single whole-genome sequence can easily exceed 80GB of data. Multiply that by tens of thousands of sequences, and the storage capacity required to house potentially life-saving genetic information can quickly surpass researchers’ on-premises IT systems and budgets.
At the Inova Translational Medicine Institute (ITMI), in Falls Church, Virginia, biomedical teams have already banked more than 7,000 genome sequences and are on track to reach their goal of more than 20,000 in less than two years’ time. Aaron Black, director of Informatics at ITMI, says that a hybrid cloud built with Amazon Web Services (AWS) and Avere Systems technologies provides the high-speed access, scale, and cost efficiency required to store—and make maximum use of—such massive data sets. “Our hybrid cloud framework built on AWS and Avere allows us to fully leverage both cloud and on-premises resources to build a genome-sequence database with virtually unlimited capacity scaling and IT savings in the millions of dollars. The functionality enabled by the ITMI hybrid cloud is helping researchers analyze large and complex data faster—and as a result, accelerate the pace of specific treatment and create the framework for preventive-care innovation.”
Challenge: Slash Costs for Fast, Secure Access to World’s Largest Genome Database
ITMI researchers want to transform healthcare, moving it from a reactive to predictive-medicine model that improves patient outcomes. As the basis of its research efforts, the institute collects data across multiple dimensions—including biological data, biological specimen, medical records, and surveys—from thousands of Inova patients and their families. Biological data, including DNA and RNA, is typically sent from ITMI laboratories to an outside vendor for genomic sequencing and then delivered back to ITMI analysts working to identify the causes of diseases and rare disorders, ascertain how particular conditions develop, and ultimately determine optimal courses of treatment or prevention.
As part of this process, ITMI is assembling what is expected to be one of the world’s largest whole-genome-sequences databases. From its current patient population of some 9,300 people representing more than 110 countries, the institute has already banked more than 250,000 samples, stored in excess of 30 billion variants, and correlated more than 46,500 diagnoses derived from the Inova Epic-based electronic medical record
For Black’s Informatics team, the storage and management of such massive amounts of genomic data presents major challenges in scale, security, resilience, durability, and cost. For example, Black estimates that building traditional on-prem storage infrastructure for petabyte-scale data stores would cost in the tens of millions of dollars. Extremely large file sizes also make data durability a concern—experiencing more than two percent decay per month could jeopardize analysts’ ability to reproduce results.
“We also have to be able to manage multiple user and business priorities,” continues Black. “In the research environment, agility and the ability to rapidly iterate on processes matter most. But on the healthcare side, Inova must ensure HIPAA compliance—that can be a particularly daunting challenge when you have to manage the movement and storage of hundreds of millions of large files that can be hundreds of gigabytes each.
“Additionally, because we’re supporting a production environment that must deliver on specific business objectives, building resiliency into both methodology and systems is essential to supporting operations. The infrastructure has to run at peak availability and efficiency. And of course, all of it must be transparent to researchers—they need to focus on the science, not underlying compute and storage platforms.”
Solution: Avere for Responsive, Secure Access for Hybrid Cloud
ITMI’s hybrid cloud, illustrated below, integrates both on- and off-prem infrastructure resources. On the compute side, a 1,024-core SGI UV 2000 with 16TB of cache-coherent shared memory enables high-performance processing for genomic analysis. A ten-node Linux server farm with NetApp storage provides general-purpose compute and storage services, as well as batch access (via Altair’s PBS Professional workload manager and job scheduler) to the SGI cluster. Total on-prem storage capacity includes approximately 1PB of disk storage and 40TB of SSDs.
Avere FlashCloud software integrates public object storage with on-prem NAS into a global namespace (GNS) to enable a consolidated view across ITMI storage resources. Leveraging this technology on an Avere FXT 3850 Edge filer cluster, ITMI can present clients with simple, high-speed access to the on-prem NetApp storage, as well as to virtually unlimited cloud-based Amazon Simple Storage Service (S3) capacity that houses the massive genomic database.
Leveraging Avere FlashMove and Avere virtual FXT Edge filer technologies, ITMI also has the flexibility to more efficiently move large data sets without disruption and to run applications in the Amazon Elastic Compute Cloud (EC2) with high-performance, low-latency data access. FlashMove software provides non-disruptive data mobility within the GNS and creates a seamless on-ramp for moving large datasets to the cloud.
Benefits: Scale, Savings, Speed, Security
Scalability + Durability at Much Lower Cost
Avere’s framework provides a highly available, secure, and scalable data storage solution for managing ITMI’s massive genomic database. “The Avere solution serves as the glue that allows us to tie together disparate storage systems and treat the cloud as on-premises infrastructure—but without the high overhead costs of a traditional data center,” says Black. “We’re currently storing 1.7PB of omic data in S3. To achieve the same durability (99.999999999% in S3) and redundancy that we have in the Amazon cloud would have cost upwards of eight figures to replicate in an on-premises data center. The savings is dramatic. And of course, the cloud offers very simple, immediate, and nearly infinite scaling.”
Maximum Utilization and Value
“In the past, our data workflow was complicated and lengthy,” Black comments. “For example, to get DNA/RNA data to Illumina, our genomic sequencing vendor, we had to buy hundreds of 3TB disks, load and encrypt ITMI lab data onto the disks, then ship them to San Diego. Illumina ran the sequencing processes, stored the results on those disks, and then shipped them back to us. At that point, before anyone could begin to use the data, our Informatics team had to unencrypt the data, run our MD5 algorithms to verify data integrity, catalogue all the data, then push it up to Amazon for storage and shared access. That workflow represented weeks of wait time for analysts, all of the processes were manual, and there was considerable potential for disruption. If a download failed part-way through or if the power went out, we had to restart. No matter how big your pipe is, it takes a long time to download hundreds of terabytes of data, so a failure could mean restarting a seven-day job.”
Today, Illumina returns sequencing results directly to AWS S3 storage buckets, and ITMI researchers, through the Avere filer, have direct, high-speed access to the genome database in the cloud. Using the Amazon Cloud eliminates the need for ITMI to bring up its own firewalls, allowing Black’s team to automate information sharing among researchers.
Analysts primarily study the variations in the human genome, so although genomic files are very large, only about three percent of that data is unique—that is the hot data automatically cached by the Avere, leaving the 97% of rarely accessed cold data in the cloud. Black describes the benefits: “On-prem caching through Avere dramatically accelerates processing and time-to-results—we’ve seen analyses that previously ran for days now complete in a matter of hours.
“Avere also lets us maximize utilization of on-premises HPC resources. In the past, because we did not have a fast, easy way to make data accessible to the SGI system, for example, researchers in some cases ended up running algorithms on Amazon EC2, even though we had plenty of compute power on-site. Today, the Informatics team can quantify the comparative cost and turnaround times of jobs, allowing researchers to choose the best combination of on- and off-prem compute/storage options for getting work done. Extracting maximum value from our infrastructure investments is one of the biggest contributors to the fast ROI we’ve achieved on the Avere purchase.”
Cloud Security and Agility
As a FIPS-compliant hybrid cloud NAS solution, Avere also enables secure data transfer. With the ability—through both Avere and AWS—to secure data at rest and in transit, ITMI was able to execute a HIPAA Business Associate Agreement indicating its compliance with personal health information protection guidelines. With Avere all encryption keys are stored and managed on Inova’s premises, preventing any unauthorized access to private data in the AWS cloud.
Black says that the hybrid cloud gives ITMI more agility at much lower costs. “We’re a relatively new organization charting new courses in genomic research. To support these efforts, we need a flexible infrastructure that can deliver needed performance and capacity—and we often don’t know at the start of a project exactly how much of either we’ll need. Before we architected the hybrid cloud, we either had to overestimate requirements or, if we undersized a job, incur delays while researchers waited for data and compute resources. Today we have seamless scalability in the cloud and maximum utilization of on-premises resources. As a result, production runs at optimal efficiency, and we’re achieving incremental improvements to prediction and outcomes.
“Building the hybrid cloud with AWS and Avere, we’ve been able to deliver infrastructure quickly, resiliently, securely, and cost-effectively. That capability allows us to better support the critical work of researchers using genomic data to inform health care, applying that information to immediate needs and directly integrating genomic research into practice of medicine.”
About Inova Translational Medicine Institute
The Inova Translational Medicine Institute (ITMI) not-for-profit research institute delves into the genomics component of personalized medicine, utilizing genomic and clinical information from patients to develop innovative methods for personalized patient care. Studiesat the Institute generate a large genomic and clinical data set that can be used as data in a variety of fields, from computational biology to psychology and biomedical research applications. ITMI utilizes this information to better characterize and predict the onset of disease with the goal of implementing preventive medicine based on the unique genomics of the individual patient. www.inova.org
About Inova Translational Medicine Institute (ITMI)
The Inova Translational Medicine Institute (ITMI) not-for-profit research institute delves into the genomics component of personalized medicine, utilizing genomic and clinical information from patients to develop innovative methods for personalized patient care. Studies at the Institute generate a large genomic and clinical data set that can be used as data in a variety of fields, from computational biology to psychology and biomedical research applications. ITMI utilizes this information to better characterize and predict the onset of disease with the goal of implementing preventive medicine based on the unique genomics of the individual patient.