Cloud Compute for Genomics-driven Cancer Drug Discovery

Making genomics data storage accessible to Amazon EC2 with reduced latency

Discovering a new drug requires extensive research and compute power. Because many bioinformatics applications are easily parallelizable, scientists can take full advantage of a linearly scalable compute infrastructure to accelerate pipelines—and ultimately time to discovery.

Lured by the compute potential of hundreds of on-demand servers, scientists at H3 Biomedicine in Cambridge, Massachusetts,set off on their first journey to the cloud. Even with high-speed access to cloud compute over a private network connection, access to on-premises storage meant a high-latency 920-mile roundtrip that seemed to last from here to eternity. Users waiting for command line responses acknowledged that the 15-millisecond latency between on-premises storage and the nearest cloud compute infrastructure rendered the service unusable for big-data bioinformatics applications accessing data housed at H3 in Cambridge.

Bret Martin, principal research computing architect at H3, says the introduction of an Avere Systems Virtual FXT (vFXT) Edge filer changed everything. “Avere reduced latency to the cloud by more than 15X, enabling scientific applications to run at maximum performance. Today Avere lets H3 bioinformaticians take full advantage of Amazon Web Services (AWS) Elastic Compute Cloud (EC2) resources to analyze more genomic data and rapidly explore multiple scientific approaches. At the same time, we’ve gained administrative, financial, and other benefits. We’re well on our way to making EC2 our primary compute environment for high-performance bioinformatics and computational chemistry applications.”

Challenge: Reduce Latency to the Cloud

Founded in 2011, H3 Biomedicine integrates human cancer genomics with next-generation synthetic organic chemistry and tumor biology capabilities to translate cancer patients’ genomes into powerful precision therapeutics. Demand for more compute power and high data growth—nearly 30 percent year over year—prompted an evaluation of cloud services to augment on-premises 80-core compute servers backed by EMC Isilon storage.

Martin explains, “We initially took advantage of Amazon’s Simple Storage Service (S3) to store about 20 percent of our data in the cloud, leaving some 80TB stored locally on the Isilon to support the on-premises HPC infrastructure. Unfortunately, when we explored direct access to the Isilon-housed data from EC2, latency issues appeared insurmountable. Even with a 1Gbps AWS Direct Connect link between our data center and Amazon’s nearest availability zone in Virginia, high latency made it unusable. Co-locating our Isilon closer to the Amazon cloud or migrating data both seemed impractical and utilizing EC2 block storage was expensive.”

Solution: Avere for Fast Cloud Access

Testing an Avere virtual NAS solution, H3 determined the technology eliminated cloud latency issues. The software-only Avere vFXT Edge filer runs in EC2 alongside H3 applications, providing extremely low-latency access to active genomics data and enabling applications to run at peak performance. Martin notes, “Using Amazon’s CloudFormation templates, we spun up the Avere vFXT cluster and started running applications in EC2 within just two hours. Today approximately 70 percent of Isilon data is accessed from EC2 through the Avere vFXT cluster—and the process is transparent to both applications and users. Using Avere, we’re achieving consistent 700-900-microsecond responsiveness between EC2 instances and our on-premises Isilon storage.”

H3 scientists run a wide array of applications in EC2. The current toolset includes the Burrows-Wheeler Alignment (BWA) utility, Bowtie, the Genome Analysis Toolkit (GATK), Picard, Sailfish, SAMtools, and STAR. Scientists also develop proprietary code leveraging both Python and R data analysis languages.

Benefits: Discovery Speed, Cost Savings, Storage Flexibility

Exploring More Genomics Data, Faster

The Avere vFXT gives H3 the ability to fully scale compute in the cloud, increasing or decreasing capacity within minutes. Martin continues, “In the past, scientists were limited to in-house compute resources when accessing on-premises data, so a pipeline run on a large batch of samples might take weeks to complete on an 80-core server. Now they can scale it out to a few dozen nodes and finish in half a day. Without Avere and EC2, our only options would have been to rent machines at an outside data center—a process that takes days or weeks—or build out our own data center, a route that did not make sense for our business. As one of our managers has noted, being good at operating a data center does not make us a better drug-discovery company. Our resources are better focused on H3’s core mission of delivering new drugs to target cancer.”

Jacob Feala, Bioinformatics Platform group leader at H3, adds, “The nearly instantaneous and infinite capacity of the cloud is extremely empowering to investigators, allowing us to rapidly explore more scientific approaches and more quickly deliver a pipeline of drug candidates. We’re able to take on computational work that we wouldn’t have been able to consider when limited to on-premises compute. Another benefit of Avere is that scientists can use familiar file-system protocols—Avere makes cloud computing accessible to more of our users.”

Precise Provisioning for Cost Savings

“The cloud provides immediate access to resources without technology lock-in,” comments Martin. He also cites financial savings, including the ability to provision resources precisely as needed—without upfront capital investment or the costs of under-or over-provisioning capacity—plus improved user productivity, greater flexibility to pursue and validate data-driven hypotheses, and disaster recovery at a fraction of the cost of a traditional physical primary/secondary-site architecture.

“Other benefits can be uniquely attributed to Avere technology,” Martin observes. “Although, for example, we could have used EC2 block storage, not only would capacity costs been high, but we’d have had to manage parallel file-sharing environments, manually shipping data and synchronizing between on-premises and cloud storage. Avere eliminates data classification and copying processes—our pipeline automatically determines what gets moved to the Avere vFXT’s cache. The entire process is transparent, requiring no involvement from IT or scientists and saving weeks of time and tens of thousands of dollars.”

Data Storage Flexibility Supporting Full Cloud Migration

“Avere is a critical enabler of H3’s cloud strategy,” concludes Martin. “As on-premises infrastructure components reach end-of-life, we expect to increasingly move our center of gravity to the cloud. We also foresee a time when we maintain no on-premises storage, migrating all of our data to S3 and using a technology like Avere for high-speed access. Avere physical and virtual FXT solutions give us tremendous flexibility to take full advantage of cloud resources to further H3 efforts in drug discovery and development.

About H3 Biomedicine

Headquartered in Cambridge, Massachusetts, the privately held H3 Biomedicine biopharmaceutical company focuses on discovery and early development of novel, targeted anti-cancer compounds for the unmet needs of genetically defined patient populations. H3 has leveraged its integrated expertise in genomics, tumor biology, bioinformatics, and innovative synthetic organic chemistry to create an integrated drug development ecosystem to deliver patient-based, genomicsdriven, small molecule drugs. In less than three years, H3 has developed four discovery platforms, producing two drug candidates for which the company expects to file investigational new drug (IND) applications.

www.H3biomedicine.com