To make big data available to all, reach for the clouds

As the amount of genomics and other data rapidly grows, researchers are turning to cloud computing; commercial services for remote data storage and processing that allow even those with little infrastructure to handle big data.

By Emily Singer
5 January 2012 | 5 min read

This article is more than five years old.

Neuroscience—and science in general—is constantly evolving, so older articles may contain information or theories that have been reevaluated since their original publication date.

Cloud collaboration: Storing data from genomic and other studies in remote servers makes them accessible to more scientists.

Say hello to the age of data democratization.

In the case of genomics, for example, the cost to sequence DNA has dropped exponentially and new machines generate orders of magnitude more data than did their predecessors. Small labs once limited to sequencing a few genes can routinely churn out genomes.

Generating all the data may be easier, but managing them after the fact is another matter.

To deal with their massive volumes of data, many labs are turning to cloud computing: services for remote data storage and analysis offered through commercial vendors, such as online retailer Amazon.com and search giant Google.

“The challenge is how to interact with huge amounts of data emerging from these studies,” says Mark Daly, associate professor of medicine at Massachusetts General Hospital and Harvard Medical School. “In many cases, the logistical challenges of moving large datasets from one investigator to another are prohibitively costly and can be much more effectively distributed through cloud computing.”

Cloud computing relies on assemblies of servers that can be accessed via broadband Internet lines. Users pay per gigabyte of data stored or transferred, giving them the agility to rent space as needed for specific projects.

One of the primary advantages of cloud computing is that it allows even labs with little infrastructure to handle big data.

“You don’t need to buy capital equipment to stitch together a data center,” says Martin Leach, chief information officer at the Broad Institute in Cambridge, Massachusetts. “Using the cloud, you can get something up and running in hours or minutes, rather than months.”

Before the cloud

The National Center for Biotechnology Information, a central repository for medical data first launched in 1988, is sort of a precursor to cloud computing. Researchers could upload the sequence of a gene or a set of genetic variations from a genome-wide association study, where it would then be available for others to use.

But new technologies are rapidly outstripping the center’s capacity. Whereas traditional sequencing machines generated kilobytes of data, newer versions create gigabytes, five to six orders of magnitude more. These machines are also cheaper to buy and to operate, meaning that more people are using them.

Sequencing capacity even in academic centers has jumped from single human genomes to hundreds or potentially thousands, creating a deluge of data that can no longer easily be stored or moved around in traditional ways.

Enter the cloud.

Cloud computing provides a relatively inexpensive way to store data, which can otherwise rapidly accumulate. The cost to sequence a typical human genome ranges between $5,000 and $30,000. Storing data from that genome, roughly about 1 terabyte of data, at dedicated genome sequencing centers like the one at the Broad Institute, runs about $1,000.

By contrast, Amazon.com charges about 16 cents per gigabyte per month for storage. That adds up to roughly $160 per month for data from a single genome.

Beyond affordability, cloud computing encourages sharing and collaboration: Once in the cloud, data can be accessed by any number of people anywhere in the world. For example, researchers at the Broad Institute are working on a project called GenomeSpace, a multi-site collaboration that uses the cloud as a way station for data and different collaborative tools.

“They can share data and have it interconverted between different applications,” says Leach. “It’s a great example of how the cloud is being used to improve workflow and drive collaboration between different groups.”

The National Database for Autism Research (NDAR), a repository for human autism research funded by the National Institutes of Health and others, also aims to make its rapidly growing store of data available via the cloud.

The database encompasses information from 22,000 participants, but is set to expand to 100,000 people and 100 terabytes of data by next year.

“It would take even the most equipped labs 50 days to get the data, which is unacceptable,” says NDAR manager Dan Hall. To avoid that downloading bottleneck, “We are moving towards cloud computing as quickly as we can,” Hall says. “Our objective for 2012 is that researchers will be doing analysis in the cloud.”

Challenge in the sky:

Cloud computing still has some hurdles to overcome, however.

For medical research involving human participants, security of sensitive personal information is a major concern. Some institutes may balk at allowing data to be stored offsite. And although cloud computing is cheaper than creating onsite data storage, it can still be expensive. Storing the data for 100 genomes would cost roughly $16,000 a month, for instance. 

Though cloud computing theoretically provides unlimited computing power, the data lines used to transmit it are limited. Leach says the Broad Institute produces data much faster than they can be uploaded to the cloud. In one test, it took 15 hours to upload less than a terabyte to Amazon’s service. The institute generates 15 terabytes a day.

What’s more, most commercial services have been set up to meet the needs of general users, not scientists.

“The systems available today are not flexible enough for research,” says David Haussler, director of the Center for Biomolecular Science & Engineering at the University of California, Santa Cruz.

For example, cloud servers are not optimized for specialized, high-performance computing, which require lots of data to move on and off the cloud. This limitation is restricted mainly to high-volume data generators like the Broad Institute.

Cloud computing also has much more prosaic applications. For example, Google Docs, a free application that allows people anywhere in the world to collaborate on a document, is a form of cloud computing.

Leach says he recently converted the entire Broad Institute to a “Google enterprise shop” using Gmail, Google Docs, and Google Talk for instant messaging.

“These cloud-based technologies allow us to rapidly grow and expand our organization,” he says. “It becomes easier to run an institute, and that means more people can do it.”

Sign up for the weekly Spectrum newsletter.

Stay current with the latest advancements in autism research.