Jan 21, 2008

The Google Approach to Large Genomics Data Sets

Getting terabytes of genomics data? Yes, easily! -- via Next Generation Sequencing (NGS), microarray, mass spectrometry, consumer genotyping ... you name it.

The bioinformatics community has been working on this problem for years. A few milestones: 1) Recognized the importance of meta-data (data about data, i.e., the running conditions to acquire the scientific data). 2) Utilized XML and Ontology to communicate.

However, it is still a great challenge. So, what did Google come up with?

In summary, here is the Google paradigm to large scientific data:
  • Premises
a) The growth of scientific data (size) outpaces the growth of Internet bandwith.
b) The consumption of the data (in terms of user-comprehensible results) is largely asymmetric in terms of size, comparing to the raw data.

  • Solution:
a) UPLOAD: Ship the data to the computational engine via FedEx or UPS.
b) ANALYZE: Data will be co-located with the computational engine (at the Google empire??)
c) DELIVER: The analyzed results or query results (usually much smaller) will be delivered to the consumer via the Internet.

Will it work? I think so.

No comments: