Open-Access Research is Leading to a Data Deluge

There is a fascinating new report out from the OECD showing how online availability of research data and papers is transforming scientific publishing. Among many effects, there is the arrival of what the authors call a “data deluge”:

E-science and grid computing developments are leading to a “data deluge”, as more sources of large-scale observational data emerge and the volume of scientific data collected is rapidly dwarfing anything in the past. Hey and Trefethen (2002, p3) noted that:

There are a relatively small number of centres around the world that act as major repositories of a variety of scientific data. Bioinformatics, with its development of gene and protein archives, is an obvious example. The Sanger Centre at Hinxton near Cambridge currently hosts 20 Terabytes of key genomic data and has a cumulative installed processing power… of around ½ Teraflop/s. Sanger estimate that genome sequence data is increasing at a rate of 4 times each year and that the associated computer power required to analyse this data will ‘only’ increase a a rate of 2 times per year… A different data/computing paradigm is apparent for the particle physics and astronomy communities. In  the next decade we will see new experimental facilities coming online that will generate data sets ranging in size from 100’s of Terabytes to 10’s of Petabytes per year. [Emphasis added]