Definition of data and data types
In the Berkeley study, data was defined as the amount of original information being produced in the US and the world in one year (Lyman and Varian, 2000). In the IDC study, digital information was measured in terms of the market of the global hardware capacity in one year. The HMI study measured the amount of information (redundant or unique) consumed in households as a total and per day. The Hilbert and Lopez study measured the amount of information storage, flow, and computation over two decades. This project needs to clearly define what is considered ecological data. Scientists generally gather data in the following ways. Raw data from a sample can be taken and assembled in a given event in time and/or area. This raw data can be directly analyzed or transformed/annotated in another usable form which could serve as the final data to be analyzed or as an intermediate dataset. Once, analyzed, meaning (possibly called information) comes from the data set. That information along with other information produced is weaved into a published journal article or report. In addition, there are several types of data that scientists can produce data from observational, experimental, and computational/generated data. Again this data can be raw data which can be transformed or directly analyzed and contribute to published journals and reports.
Still other types of data may not be quantifiable data (such as discussed above) but qualitative data in the form of expert knowledge. If included, a method would have to be devised on how to estimate expert knowledge from people possibly in various fields of ecology and types of professions (i.e. academic, research stations, Govt, NGO). Lesk introduced previous studies that attempted to estimate knowledge retained in humans in equivalent bytes (Lesk, 1997).
Scope in space, medium, and time
The above studies also vary in scope. Most focus on estimating information worldwide and/or including the US and choose to include or leave out different media. For example the Berkeley study focused on nearly all data held within in Print, Film, Optical, and Magnetic media while IDC focused only on digital information (which most likely includes optical and magnetic).
Due to the wide range of fields, ecological data has historically been collected as many different types of data. Ecological data has been collected as printed data (i.e. numbers written in a notebook), data born digital (i.e. DNA sequences from a PCR, Sonar), images that may be digital or printed (GIS layers, Satellite imagery, Remote sensing, species photography), audio that may be analog or digital (i.e. bird calls) and video footage (i.e. camera traps). Most of these data types are probably stored as digital (Magnetic) data with some being in the form of Optical and the rest as Print. Therefore equally important will be the need to define the type of medium data is transcribed and stored upon.
As most of the previous studies have reported, the largest source of data is contained in Magnetic or digital storage devices. It could be assumed that the majority of ecological data is also stored as Magnetic data. This includes servers (i.e. databases of organizations and labs), hard drives (i.e. personal computers), and flash drives.
Furthermore, the internet also houses a repository of digital information that is under the category of Magnetic. The Berkeley study defined the World Wide Web as containing two groups of web content: the “surface” web that holds static, publically available web pages, and the “deep” web that consists of specialized web-accessible databases and dynamic websites. The deep web is estimated to be less publically known and about 400 times larger than the information on the surface web. (Lyman and Varian, 2000). Lots of ecological information is housed in databases that can be accessed through the web (ie. Calflora.org) while other databases are probably only accessible to members. Furthermore, people are sharing and generating new information through web-based file hosting services (i.e. Dropbox) and Web-based data storage services (i.e Google Documents). This new medium could become a growing new way for the generation of new ecological data as well. Still there is probably a wide number of Print and Optical media that ecological data is stored upon. However some portion of this original data is gradually being converted into digital. Thus, identifying these various forms of media and deciding how to quantify these categories may need to be considered for more proper estimates.
For ecology, a boundary line must be defined about whether all fields of ecology will be included or not. Many times, ecology can be interdisciplinary and overlap with many other sciences. And lastly data can be produced from various sources such as government agencies, NGOs, research stations, consulting groups, in addition to Academic Universities. Generally, ecological data is gathered by large scale projects and small scale projects. While large scale projects receive large amounts of funding, the amount of small scale projects is large in comparison (Heidorn, 2008). This project could propose to narrow its scope of ecological data generation to small scale projects.
Furthermore the studies differ in how they handle the aspect of time. Some studies estimate the amount of information in a snapshot year, aggregated over several years, or additionally as a rate. This project needs to set the scope on what media will it consider to be ecological data as well as the time frame.
Dark data
According to HMI 2009 study, dark data is data generated automatically from machine to machine without the intervention of a human. In one of the IDC study’s example, sending one email with an attachment to four recipients generated 8 times the amount of data due to the processes of email servers that generate redundant information (IDC, 2008). However, scientific “dark data” is defined as scientific data that is not indexed and stored thus becoming invisible and inaccessible to the scientific community or potential users (Heidorn, 2008). In general scientists generate more data than they eventually publish and therefore hold this dark data for long amounts time (Heidorn, 2008). Data can turn into dark data for a variety of reasons. Some data could be positive or negative research findings of large and small projects that never got published, or simply never got analyzed. Eventually this data is never revisited again by the researcher and is lost or destroyed. Heidorn claimed that dark data has potentially high impacts and value for the scientific community if brought into the spotlight (Heidorn, 2008). Therefore it may be important for this project to estimate the amount of dark data researchers hold.
Generation versus consumption
In addition to calculating the amount of data generated, there is a wide range of data being communicated and consumed. Ecological data has many inherent values that can be used for policy decisions and natural resources management. Studies have attempted to quantify the amount of data being consumed and flowing through information channels. For example, the HMI study calculated the amount of information consumed by households. The Berkeley study, followed by Hilbert and Lopez, distinguished between storage data and flow data, estimating the annual size of the “stock” of new information contained in storage media as well as the volume of information seen or heard each year in information flows (Berkeley 2000). While potentially interesting, the project most likely will not be assessing how much ecological data are consumed or flowing through various information channels. If however, attempts were made to estimate ecological information flows, this may translate to estimating information flowing through ecological conferences, presentations, education workshops, school lectures, emails/phone conversations between professors, and interviews. Still, it would be interesting to be able to quantify how much ecological information is shared within the ecological community of academia, land managers, and the general public.
Ecological data growth rate
Several studies such as HMI and IDC have attempted to define the growth rate of information and project that growth rate into future estimates. It may be relevant to include an estimate of the ecological growth rate of data. The HMI Case studies have estimated the rate of data production in the MIT department of Biological Oceanography and Climate change to increase dramatically (HMI Case Studies). For example, one Biological Oceanography scientist estimate due to rapid decrease in the cost and an increase in speed of DNA sequencing technologies which could cause her lab to produce about 10-20x more data by 2014 (HMI Case Studies). Similarly, the number of journal articles and the number of scientists is growing globally (Mabe and Amin, 2001). It could be that the field of ecology will also follow the trajectory of increased data acquisition over time. The ecological data growth rate may vary depending on field. In addition, several variables that may affect data growth rates could be funding, technological advances, and better experimental techniques. Therefore it could be assumed that in general, the ecological data growth rate is increasing but there may be time periods of reduced rates of data generation due to indirect parallel forces such as economic downturns. Nonetheless, it may be useful to estimate the rate of ecological data growth within a defined time period and to attempt to extrapolate the ecological data growth rate into the near future.
Methods of Measuring Ecological Data
Most of the previous studies have measured the amount of information within their defined scope in bytes (Lyman and Varian, 2000; Lyman and Varian, 2003, Lesk, 1997; IDC, 2007). The HMI study chose to expand their units of measurement to estimate the amount of information in bytes, words and hours (HMI, 2009). Due to the expanding use and conversion into Magnetic media, digital units are logical for ecological data. A great amount of ecological data is now in digital form (PDFs, csv files, jpegs, etc), which make estimations of bytes practical. Past studies have made attempts to translate non-digital types of information into units in bytes (Lyman and Varian, 2000; Lyman and Varian, 2003; Lesk, 1997). For example, Lesk estimated one sheet of paper to contain approximately 5000 bytes (Lesk, 1997) and the Berkeley study estimated the percentage of fresh information held on a piece of paper and converted into bytes (Lyman and Varian, 2000).
It could also be possible to estimate the amount of information in words (number of words in a journal article) and in hours (ie hours spent in the field collecting data, or writing a paper) however, these units of measurements may not translate well to estimating other types of data (i.e. imagery). Therefore it may be best for the project to use bytes as its main form of measuring ecological data. Nevertheless, for this project, some type of standardization of units is required to compare and estimate the amount of ecological data.
The Berkeley and IDC studies have provided an upperbound estimate of the amount of information out there. It can be assumed that the amount of ecological data is some portion of this larger estimate. As both studies report, a large amount of the information is for entertainment (i.e. music, movies) and therefore it may be possible to zero in on the amount of ecological information out there. For example, the Berkley 2000 study had estimated the stock amount all book titles to be 26TB and the information flow of printed scholarly journals to be 2 terrabytes (Lyman and Varian 2000). Therefore gathering pieces of estimates from these previous studies can assist this project in obtaining a very rough estimate ecological information as a starting point.
Potential sources of data
For extrapolation purposes, it may be useful to estimate the total amount of ecologists in different fields, the amount of ecological organizations/research stations, the total amount or rate of journal articles being produced, the amount of contributions to ecological databases, and other sources where ecological data could potentially be generated. Some supplemental information that may be useful could be estimating the lifespan of the career of ecologists and the rate of new and old ecologists entering and leaving the field. Another question is how data generated from these sources will be handled. Data can be generated by a single experiment, by a single or multiple primary investigators, an entire lab, or a whole organization. To prevent redundancy, it may be important to standardize and define what we treat as a new source of data.
Main sources of potential ecological data could come from researchers. A survey could be established to inquire the amount of journal articles they produce in a year, the amount of projects they manage, the amount of ecological data their lab produces, the amount of data in their hard drives, and the percentage of their data that is never used. The HMI Case studies looked at the labs of several researchers in various fields and estimated how much data the researcher and their team produced in a year (HMI Case Studies, 2009). Therefore similar case studies for ecologists may be conducted. Information from industry, companies, and government could provide some source of ecological knowledge due to government acts such as NEPA, and ESA. For example, automated sources of ecological data from weather stations and satellite imagery that take measurements over time, can be an immense source of new data. Still, methods would have to be produced to estimate these categories which may prove difficult or not considered to be part of the project’s scope.
Lastly, well managed repositories such as the Knowledge Network for Biocomplexity (KNB) could be a great source of estimating data and a good form of comparison against the amount of data gathered by small project scientists.
Potential obstacles
Duplication and Redundancy: Data in general can be copied and stored in various types of media. Over time the ability to copy data especially in digital has been increasingly cheaper and easier. Therefore any attempt to estimate the amount of data generated must deal with the issue of redundancy. A host of previous studies had either decided not to address the issue by including duplicated data or had made attempts to account for redundancy. For example, the HMI study stated that the study included all information including duplicate information in their counting while the Berkeley study accounted for redundancy by clearly stating their assumptions in their methods for various estimates. Estimating ecological data will also have the issue of redundancy due to the existence of backup drives, journal articles being published in print and online. In addition, journal articles simultaneously contribute new information while citing findings of other journal articles. Therefore if the redundancy is to be accounted for, a list of assumptions must be defined clearly in the methods in order to avoid double counting.
Compression: While many studies have decided to use bytes as the unit of measurement for information, digital information has the ability to be compressed (i.e. winzip, winrar) into smaller amounts of bytes. Therefore based on whether a data file is compressed or not and the type of compression method, the estimated amount of total bytes can have a significantly large range. The Berkeley study had decided to deal with the problem of compression by including an upperbound estimate of “reasonable” compression data and a lowerbound estimate of highly compressed data. The study then took a middle ground approach to finalizing their estimate of the amount of information out there.
Issue of data loss: Data is can also be lost due to deletion and overwriting. If data is considered no longer usable, ecologists could erase the data to make room for new data. Data could have been erased because the data has been transformed into a new format more relevant to the study or because the data is very old. This issue may be interesting and provide insight into how ecologists treat information, but this may be a hard to address for the scope of this project.
Issue of Uncertainty: None of the previous studies provided statistical analyses to estimate their uncertainty. Although the Berkeley study did provide upperbound and lowerbound estimates to try to address their uncertainly, most studies generally gave absolute numbers.
Other issues
The HMI study asserted that the pure volume of information does not necessarily determine its value or impact (HMI 2009). The IDC study attempted to put a dollar value on information by studying the software industry and the information management markets (IDC, 2007). This may pertain to ecological data because experiments are funded because it has potential real world applications to managing natural resources and preserving ecosystem services which cost millions of dollars. Similarly, the generation of ecological data can also affect policy and decision making which also have economic costs. Thus it may be interesting to estimate how much ecological information is worth though it may not be relevant to the scope of the project.
Wow- this is all great stuff. While reading the last couple of posts, my brain has been zipping off in all kinds of interesting directions. Let’s talk more in person today!
Great! I have still a little more to summarize which I will add to by today. Drop by any time