{"id":185,"date":"2011-06-22T17:33:42","date_gmt":"2011-06-22T22:33:42","guid":{"rendered":"http:\/\/notebooks.dataone.org\/tracking1000datasets\/?p=185"},"modified":"2013-05-09T01:17:13","modified_gmt":"2013-05-09T01:17:13","slug":"thoughts-on-the-processresults-so-far","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/data-reuse\/thoughts-on-the-processresults-so-far\/","title":{"rendered":"Thoughts on the process\/results so far"},"content":{"rendered":"
Below are some rough thoughts that I’ve hacked out on the data collection process and results so far.\u00a0 They are somewhat scattered at this point, so bear with me.\u00a0 I’ve also included thoughts on potential graphs that can be made to display the findings as of yet.<\/p>\n
<\/p>\n
Based on the preliminary searches I did on the Web of Science citations and having completed searching and analysis for most of the accession numbers in Google Scholar, it seems that articles which cite the dataset rather than the data collection article are more likely to actually reuse the data. \u00a0It appears that most papers cite the dataset directly in the text referring to the repository name or abbreviation and then the Unique Identifier. \u00a0For data reuse of GEO and Array Express repositories especially, it was also common to have a table listing all of the Unique identifiers of datasets reused in the study.<\/p>\n
<\/p>\n
Data repositories that have a more unique data identifier allow a search with a higher recall and precision, whereas data repositories that have a generic data identifier such as a four digit number require more search parameters to increase precision so much that some potential hits may be excluded. \u00a0For example, GEOROC has a 4 or 5 digit ID without an associated letter or repository identifier. \u00a0Therefore we had restrict the search terms to GEOROC9022 OR “GEOROC 9022” where 9022 is the GEOROC assigned ID number, as the search for GEOROC 9022 without quotation marks returned way too many unrelated results to sort through. \u00a0However, this may have weeded out potential data reuses that were not found, as no hits were found using those search terms for the GEOROC repository. \u00a0A better search that returned more precise results was GEO; out of 165 citations collected, only 6 did not cite the dataset. \u00a0This is directly related to unique identifiers for GEO having the three letters GSE directly preceeding the the 4+ digit accession number without a space between. \u00a0Repositories using a DOI for each dataset were also somewhat easier to track, although you had to search for the doi without the prefix \u201cdoi:\u201d, with the prefix, and with the prefix and a space as authors do not cite DOIs consistently and Google does not tell you how the algorithm works for retrieving articles within Google Scholar so creating a search string is always hit an miss until you find a combination that works.<\/p>\n
Potential graphs so far<\/span><\/p>\n Below are some rough thoughts that I’ve hacked out on the data collection process and results so far.\u00a0 They are somewhat scattered at this point, so bear with me.\u00a0 I’ve also included thoughts on potential graphs that can be made to display the findings as of yet. Based on Continue reading Thoughts on the process\/results so far<\/span>\n
\n
\n
\n