One of the goals of this internship is to use the Metadata Quality API reporting feature to score metadata records within the DataONE system and then attempt to determine whether that metadata score has any relationship with whether or not (or how many times) a dataset described by that metdata is downloaded.
The first broad step in this process is to build a sampling frame from which we can draw metadata records: some that are related to datasets that have been downloaded, and some that are not. Next, we collect the sample and generate scores for each sampled metadata record. Finally, we compare the scores for records from the two groups to see if there is a statistically significant difference between them.
Our hypothesis is that metadata records with higher scores will be associated with datasets that have more downloads. That is, if the scoring rubric for the metadata records is aligned with attributes of metadata that are important to potential downloaders, perhaps a record that scores more points will be more appealing to users.
So far, I have identified 5.8 million EML and ISO metadata download events within the logs. There are many cases where a single client has downloaded a record multiple times. Since we are mainly interested using these records to identify metadata records for downloaded data, we can remove the duplicates, leaving about 850,000 records.
I have written scripts to associate metadata PIDs with resource maps. The next step is to allow that script to run over the identified records and find their parent resource maps, if they exist. I have also written a script that is able to draw a random sample from the DataONE catalog. I have started on a script that will get the scores of the metadata records once I’ve identified the sample size.
The next steps will be to finish data collection and sampling, then proceed with statistical testing. Until then, there are no results to show for this part of the work. Next week, I’ll report on how the sampling and statistical testing goes. Check out the GitHub repository for the project as the code for this analysis takes shape. As usual, there are some more technical details of the process on the hpad.