Hi all,
This is my week 5. Since I have had an actual scientific use case, I have distinguished and generated two different cases 1) transparency (understanding of what has happened and validate the information given) and 2) reproducing (intermediate) outputs.
1) Transparency
This case mainly shows how to capture information of what are inputs, outputs, and scripts and to validate the given provenance information for the dataset are valid.
This can be done in two ways either fully automated or half-automated. In order to have the full automation, the Rstudio was used (the use case contains R scripts) to get package, keep only provenance relationships from the package, create RDF in xml for provenance, and Convert RDF to PDF for visualization through graphviz (dot file). The Visualization is available using some packages in R (e.g., rdflib). The cons here is the workflow generated is not fully connected, but fully contain necessary information (as the provenance in DataONE landing page). However, this can be covered if we take half-automated steps. Since we have captured the provenance information from the package, we as a user can manually create, e.g., comments for YesWorkflow (YW). Once the scripts are commented, then the fully connected workflow with full provenance information can be generated in YW. If all above are not possible for some reason, then we can recapture provenance using some tools, e.g., recordr. Then, we can take the half-automated step to visualize the provenance. At the end, the clarification of the objects and the validation of provenance information given are clear.
2) Reproducibility
Here, based on the information given and validated in 1), (intermediate) outputs can be reproduced. However, as the environments are different, this requires some modifications in the scripts (e.g., paths the input and intermediate files are located). After modification, the artifacts are simply generated. There exist some variants (e.g., colors for plots are different) that should be investigated to verify whether the factors affect to the results. In this use case, the version of library affects the colors for the plot for a image reproduced (but not the results). However, this is an important step to verify that reproduced outputs are actually same as the original ones (meaning the analysis or claims the scientist who published this dataset is contained).
The next action would be:
– Is it possible to have fully connected provenance workflow which contains the same level of information (as the landing page) using full automation?
– Is it possible that the half-automation can be full automation? (e.g., the comments for YW can be generated using some packages or scripts)
– Is the provenance provided original one? If not, how do we capture provenance of provenance?
– Verification the results in depth
I think this is all for this week. Hope that all have nice weekend.