Hi All,
This is the second week of my internship, and below are the tasks that were completed during this week.
- Adding the patterns for the rest of the file formats into the dataone magic file.
Continuing the work from the last week, we were able to create additional patterns for the detecting the file format using the file command with the custom magic file. Below are the 13 file formats that we wanted to identify. We created different files for each of the format initially but combined them later into one by adding additional tests level. The custom magic file can identify the 12 file formats successfully out of 13.
- http://ns.dataone.org/metadata/schema/onedcx/v1.0
- http://datadryad.org/profile/v3.1
- FGDC-STD-001-1998
- FGDC-STD-001.1999
- http://purl.org/ornl/schema/mercury/terms/v1.0
- http://www.isotc211.org/2005/gmd
- http://www.isotc211.org/2005/gmd-pangaea
- http://www.isotc211.org/2005/gmd-noaa
- eml://ecoinformatics.org/eml-2.1.1
- eml://ecoinformatics.org/eml-2.1.0
- eml://ecoinformatics.org/eml-2.0.1
- eml://ecoinformatics.org/eml-2.0.0
- http://www.openarchives.org/ore/terms
- Develop a unittest in python for testing the custom magic file.
A python unittest was created using the unittest library for functional testing of the magic file. This test helped us in making sure that when we are adding new patterns for other files it doesn’t break.
- Compile the magic file with new changes, so that it can be distributed and used by others.
After testing and creation of the magic file we wanted to create and share the file with others. This can be done by passing the magic using -m option or can be compiled on linux using the steps mentioned in readme.
- Submit the dataone magic file in Magic(5) format for inclusion with the libmagic library.
We submitted the dataone magic file in Magic(5) format to the Linux administrators for including the changes in the next release of file and libmagic. The changes were accepted and committed to the repository and will be available in the next release of the file command (probably file 5.34 and above.
- Apace-Tika, and customization of the custom-mimetypes.xml file.
Apache-Tika is another tool, which can be used for detecting and extracting metadata and text from over a thousand different file types. It uses an xml file which contains all the information about the existing file-types. Our next step is to develop and pass the custom mime type information for the DataONE file formats, which can be detected using Tika.
That’s all for now, see you all next week!
Have a great weekend!
Resource links: Github,Project Plan