{"id":1218,"date":"2013-06-07T16:29:41","date_gmt":"2013-06-07T16:29:41","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=1218"},"modified":"2013-06-07T16:29:41","modified_gmt":"2013-06-07T16:29:41","slug":"normalizing-data","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/ontology-coverage\/normalizing-data\/","title":{"rendered":"Normalizing data"},"content":{"rendered":"
This week had me spending most of my time normalizing my two data sets (my corpus and ontology). This normalizing process (for the corpus) was fairly straight forward and involved a few steps,
\n1) remove punctuation
\n2) force lower case
\n3) remove stop words
\n4) stem each word
\n5) remove “number words”<\/p>\n
The process of removing punctuation, forcing lower case and removing stop words already had scripts to do most of this from last year. <\/p>\n
To stem, I utilize an existing library (pypi stemming) that allows for four types of stemming: lovins, paice, porter, and snowball. These range from heavy-weight stemmers (lovins and paice) to lightweight rules-based stemmers (porter and snowball). The code is implemented in such a way that the type of stemmer can be chosen. Lastly, a “number word” is a word that contains no non-number characters (e.g., 1827 or 832489). While these might be useful to a person, they provide no meaningful value in this type of evaluation. I also included formal test cases.<\/p>\n
The process for the ontology was slightly more tricky as most text within the ontology file was metadata (as the ontology is stored in XML). For this, I used regular expressions to isolate those features (e.g., the classes) of the ontology to be normalized. Each ontology is then saved in a separate directory allowing for the comparison between the original and the stemmed version. I then wrote formal test cases to ensure the code was meeting expectations.<\/p>\n
An example:
\nINPUT: “I am having the…arrgg! Just the hardest day, with, my new vegetable champion 7000.”
\nOUTPUT: arrgg just hard day new vegetable champion<\/p>\n
This will allow for better text matching in the future.<\/p>\n","protected":false},"excerpt":{"rendered":"
This week had me spending most of my time normalizing my two data sets (my corpus and ontology). This normalizing process (for the corpus) was fairly straight forward and involved a few steps, 1) remove punctuation 2) force lower case 3) remove stop words 4) stem each word 5) remove Continue reading Normalizing data<\/span>