{"id":1218,"date":"2013-06-07T16:29:41","date_gmt":"2013-06-07T16:29:41","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=1218"},"modified":"2013-06-07T16:29:41","modified_gmt":"2013-06-07T16:29:41","slug":"normalizing-data","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/ontology-coverage\/normalizing-data\/","title":{"rendered":"Normalizing data"},"content":{"rendered":"<p>This week had me spending most of my time normalizing my two data sets (my corpus and ontology).  This normalizing process (for the corpus) was fairly straight forward and involved a few steps,<br \/>\n1) remove punctuation<br \/>\n2) force lower case<br \/>\n3) remove stop words<br \/>\n4) stem each word<br \/>\n5) remove &#8220;number words&#8221;<\/p>\n<p>The process of removing punctuation, forcing lower case and removing stop words already had scripts to do most of this from last year.  <\/p>\n<p>To stem, I utilize an existing library (pypi stemming) that allows for four types of stemming: lovins, paice, porter, and snowball.  These range from heavy-weight stemmers (lovins and paice) to lightweight rules-based stemmers (porter and snowball).  The code is implemented in such a way that the type of stemmer can be chosen.  Lastly, a &#8220;number word&#8221; is a word that contains no non-number characters (e.g., 1827 or 832489).  While these might be useful to a person, they provide no meaningful value in this type of evaluation.  I also included formal test cases.<\/p>\n<p>The process for the ontology was slightly more tricky as most text within the ontology file was metadata (as the ontology is stored in XML).  For this, I used regular expressions to isolate those features (e.g., the classes) of the ontology to be normalized.  Each ontology is then saved in a separate directory allowing for the comparison between the original and the stemmed version.  I then wrote formal test cases to ensure the code was meeting expectations.<\/p>\n<p>An example:<br \/>\nINPUT: &#8220;I am having the&#8230;arrgg! Just the hardest day, with, my new vegetable champion 7000.&#8221;<br \/>\nOUTPUT: arrgg just hard day new vegetable champion<\/p>\n<p>This will allow for better text matching in the future.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This week had me spending most of my time normalizing my two data sets (my corpus and ontology). This normalizing process (for the corpus) was fairly straight forward and involved a few steps, 1) remove punctuation 2) force lower case 3) remove stop words 4) stem each word 5) remove <a class=\"more-link\" href=\"https:\/\/notebooks.dataone.org\/ontology-coverage\/normalizing-data\/\">Continue reading <span class=\"screen-reader-text\">  Normalizing data<\/span><span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":42,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[15],"tags":[143,151,141,142],"_links":{"self":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/1218"}],"collection":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/users\/42"}],"replies":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/comments?post=1218"}],"version-history":[{"count":1,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/1218\/revisions"}],"predecessor-version":[{"id":1220,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/1218\/revisions\/1220"}],"wp:attachment":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/media?parent=1218"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/categories?post=1218"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/tags?post=1218"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}