{"id":3499,"date":"2019-06-24T15:23:00","date_gmt":"2019-06-24T15:23:00","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=3499"},"modified":"2019-06-24T15:27:51","modified_gmt":"2019-06-24T15:27:51","slug":"week-5-association-rules-and-midterm-evaluation","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/prov-self\/week-5-association-rules-and-midterm-evaluation\/","title":{"rendered":"Week 5 – Association Rules and Midterm Evaluation"},"content":{"rendered":"\n
Hello World! <\/p>\n\n\n\n
This week is a week filled with code and debug. Based on previous research, we decided to focus on the Galaxy Zotero Group and do hands-on experiments\u00a0after discussing with Bertram. Following is the process to achieve our first goal: explore tags distribution.<\/p>\n\n\n\n
Data Collection<\/strong><\/p>\n\n\n\n The way to get the data is not easy for me since Galaxy Group uses a little strange JS infrastructure to build\u00a0the website, which means it is not easy for code to capture the real data source. But finally, I got it! By using the Chrome Inspector, I found the API and then got the data I want. Meanwhile, \u00a0it is necessary for us to send an email to the Group asking for permission to use the data for research.<\/p>\n\n\n\n Data Summary: this data contains\u00a07754 Rows and each row means a paper related to the Galaxy Project. Furthermore, there are 26 attributes in the datasets, including paper title, paper type( journal\/ conference\u2026.), creators (authors), abstract, dot, tags, etc. <\/em><\/p>\n\n\n\n Data Cleaning<\/strong><\/p>\n\n\n\n After collecting the raw data, the next step goes to data cleaning. Since our goal is to explore the tags distribution and paper content, column \u201cpaper_title\u201d, \u201cDOI\u201d and \u201ctags\u201d are extracted from the raw data. <\/p>\n\n\n\n