{"id":2953,"date":"2017-06-10T00:03:47","date_gmt":"2017-06-10T00:03:47","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=2953"},"modified":"2017-06-16T22:24:11","modified_gmt":"2017-06-16T22:24:11","slug":"exploration-of-search-logs-metadata-quality-and-data-discovery-week-3","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/search-logs\/exploration-of-search-logs-metadata-quality-and-data-discovery-week-3\/","title":{"rendered":"Exploration of Search Logs, Metadata Quality and Data Discovery: Week 3"},"content":{"rendered":"
My goals for week 3 were to collect download logs from a SOLR index, parse those logs into tokens, populate a database with the log information, and relate the download events to the search events by connecting them in time and by remote host address. \u00a0I was able to accomplish these goals after a minor detour through troubleshooting a problem I was having with the SOLR query results. \u00a0Check out the hpad <\/a>for a lot more detail on what was going wrong and how we went about getting the data out of SOLR.<\/p>\n In the end, we wound up exporting the download logs to CSV files, which I imported into the database with a Python function. There are roughly 35 million download events, which makes for slow going when importing into the database. \u00a0At the moment, since we have only about one year of search logs, we can reduce the size of these download data by restricting them to the time interval of the search logs. \u00a0We can further reduce the data by looking more closely at what each download event represents–since it is possible through the DataONE search interface to download an entire dataset or individual files from within, not all download events are necessarily the same.<\/p>\n That will be part of next week’s exploratory data analysis. \u00a0I’ll try to get some simple visualizations up and going so we can get a more intuitive feel for the shape of the data we’ve assembled so far. \u00a0In the meantime, the new program code is up on the GitHub<\/a> repository.<\/p>\n","protected":false},"excerpt":{"rendered":" My goals for week 3 were to collect download logs from a SOLR index, parse those logs into tokens, populate a database with the log information, and relate the download events to the search events by connecting them in time and by remote host address. \u00a0I was able to accomplish Continue reading Exploration of Search Logs, Metadata Quality and Data Discovery: Week 3<\/span>