{"id":1959,"date":"2014-02-04T05:09:12","date_gmt":"2014-02-04T05:09:12","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=1959"},"modified":"2014-02-04T15:45:25","modified_gmt":"2014-02-04T15:45:25","slug":"harvesting-dataoneorg-twitter-mentions-via-topsy","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/data-science\/harvesting-dataoneorg-twitter-mentions-via-topsy\/","title":{"rendered":"Harvesting @DataONEorg Twitter Mentions via Topsy"},"content":{"rendered":"<p>The <a title=\"DataONE Community Engagement via Twitter\" href=\"https:\/\/notebooks.dataone.org\/data-science\/dataone-community-engagement-via-twitter\/\">previous notebook entry<\/a> concerned mentions of @DataONEorg on Twitter.<\/p>\n<p>I established the following:<\/p>\n<p>The oldest tweet is from 2 years ago.<\/p>\n<p>It is dated July 29, 2012.<\/p>\n<p>This tweet is accessible from here:<\/p>\n<p><a href=\"http:\/\/topsy.com\/s?q=%40DataONEorg&amp;window=a&amp;type=tweet&amp;sort=date&amp;offset=990\" target=\"_blank\">http:\/\/topsy.com\/s?q=%40DataONEorg&amp;window=a&amp;type=tweet&amp;sort=date&amp;offset=990<\/a><\/p>\n<p>The very first re-tweet of @DataONEorg was March 15, 2011.<\/p>\n<p>This was 5 months after @DataONEorg joined Twitter (November 18, 2010).<\/p>\n<p>The tweet is accessible via Topsy from this link:<\/p>\n<p><a href=\"http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;offset=150&amp;mintime=1288612824&amp;maxtime=1320148851\" target=\"_blank\">http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;offset=150&amp;mintime=1288612824&amp;maxtime=1320148851<\/a><\/p>\n<p>This is valid for the time period November 1, 2010 to November 1, 2011.<\/p>\n<p>I need the missing period between November 1, 2011 and July 29, 2012.<\/p>\n<p>I must generate a new search limited to that time period on Topsy.<\/p>\n<p>The link for the time period November 1, 2011 to July 29, 2012 is:<\/p>\n<p><a href=\"http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;mintime=1320148824&amp;maxtime=1343563251\">http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;mintime=1320148824&amp;maxtime=1343563251<\/a><\/p>\n<p>I now have three links for two time periods:<\/p>\n<ol>\n<li>November 1, 2010 to November 1, 2011<a href=\"http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;offset=170&amp;mintime=1288612824&amp;maxtime=1320148851\">http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;offset=170&amp;mintime=1288612824&amp;maxtime=1320148851<\/a><\/li>\n<li>November 1, 2011 to July 29, 2012 is:\u00a0<a href=\"http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;offset=340&amp;mintime=1320148824&amp;maxtime=1343563251\">http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;offset=340&amp;mintime=1320148824&amp;maxtime=1343563251<\/a><\/li>\n<li>July 29, 2012 &#8211; July 29, 2013\u00a0<a href=\"http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;offset=570&amp;mintime=1343563224&amp;maxtime=1375099251\">http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;offset=570&amp;mintime=1343563224&amp;maxtime=1375099251<\/a><\/li>\n<li>July 29, 2013 &#8211; February 6, 2014\u00a0<a href=\"http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;offset=410&amp;mintime=1375099224&amp;maxtime=1391688051\">http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;offset=410&amp;mintime=1375099224&amp;maxtime=1391688051<\/a><\/li>\n<\/ol>\n<p>It is now possible to estimate number of tweets, based on 10 tweets per page:<\/p>\n<ol>\n<li>November 1, 2010 to November 1, 2011 n = 170, 1700 tweets<a href=\"http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;offset=170&amp;mintime=1288612824&amp;maxtime=1320148851\"><br \/>\n<\/a><\/li>\n<li>November 1, 2011 to July 29, 2012 \u00a0n = 340, 3,400 tweets<\/li>\n<li>July 29, 2012 &#8211; July 29, 2013 n = 570, 5,700 tweets<a href=\"http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;offset=570&amp;mintime=1343563224&amp;maxtime=1375099251\"><br \/>\n<\/a><\/li>\n<li>July 29, 2013 &#8211; February 6, 2014 n = 410, 4,100 tweets<\/li>\n<\/ol>\n<p>Now I need to create a spreadsheet with unique URLs for each page of 10 tweets each, counting down from the maximum tweet for each time period.<\/p>\n<p>For example:<\/p>\n<p><a href=\"http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;offset=170&amp;mintime=1288612824&amp;maxtime=1320148851\">http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;offset=170&amp;mintime=1288612824&amp;maxtime=1320148851<\/a><\/p>\n<p><a href=\"http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;offset=170&amp;mintime=1288612824&amp;maxtime=1320148851\">http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;offset=160&amp;mintime=1288612824&amp;maxtime=1320148851<\/a><\/p>\n<p><a href=\"http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;offset=170&amp;mintime=1288612824&amp;maxtime=1320148851\">http:\/\/topsy.com\/s?q=%40DataONEorg&amp;type=tweet&amp;sort=date&amp;offset=150&amp;mintime=1288612824&amp;maxtime=1320148851<\/a><\/p>\n<p>And so on.<\/p>\n<p>There are a total of 15,500 tweets.<\/p>\n<p>There would be 1,500 rows of unique URLs encompassing each of the four time periods.<\/p>\n<p>I have three possibilities in mind for extracting this data.<\/p>\n<p>1. Try the linky Firefox add on to collect 10 items 1,500 times (probably impractical)<\/p>\n<p>I can view 100 pages at a time. Would only take 15 iterations of that. Worth looking at.<\/p>\n<p>2. Try Xenu link checking software to harvest links as if doing a link check.<\/p>\n<p>3. Some other URL scraping tool.<\/p>\n<p>I will investigate this further on a PC, as Xenu works on a PC.<\/p>\n<p>This PHP example did not work:<\/p>\n<p>http:\/\/www.web-max.ca\/PHP\/misc_23.php<\/p>\n<p>This may be worth looking at:<\/p>\n<p>Scraping multiple Pages using the Scraper Extension and Refine &#8211; See more at: <a href=\"http:\/\/schoolofdata.org\/handbook\/recipes\/scraping-multiple-pages-with-refine-and-scraper\/\">http:\/\/schoolofdata.org\/handbook\/recipes\/scraping-multiple-pages-with-refine-and-scraper\/<\/a><\/p>\n<p>I&#8217;ll need to test these possibilities.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The previous notebook entry concerned mentions of @DataONEorg on Twitter. I established the following: The oldest tweet is from 2 years ago. It is dated July 29, 2012. This tweet is accessible from here: http:\/\/topsy.com\/s?q=%40DataONEorg&amp;window=a&amp;type=tweet&amp;sort=date&amp;offset=990 The very first re-tweet of @DataONEorg was March 15, 2011. This was 5 months after <a class=\"more-link\" href=\"https:\/\/notebooks.dataone.org\/data-science\/harvesting-dataoneorg-twitter-mentions-via-topsy\/\">Continue reading <span class=\"screen-reader-text\">  Harvesting @DataONEorg Twitter Mentions via Topsy<\/span><span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":35,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12],"tags":[233,23,140,215,227,192,232],"_links":{"self":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/1959"}],"collection":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/users\/35"}],"replies":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/comments?post=1959"}],"version-history":[{"count":2,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/1959\/revisions"}],"predecessor-version":[{"id":1961,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/1959\/revisions\/1961"}],"wp:attachment":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/media?parent=1959"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/categories?post=1959"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/tags?post=1959"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}