{"id":1959,"date":"2014-02-04T05:09:12","date_gmt":"2014-02-04T05:09:12","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=1959"},"modified":"2014-02-04T15:45:25","modified_gmt":"2014-02-04T15:45:25","slug":"harvesting-dataoneorg-twitter-mentions-via-topsy","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/data-science\/harvesting-dataoneorg-twitter-mentions-via-topsy\/","title":{"rendered":"Harvesting @DataONEorg Twitter Mentions via Topsy"},"content":{"rendered":"
The previous notebook entry<\/a> concerned mentions of @DataONEorg on Twitter.<\/p>\n I established the following:<\/p>\n The oldest tweet is from 2 years ago.<\/p>\n It is dated July 29, 2012.<\/p>\n This tweet is accessible from here:<\/p>\n http:\/\/topsy.com\/s?q=%40DataONEorg&window=a&type=tweet&sort=date&offset=990<\/a><\/p>\n The very first re-tweet of @DataONEorg was March 15, 2011.<\/p>\n This was 5 months after @DataONEorg joined Twitter (November 18, 2010).<\/p>\n The tweet is accessible via Topsy from this link:<\/p>\n http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=150&mintime=1288612824&maxtime=1320148851<\/a><\/p>\n This is valid for the time period November 1, 2010 to November 1, 2011.<\/p>\n I need the missing period between November 1, 2011 and July 29, 2012.<\/p>\n I must generate a new search limited to that time period on Topsy.<\/p>\n The link for the time period November 1, 2011 to July 29, 2012 is:<\/p>\n http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&mintime=1320148824&maxtime=1343563251<\/a><\/p>\n I now have three links for two time periods:<\/p>\n It is now possible to estimate number of tweets, based on 10 tweets per page:<\/p>\n Now I need to create a spreadsheet with unique URLs for each page of 10 tweets each, counting down from the maximum tweet for each time period.<\/p>\n For example:<\/p>\n http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=170&mintime=1288612824&maxtime=1320148851<\/a><\/p>\n http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=160&mintime=1288612824&maxtime=1320148851<\/a><\/p>\n http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=150&mintime=1288612824&maxtime=1320148851<\/a><\/p>\n And so on.<\/p>\n There are a total of 15,500 tweets.<\/p>\n There would be 1,500 rows of unique URLs encompassing each of the four time periods.<\/p>\n I have three possibilities in mind for extracting this data.<\/p>\n 1. Try the linky Firefox add on to collect 10 items 1,500 times (probably impractical)<\/p>\n I can view 100 pages at a time. Would only take 15 iterations of that. Worth looking at.<\/p>\n 2. Try Xenu link checking software to harvest links as if doing a link check.<\/p>\n 3. Some other URL scraping tool.<\/p>\n I will investigate this further on a PC, as Xenu works on a PC.<\/p>\n This PHP example did not work:<\/p>\n http:\/\/www.web-max.ca\/PHP\/misc_23.php<\/p>\n This may be worth looking at:<\/p>\n Scraping multiple Pages using the Scraper Extension and Refine – See more at: http:\/\/schoolofdata.org\/handbook\/recipes\/scraping-multiple-pages-with-refine-and-scraper\/<\/a><\/p>\n I’ll need to test these possibilities.<\/p>\n <\/p>\n","protected":false},"excerpt":{"rendered":" The previous notebook entry concerned mentions of @DataONEorg on Twitter. I established the following: The oldest tweet is from 2 years ago. It is dated July 29, 2012. This tweet is accessible from here: http:\/\/topsy.com\/s?q=%40DataONEorg&window=a&type=tweet&sort=date&offset=990 The very first re-tweet of @DataONEorg was March 15, 2011. This was 5 months after Continue reading Harvesting @DataONEorg Twitter Mentions via Topsy<\/span>\n
\n
\n<\/a><\/li>\n
\n<\/a><\/li>\n