The previous notebook entry concerned mentions of @DataONEorg on Twitter.
I established the following:
The oldest tweet is from 2 years ago.
It is dated July 29, 2012.
This tweet is accessible from here:
http://topsy.com/s?q=%40DataONEorg&window=a&type=tweet&sort=date&offset=990
The very first re-tweet of @DataONEorg was March 15, 2011.
This was 5 months after @DataONEorg joined Twitter (November 18, 2010).
The tweet is accessible via Topsy from this link:
This is valid for the time period November 1, 2010 to November 1, 2011.
I need the missing period between November 1, 2011 and July 29, 2012.
I must generate a new search limited to that time period on Topsy.
The link for the time period November 1, 2011 to July 29, 2012 is:
http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&mintime=1320148824&maxtime=1343563251
I now have three links for two time periods:
- November 1, 2010 to November 1, 2011http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&offset=170&mintime=1288612824&maxtime=1320148851
- November 1, 2011 to July 29, 2012 is:ย http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&offset=340&mintime=1320148824&maxtime=1343563251
- July 29, 2012 – July 29, 2013ย http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&offset=570&mintime=1343563224&maxtime=1375099251
- July 29, 2013 – February 6, 2014ย http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&offset=410&mintime=1375099224&maxtime=1391688051
It is now possible to estimate number of tweets, based on 10 tweets per page:
- November 1, 2010 to November 1, 2011 n = 170, 1700 tweets
- November 1, 2011 to July 29, 2012 ย n = 340, 3,400 tweets
- July 29, 2012 – July 29, 2013 n = 570, 5,700 tweets
- July 29, 2013 – February 6, 2014 n = 410, 4,100 tweets
Now I need to create a spreadsheet with unique URLs for each page of 10 tweets each, counting down from the maximum tweet for each time period.
For example:
And so on.
There are a total of 15,500 tweets.
There would be 1,500 rows of unique URLs encompassing each of the four time periods.
I have three possibilities in mind for extracting this data.
1. Try the linky Firefox add on to collect 10 items 1,500 times (probably impractical)
I can view 100 pages at a time. Would only take 15 iterations of that. Worth looking at.
2. Try Xenu link checking software to harvest links as if doing a link check.
3. Some other URL scraping tool.
I will investigate this further on a PC, as Xenu works on a PC.
This PHP example did not work:
http://www.web-max.ca/PHP/misc_23.php
This may be worth looking at:
Scraping multiple Pages using the Scraper Extension and Refine – See more at: http://schoolofdata.org/handbook/recipes/scraping-multiple-pages-with-refine-and-scraper/
I’ll need to test these possibilities.
Pingback: Extraction, Transform and Load | TannerJessel.info
Pingback: Mentions of @DataONEorg August 1 2010 – Feb 4 – 2014 | TannerJessel.info