A few years ago I wrote a little Perl script called ‘timesScraper’ that grabs the top stories from the New York Times’ RSS feeds. The idea was I could use it to download local copies of the day’s news for reading on my laptop when I was away from an internet connection.
More details and the download after the jump.
The script grabs from a hard-coded list of feeds and their URLs and assembles the downloaded stories as links on an HTML index page. It grabs the “Printer friendly” version of each article so that the article text is complete and it doesn’t have to worry about images or embedded media. It requires the following Perl Modules, most of which are installed by default on Mac OS X or are easily installed using CPAN:
LWP::UserAgent HTTP::Cookies::Netscape XML::Simple Getopt::Long Encode HTML::Entities CGI Pod::Usage
Since The New York Times requires a free login for some stories timesScraper looks for a Firefox cookies.txt files. If you’re not using Firefox you’ll need to specify the path to your own cookies.txt file using the -cookies option.
The other option that timesScraper takes is -wait. This value specifies the maximum number of seconds that timesScraper will wait between fetching articles. The default maximum wait is 10 seconds. TimesScraper chooses a random value somewhere in between 0 and the max wait value. This is designed to be friendly to the Times web server and also to mask the appearance of a robot grabbing pages.