<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>render fast &#187; Perl</title>
	<atom:link href="http://renderfast.com/category/perl/feed/" rel="self" type="application/rss+xml" />
	<link>http://renderfast.com</link>
	<description>A blog about developing software by trial and error by Doug Letterman</description>
	<lastBuildDate>Mon, 24 Jan 2011 17:24:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.4</generator>
		<item>
		<title>Scrape the Times</title>
		<link>http://renderfast.com/2008/05/27/scrape-the-times/</link>
		<comments>http://renderfast.com/2008/05/27/scrape-the-times/#comments</comments>
		<pubDate>Tue, 27 May 2008 16:37:11 +0000</pubDate>
		<dc:creator>Doug</dc:creator>
				<category><![CDATA[Downloads]]></category>
		<category><![CDATA[Perl]]></category>

		<guid isPermaLink="false">http://renderfast.com/?p=12</guid>
		<description><![CDATA[A few years ago I wrote a little Perl script called &#8216;timesScraper&#8217; that grabs the top stories from the New York Times&#8217; RSS feeds. The idea was I could use it to download local copies of the day&#8217;s news for reading on my laptop when I was away from an internet connection. More details and [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-14" style="float: left;" title="nytimes" src="http://renderfast.com/wp-content/uploads/2008/05/nytimes.png" alt="" width="200" height="136" />A few years ago I wrote a little Perl script called &#8216;timesScraper&#8217; that grabs the top stories from the <a title="New York Times RSS Feeds" href="http://www.nytimes.com/services/xml/rss/index.html">New York Times&#8217; RSS feeds</a>. The idea was I could use it to download local copies of the day&#8217;s news for reading on my laptop when I was away from an internet connection.</p>
<p><em>More details and the download after the jump.</em><br />
<span id="more-12"></span></p>
<p>The script grabs from a hard-coded list of feeds and their URLs and assembles the downloaded stories as links on an HTML index page. It grabs the &#8220;Printer friendly&#8221; version of each article so that the article text is complete and it doesn&#8217;t have to worry about images or embedded media. It requires the following Perl Modules, most of which are installed by default on Mac OS X or are easily installed using CPAN:</p>
<pre><a title="CPAN: LWP::UserAgent" href="http://search.cpan.org/~gaas/libwww-perl-5.812/lib/LWP/UserAgent.pm">LWP::UserAgent</a>
<a title="CPAN: HTTP::Cookies::Netscape" href="http://search.cpan.org/author/GAAS/libwww-perl-5.812/lib/HTTP/Cookies/Netscape.pm">HTTP::Cookies::Netscape</a>
<a title="CPAN: XML::Simple" href="http://search.cpan.org/author/GRANTM/XML-Simple-2.18/lib/XML/Simple.pm">XML::Simple</a>
<a title="CPAN: Getopt::Long" href="http://search.cpan.org/author/JV/Getopt-Long-2.37/lib/Getopt/Long.pm">Getopt::Long</a>
<a title="CPAN: Encode" href="http://search.cpan.org/author/DANKOGAI/Encode-2.25/Encode.pm">Encode</a>
<a title="CPAN: HTML::Entities" href="http://search.cpan.org/~gaas/HTML-Parser-3.56/lib/HTML/Entities.pm">HTML::Entities</a>
<a title="CPAN: CGI" href="http://search.cpan.org/author/LDS/CGI.pm-3.37/CGI.pm">CGI</a>
<a title="CPAN: Pod::Usage" href="http://search.cpan.org/author/MAREKR/Pod-Parser-1.35/lib/Pod/Usage.pm">Pod::Usage</a></pre>
<p>Since The New York Times requires a free login for some stories timesScraper looks for a Firefox cookies.txt files. If you&#8217;re not using Firefox you&#8217;ll need to specify the path to your own cookies.txt file using the -cookies option.</p>
<p>The other option that timesScraper takes is -wait. This value specifies the <em>maximum</em> number of seconds that timesScraper will wait between fetching articles. The default maximum wait is 10 seconds. TimesScraper chooses a random value somewhere in between 0 and the max wait value. This is designed to be friendly to the Times web server and also to mask the appearance of a robot grabbing pages.</p>
<p><a title="Download timesScraper" href="http://renderfast.com/wp-content/uploads/2008/05/timesscraper">Download timesScraper</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://renderfast.com/2008/05/27/scrape-the-times/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

