<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Matthew Elliston &#187; Tools</title>
	<atom:link href="http://www.matthewelliston.com/category/development/tools/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.matthewelliston.com</link>
	<description>A site filled with my pictures, thoughts and things I want to remember!</description>
	<lastBuildDate>Wed, 24 Aug 2011 14:03:10 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Howto scrape an entire website</title>
		<link>http://www.matthewelliston.com/howto-scrape-an-entire-website/</link>
		<comments>http://www.matthewelliston.com/howto-scrape-an-entire-website/#comments</comments>
		<pubDate>Tue, 04 Aug 2009 18:22:06 +0000</pubDate>
		<dc:creator>Matthew</dc:creator>
				<category><![CDATA[Tools]]></category>

		<guid isPermaLink="false">http://www.matthewelliston.com/?p=95</guid>
		<description><![CDATA[I had the need to backup an entire site and its content when I couldn&#8217;t get access to the FTP details quickly. Luckily the site was just static content so I was able to just use one of the many tools available in a regular linux shell. Here is the code I typed into my<a href="http://www.matthewelliston.com/howto-scrape-an-entire-website/" class="read-more">Continue Reading</a>]]></description>
			<content:encoded><![CDATA[<p>I had the need to backup an entire site and its content when I couldn&#8217;t get access to the FTP details quickly. Luckily the site was just static content so I was able to just use one of the many tools available in a regular linux shell.</p>
<p>Here is the code I typed into my terminal:</p>
<pre class="brush: bash">wget -m --tries=5 "http://www.foo.com"</pre>
<p>The &#8220;-m&#8221; from the Man pages states that it is mirroring where it will follow links around the pages. The &#8220;&#8211;tries=5&#8243; will stop  wget from running into an infinite loop.</p>
<p>I&#8217;m not sure how well this will work with dynamic sites it may just capture the HTML of the outputted server side script but at least its better than nothing.</p>
<p>Further options such as:</p>
<pre class="brush: bash">--referer=www.google.com</pre>
<p>For setting the referrer and:</p>
<pre class="brush: bash">--user-agent=Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1) Gecko/20090717 Fedora/3.5.1-3.fc11 Firefox/3.5.1</pre>
<p>For setting the user agent if a particular website is proving tricky to download.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.matthewelliston.com/howto-scrape-an-entire-website/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

