Monday, April 12, 2010

Php web rss crawler?

I was wondering if anyone knew of some code that i could use to crawl a website looking for rss feeds, kind of like how firefox detects rss feeds in a webpage.

Php web rss crawler?
Conceptually, the code is very simple:





1. Find all hyperlinks in the home page that point to pages in the same domain and put them into a queue.





2. For every link in the queue, send a HEAD request to its URL. If "Content-Type:" header is present and is one of the content types used by RSS feeds (text/xml, application/xml, application/rss+xml, or somesuch), send a GET request and attempt to parse the body of the response as an RSS feed.





3. While trying URLs, keep adding to the queue all hyperlinks pointing to pages in the same domain found in the currently viewed page.





The problem with doing it in PHP is going to be the time limit. PHP scripts are typically killed after using 30 seconds of processor time. So you'll have to find a way to redirect the script to itself every now and then and store the queue between redirects.
Reply:Firefox knows there is a feed from the %26lt;head%26gt; no crawling needed... Look at this (this yahoo answers page)





%26lt;link rel="alternate" type="application/rss+xml"


title="Yahoo! Answers: Answers and Comments for php web rss crawler?" href="/rss/question?qid=20061030221026AA... /%26gt;





says there is an rss feed for the page.





Update:





there is a rarely used variant of the %26lt;a href%26gt; that contains a rel= attribute that could be used for multiple rss links and rel=alternate is even more rare... but you would just have to scrape (search) and not crawl.





to crawl every link to find which ones are rss would find very little.








What is your website where you see this behavior?





Update 2:


You want to crawl web pages searching for rss feeds.


that's a different story.





Any simple php or perl crawler (or even the utlil wget) will do it... then simply grep for the rss feeds.
Reply:Firefox mostly uses the LINK tags in the HEAD tag to get the RSS feed. It finds the one with the right content type I believe. That's covered.





Jake Cigar noted that you can attach a REL attribute to a link element. This describes the relationship of the linked resource, and I use it on my websites (it's good semantically as well). Barely any sites use this though, but you can look in these A tags and find one with a right REL. You can also look at each A tag to check the URL. Example:


%26lt;a href="http://www.example.com/feed.rss" rel="rss%26gt;Feed%26lt;/a%26gt;





You may want to look into regular expressions. You can extract text from a document quite easily once you've learned regexp. Please look at:


http://us3.php.net/manual/en/ref.pcre.ph...


Example:


preg_match('#"http://([^"]+)"#', $text, $m);


That will actually fetch all the absolute URLs out of the page that are surrounded by double quotes ("). Clearly my example will have to be expanded upon to support more types of URLs.





To crawl, you will have to get all the URLs on the page. Regular expressions can be used to match links to other resources. Once you've got a list of URLs, for each one, download it with PHP and check its content and/or it's content type. Crawling is probably unnecessary though. Most sites publish their RSS as a LINK in the HEAD tag.


No comments:

Post a Comment