Home | Archive | SEO Tools | Contact

CURL Page Scraping Script

Using cURL and page scraping for specific data is one of the most important things I do when creating databases. I’m not just talking about scraping pages and reposting here, either.

You can use cURL to grab the HTML of any viewable page on the web and then, most importantly take that data and pick out the bits you need. This is the basis for link analysis scripts, training scripts, compiling databases from sources around the web, there’s almost limitless things you can do.

I’m providing a simple PHP class here, which will use cURL to grab a page then pull out any information between user specified tags, into an array. So for instance, in our example you can grab all of the links from any web page.

The class is quite simple – I had to get rid of the lovely indententation to make it fit nicely onto the blog, but it’s fairly well commented.

In a nutshell, it does this:

1) Goes to specified URL

2) Uses cURL to grab the HTML of the URL

3) Takes the HTML and scans for every instance of the start and end tags you provide (e.g. < a > < / a >)

4) Returns these in an array for you.

Download taggrab.class.zip

<?php

class tagSpider
{

// set variable to hold curl instance
var $crl;

// this is where we dump the html we get
var $html; 

// set for binary type transfer
var $binary; 

// this is the url we are going to do a pass on
var $url;


// automatically executed on class call to clear variables
function tagSpider()
{
$this->html = "";
$this->binary = 0;
$this->url = "";
}



// takes url passed to it and.. can you guess?
function fetchPage($url)
{


// set the URL to scrape
$this->url = $url;

if (isset($this->url)) {

// start cURL instance
$this->ch = curl_init ();

// this tells cUrl to return the data
curl_setopt ($this->ch, CURLOPT_RETURNTRANSFER, 1);

// set the url to download
curl_setopt ($this->ch, CURLOPT_URL, $this->url); 

// follow redirects if any
curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, true); 

// tell cURL if the data is binary data or not
curl_setopt($this->ch, CURLOPT_BINARYTRANSFER, $this->binary); 

// grabs the webpage from the internets
$this->html = curl_exec($this->ch); 

// closes the connection
curl_close ($this->ch); 
}

}


// function takes html, puts the data requested into an array
function parse_array($beg_tag, $close_tag)

{
// match data between specificed tags
preg_match_all("($beg_tag.*$close_tag)siU", $this->html, $matching_data); 

// return data in array
return $matching_data[0];
}


}
?>

So that is your basic class, which should be fairly easy to follow (you can ask questions in comments if needed).

To use this, we need to call it from another PHP file to pass the variables we need to it.

Below is tag-example.php which demonstrates how to pass the URL, start/end tag variables to the class and pump out a set of results.

Download tag-example.zip

<?php

// Inlcude our tag grab class
require("taggrab.class.php"); // class for spider

// Enter the URL you want to run
$urlrun="http://www.techcrunch.com/";

// Specify the start and end tags you want to grab data between
$stag="<a href=";
$etag="</a>";

// Make a title spider
$tspider = new tagSpider();

// Pass URL to the fetch page function
$tspider->fetchPage($urlrun);

// Enter the tags into the parse array function
$linkarray = $tspider->parse_array($stag, $etag); 

echo "<h2>Links present on page: ".$urlrun."</h2><br />";
// Loop to pump out the results
foreach ($linkarray as $result) {

echo $result;

echo "<br/>";
}

?>

So this code will pass the Techcrunch website to the class, looking for any standard a href links. It will then simply echo these out. You could use this in conjunction with SearchStatus Firefox Plugin to quickly see what links Techcrunch is showing bots and what they are following and nofollowing.

You can view a working example of the code here.

As I said, there’s so much you can do from a base like this, so have a think. I might post some proper tutorials on extracting data methodically, saving it to a database then manipulating it to get some interesting results.

Enjoy.

Edit: You’ll of course need cURL library installed on your server for this to work!

Like this article? Then subscribe to the feed!


Related Posts:


Next Post:
SEO Tips For Business and More »

Previous Post:

« Blogs Worth Reading

21 responses to “CURL Page Scraping Script”

  • sloth says:

    I should look into cURL. For now I’m happy using file_get_contents,preg_match and pregmatch_all.

    Comment by sloth
    December 18th, 2008 @ 2:46 pm

  • sloth says:

    whoa looks like I broke your comments section.

    Comment by sloth
    December 18th, 2008 @ 2:47 pm

  • sloth says:

    and it hasn’t put all the code that I wrote, damn :/

    Comment by sloth
    December 18th, 2008 @ 2:48 pm

  • Mark says:

    Haha, it’s WordPress just parsing the code as live. Well, some of it (:

    Upload the code example to Google code and link to it if you want to.

    Comment by Mark
    December 18th, 2008 @ 5:11 pm

  • sloth says:

    I tell you what. Here’s a link to the tutorial that took care of all my scraping needs. It’s really simple to do.

    http://www.thefutureoftheweb.com/blog/web-scrape-with-php-tutorial

    Comment by sloth
    December 18th, 2008 @ 6:03 pm

  • Shane says:

    Have you run into this before?

    CURLOPT_FOLLOWLOCATION cannot be activated when in safe_mode or an open_basedir

    Comment by Shane
    December 18th, 2008 @ 8:08 pm

  • Jez says:

    If you are into more advanced CURL ing, logging into websites for example then this an extremely useful class… takes care of cookies etc:

    http://sourceforge.net/projects/snoopy/

    Comment by Jez
    December 23rd, 2008 @ 11:05 am

  • Jez says:

    sorry 4got to subscribe to your comments….

    Comment by Jez
    December 23rd, 2008 @ 11:05 am

  • ScanKey says:

    Yo Mark!

    Does this mean all your code is now gonna be OOP?

    lol, JK!! Love it all, and wish u n urs all the best for the new year!

    Des

    Comment by ScanKey
    December 31st, 2008 @ 2:27 am

  • Mark says:

    @Shane

    Regarding your error:

    http://bugs.typo3.org/view.php?id=4292

    Comment by Mark
    January 2nd, 2009 @ 1:00 am

  • thrifty says:

    what about curl to grab media? Viral marketing i think this may also be handy ;-) try video swiper

    Comment by thrifty
    January 25th, 2009 @ 11:17 pm

  • Helen Hunt says:

    This is a nice tip. I have recently started using PHP after years of Java.

    Nonetheless, I think PHP is much more available on most platforms than Java is, and have found it even more interesting.

    Thanks for this sample code

    Comment by Helen Hunt
    February 6th, 2009 @ 7:02 pm

  • thrifty says:

    Helen you name seems familiar?
    do you have a cousin called mike?

    if not i would be interested in meeting you ? :-) to discover your dark side ;-)

    Comment by thrifty
    February 7th, 2009 @ 3:08 am

  • Mark says:

    SEO Blog.. Dating Website… It’s a fine line apparently, thrifty?

    Comment by Mark
    February 7th, 2009 @ 4:41 am

  • thrifty says:

    mmm yes slightly red faced and i appologise to Helen profusely…sorry
    not an excuse but as you can see by the time i posted…the chances of me being completely sober on a friday night at 3.08 AM is pretty Slim to say the least.
    lesson number one dont Drink and Type.
    sorry again Helen ohh and as well mark its your blog..consider myself slapped on the wrist.

    Comment by thrifty
    February 7th, 2009 @ 5:19 pm

  • Jenni says:

    biterscripting also is good at scraping web pages and harvesting data. They have a few good samples posted over at http://www.biterscripting.com/samples_internet.html .

    Jenni

    Comment by Jenni
    July 20th, 2009 @ 3:46 pm

  • tiong says:

    curlopt_followlocation is not necessarily as it cannot be activated when in safe_mode :)

    Comment by tiong
    March 13th, 2010 @ 3:31 am

  • Steve says:

    Awesome tutorial!
    I had to use curl on my host 1and1.

    http://www.quickscrape.com/ is what I came up with!

    Comment by Steve
    December 2nd, 2010 @ 7:56 am

  • Mark says:

    @Steve

    Sweet job. So, where’s my cut? (=

    Comment by Mark
    December 2nd, 2010 @ 4:56 pm

  • laksh says:

    thanks !!

    Comment by laksh
    March 8th, 2011 @ 10:55 am

  • Stranger says:

    Really good tips. How to use this cURL script to extract a required portion of web page, not just links, and display scraped data in own page inside div?

    Comment by Stranger
    March 15th, 2011 @ 10:47 pm