Home | Archive | Contact
Previous Entries

Archive for the 'Research & Analytics' Category

CURL Page Scraping Script

Tuesday, December 16th, 2008

Using cURL and page scraping for specific data is one of the most important things I do when creating databases. I’m not just talking about scraping pages and reposting here, either.

You can use cURL to grab the HTML of any viewable page on the web and then, most importantly take that data and pick out the bits you need. This is the basis for link analysis scripts, training scripts, compiling databases from sources around the web, there’s almost limitless things you can do.

I’m providing a simple PHP class here, which will use cURL to grab a page then pull out any information between user specified tags, into an array. So for instance, in our example you can grab all of the links from any web page.

The class is quite simple – I had to get rid of the lovely indententation to make it fit nicely onto the blog, but it’s fairly well commented.

In a nutshell, it does this:

1) Goes to specified URL

2) Uses cURL to grab the HTML of the URL

3) Takes the HTML and scans for every instance of the start and end tags you provide (e.g. )

4) Returns these in an array for you.

Download taggrab.class.zip

<?php

class tagSpider
{

// set variable to hold curl instance
var $crl;

// this is where we dump the html we get
var $html; 

// set for binary type transfer
var $binary; 

// this is the url we are going to do a pass on
var $url;


// automatically executed on class call to clear variables
function tagSpider()
{
$this->html = "";
$this->binary = 0;
$this->url = "";
}



// takes url passed to it and.. can you guess?
function fetchPage($url)
{


// set the URL to scrape
$this->url = $url;

if (isset($this->url)) {

// start cURL instance
$this->ch = curl_init ();

// this tells cUrl to return the data
curl_setopt ($this->ch, CURLOPT_RETURNTRANSFER, 1);

// set the url to download
curl_setopt ($this->ch, CURLOPT_URL, $this->url); 

// follow redirects if any
curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, true); 

// tell cURL if the data is binary data or not
curl_setopt($this->ch, CURLOPT_BINARYTRANSFER, $this->binary); 

// grabs the webpage from the internets
$this->html = curl_exec($this->ch); 

// closes the connection
curl_close ($this->ch); 
}

}


// function takes html, puts the data requested into an array
function parse_array($beg_tag, $close_tag)

{
// match data between specificed tags
preg_match_all("($beg_tag.*$close_tag)siU", $this->html, $matching_data); 

// return data in array
return $matching_data[0];
}


}
?>

So that is your basic class, which should be fairly easy to follow (you can ask questions in comments if needed).

To use this, we need to call it from another PHP file to pass the variables we need to it.

Below is tag-example.php which demonstrates how to pass the URL, start/end tag variables to the class and pump out a set of results.

Download tag-example.zip

<?php

// Inlcude our tag grab class
require("taggrab.class.php"); // class for spider

// Enter the URL you want to run
$urlrun="http://www.techcrunch.com/";

// Specify the start and end tags you want to grab data between
$stag="<a href=";
$etag="</a>";

// Make a title spider
$tspider = new tagSpider();

// Pass URL to the fetch page function
$tspider->fetchPage($urlrun);

// Enter the tags into the parse array function
$linkarray = $tspider->parse_array($stag, $etag); 

echo "<h2>Links present on page: ".$urlrun."</h2><br />";
// Loop to pump out the results
foreach ($linkarray as $result) {

echo $result;

echo "<br/>";
}

?>

So this code will pass the Techcrunch website to the class, looking for any standard a href links. It will then simply echo these out. You could use this in conjunction with SearchStatus Firefox Plugin to quickly see what links Techcrunch is showing bots and what they are following and nofollowing.

You can view a working example of the code here.

As I said, there’s so much you can do from a base like this, so have a think. I might post some proper tutorials on extracting data methodically, saving it to a database then manipulating it to get some interesting results.

Enjoy.

Edit: You’ll of course need cURL library installed on your server for this to work!

Posted in Grey Hat, Research & Analytics, Scripting, Search Engine Optimisation | 21 Comments

Blogs Worth Reading

Monday, December 15th, 2008

I’ve never done a round-up of the blogs I read before, which I guess is a bit selfish. So, in no particular order (and this isn’t a complete list) some of my favourite blogs, if you’re looking for some inspiration.

Dark SEO Programming is run by Harry. As he puts it, “SEO Tools. I make ’em”. A great guy if you need help with coding and somewhat of a captcha guru, with a sense of humour. Definitely worth keeping up with. I wouldn’t be surprised if this guy starts making big Google waves in the next few years.

Ask Apache is a blog I absolutely love. Great, detailed tutorials on script optimisation, advanced SEO and mod_rewrite. AskApache’s blog posts are the kind of ones that live in your bookmarks, rather than your RSS Reader.

Andrew Girdwood is a great chap from BigMouthMedia I met last year (although I very much doubt he remembers that). Andrew seems to be a vigilante web bug hunter. What I like about his blog is that he is usually the first to find weird things with Google that are going down. This usually gets my brain rolling in the right direction of my next nefarious plan. ^_^

Blackhat SEO Blog run by busin3ss is always worth checking out. He was even kind enough to give me a pre-release copy of YACG mass installer to review (it’s coming soon – I’m still playing!). Apart from his excellent tools, his blog features the darker side of link building, which of course, interests me greatly.

Kooshy is a blog run by a guy I know, who.. Well I think he wants to remain anonymous (at least a little). He’s just got started again after closing down his last blog and moving Internet personas (doesn’t the mystery just rivet you?). Anyway, get in early, I think we can expect some good stuff from here. He’s already done a cool post on Pimpin’ Duplicate Content For Links.

Jon Waraas is run by.. Can you guess? Jon has something that a lot of even really smart Internet entrepreneurs are missing, good old fashioned elbow grease. This guy is a workaholic and it pays off in a big way. Apart from time saving posts on loads of different ways to monetise your site, build backlinks and flush out your competitors I get quite a lot of inspiration for his constant stream of effort and ideas. I could definitely take a leaf out of his work ethic book.

Blue Hat SEO is becoming one of the usual suspects really. If you’re here, you probably already know about Eli. Being part of my “let’s only do a post every few months club”, I love Eli’s blog because there is absolutely no fluff. He gets straight down to the business of overthrowing Wikipedia, exploiting social media and answering specific SEO questions. You’ll struggle to find higher quality out there.

SEO Book is probably the most “famous” blog I’m going to mention here. Aaron was off at a disadvantage, because to be honest, I thought he was a massive waste of space for quite a while. (I guess that’s what happens when you take your SEO youth on Sitepoint listening to the people with xx,xxx posts on there). I bought his SEO Book and for me, at least, it was way too fluffy. I’m pleased he’s started an SEO training service now as it represents much better value. I’m sure he was making a lot of money from his SEO Book, but perhaps milked it too long (like I probably would have). Anyway, I kept with his blog and I’ve been impressed with his attitude and posts. He’s done some really cool stuff, like the SEO Mindmap and more recently, a keyword strategy flowchart which would be useful for those looking to a more structured search approach. He’s also written about algorithm weightings for different types of keywords and of course has some useful SEO Tools.

Slightly Shady SEO – Great name, great blog. Although XMCP will probably take it as an insult, I’ve always regarded Slightly Shady as the blog most similar to mine on this list. Maybe it’s because I wish I’d written some of the posts he has, before he did, hehe. Again, a no BS approach to effective SEO, whether he’s writing about Google’s User Data Empire, hiding from it or site automation it’s all gravy.

The Google Cache is a great blog for analytical approaches to SEO. There are some awesome posts on Advanced Whitehat SEO and using proxies with search position trackers. I like.

SEOcracy is run by a lovely database overlord called Rob. Rob’s a cool guy, he was kind enough to donate some databases to include in the Digerati Blackbox a while back. Most of his databases are stashed away in his content club now, which is well worth a look in. He’s also done some enlightening posts on keyword research, stuffing website inputs and Google Hacking.

This is all I’ve got time for now, apologies if I’ve missed you. There may be a Part II in the near future.

Posted in Affiliate Marketing, Approved Services, Black Hat, Blogging, Digerati News, Google, Grey Hat, Marketing Insights, Research & Analytics, Search Engine Optimisation, Social Marketing, Splogs, Viral Marketing, White Hat, Yahoo | 7 Comments

Understanding Optimum Link Growth

Friday, December 12th, 2008

Good evening all and Merry Christmas to all those who celebrate this time of year (you Pagans, you!). Rather than sit around the fire talking about yesteryear and smashing whiskey glasses into the fire, I’d like to talk to you about the far more interesting subject of link growth.

Link Growth on The Intertubes
For the context of this conversation (and by that I mean one-way lecture), I am assuming that everyone is defining link growth at the rate at which a domain as a whole and specific pages gain new backlinks. More importantly, how quickly search engines discover and “count” these backlinks.

I’ve blogged before about link velocity before and generally summerised that it was of course, a factor in how well your website ranks. However, as with most SEO topics, the devil is in the detail and there’s a lot of myths about the detail. So I would like to discuss:

1) What signals do “good” links and “notsogood” links give to your website?

2) How does domain age and your current backlink count play a part in determining your “optimal” link velocity?

3) Can you be harmed by incoming links?

These are what I believe are some of the most important (it’s definitely not all) factors attributing to link growth / velocity. As I want to have this blog post finished by Christmas, I’m going to try and stick around these core 3 points, although I’m sure I’ll end up running off at a tangent like I usually do. If however, you think I’ve missed something critical, drop me a comment and I’ll see if I can do a followup.

The difference between trust & popularity
When talking about links, it’s important to realise that there is a world of difference between a signal of trust and a signal of popularity. They are not mutually exclusive and to rank competitively, you’ll need signals of both trust and popularity, but for now realising they are different is enough.

For instance: Michael Jackson is still (apparently) very popular, but you wouldn’t trust him to babysit your kids now, would you? The guy down the road in your new neighbourhood might be the most popular guy in your street, but you’re not going to trust him until someone you know well gives him the thumbs up.

So for your site to rank well, Google needs to be able to have a degree of trust (e.g. source of incoming links, domain age, site footprints) to ensure your not just another piece of 2 bit webscum and it needs to know your content is popular (i.e. good content, link velocity, types of links). As I’ve already said, I’m not going to get into a drawn out debate about content here, just looking at links.

What comes first, trust or popularity?
It doesn’t really make much logical sense that you’ll launch a website and with no fanfare, you get a stream of hundreds of low quality links every week.

This kind of sits well with the original plan of the PageRank algorithm, which let’s not forget is actually (originally) trying to calculate the chance that a random surfer clicking around the web will bump into your site. This notion of a random surfer, clicking random links gave Google an excellent abstract to work out the whole “page authority” that the lion’s share of their algorithm sprang from.

Nowadays, you’ll hear lots of people trumping about going after quality (i.e. high PR links) rather lots of “low quality” (low PR links) while trying to remain relevant. From the algorithm origins point of view, the higher PR pages simply have more of these virtual random surfers landing on them; so more chance of a random surfer clicking your link.

Looking back at “time zero” when the PageRank started to propagate around the web, apart from internal PR stacking, all sites were equal, so PageRank was actually collected by raw numbers of links, rather than this “quality” (high PR) angle, which is actually just a cumulative effect of the PageRank algorithm (at least in its original form).

Hopefully, you’re still with more or not bored about going over fundementals, but without this level of understanding you’ll have a job getting your head around the more advanced concepts of link growth. Keep in mind here, I’m talking about pure PageRank in its original form (I’m sure it’s been updated since it was published), I’m not talking about ranking factors as a whole. To be honest, when I’m ranking websites (which I’m pretty good at), PageRank normally plays a very, very small role in my decision making, it is however useful as an abstract concept when planning linking strategies.

The point I’ve been eluding to here is, for Google to buy into the fact that yes your site is getting lots of natural “run of the mill” links, you firstly will need links from higher PageRank pages (or authorative pages, which are slightly different – bare with me). This line of thinking is of course assuming you don’t use a product like Google Analytics – (“Googlebot: Hmm, 58 visitors per month and 1,200 new incoming links per month, makes perfect sense!”).

Google is also pretty good at identifying “types” of websites and marrying this up to trust relationships. So for instance, I think most people would like a link from the homepage of the BBC News website, it’s a whopping PR9 and has bucket loads of trust. Here’s a question though: Is it a “relevant” link? The BBC News website covers a massive variety of topics, as most news sites do, so what is relevant and what is not is pretty much dependent on the story, which of course cover all topics. Does a link from the BBC News site mean your site is “popular”? No, (although it might make it so). Here’s a good question to ask yourself, between these two scenerios which is most believable:

1) Brand new site launched :: Couple of links from small blogs :: Gets 2,000 links in first month

2) Brand new site launched :: 1 linked from BBC News Homepage :: Gets 2,000 links in first month

Of course, you’ve hopefully identified situation 2 as the far more likely candidate. Lets consider what Google “knows” about the BBC website:

Googlebot says:

1) I know it’s a news website (varied topics)

2) I know millions of other sites link to it (it’s incredibly popular)

3) Lots of people reference deep pages (the content is of great content)

4) I see new content hourly as well as all the syndicated content I’m tracking (Fresh – as a news site should be)

5) It’s been around for years and never tried to trick me (another indicator of trust)

6) If they link to somebody, they are likely to send them lots of traffic (PR)

7) if they link to somebody, I can pretty much be sure I can trust this person they link to

Despite its critics, I’m a big believer in (at least some kind of) TrustRank system. It makes perfect sense and if you haven’t read the PDF, it’s very much worth doing so. In a hat tip to critics, it is incredibly hard to prove because of the dynamic nature of the web, it is almost impossible to seperate the effects of PageRank, relevance, timing, content and a myriad of other glossary terms you could throw at any argument. However, without leaps of faith, no progress would be made as we’re all building on theory here.

Site Note: While I’m talking about experimentation and proof, I’m still chipping away at my SEO Ranking Factors project (albeit slower than I like) and I’ll be willing to share some scripts for “tracking TrustRank” in the new year – dead useful stuff.

Okay, the point I’m making here is that these high trust/authority whatever you want to call them, sites are a stepping stone to greater things. I would agree with the whitehat doctrine that yes (if it’s your own domain at least) you will require links from these sources if you are to rank well in the future. We’ll look at some examples of how to rank without those links later (:

Trust needs to come before mass popularity and there are other things you may want to consider apart from just scanning websites and looking for as much green bar as possible. There are other mechanisms, which while I don’t believe Google is using to the full extent they should (even when they play around with that godamn WikiSearch – musn’t get started on that).

So looking from a Wikinomics aspect, they are less trustworthy but being on the front page of Digg, being popular in Stumble, having lots of delicious bookmarks could all be signals of trust as well as popularity (although at the moment at least, they are easier to game). I would expect, before Google can use these types of signals as strong factors of search, there will need to be more accountability (i.e. mass information empire) for user accounts. This is perhaps one of the things that could make WikiSearch work, being linked to your Google Account, Google can see if you use Gmail, search, docs, video, blogger, analytics, the list goes on – it’s going to be much harder to create “fake” accounts to boost your popularity.

Domain age and link profiles
Domain age definitely has its foot in the door in terms of ranking, however having an old domain doesn’t give you a laminated backstage pass to Google rankings. The most sense you’re going to get out of looking at domain age comes with overlaying it with a link growth profile, which is essentially the time aspect of your link building operation.

Your natural link growth should have an obvious logical curve when averaged out, probably something like this:

Which roughly shows that during a natural (normalised) organic growth, the amount of links you gain per day/month/week will increase (your link velocity goes up). This is an effect of natural link growth, discovery and more visitors to your site. Even if you excuse my horrific graph drawing skills, the graph is pretty simplified.

How does this fit into link growth then?
I’ll be bold and make a couple of statements:

1) When you have established trust, even the crappiest of crap links will help you rank (proof to come)

2) The more trustage (that’s my new term for trust over time (age)) the greater “buffer” you have for building links quickly

Which also brings us to two conclusions:

3) Straying outside of this “buffer zone” (i.e. 15,000 low quality new links on week 1) can you see penalised.

4) If you’ve got great trust you can really improve your rankings just by hammering any crap links you like at the site.

So, going along with my crap-o-matic graphs:

As I’ve crudely tried to demonstrate in graphical form, your “buffer zone” for links increases almost on a log scale, along with your natural links. Once you’ve established a nice domain authority, it’s pretty much free game with links, within reason.

I s’pose you’re going to want some proof for all these wild claims, aren’t you?

Can incoming links harm your website?
The logical answer to this would be “no”. Why would Google have a system in place that penalises you for bad incoming links? If Google did this, they would actually make their job of ranking decent pages much harder, with SEOs focusing in damaging the competition, rather than working on their own sites. It would be a nightmare, with a whole sub-economy of competitor disruption springing up.

That’s the logical answer. Unfortunately, the correct answer is yes. I’ll say it again for the scan readers:

It is possible to damage the rankings of other websites with incoming links

Quote me if you like.

Now by “bad links” I don’t mean the local blackhat viagra site linking to you, that will most likely have absolutely no effect whatsoever. Those kind of sites which Google class “bad neighbourhood” can’t spread their filth by just linking to you, let’s be clear on that. You’re more at risk if someone tricks you into linking to a bad site with some kind of Jedi mind trick.

There’s two ways I’ve seen websites rankings damaged by incoming links:

1) Hopefully this one is obvious. I experienced this myself after registering a new domain, putting a site up 2 days later – which ranked great for the first couple of weeks. Then, well.. I “accidently” built 15,000 links to it in a single day. Whoops. I never saw that site in the top 100 again.

2) There is a reliable method to knocking pages out of the index, which I’ve done (only once) and seen others do many, many times. Basically, you’re not using “bad” links as such, by this I mean not from dodgy/blackhat or banned sites, they are links from normal sites. If for instance, you find a sub-page of a website ranking for a term, say “elvis t-shirts” (this is a random term, I don’t even know what the SERPs are for this term) with 500 incoming links to that page. If you get some nice scripts and programs (I won’t open Pandora’s Box here – if you know what I’m talking about then great) and drop 50,000 links over a 2 week period with the anchor text “buy viagra”, you’ll find quite magically you have totally screwed Google’s relevancy for that page.

I’ve seen pages absolutely destroyed by this technique, going from 1st page to not ranking in the top 500 – inside of a week. Pretty powerful stuff. You’ll struggle with root domains (homepages) but sub-pages can drop like flies without too much problem. Obviously, the younger the site the easier this technique is to achieve.

You said you could just rank with shoddy links?
Absolutely true. Once you’ve got domain authority, it’s pretty easy to rank with any type of link you can get your hands on, which means blackhat scripts and programs come in very useful. To see this in effect, all you have to do is keep your eye on the blackhat SERPs. “Buy Viagra” is always a good search term to see what the BHs are up to. It is pretty common to see Bebo pages, Don’t Stay In pages – or the myriad of other authorative domains with User Generated Content rank in the top 10 for “Buy Viagra”. If you check out the backlink profiles of these pages you will see, surpise, surprise, they are utter crap low qualtiy links.

The domains already have trust and authority – all the need is popularity to rank.

Trust & Popularity are two totally different signals.

Which does your site need?

We have learnt:

1) You can damage sites with incoming links

2) Trust & Authority are two totally different things – Don’t just clump it all in as “PageRank”

3) You can rank pages on authority domains with pure crap spam links (:

Good night.

Posted in Google, Research & Analytics, Search Engine Optimisation | 18 Comments

SEO For The Uneducated

Sunday, October 5th, 2008

I had written a post analysing this.

I’ve deleted it.

It only dampens the impact.

Google, please institute a “did you mean” correction for this search. You have the chance to educate so many people.

Hats off to all those companies producing products for the mass “Wayne & Waynetta” (see “the great unwashed”) market. No wonder you’re making so much money.

Go on. Digg It.

Posted in Advertising, Google, Research & Analytics | 13 Comments

SEO Tools & The Future

Monday, August 11th, 2008

I’m aware this is the longest I’ve been without posting, but, meh. What ya gonna do?

I’ve had a post in drafts for a while, which isn’t quite finished. I’ll most likely be polishing and posting it tomorrow. Some real simple blackhat stuff, how to set up a few automated blogs and socially promote content to get some easy pocket money.

To my dear Elite SEO Tools subscribers. Yes, the service has been canned for now. As those of you “on the inside” saw, I was giving membership subscriptions free for the last few months as we were having problems with Social Storm. Unfortunately, several systems such as Propeller, OnlyWire and a couple of other APIs have significantly changed since launch. Nothing that couldn’t be met with a concerted effort – but something I really don’t have time to continue supporting at the moment.

I was *incredibly* pleased with the feedback from the SEO Tools – with several members signing up for multiple accounts as they were so happy with the results! The most popular tool (as I suspected) was Link Buster: which was a set of methods to automatically generate links. I will most likely look at developing a more extensive version of this in 2008, which will focus on links alone – as it seems to be what most people are having problems with.

As for the social media side – AutoStumble is flying! If you check the site out, I’ve added (at the top – squint) a live counter for the amount of stumbles swapped on the network. We are fast approaching 200,000 votes swapped – which is awesome! Like the SEO Tools, I’ve had excellent feedback from both direct feedback and forum chatter. Of course, there’s a few moaners (or as Aaron Wall calls them – Wankers), which does happen when you are at the “lower” price point with software – but hey, generally looking great…

With such great support and feedback, I am looking to develop AutoStumble to allow automatic account creation on StumbleUpon (yes, that means beating reCAPTCHA) and posting against multiple URLs…

Anyway, that’s the news – watch out for the post tomorrow.

Posted in Black Hat, Blogging, Digerati News, Research & Analytics, Splogs | 5 Comments