How I Make Sense of Google's Complex Algorithm
In the past, we all knew that search engines were just a bit more complicated than they appeared, but we created mental models that did a pretty good job explaining the SERPs and we let it go at that... for a long long time. I like to think of those old models as the "punch list" approach - here's all the factors we think Google measures and combines into their recipe - let's make sure we hit each one.
Then, slowly but surely, something shifted. Keywords plus backlnks could no longer explain the rankings we started to notice. What on earth is going on? Here's what I've been able to put together.
BIG DATA
WE all know Google loves data. I'd guess that they collect at least ten times the number of signals, compared to what they actively use in the algorithm at any time. And they never delete any of it ;) When Panda first crawled out of development, we started hearing a lot more about machine learning - but Google has preferred the machine learning approach from the beginning - and they let their machines free check out the BIG DATA pile just to see what correlates and what doesn't. There's a reason so many of their PhD hires are statisticians.
STATISTICIANS
Today more 200 signals are actively used - and I'm betting it's FAR more. They know when any particular signal (say backlink anchor text) is natural or at least along the same lines as the rest of that market - and when it's been seriously manipulated. Lots of backlinks should correlate with some other mentions here and there. If it's too low (or maybe too high) then it might get devalued or even tossed out.
Read some of the Spam Detection patents - especially the one about Phrase Based Indexing. This statistics thing is really big.
TAXONOMIES - AUTOMATED!
Google has been automating taxonomy generation for a long time. Query terms are assigned taxonomies, websites are assigned taxonomies. When the statisticians play with their big data, I'm pretty sure that they look at statistical relevance with a given taxonomy - let's say within a market place. Clearly signals are used differently for a crafts website than for gambling, for example.
So when two URLs seem to have "the same" signals but one far outranks the other - it's more likely to be the way that signals correlate and interact - as well as signals you're not used to thinking about.
CORRELATING SIGNALS
Historical signals are a big one. Remember that scary big patent full of possibilities? They've definitely been collecting and testing all those kind of data.
How about User Engagement signals of many kinds? All the search engines have been looking at that kind of data because it's so danged hard to fake. At the same time, when Matt Cutts says that bounce rate is "too noisy" a signal foir them to use - he's not just flapping his gums. He knows, mathematically, exactly how useful or not these signals are in generating good quality rankings.
And there are many thousands of correlations to be measured an watch - thousands I tell you.