Latent Dirichlet Allocation (LDA) Correlations Clarified

Upon SEOMoz’s announcement regarding the relationship between LDA Cosine values and Google search rankings, I immediately had reservations about the way that many individuals in the community were reading the results. Admittedly, Rand and Ben have been careful about taking some of these observations with a grain of salt, making it clear to state that by no means does LDA represent the majority of Google’s ranking algorithm.

That being said, I took special interest because, like many other SEO’s who work in competitive spaces, I have long regarded on-page factors as being only valuable for long-tail searches. My first and primary concern was that because SEOMoz’s team was looking at a large keyword set without regard to competitiveness, that their data would be skewed by non-competitive spaces. I mentioned it in my previous post in greater depth, but I will cover it more briefly here.

Let’s say you have 100 applicants for 10 new developer positions and you consider a college degree as the most important factor. Unfortunately, no applicants have college degrees. You can’t use college degrees to determine who gets the job. Instead, you may have to depend on who has the most experience. If I then run a study on who you chose to hire, my numbers would say that college degrees barely matter at all, and experience is more clearly more important. In reality, if you had 10 applicants with degrees, you would have hired all of them.

Similarly, if we analyze a long-tail search where none of the relevant pages have any backlinks, it will appear that links don’t matter and on-page factors, like LDA, are highly correlated. In reality, Google is forced to rely on relevance in these cases. Later, when we try to compare this factor with others, like inbound links, this will appear stronger not because it actually is but because in a certain percentage of cases, there were no link measurements to consider at all. My analysis appears to bear this out.

Although I do not pretend to have the scientific or mathematical background to back this up, I am handy with PHP and Excel.

I pushed 100 competitive keywords and 100 long-tail keywords through Google to get their top 10 rankings. I then pushed their content through SEOMoz’s LDA tool. Then, with some minor data scrubbing, I averaged out the LDA scores for each position 1 through 10 and aggregated them based on whether the term was or was not competitive.

The end result? In non-competitive, long-tail keywords, there is a very strong relationship between LDA and rankings. In competitive, short-tail keywords, there is little relationship between LDA and rankings. Most importantly, when you aggregate the data, the correlation from long-tail slope overwhelms the lack of a trend in competitive terms…

What does this mean

  1. Does this mean LDA doesn’t matter?: Were you listening? Relevance is still a key factor, it is just not nearly as important once popularity measurements can be considered. Don’t spend all your time trying to tweak your LDA score. Once your site is suitably relevant, focus on external factors
  2. Does this mean SEOMoz was wrong?: Actually, in my opinion, this data indicates that SEOMoz has discovered the primary tool by which Google determines page relevance. It is hugely important.
  3. What should be done about it?: Hopefully, the good folks at SEOMoz will run some tests to help us figure out how much it matters in more competitive spaces. My quick Excel-Fu (admittedly with the assistance of Jeff Staub, our COO) hardly meets scientific standards.

4 Comments

  1. Michael Martinez
    Sep 10, 2010

    Once again SEOmoz has unleashed a load of crap on the SEO world and, much to my disappointment Russ, you’re supporting it.

    There is no science to their “correlation studies”, which don’t produce any scientifically acceptable correlations. They don’t even come close.

    Author Response: See, now you sound silly. You can have a scientific study that brings out non-significant results. However, SEOMoz gladly publishes statistical significance data along with their results. It is science. It might not reveal a 100% answer, but science rarely does.

    You’re concerned about uncompetitive queries affecting the data sets? Dude, anyone who wants to attempt algorithm analysis MUST use uncompetitive queries (because all the competitive queries are completely distorted).

    Author Note: I did not say we shouldn’t look at uncompetitive queries, I said we need to look at them independently. I stand by this claim and I think I have presented the logic and some statistical evidence to back it.

    Furthermore, Latent Dirichlet Allocation does not conform with Google’s primary goal of serving the most relevant result to a query. LDA is an extension of the classic “bag of words” concept that ignores word order and proximity — both of which are important to Google’s relevance algorithms (according to Googlers and Google documents).

    Author Note: Come on Michael, you know better than that. Google’s Goals do not necessarily coincide with their actual practices – especially since scalability is so important. We know with Google’s release of n-gram data years ago that the “bag of words” kind of approach is both statistically acceptable and in-use at Google. Until now, we didn’t have any data to back up that knowledge. Now we do.

    LDA provides a conceptual map of what an object may be related to. A lot of research has focused on how LDA may be used in image analysis and vision technologies.

    The good folks at SEOmoz need to stop running tests because, frankly, they run lousy tests that have no scientific credibility.

    They make outrageous marketing claims and sell subscriptions to “pro services” and seats at seminars. They’re not providing the SEO community with any insight into or advantage over search algorithms.

    The Google algorithm in particular changes so quickly that even if the SEOmoz tests were credible they cannot publish data fast enough to provide any useful strategies in influencing Google’s search results.

    You need to stop drinking the kool-aid, my friend, and reinvest in a healthy skeptical approach to any claims about “SEO tests’.

  2. Sean Weigold Ferguson
    Dec 16, 2010

    I’m interested in seeing the correlation coefficients for non-averaged values (using every data point). I’m guessing that they drop below 0.3.

    Author Response: That very well might be the case, and it is something we are going to be looking into in the near future

  3. Glenn Friesen
    Jun 16, 2011

    I compete in several competitive SERPs. For one term, my result was stuck at position 7, no matter what I tried. I applied LDA analysis to the page, and filled out the content using the words that helped raise the LDA score of the page. Though it was a bit too verbose for my personal opinion of what qualifies “great content”, the text looked good and read well. I expected the result to move to position 5. I waited one day, and now the position of the ranking is 5, and not 7, just as expected.

    After a period to determine if the result is sustainable, I plan to revert the page back to it’s earlier version to hopefully observe the result return to position 7. Today, I plan to improve another competitive page’s LDA score (to hopefully observe the ranking for that result move up from position 4 to position 2).

    I suspect that if LDA isn’t the model that Google’s using, it’s at least well-correlated to it. The model could very well be a Hidden Markov Model, or phrase based IR or something else — or a combination of things. Regardless, employing the LDA tool seems to actually affect the rankings — something I’ve yet to observe out of any advanced on-page SEO development tactic.

    == Thanks for the excellent research regarding the importance of competitiveness in the context of whether LDA matters. == Mad props!!!!!!!!

  4. Leslie Burkhalter
    Feb 14, 2012

    Thanks for the info. I have done exactly as you say with the links, so let’s hope they work.

Trackbacks/Pingbacks

  1. Discussing LDA and SEO – Whiteboard Friday - [...] I noticed that Russ Jones did work to reproduce our findings.  He used a different dataset and different methodology,…
  2. Latent Dirichlet Allocation Optimization - [...] Latent Dirichlet Allocation (LDA) Correlations Clarified [...]
  3. LDA correlation 0.17 not 0.32 - Blog WorthEstimator.com Article - [...] of the bug and what evidence there was of it: I was looking into the discrepancy between Russ Jones’s…
  4. Content Optimization: Revisiting Topic Modeling, LDA & Our Labs Tool | iSEO Blog - [...] so more polished results may still be several months away.We’ve been excited to have others analyze our work (which…
  5. Content Optimization: Revisiting Topic Modeling, LDA & Our Labs Tool | Traffic Building Tips - Free SEO, Web Traffic And Link Building Tips - [...] been excited to have others analyze our work (which is how we discovered our initial error on the correlation…
  6. New Tool: LDA Content Optimizer - [...] the dust has settled and we can start talking about what does this mean for me as a webmaster.…
  7. SEOMoz was Right on Relevancy – The Birth of nTopic, the LDA Google Search Ranking Factor | The Google Cache: Search Engine Marketing, SEO & PPC - [...] announced their findings, in fact I immediately used their free tool (no longer available) to compare long-tail to short…
  8. Biomagnification and Back Link Penalties | The Google Cache: Search Engine Marketing, SEO & PPC - [...] industry, come from other academic pursuits. While these are regularly computer sciences (like latent dirichlet allocation) or mathematics (like…

Submit a Comment

Your email address will not be published. Required fields are marked *