Duplicate Content Round-up: Diagnosis and Correction with Free Tools
Here is the Duplicate Content Tool if that is all you are looking for…
“Duplicate content” has become a standard part of the SEO lexicon over the last year or so (2005-2006), and over that time, a handful of common causes have been identified – the most common of which is poor URL handling. There is one cause, legitimately having duplicate pages or very similar pages throughout your site, which is not caused by poor URL handling, and I would recommend SEOJunkie’s tool for determining that type of issue.
To note, there is some skepticism (really, skepticism in the SEO world?) as to the actual effect of duplicate content on a sites ranking. I believe that the largest impact occurs through PR dispersion. This occurs when inbound and intra-site links point to various versions of the same page, thus causing the alternate versions to split link power, rather than one single copy enjoying all the link power.
On the other hand, the most common causes are these URL issues which make it appear as if multiple copies of the same page exist on the site. I have listed these below along with solutions…
(1) WWW vs Non-WWW
What it is:
This is by far the most common and most talked about problem of poor URL handling. To be honest, most of us consider this more a problem with Google than with webmasters, but there really is no great solution around it (although they have added a method through google sitemaps to help fix this issue). That being said, essentially Google treats the www. and non-www forms of your site as two separate sites. When this occurs, Google spiders and caches both versions of your site, causing exact duplicate copies of your pages to be included in the search engine. This also causes PageRank dispersion, where the value of links pointing to your site are divided between these alternate domains (www and non-www).
Diagnosis:
There are a couple methods of diagnosis:
- By Hand: Visit the non-www and www version of your domain. If they both resolve w/o one of them redirecting to the other, you are susceptible.
- By Header: This is important, as many www to non-www redirects are being performed with 302 redirects, which are not as effective.
Use The Duplicate Content Tool - By Cache: If Google has cached different numbers of pages using the non-www and www form of your domain in a site: command, then there is a good chance you are already suffering this penalty.
Use The Duplicate Content Tool - By PR Dispersion: If your www and non-www form of the domain have different PRs, you are most likely suffering from this penalty.
Use The Duplicate Content Tool
How to fix it:
While the htaccess method, IMHO, is by far the best, some folks just do not have the ability to create a custom htaccess either because they lack the permission or Apache altogether. The alternative is to drop a snippet of code into all of your dynamic pages (php, asp, coldfusion, or other) that checks the HTTP_HOST for www. and redirects to the correct domain/page if the www is missing. I have included that code as well.
.htaccess method
Options +FollowSymLinks
RewriteEngine on
RewriteCond %{HTTP_HOST} ^thegooglecache\.com [NC]
RewriteRule ^(.*)$ http://www.thegooglecache.com/$1 [L,R=301]PHP Method
$host = $_SERVER['HTTP_HOST'];
$file = $_SERVER['REQUEST_URI'];
if(!stristr($host,'www.')) {
$url = "http://www.$host".$file;
header("HTTP/1.1 301 Moved Permanently");
header("Location: $url");
exit();
}
?>ASP Method
Thanks to Christin Phelps for this one!
<%
location = Request.ServerVariables("SERVER_NAME")
findwww = Left(result,3)
if (findwww <> "www") then
newlocation = "http://www." & location
response.Status="301 Moved Permanently"
response.AddHeader "Location",newlocation
end if
%>ColdFusion Method
Thanks to Ray Camden and the folks in the #coldfusion channel on DalNet IRC for all their help!
<cfif cgi.http_host is "my-site.com">
<cfheader statuscode="301" statustext="Moved permanently">
<cfheader name="Location" value = "http://www.my-site.com/#cgi.script_name#?#cgi.query_string#">
</cfif>
(2) Default Pages
1. What It Is:
This common method of poor URL handling occurs when a directory can be referred to both by its folder (http://www.openoffice.org) and by its default page (http://www.openoffice.org/index.html). In these cases, both the folder version and the index version can be indexed by Google, possibly causing PR dispersion and a duplicate content penalty.
2: How to Diagnose It:
- By Hand: Just visit your site and check to see if you can view the home page both with and without the default page.
- By Header: Check to see if a 200 OK is returned w/o 404 when the default and folder versions are accessed.
Use the Tool
3: How to Fix It:
Once again, the best solution is the htaccess, but there are other options as well…
.htaccess Solution:
Make sure you replace .html with .php or .htm on BOTH LINES dependent upon what your default page is…
RewriteCond %{THE_REQUEST} (index\.html) [NC]
RewriteRule ^(.*)index\.html$ http://www.virante.com/$1 [L,R=301]PHP Solution: Include this on all index.php pages
if(stristr($_SERVER['REQUEST_URI'],'index.php')) {
$newuri = str_replace('index.php','',$_SERVER['REQUEST_URI']);
$host = $_SERVER['HTTP_HOST'];
$url = "http://www.$host".$newuri;
header("HTTP/1.1 301 Moved Permanently");
header("Location: $url");
exit();
}
?>
(3) False 404 Pages
1. What it Is
Someone, at some point, got the bright idea to start putting landing pages on 404s. Unfortunately, if your page does not correctly return a 404 error header, Google is going to cache that page. One by one, every dead link pointing to your site will result in an extra PR-Sucking vacuum page that is useless duplicate content.
There is also speculation that Google purposefully tests out non-existent pages to determine whether or not you may be generating pages on-the-fly based on keywords in the URL. The last thing you want your site to appear to function the same way as a spammers.
2. How to Diagnose It:
- By Header: The only way to do this is to analyze the header returned when visiting a non-existent page.
Use The Duplicate Content Tool
3. How to Fix it:
Talk to your webmaster or host to get this fixed ASAP.
(4) Other URL Canocalization
There are several other types of URL canocalization that can occur which are much harder to diagnose other than by hand. These are most commonly query string issues. I would recommend hiring an ethical search engine marketing firm like Virante, for whom I work, to do this. One common result in this is a large number of supplemental results, which can be discovered also by using the tool.
Conclusions
The real solution is simply to be clean and concise with your URLs.
1. Create an htaccess file that prevents www canocalization
2. Make sure your htaccess file prevents default page canocalization
3. Make sure your 404 pages return true 404 errors
4. Use the same, absolute URLs throughout your site
5. Check your site: early and often
6. Hire an SEO if your site is large or is a primary part of your business.
In your default page duplicates section, you may want to change your rewrite rule to
RewriteRule ^(.*)/index\.htm$ http://www.ramjack.info/$1/ [L,R=301]
I’ve added a forward slash before index\.htm and after $1. This way, if a page like example.com/stock-index.htm is requested, the server doesn’t try to redirect to “example.com/stock-” and throw a 404 error.
How do you get round the problem when the client wont move away from using a web analytics system that appends string to detrmine the link that a use clicked on the previous page. I’m fighting with it and they wont believe that it is seriously detrimental to their page performance.