Missed parts 1 to 4?
- SEO Basics Pt.1: Page Titles and Meta Descriptions
- SEO Basics Pt.2: On-page Tips and Tricks
- SEO Basics Pt.3: How to get (natural) links
- SEO Basics Pt.4: How to Conduct Outreach
You might hear a lot of noise in the SEO space about duplicate content on your website and the problems it could cause for you as a webmaster. In this edition of the SEO Basics series we’ll be looking at duplicate content, using canonical tags to avoid any problems and we’ll also be dispelling a few myths along the way. Let’s start with some FAQs:
What is duplicate content?
This term describes any identical content or near-identical content – usually text/copy – which is present on more than one page or more than one domain. The difference between page and domain is that you could be duplicating pages on your own website or someone else could be duplicating your content on their website. The problem faced by search engines when presented with duplicate content is threefold; they don’t know which version to index, they don’t know to which version they should give credit (authority, rank etc) and they don’t know which version should rank for related search queries. Read Google’s description of duplicate content here.
What can cause duplicate content?
On your website duplicate content can be caused by a number of things. On a human level, you could publish the same content more than once across different URLs. However, it’s more often the case that technical idiosyncrasies of our websites cause content to be present under more than one URL. This could be caused by:
- Printer friendly versions of pages
- Session IDs which track users using the URL
- Parameters which change the URL but do not change the site content
- Site hierarchy / navigation which causes the same page to be accessible by more than one URL
- One slightly more nefarious possibility is that another website is scraping / stealing your website’s content
Can duplicate content harm your website?
Matt Cutts – Google’s prestigious and public-facing engineer – has stated on several occasions that [tweetable alt=””]as a general rule duplicate content won’t harm your website[/tweetable]. However, there are a few caveats to this rule which I’ll explore:
- If content is duplicated on a large enough scale, this could be seen as spammy
- Allowing search engine spiders to crawl the same content multiple times wastes your crawl budget
- If you have two versions of a page, which one does Google display? Maybe neither!
Avoid Duplicate Content Issues
So how do you avoid problems which arise as a result of duplicate content? There are several ways to avoid duplicate content issues, some are more appropriate than others depending on your particular situation. Let’s look at some of the options:
A canonical tag is part of the HTML Head of a web page, which uses the relationship parameter to tell search engines which version of the content is the preferred version. This way, the search engines know which version is the original. To implement the canonical tag, the following HTML should be added to the Head section of all versions of the content (the URL within the tag should point to the original version):
<link href="http://www.example.com/original-version/" rel="canonical" />
This example works well when your site structure means that the same content can be accessed by several URLs. It’s also a good means of ensuring that if your content is scraped by another website, you’re doing to best you can to indicate that your version is the original. For sites with printer friendly versions of pages, this may be a good option if you can dynamically add this tag to the printer friendly version.
Search engine spiders / robots sent by search engines can be controlled by webmasters using The Robots Exclusion Protocol within a text file called robots.txt. You can read more about using Robots.txt here but, fundamentally, you can request that search engines do not crawl specific sections of your site using this file.
This means that duplicate versions of your pages which waste crawl budget can still be found by users, but that they will not be crawled by search engines. You can use the robots.txt file to block printer friendly versions of pages and entire directories if required!
Using parameter handling in Google Webmaster Tools you can tell Google the purpose of different URL parameters on your website. For example, you can tell Google that URL parameters for Session ID tracking, printer friendly pages or various other parameters do not change the content of the page but should not be treated as duplicates. It’s worth taking the time to learn more about parameter handling as implementing this method incorrectly can have some pretty detrimental effects.
It’s also worth noting that this method only works for Google – setting parameter handling in Google Webmaster Tools will have no affect on Bing, Yahoo or any other search engine.
Our old friend, the 301 redirect, pops up again! In simple cases where you have two versions of a page but you only need one, you should permanently (301) redirect one version of the page to the preferred version. This does a couple of things; it ensures that any users who have the old page bookmarked are taken to the new version instead of seeing a 404 error page and it also tells search engines that the content has moved permanently. They are then able to update their index and pass on any authority generated by the old page. You can learn more about implementing a 301 redirect here.
In a later addition we’ll be exploring how to use Href Lang to deal with duplicate content issues across multi-lingual websites – watch this space! In the meantime, if you have any questions please get in touch or let us know in the comments section!