We are aiming to raise £10k to help 4 small local charities in Brixton. Visit our Just Giving page.

How to recover long lost URLs after a site migration

Tech SEO

A site migration is a project that requires a lot of planning, analysis and, generally, stress from the team undertaking it. In a perfect world, all the URLs are mapped and redirects are implemented carefully so no pages are lost in Google’s index limbo. From my experience fixing a lot of botched migrations in the last few years, the above is not always the case. 

The main issue I have noticed is clients forgetting or purposely not redirecting a part of their site, which can in some instances have a disastrous impact on the organic performance of their new site.

So, what to do when you don’t have a crawl of your old site and you want to find those pages that were ranking in Google in the past? I use Wayback Machine and a script courtesy of Hartator.

Below is a list of all the steps you need to take in order to get a full list of URLs and resources from your ‘long lost’ website. 

  1. Install Ruby on your machine

Ruby is a dynamic, open source programming language and you need to install it in order for your machine to understand it.

I would recommend downloading the latest stable release version available for your machine. You can find them below:

  • On Linux/UNIX, you can use the package management system of your distribution or third-party tools (rbenv and RVM).
  • On macOS machines, you can use third-party tools (rbenv and RVM).
  • On Windows machines, you can use RubyInstaller.
  1. Install wayback_machine_downloader

    1. Open command prompt with Ruby by searching it on your file explorer
    1. Install gem install wayback_machine_downloader

To install the downloader copy and paste the following in your command prompt;

gem install wayback_machine_downloader

Note: when working with the script always ensure you are in the right command prompt. It should say at the top “Start Command Prompt with Ruby” which is different than the normal Windows one (see below)

You will most probably be prompted by different messages from the Ruby installer and documentation. All you need to do is accept and allow access. Once it is finished you can close all the different windows (see below).

You need to close all of the above and then open the “command prompt* with Ruby” as it’s a different command prompt than the one that was shown before.

*a command Prompt is a command line interpreter application. It’s used to execute entered commands. I also call it ‘the black box’.

  1. Download a list of all URLs that have been archived

    1. Open command prompt with Ruby
    2. Select the folder where you want the file to be saved (I have used my default space)
    3. Type in the following (I have used our website as an example)
wayback_machine_downloader https://passion.digital -l > passion_digital_export.txt

Pro tip: you can exclude most of the the scripts, images, fonts… by adding an exclude parameter in the query as per the example below. This will speed up the cleaning of the file.

wayback_machine_downloader https://passion.digital -l --exclude "/\.(gif|jpg|jpeg|webp|png|txt|json|woff|woff2|js|css|svg|eot|ttf|gif|ico|xml)/i"> passion_digital_export.txt
    1. Find the file on your computer (it will be saved in the same folder where you ran the command C:\Users\{Name}\)

Once you have the file of all URLs we can start cleaning the data to only keep the pages that matter to you.

    1. Open the file in excel

To do so open Excel and navigate to the “Data” Tab and select from text/CSV

    1. Clean the file

Now this should be the easy part, a few ‘search and replace’ prompts, and removing non relevant pages from your CMS and you should be good to go.

Note that if you want to do multiple exports of the same site it will override the .txt file so you can’t have it open. If you want to save the exports in another file you should change the name of the file on the command (see below).

  1. Check the pages with your crawler of choice

In my case I use Screaming Frog to check pages that return a 404 that could be redirected to the new site. I use list mode and paste all the URLs found!

In summary, there are many other uses of this script, this is just an example but it’s a damn good one to identify pages that have been forgotten in the redirect map of a migration!