Automatic Website Migration Tool: Free Streamlit App & Source

Easy to Use Streamlit App

I hope that by releasing this Streamlit app, it’ll reach a much wider audience than my previously released scripts by removing the technical barrier to getting started.

That said, if you are a techy and want to dabble – I’m also releasing the full source code for you to play with. Want to use Sentence Transformers? Want to add support for your favourite crawler? Now you can!

I had so many more ideas for this release but had to remind myself that this is a portfolio piece and not a full blown SaaS project! I will probably revisit again in the future.

Benefits

Simple to use Streamlit interface.
Saves hours / days of time!
Efficient at mapping very large datasets.
Useful datasets where there will not always be a 1:1 match available.
Run the migration multiple times with different parameters to find the best column combination for matching.

App and Source Code

Features

Simple to use website migration via handy Streamlit interface.
Simple column mapping with common crawl column names in five languages mapped automatically.
Outputs a pre-formatted Excel Workbook with median distribution score chart.
Scorecard Visualization showing relevancy score changes between runs.
Matches and scores multiple column combinations at once to find the highest scoring match for each URL.
Complete control over which columns are used for matching.
Use custom columns for even more precise matching.
Full source code release – modify as you see fit!
Choice of match models (TF-IDF, EditDistance, and RapidFuzz).
Supports .csv, .xlsx, and .xls file formats.

Simple to Use Excel Output

The output is a ready formatted Excel workbook, with median scores bucketed into a chart on a separate workbook.

It contains the following notable columns.

Source and Destination URLs – Used for the final matching.

Best Match On – The column with the highest matching score.

Highest Matching URL – The corresponding URL to the Best Match On value.

Best Match Content – The content of the highest match for reference.

Median Match Score – The median similarity score of all column matches. (Useful as a confidence score.

All Column Match Scores – Shows all column match scores. Useful for sanity checking / debugging.

Run the app multiple times to get the best scoring combination of columns.

What makes this script different, is that it scores each match run based on the median match score of the columns used for matching.

In other words, you can re-run the script using a different combination of column or matching model, and see how the run compared to the previous run to find the best scoring combination of columns.

Image shows the increase in match score after changing the columns used for matching. — By switching up the column order, I was able to increase the median match score by 5.5%

First Run of the App To Get A Bench Mark

Second Run of App Using A Different Combination of Columns – Much Better!

Choice of Matching Model

The app is designed to run with sane defaults / balanced settings right out of the box.

However, buried under the Advanced Options setting, are options to change the default matching model.

Because the script is designed to score the run based on the columns and matching model used, it is worth experimenting to see if it is possible to increase the overall match score by using a different matching model.

TF-IDF – The default matching model. Suitable for most use cases.

EditDistance – Useful for matching based on character-level differences, such as small text variations.

RapidFuzz – Use with very large datasets, fast string matching.

Full Source Code Release – Remix and Mashup

There were so many directions I could have taken this app, I thought it would be interesting to release the source code and see what people come up with.

For example, there are way more advanced options that could be used depending on the type matching model used.

Instead of manually re-running the code, you could automatically re-run it to self optimise for the best out, AKA a grid search.

Let me know on twitter (@leefootseo) if you build anything!

Tips and Tricks

Ensure Staging and Live crawl files have the same columns for matching.
Remove unused columns to save time when uploading.
Consider matching different templates, like product pages to product pages.
Try matching on unique elements such as MPN/SKU and extracted content.
Avoid matching on templated / identical content. (e.g. don’t use meta description if all pages have an identical meta description).
Try matching on different combinations of columns until you get the highest distribution score / highest median match score possible.

Instructions

Using the app is easy.

Just crawl the live and staging Websites, export the data to a .csv or .xlsx file using Screaming Frog, and export the data.

Simple uploader for uploading csv files ready for matching. — The app contains a simple to use file uploader.

I recommend experimenting with custom extractions for matching, as well as experimenting with matching like-for-like page templates individually rather than the entire dataset at once.

Alternatively, you can just upload a headered URL list, but you’ll miss out on the benefits of multiple column matches.

Demo Output

Contact Me

If you’d like to run this script as a managed service or would like a bespoke version for your business, please get in touch.