BERT Automatic Interlinking Streamlit App V2

Discover Contextual Connections Between Pages.

The first version of this app was recently highlighted in this excellent article, ‘Mastering Topic Clusters in SEO.‘ (If you haven’t yet, it’s a must-read!)

This got me thinking it’s high time for an overhaul of the original app. Checking in, I’m amazed it had over 17,000 unique visitors since its release in May 2022!

I’ve done absolutely nothing to market it other than the initial tweet and being featured as Streamlit’s app of the month in May 2022.

This success led me to realise that it’s high time the app received an overhaul.

Try it: BERT interlinking app V2

What鈥檚 it all about?

The premise is to use Sentence Transformers to find deeper relationships between pages on a website.

Initially, it was designed to highlight related Buyer’s Guides on eCommerce category pages, as well as to suggest related products.

Here are some practical applications for this app:

  • Interlink Related Products Within Category Text
  • Show Related Categories on Existing Category Pages
  • Surface Related Content to Assist Conversion (Blogs, Buyers’ Guides, etc.)
  • Display Products Most Closely Related to Category Page for Increased Relevancy
  • Feature Related Blog Posts

A Different Angle

BERT Semantic Page Interlinker V2

鈥淭here are tons of page Interlinkers out there, why not just use those?鈥

It鈥檚 a fair point, this script doesn鈥檛 even use anchor text or link metrics when making recommendations. Its unique approach lies in forming logical connections between pages, aimed at enhancing user satisfaction and potentially increasing revenue.

Example #1:

Here鈥檚 an example of the Interlinker correctly making the connection between a page about HP Original Ink and printer paper.

If this was an eCommerce project, the recommendation would most likely be to show the printer and printer ink categories as related categories on the page or within the existing anchor text.

Example #2:

How about linking Hair Care to:

  • Hair Treatments
  • Hair Straighteners
  • Hair Curlers
  • Hair Brushes
  • Hair Clippers and
  • Hair Dryers?

Syntactically those words are far away from each other, but using the magic of Sentence Transformers the app is able to understand that these pages are related.

Linking those related pages would be good for users, bots and ultimately the bottom line.

New Features

Looking back, the initial version was quite basic, with a codebase that, although functional, left much to be desired.

In fact, I noticed a huge oversight that prevented around 90% of matches from being returned!

Clustering

  • Minimum Similarity Cutoff
  • Choice of Sentence Transformers
  • Multi-Lingual Transformer Trained in 50 Languages

Visualization Options

  • Choose between Tree Chart or Radial Chart Visualisations
  • Full Chart Customization Options: Nodes, Chart Size, Formatting
  • Real Time Chart Updates: See the Impact of Changing Values on Link Relationships
  • Automatic Source and Destination URL Mapping

File Handling

  • Supports Uploads in Excel or CSV formats (UTF8 format only)
  • Excel Download with Mapping Between Pages and Similarity Threshold Cutoff
  • Source File Filtering Options: Match Only Indexable Pages, Remove Parameters, etc.

Code Base / Performance / UX

  • Completely Rewritten from the Ground Up
  • Cleaner, More Efficient Code Base
  • Fixed Bug Preventing Most Matches from Being Returned
  • Automatic Mapping of Screaming Frog’s Address and H1-1 Columns in Five Languages
  • Improved UX and Error Handling Including Tool Tips and Instructions
BERT Interlinker V2 Clustering Options

Instructions and Getting Started

screaming frog crawl file example
screaming frog crawl file example

You need a crawl file that at a minimum contains a URL column and a column to match.

Just crawl any site you need to run the interlinker on, export the html file as a csv or Excel file and upload it to the Streamlit app.

While I recommend matching on the H1 tag for best results, you’re free to match on any element you prefer, such as the page title, blog content, etc.

In testing I found the H1 works best because it鈥檚 a description of the content that can be found o the page, rather than the content itself 鈥 which gave mixed results.

The App is setup to automatically detect and populate the address column and keyword column using the default URL and H1 column names of Screaming Frog (in five languages) for convenience, but you鈥檙e free to change the mapping by changing the drop-down menu.

Please note: You are free to use any crawler you like as long as the output file contains a URL column and the column containing the content you would like to interlink. (You just may need to manually select the correct column, rather than have it automatically mapped for you).

Settings Deep Dive

Clustering Options

I used sensible defaults for this app which should make it easy to just jump in and start using the app with little to no research.

If you want to get the best out of the app though, it鈥檚 worth your time drilling into the various options at your disposal.

Transformer Model

There are three transformer models available. They have been selected based on speed, a balance of speed and semantic performance and multi-linguistic capabilities.

  • paraphrase-MiniLM-L3-v2 (Fastest 鈥 default model)
  • all-MiniLM-L6-v2 (Good balance between speed and semantic performance)
  • paraphrase-multilingual-MiniLM-L12-v2 (Multilingual, trained in 50+ languages)

Minimum Similarity Score

This is the threshold in which results are considered to be similar. The higher the score, the stricter the matching. The chart will display the top results in real time, so you can get a good idea of which cut off works well for your website.

This option affects the Excel file download. If you want to include all the results, set the similarity score to the lowest option (50%).

BERT Interlinker Clustering Options
BERT Interlinker Clustering Options
BERT Semantic Interlinker File Operations
BERT Semantic Interlinker File Operations

File Operations

Upload File

Valid options are .csv or .xlsx.

Select Columns

Options to select both the Address and Content columns. Columns are automatically mapped to the default Screaming Frog column names for the URL and H1-1 by default. They can be mapped to any column you specify.

Includes automatic validation for the URL column.

Filtering Options

Filters are automated selected by default and filter out the following URL types.

  • Drop Duplicate Pages
  • Filter Out Paginated URLs
  • Filter Out URLs with Parameters
  • Keep Only Indexable Status URLs

Filtered URLs are removed from the total URL count when processing the file.

If a Indexability column is found, the option to keep only indexable URLs will drop any non-indexable URLs before processing.

Chart Creation Options

Tree Layout Options

Choice of Tree layout or radial layout. This option can be changed on the fly after processing is completed, so you can see which you prefer. (The default is Tree).

Charts can be saved by right clicking and choosing the 鈥楽ave image鈥 option in your browser.

Number of Level 1 / Level 2 Nodes to Preview

This option restricts the number of nodes visible in the graph to the top X nodes at level 1 and 2 (It does not impact the final output).

Tree Chart Node Size

Increase or decrease the size of the Tree Chart node size. Useful for fine tuning visualisations.

Font Size of Labels

Increase or decrease the label font size, Useful for fine tuning visualisations.

Chart Height

Increase or decrease the height of the tree and radial chart. Useful to display more chart nodes on the screen / make charts easier to read. Useful if exporting to a presentation.

Source and Destination Path Filters

Note: Filters are applied to the saved Excel file.

Streamlit Web App Limitations – (Contact Me if You Need More)

    Input File Limited to a Maximum of 1,000 Rows

    The Streamlit app currently limits imports to 1,000 rows. This may change in the future depending on how the Streamlit app copes. Sentence Transformers are very resource intensive.

    You can work around this to an extent by cleaning up the crawl file ahead of time. (Removing non-indexable pages or matching different templates individually).

    Note: This value is calculated after the pre-filtering options are applied.

    Should you require more rows, please don’t hesitate to get in touch – I’m available to customise this app further or provide it as a managed service, tailored to your needs.

    15 Matches per URL

    Limited to the first 15 matches per URL. In testing large eCommerce Websites this seemed to be more than enough related pages. The app needs to be constrained for situations when someone uploads 1000鈥檚 of boiler plate [Keyword] in [Location] type pages that would endlessly match to itself.

    Limited to Three Different Sentence Transformers

    I tried to pick a good balance between speed and performance, whilst still ensuring that there is a multi-lingual option. For performance reasons the fastest transformer is selected, but there is an option to choose a slower model with a higher semantic score.