GOV.UK Document Clustering

Introduction

This website documents some document clustering experiments that were run against the GOV.UK Topic Taxonomy using the Python BERTopic library via the code in this GitHub repository.

The Taxonomy Health page in the Content Tagger application indicates that a significant number of taxons have “too much content tagged to them”, i.e. they are associated with more than 300 content items.

The BERTopic library was used to generate candidate clusters of content items to split up the content items into new child taxons (i.e. sub-topics) of the taxon in question. If a sufficient number of content items could be pushed down into new sub-topics, this would make the topic healthy according to the current heuristic.

Topic selection

A couple of unhealthy topic taxons with approx 1,000 content items were identified. These were both level 2 taxons “leaf” taxons so that there was obviously “room” for the creation of new sub-topics without exceeding the maximum depth of 5.

Common configuration

The BERTopic clustering script was run against the data for these example taxons as follows:

Configuration variations

The clustering script was run with the following variations:

  1. No topic reduction.
  2. Automatic topic reduction.
  3. Manual topic reduction with 12 topics specified. 12 topics is the maximum number of child taxons allowed by another taxonomy health metric. The outliers are counted as a topic, so only 11 topics are actually generated.
  4. Manual topic reduction with 6 topics specified. The outliers are counted as a topic, so only 5 topics are actually generated.
  5. Manual topic reduction with 12 topics specified, but with “three words at most” in the LLM prompt changed to “five words at most”. The average number of words in a topic name across the existing Topic taxonomy is approx 5.
  6. Manual topic reduction with 12 topics specified, but with “three words at most” in the LLM prompt changed to “five words at most” and with the following extra context included in the prompt: “The topic already has a parent topic, “$parent-topic-name”, which itself has a parent topic, “$grandparent-topic-name”. The topic label can assume this context, i.e. it not necessary to repeat these terms in the topic label.”

Results

The results for each of the selected topics can be viewed here:

Further work