GOV.UK Document Clustering

Introduction

This website documents some document clustering experiments that were run against the GOV.UK Topic Taxonomy using the Python BERTopic library via the code in this GitHub repository.

The Taxonomy Health page in the Content Tagger application indicates that a significant number of taxons have “too much content tagged to them”, i.e. they are associated with more than 300 content items.

The BERTopic library was used to generate candidate clusters of content items to split up the content items into new child taxons (i.e. sub-topics) of the taxon in question. If a sufficient number of content items could be pushed down into new sub-topics, this would make the topic healthy according to the current heuristic.

Topic selection

A couple of unhealthy topic taxons with approx 1,000 content items were identified. These were both level 2 taxons “leaf” taxons so that there was obviously “room” for the creation of new sub-topics without exceeding the maximum depth of 5.

Work > Labour market reform (998 content items)
Going and being abroad > Travel abroad (1,110 content items)

Common configuration

The BERTopic clustering script was run against the data for these example taxons as follows:

Using a database dump from the Content Store application from January 2026 as the source of content items.
A random number generator seed was used to improve reproducibility across runs to make comparisons easier.
Using the title (repeated twice for emphasis) and the body (stripped of HTML tags) of each content item.
No content was included from PDF/HTML attachments associated with each conten item.
English stop words were excluded from the keywords generated for each cluster.
The keywords for each cluster were constrained to be either one or two words.
They keywords for each cluster were converted into human-friendly topic names by using the built-in LLM integration (using Open AI’s gpt-4o-mini model) with 4 example documents for each cluster limited to 2,000 tokens per document.
The default LLM prompt was simplified, but the “three words at most” constraint (for the topic name) was initially left in place.
The documents not in a cluster were listed under an “Outliers” topic.

Configuration variations

The clustering script was run with the following variations:

No topic reduction.
Automatic topic reduction.
Manual topic reduction with 12 topics specified. 12 topics is the maximum number of child taxons allowed by another taxonomy health metric. The outliers are counted as a topic, so only 11 topics are actually generated.
Manual topic reduction with 6 topics specified. The outliers are counted as a topic, so only 5 topics are actually generated.
Manual topic reduction with 12 topics specified, but with “three words at most” in the LLM prompt changed to “five words at most”. The average number of words in a topic name across the existing Topic taxonomy is approx 5.
Manual topic reduction with 12 topics specified, but with “three words at most” in the LLM prompt changed to “five words at most” and with the following extra context included in the prompt: “The topic already has a parent topic, “$parent-topic-name”, which itself has a parent topic, “$grandparent-topic-name”. The topic label can assume this context, i.e. it not necessary to repeat these terms in the topic label.”

Results

The results for each of the selected topics can be viewed here:

Further work

As recommended in the warning in the BERTopic documentation for topic reduction, it might be better to tune the min_cluster_size & nr_topics parameters to reduce the number of clusters at an earlier stage of the process. Apparently HDBSCAN (used by BERTopic by default) is quite sensitive to the values of these two parameters relative to the text being clustered. It might be worth using TopicTuner to optimize the values.
Allow the user to modify the suggested topic names and then re-run the script using the new values as seed topics.
Include content from PDF/HTML attachments. The script already supports this with the --with-pdf-attachments & --with-html-attachments command line options.
Modify the LLM prompt to generate topic descriptions as well as names.