Tutorial: data-driven news discourse analysis with Python (Part 1)

Published in

Data science at Nesta

11 min readSep 25, 2023

Image credit: Jana Dobreva for Fine Acts

At Nesta, we wish to understand public narratives around issues related to our three missions in education, health and sustainability. Previously, as part for our Innovation Sweet Spots projects, we have characterised news and political discourses around low-carbon heating technologies and novel food technologies. In these projects, we aimed to cut through the hype and compare the trends in news coverage and parliamentary debates with trends in research and investment linked to technologies that could have an impact on Nesta’s mission goals. We are curious about both the intensity of the discourse, meaning how many times one technology is mentioned compared to other comparable innovations, as well as how the technology is being talked about.

Others have used discourse analysis to characterise the debates around climate change legislation, the discourse of obesity in news media according to different political leanings, and compare the differences in how media is writing about men versus women working in the creative industries, among many other examples.

In this tutorial, I will show you how you can get started with news discourse analysis using Python, and begin piecing together the narrative around a topic of your interest. I’ll cover how to collect news articles using The Guardian API, look at the number of news articles across years, and perform topic analysis with sentence embeddings.

As our main example throughout this article, we will analyse the discourse around heat pumps, which are a very efficient low-carbon heating technology seen as the future of domestic heating. Nesta’s sustainable future mission is focussed on decarbonising UK homes, including facilitating the adoption of this technology.

Setting up and importing requirements

We will use Python modules that we’ve developed while working on Nesta’s Innovation Sweet Spots projects, which leverage a variety of natural language processing methods and packages, such as spaCy and TextaCy for text processing, and BERTopic for clustering text embeddings into topics.

While the code is always a work in progress, we have used these modules in a couple of projects already and we hope it might be useful for your projects as well. Feel free to fork and adapt the code according to your requirements.

In parallel to reading this tutorial, you can also replicate this analysis in this Google Colab notebook.

Once you’re all set up, import the public discourse analysis utils package.

from innovation_sweet_spots.utils.pd import pd_analysis_utils as au

Getting the data: Using The Guardian Open Platform

First, we need to get some discourse data to analyse! We will use news articles published by The Guardian newspaper and made available on their Open Platform. While this news source can be seen as somewhat politically left-leaning, it has the advantage of covering a wide-range of technologies and innovations and hence in the past we’ve found it useful as a proxy for the wider news discourse. The Guardian is also, to the best of our knowledge, the only major UK newspaper to make its text freely available for research. Nonetheless, you can apply the same methods to analyse other text data, for example, transcripts of policy debates in the parliament.

To access Guardian news articles, you should set up a Guardian API key. Setting your key to "test" might work, but you should apply for your own developer key.

# Replace “test” with your key 
API_KEY = "test"

We will fetch news articles from The Guardian about heat pumps, the low-carbon heating technology. However, before we download all the articles mentioning heat pumps, let’s check how many articles there are in total. We’ll use the search_content function from the guardian helper module, which is a simple wrapper we wrote for The Guardian API, and set only_first_page=True.

test_articles = au.guardian.search_content(
  search_term="heat pumps",
  api_key=API_KEY,
  only_first_page=True,
  use_cached=False,
  save_to_cache=False
)

Collected 100 (14%) results

This will yield a list test_articles of the most recent 100 articles and, at the time of writing this tutorial, report that this amounts to about 14% of the total number of the articles. This means that there are more than 700 articles mentioning “heat pumps”.

The parameter save_to_cache controls whether the data is saved locally, and use_cached controls whether you’re accessing the locally saved data (by default, the articles will be stored in inputs/data/guardian/api_results). By caching the query results locally, you can reduce the number of API calls. Note, however, that your cached version might become out of date and hence, when updating the analysis, you should set use_cached to False.

Each article is a dictionary with multiple fields (see code snippet below), such as a unique id, headline and body as well as the information about the author. The fields returned by the API can be specified in the API call using parameters show-fields and show-tags. For this purpose, you can define your default fields using a config file (see an example of our config file here).

{
    "id": "environment/2023/jul/17/uk-installation-heat-pumps-report",
    "type": "article",
    "sectionId": "environment",
    "sectionName": "Environment",
    "webPublicationDate": "2023-07-17T04:00:18Z",
    "webTitle": "UK installations of heat pumps 10 times lower than in France, report finds",
    "webUrl": "https://www.theguardian.com/environment/2023/jul/17/uk-installation-heat-pumps-report",
    "apiUrl": "https://content.guardianapis.com/environment/2023/jul/17/uk-installation-heat-pumps-report",
    "fields": {
        "headline": "UK installations of heat pumps 10 times lower than in France, report finds",
        "trailText": "Analysts call on government to make pumps mandatory for all new homes and scale up grants for installation in existing properties",
        "body": "<p>The UK is lagging far behind France and other EU countries in installing heat pumps, research has shown, with less than a tenth of the number of installations despite having similar markets.</p> <p>Only 55,000 heat pumps...
    },
    "tags": [
        {
            "id": "profile/fiona-harvey",
            "type": "contributor",
            "webTitle": "Fiona Harvey",
            "webUrl": "https://www.theguardian.com/profile/fiona-harvey",
            "apiUrl": "https://content.guardianapis.com/profile/fiona-harvey",
            "references": [],
            "bio": "<p>Fiona Harvey is an environment editor at the Guardian</p>",
            "bylineImageUrl": "https://uploads.guim.co.uk/2022/12/08/Fiona_Harvey_old_image.jpg",
            "bylineLargeImageUrl": "https://uploads.guim.co.uk/2022/12/08/Fiona_Harvey_old_image.png",
            "firstName": "Fiona",
            "lastName": "Harvey"
        }
    ],
    "isHosted": false,
    "pillarId": "pillar/news",
    "pillarName": "News"
}

Now let’s get all the articles mentioning heat pumps. For this purpose, we’ll use the get_guardian_articles function which wraps the search_content function used above and filters the articles based on their Guardian news categories (if required), separates article text data and metadata, and saves the filtered articles.

In this example, we also specify the following article categories to reduce the possibility of irrelevant articles. All category names can be found by checking The Guardian API sections endpoint (see here an API call listing the sections).

# Define allowed article categories
CATEGORIES = [
  "Environment",
  "Technology",
  "Science",
  "Business",
  "Money",
  "Cities",
  "Politics",
  "Opinion",
  "UK news",
  "Life and style",
]

You can also specify multiple search terms to be included in your query. For example, in my experience, it’s best to use both singular and plural forms with Guardian API and hence we will specify both “heat pump” and “heat pumps” as our search terms here. In this implementation, each search term is queried separately and repeated hits (ie, same article featuring multiple search terms) are deduplicated.

# List of search terms
SEARCH_TERMS = ["heat pump", "heat pumps"]

articles_df, articles_metadata = au.get_guardian_articles(
  # Specify the search terms
  search_terms=guardian_search_terms,
  # To fetch the most recent articles, set use_cached to False
  use_cached = False,
  # Specify the API key
  api_key=API_KEY,
  # Specify which news article categories we'll consider
  allowed_categories = CATEGORIES,
)

This will output a articles_df dataframe with article ids, texts and publishing date, as well as a separate dictionary articles_metadata storing the data associated with each unique article ID, such as the title, URL and author.

articles_df.head(3)

An example of the first three rows of the news articles table, featuring columns for the unique article id, article text and publishing date and year.

Measuring the hype: Number of news articles across years

Now that we have our news data, we can start analysing the discourse! First, we can specify the path to the analysis outputs directory, which will come handy when revisiting the analysis in the future. Note that we are storing the analysis outputs separately from the cached search results (discussed above), in order to separate the analysis process, which is agnostic to the data sources, from the data fetching process.

# Specify the location for analysis outputs
from innovation_sweet_spots import PROJECT_DIR
OUTPUTS_DIR = PROJECT_DIR / "outputs/data/discourse_analysis_outputs"

We can then specify the name ANALYSIS_ID for this specific analysis session — all the output tables will be stored in a subfolder of OUTPUTS_DIR with the same name.

ANALYSIS_ID = "guardian_heat_pumps_tutorial"

We will then define a couple of additional filtering criteria to keep the most relevant results to our geographical context, by specifying a (non-exhaustive) list of UK-related geographic terms. As an example, we will also exclude any article that mentions Australia. The filtering at this stage is very simple and looks for exact matches of the provided strings in the original article text.

# Terms required to appear in the articles, 
# for the articles to be considered in the analysis
REQUIRED_TERMS = [
  "UK",
  "Britain",
  "Scotland",
  "Wales",
  "England",
  "Northern Ireland",
  "Britons",
  "London",
]

# Articles with these terms will be removed from the analysis
BANNED_TERMS = ["Australia"]

We then use the DiscourseAnalysis class, which allows us to run different analyses, from assessing search term mentions, to detecting words commonly collocated with our search terms, to topic analysis. Importantly, you can use this class for analysing not just The Guardian news articles as shown here, but in fact any time-stamped text data such as parliamentary debates, web forum discussions or online comments. The only requirement is that the input Pandas DataFrame needs to have the columns id, text, date, year.

# Set up the discourse analysis class
pda = au.DiscourseAnalysis(
  search_terms = SEARCH_TERMS,
  outputs_path = OUTPUTS_DIR,
  query_identifier = ANALYSIS_ID,
  required_terms = REQUIRED_TERMS,
  banned_terms = BANNED_TERMS,
)
# Add the input data
pda.load_documents(document_text=articles_df)

As our first analysis, we can simply consider the number of articles published per year that contain our search terms (the results for each search term are combined and deduplicated).

pda.plot_mentions(use_documents=True)

Number of articles in *The Guardian featuring the terms* “heat pump” or “heat pumps” across different years

While a very simple result, it can already reveal interesting patterns of waxing and waning interest in a particular topic. In our previous research, we interpreted these trends using the hype dynamics models, which reflect the typical life cycle of peak, disappointment and recovery of expectations associated with emerging technologies. One popular example of this is Gartner’s hype cycle, in which the initial phase of ‘inflated expectations’ is followed by a ‘trough of disillusionment’ where interest wanes as the initial implementations fail to deliver value.

The peak around 2008, subsequent decrease up until late 2017 and then the following recovery appear to reflect this type of pattern, which prompted us to suggest back in 2021 that heat pumps might be proceeding to the so-called ‘plateau of productivity’ phase of Gartner’s hype cycle where the initial faults of the technology have been addressed and mainstream adoption can start to take off.

*Diagram of the hype cycle model. Adapted from* *Wikipedia*

In our previous research on green technologies, we compared the trends around heat pumps with other, competing alternatives such as heating homes with hydrogen gas, which at that time showed a steep rise in news mentions. At that time, we noted the risk of consumers potentially becoming distracted from adopting more mature heat pump technology by what appeared to be an early-stage hype around hydrogen.

Such interpretations, however, need to be cross-referenced with more in-depth analysis of the discourse content (as shown, for example, in the next sections of this tutorial) and sense-checked with domain experts. Hence, this type of result should be seen more as a data-informed hypothesis about the innovation trends rather than a comprehensive analysis.

Finally, when considering the growth trends of news mentions, another important element is a baseline growth trend that we can use as a reference. For example, there might be a remote chance that the peaks and troughs in search term mentions are driven by some more general changes in the publisher’s yearly output. One simple option to account for this is to normalise the number of articles mentioning your search terms by the total number of articles published in any given year. This can be obtained by using the function get_total_article_counts and specifying the same article categories that we used previously.

total_counts = au.get_total_article_counts(sections=CATEGORIES, api_key=API_KEY)

The function is calling Guardian API with an empty search term, and then using the "total" field of the response to infer the reference number of articles (which you can see on this example of an API call).

After dividing the number of articles mentioning heat pumps with the total number of reference articles, we find that the shape of the trend is preserved.

*Number of articles mentioning search terms normalised by the total number of articles published in a given year*

Characterising discourse topics using BERTopic

Further insight into the discourse can be of course gained by considering the actual text of the articles. For example, one could start by inspecting the sentences from a specific year containing our search terms. For this purpose, you can use a dictionary combined_term_sentences, which is organised by year and contains all the sentences with any of the provided search terms, and the unique id of the corresponding news article.

pda.combined_term_sentences["2022"].head(5)

Example of five articles mentioning heat pumps in 2022

Such inspection is also essential to catch erroneous mentions that are irrelevant to your particular query — such as the one of the examples above mentioning ‘heat pump tumble dryers’, which is not a home heating technology.

Nonetheless, we can also gain some insights in a way that doesn’t immediately require reading decades worth of articles. We can consider topic modelling or clustering to automatically find themes within our news data. There are many approaches that we could use for this purpose, but here we show BERTopic, which is a convenient package with many ‘out of the box’ functionalities such as clustering, dimensionality reduction and visualisation.

Our DiscourseAnalysis class wraps around BERTopic, so creating the topic model is as simple as running the fit_topic_model() function. By default, it uses the sentences featuring our search terms (as shown above).

topic_model, docs = pda.fit_topic_model()

In the background, BERTopic is embedding the sentences using a sentence transformer model (typically the all-MiniLM-L6-v2 model), reducing their dimensionality with UMAP, and then clustering the reduced embeddings with HDBSCAN. Then, to characterise the different clusters, it finds a representational set of keywords by using a weighted version of TF-IDF.

In my experience, the most interesting result from BERTopic is the visualisation of the sentence embeddings and clusters, by running visualize_documents()

topic_model.visualize_documents(docs)

BERTopic visualisation, where each dot is a sentence mentioning our search terms and colours indicate distinct clusters (themes); similar sentences are located closer to each other in this visualisation. Each cluster is described by three characteristic words. Gray dots between the clusters are considered noise and are not assigned to any cluster.

In this example, we find clusters related to main topics of conversation around heat pumps, such as the different types of heat pump technology (ground source and air source), alternative heating technologies (gas boilers, hydrogen heating, biomass boilers), solar panels (which are sometimes used together with heat pumps), as well as Government grants and targets (600,000 heat pump installations by 2028). The points in colour indicate cluster cores, whereas the grey points in between are considered too noisy by the HDBSCAN algorithm to assign them to a cluster.

Note that BERTopic employs non-deterministic algorithms, and hence you will get slightly different results between different runs (you can make the output deterministic by following the steps here). Due to the nature of HDBSCAN, you might also occasionally end up with only one large cluster — in that case you can simply rerun the fit_topic_model() function again.

Conclusion

So far we have tried a couple ways of gaining an overview of the intensity and content of news discourse using Python. In the second part of this tutorial, we will analyse the discourse in finer detail, by considering the vocabulary used together with our search terms. We will review approaches to measure the importance of collocations, use spaCy to extract collocated linguistic features associated with our search terms, and finally consider the role of large language models in speeding up our analysis.

Thank you to Sofia Pinto, Zayn Meghji and Emily Mathieson for reviewing the article, and I’m very grateful to Jyldyz Djumalieva and Jack Rasala for their major contributions to the public discourse analysis Python modules.