Practical Business Python

Taking care of business, one python script at a time

Mon 12 October 2020

Case Study: Processing Historical Weather Pattern Data

Posted by Chris Moffitt in articles   

article header image

Introduction

The main purpose of this blog is to show people how to use Python to solve real world problems. Over the years, I have been fortunate enough to hear from readers about how they have used tips and tricks from this site to solve their own problems. In this post, I am extremely delighted to present a real world case study. I hope it will give you some ideas about how you can apply these concepts to your own problems.

This example comes from Michael Biermann from Germany. He had the challenging task of trying to gather detailed historical weather data in order to do analysis on the relationship between air temperature and power consumption. This article will show how he used a pipeline of Python programs to automate the process of collecting, cleaning and processing gigabytes of weather data in order to perform his analysis.

Problem Background

I will turn it over to Michael to give the background for this problem.

Hi, I’m Michael, CEO of a company providing services to energy providers, especially focusing on electrical power and gas. I wanted to do an ex-post analysis to get deeper insights into the deviation of the power consumption of electrical heating systems in comparison to the air temperature. Since we provide power to other companies, we need to have a good grasp on the power consumption, which correlates to the air temperature. In short, I needed to know how well I can predict the long term temperatures and how much deviation is to be expected.

To be able to do this analysis, I needed historical temperatures, which are supplied by the German weather service, DWD. Since it would be really time consuming to download all the files and extract them by hand, I decided to give this a shot with Python. I know a few things about programming, but I am pretty far from a professional programmer. The process was a lot of trial and error, but this project turned out to be exactly the right fit for this approach. I use a lot of hardcore Excel analysis, fetching and munching data with Power Query M, but this was clearly over the limit to what can be done in Excel.

I am really happy with the results. There is hardly anything as satisfying as letting the computer do the hard work for the next 20 min, while grabbing a cup of coffee.

I am also really happy to have learned a few more things about web scraping, because I can use it in future projects to automate data fetching.

Here is a visual to help understand the process Michael created:

Data Processing Pipeline

If you are interested in following along, all of the code examples are available here.

Downloading the Data

The first notebook in the pipeline is 1-dwd_konverter_download . This notebook pulls historical temperature data from the German Weather Service (DWD) server and formats it for future use in other projects.

The data is delivered in hourly frequencies in a .zip file for each of the available weather stations. To use the data, we need everything in a single .csv file with all stations side-by-side. Also, we need the daily average.

To reduce computing time, we also crop all data earlier than 2007.

For the purposes of this article, I have limited the download to only 10 files but the full data set is over 600 files.

import requests
import re
from bs4 import BeautifulSoup
from pathlib import Path

# Set base values
download_folder = Path.cwd() / 'download'
base_url = 'https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/historical/'


# Initiate Session and get the Index-Page
with requests.Session() as s:
    resp = s.get(base_url)

# Parse the Index-Page for all relevant <a href>
soup = BeautifulSoup(resp.content)
links = soup.findAll("a", href=re.compile("stundenwerte_TU_.*_hist.zip"))

# For testing, only download 10 files
file_max = 10
dl_count = 0

#Download the .zip files to the download_folder
for link in links:
    zip_response = requests.get(base_url + link['href'], stream=True)
    # Limit the downloads while testing
    dl_count += 1
    if dl_count > file_max:
        break
    with open(Path(download_folder) / link['href'], 'wb') as file:
        for chunk in zip_response.iter_content(chunk_size=128):
            file.write(chunk)

print('Done')

This portion of code will parse the download page and find all of the zip files with the name studenwerte_TU and save them in a download directory.

Extracting the Data

After the first step is completed, the download directory contains multiple zip files. The second notebook in the process is 2-dwd_konverter_extract which will search each zip file for a .txt file that contains the actual temperature values.

The program will then extract each file and move to the import directory for further processing.

from pathlib import Path
import glob
import re
from zipfile import ZipFile

# Folder definitions
download_folder = Path.cwd() / 'download'
import_folder = Path.cwd() / 'import'

# Find all .zip files and generate a list
unzip_files = glob.glob('download/stundenwerte_TU_*_hist.zip')

# Set the name pattern of the file we need
regex_name = re.compile('produkt.*')

# Open all files, look for files that match ne regex pattern, extract to 'import'
for file in unzip_files:
    with ZipFile(file, 'r') as zipObj:
        list_of_filenames = zipObj.namelist()
        extract_filename = list(filter(regex_name.match, list_of_filenames))[0]
        zipObj.extract(extract_filename, import_folder)

display('Done')

After running this script, the import directory will contain text files that look like this:

STATIONS_ID;MESS_DATUM;QN_9;TT_TU;RF_TU;eor
        3;1950040101;    5;   5.7;  83.0;eor
        3;1950040102;    5;   5.6;  83.0;eor
        3;1950040103;    5;   5.5;  83.0;eor
        3;1950040104;    5;   5.5;  83.0;eor
        3;1950040105;    5;   5.8;  85.0;eor

Building the DataFrame

Now that we have isolated the data we need, we must format it for further analysis.

There are three steps in this notebook 3-dwd_konverter_build_df :

Process Individual Files

The files are imported into a single DataFrame, stripped of unnecessary columns and filtered by date. Then we set a DateTimeIndex and concatenate them into the main_df . Because the loop takes a long time, we output some status messages, to ensure the process is still running.

Process the concatenated main_df

Then we display some info of the main_df so we can ensure that there are no errors, mainly to ensure all data-types are recognized correctly. Also, we drop duplicate entries, in case some of the .csv files were accidentally duplicated during the development process.

Unstack and export

For the final step, we unstack the main_df and save it to a .csv and a .pkl file for the next step in the analysis process. Also, we display some output to get a grasp of what is going on.

Now let’s look at the code:

import numpy as np
import pandas as pd
from IPython.display import clear_output

from pathlib import Path
import glob


import_files = glob.glob('import/*')
out_file = Path.cwd() / "export_uncleaned" / "to_clean"

obsolete_columns = [
    'QN_9',
    'RF_TU',
    'eor'
]

main_df = pd.DataFrame()
i = 1

for file in import_files:

    # Read in the next file
    df = pd.read_csv(file, delimiter=";")

    # Prepare the df before merging (Drop obsolete, convert to datetime, filter to date, set index)
    df.drop(columns=obsolete_columns, inplace=True)
    df["MESS_DATUM"] = pd.to_datetime(df["MESS_DATUM"], format="%Y%m%d%H")
    df = df[df['MESS_DATUM']>= "2007-01-01"]
    df.set_index(['MESS_DATUM', 'STATIONS_ID'], inplace=True)

    # Merge to the main_df
    main_df = pd.concat([main_df, df])

    # Display some status messages
    clear_output(wait=True)
    display('Finished file: {}'.format(file), 'This is file {}'.format(i))
    display('Shape of the main_df is: {}'.format(main_df.shape))
    i+=1

# Check if all types are correct
display(main_df['TT_TU'].apply(lambda x: type(x).__name__).value_counts())

# Make sure that to files or observations a duplicates, eg. scan the index for duplicate entries.
# The ~ is a bitwise operation, meaning it flips all bits.
main_df = main_df[~main_df.index.duplicated(keep='last')]


# Unstack the main_df
main_df = main_df.unstack('STATIONS_ID')
display('Shape of the main_df is: {}'.format(main_df.shape))

# Save main_df to a .csv file and a pickle to continue working in the next step
main_df.to_pickle(Path(out_file).with_suffix('.pkl'))
main_df.to_csv(Path(out_file).with_suffix('.csv'), sep=";")

display(main_df.head())
display(main_df.describe())

As this program runs, here is some of the progress output:

'Finished file: import/produkt_tu_stunde_20041101_20191231_00078.txt'
'This is file 10'
'Shape of the main_df is: (771356, 1)'
float    771356
Name: TT_TU, dtype: int64
'Shape of the main_df is: (113952, 9)'

Here is what the final DataFrame looks like:

TT_TU
STATIONS_ID 3 44 71 73 78 91 96 102 125
MESS_DATUM
2007-01-01 00:00:00 11.4 NaN NaN NaN 11.0 9.4 NaN 9.7 NaN
2007-01-01 01:00:00 12.0 NaN NaN NaN 11.4 9.6 NaN 10.4 NaN
2007-01-01 02:00:00 12.3 NaN NaN NaN 9.4 10.0 NaN 9.9 NaN
2007-01-01 03:00:00 11.5 NaN NaN NaN 9.3 9.7 NaN 9.5 NaN
2007-01-01 04:00:00 9.6 NaN NaN NaN 8.6 10.2 NaN 8.9 NaN

At the end of this step, we have the file in a condensed format we can use for analysis.

Final Processing

The data contains some errors, which need to be cleaned. You can see, by looking at the output of main_df.describe(), that the minimum temperature on some stations is -999. That means that there is no plausible measurement for this particular hour. We change this to np.nan, so that we can safely calculate the average daily value in the next step.

Once these values are corrected, we need to resample to daily measurements. Pandas resample makes this really simple.

import numpy as np
import pandas as pd
from pathlib import Path

# Import and export paths
pkl_file = Path.cwd() / "export_uncleaned" / "to_clean.pkl"
cleaned_file = Path.cwd() / "export_cleaned" / "cleaned.csv"

# Read in the pickle file from the last cell
cleaning_df = pd.read_pickle(pkl_file)

# Replace all values with "-999", which indicate missing data
cleaning_df.replace(to_replace=-999, value=np.nan, inplace=True)

# Resample to daily frequency
cleaning_df = cleaning_df.resample('D').mean().round(decimals=2)

# Save as .csv
cleaning_df.to_csv(cleaned_file, sep=";", decimal=",")

# Show some results for verification
display(cleaning_df.loc['2011-12-31':'2012-01-04'])
display(cleaning_df.describe())
display(cleaning_df)

Here is the final DataFrame with daily average values for the stations:

TT_TU
STATIONS_ID 3 44 71 73 78 91 96 102 125
MESS_DATUM
2011-12-31 NaN 3.88 2.76 1.19 4.30 2.43 NaN 3.80 NaN
2012-01-01 NaN 10.90 8.14 4.03 10.96 10.27 NaN 9.01 NaN
2012-01-02 NaN 7.41 6.18 4.77 7.57 7.77 NaN 6.48 4.66
2012-01-03 NaN 6.14 3.61 4.46 6.38 5.28 NaN 5.63 3.51
2012-01-04 NaN 5.80 2.48 4.45 5.46 4.57 NaN 5.85 1.94

Summary

There are several aspects of this case study that I really like.

  • Michael was not an expert programmer and decided to dedicate himself to learning the Python necessary for solving this problem.
  • It took some time for him to learn how to accomplish multiple tasks but he persevered through all the challenges and built a complete solution.
  • This was a real world problem that would be difficult to solve with other tools but could be automated with very few lines of Python code.
  • The process could be time consuming to run so it’s broken down into multiple stages. This is a great idea to apply to other problems. This previous article actually served as the inspiration for many of the techniques used in the solution.
  • This solution brings together many different concepts including web scraping, downloading files, working with zip files and cleaning & analyzing data with pandas.
  • Michael now has a new skill that he can apply to other problems in his business.

Finally, I love this quote from Michael:

There is hardly anything as satisfying as letting the computer do the hard work for the next 20 min, while grabbing a cup of coffee.

I agree 100%. Thank you Michael for taking the time to share such a great example! I hope it gives you some ideas to apply to your own projects.

Comments