Case Study: Processing Historical Weather Pattern Data
Posted by Chris Moffitt in articles
Introduction
The main purpose of this blog is to show people how to use Python to solve real world problems. Over the years, I have been fortunate enough to hear from readers about how they have used tips and tricks from this site to solve their own problems. In this post, I am extremely delighted to present a real world case study. I hope it will give you some ideas about how you can apply these concepts to your own problems.
This example comes from Michael Biermann from Germany. He had the challenging task of trying to gather detailed historical weather data in order to do analysis on the relationship between air temperature and power consumption. This article will show how he used a pipeline of Python programs to automate the process of collecting, cleaning and processing gigabytes of weather data in order to perform his analysis.
Problem Background
I will turn it over to Michael to give the background for this problem.
Hi, I’m Michael, CEO of a company providing services to energy providers, especially focusing on electrical power and gas. I wanted to do an ex-post analysis to get deeper insights into the deviation of the power consumption of electrical heating systems in comparison to the air temperature. Since we provide power to other companies, we need to have a good grasp on the power consumption, which correlates to the air temperature. In short, I needed to know how well I can predict the long term temperatures and how much deviation is to be expected.
To be able to do this analysis, I needed historical temperatures, which are supplied by the German weather service, DWD. Since it would be really time consuming to download all the files and extract them by hand, I decided to give this a shot with Python. I know a few things about programming, but I am pretty far from a professional programmer. The process was a lot of trial and error, but this project turned out to be exactly the right fit for this approach. I use a lot of hardcore Excel analysis, fetching and munching data with Power Query M, but this was clearly over the limit to what can be done in Excel.
I am really happy with the results. There is hardly anything as satisfying as letting the computer do the hard work for the next 20 min, while grabbing a cup of coffee.
I am also really happy to have learned a few more things about web scraping, because I can use it in future projects to automate data fetching.
Here is a visual to help understand the process Michael created:
If you are interested in following along, all of the code examples are available here.
Downloading the Data
The first notebook in the pipeline is
1-dwd_konverter_download
. This notebook pulls
historical temperature data from the German Weather Service (DWD) server and formats it for
future use in other projects.
The data is delivered in hourly frequencies in a .zip file for each of the available weather stations. To use the data, we need everything in a single .csv file with all stations side-by-side. Also, we need the daily average.
To reduce computing time, we also crop all data earlier than 2007.
For the purposes of this article, I have limited the download to only 10 files but the full data set is over 600 files.
import requests
import re
from bs4 import BeautifulSoup
from pathlib import Path
# Set base values
download_folder = Path.cwd() / 'download'
base_url = 'https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/historical/'
# Initiate Session and get the Index-Page
with requests.Session() as s:
resp = s.get(base_url)
# Parse the Index-Page for all relevant <a href>
soup = BeautifulSoup(resp.content)
links = soup.findAll("a", href=re.compile("stundenwerte_TU_.*_hist.zip"))
# For testing, only download 10 files
file_max = 10
dl_count = 0
#Download the .zip files to the download_folder
for link in links:
zip_response = requests.get(base_url + link['href'], stream=True)
# Limit the downloads while testing
dl_count += 1
if dl_count > file_max:
break
with open(Path(download_folder) / link['href'], 'wb') as file:
for chunk in zip_response.iter_content(chunk_size=128):
file.write(chunk)
print('Done')
This portion of code will parse the download page and find all of the zip files with the name
studenwerte_TU
and save them in a
download
directory.
Extracting the Data
After the first step is completed, the download directory contains multiple zip files.
The second notebook in the process is
2-dwd_konverter_extract
which will search each
zip file for a .txt file that contains the actual temperature values.
The program will then extract each file and move to the
import
directory for further processing.
from pathlib import Path
import glob
import re
from zipfile import ZipFile
# Folder definitions
download_folder = Path.cwd() / 'download'
import_folder = Path.cwd() / 'import'
# Find all .zip files and generate a list
unzip_files = glob.glob('download/stundenwerte_TU_*_hist.zip')
# Set the name pattern of the file we need
regex_name = re.compile('produkt.*')
# Open all files, look for files that match ne regex pattern, extract to 'import'
for file in unzip_files:
with ZipFile(file, 'r') as zipObj:
list_of_filenames = zipObj.namelist()
extract_filename = list(filter(regex_name.match, list_of_filenames))[0]
zipObj.extract(extract_filename, import_folder)
display('Done')
After running this script, the
import
directory will contain text files that
look like this:
STATIONS_ID;MESS_DATUM;QN_9;TT_TU;RF_TU;eor
3;1950040101; 5; 5.7; 83.0;eor
3;1950040102; 5; 5.6; 83.0;eor
3;1950040103; 5; 5.5; 83.0;eor
3;1950040104; 5; 5.5; 83.0;eor
3;1950040105; 5; 5.8; 85.0;eor
Building the DataFrame
Now that we have isolated the data we need, we must format it for further analysis.
There are three steps in this notebook
3-dwd_konverter_build_df
:
Process Individual Files
The files are imported into a single DataFrame, stripped of unnecessary columns and filtered by date.
Then we set a
DateTimeIndex
and concatenate them into the
main_df
. Because the loop takes a
long time, we output some status messages, to ensure the process is still running.
Process the concatenated main_df
Then we display some info of the
main_df
so we can ensure that there are no errors, mainly
to ensure all data-types are recognized correctly. Also, we drop duplicate entries, in case
some of the .csv files were accidentally duplicated during the development process.
Unstack and export
For the final step, we unstack the
main_df
and save it to a .csv and a .pkl file for the
next step in the analysis process. Also, we display some output to get a grasp of what is
going on.
Now let’s look at the code:
import numpy as np
import pandas as pd
from IPython.display import clear_output
from pathlib import Path
import glob
import_files = glob.glob('import/*')
out_file = Path.cwd() / "export_uncleaned" / "to_clean"
obsolete_columns = [
'QN_9',
'RF_TU',
'eor'
]
main_df = pd.DataFrame()
i = 1
for file in import_files:
# Read in the next file
df = pd.read_csv(file, delimiter=";")
# Prepare the df before merging (Drop obsolete, convert to datetime, filter to date, set index)
df.drop(columns=obsolete_columns, inplace=True)
df["MESS_DATUM"] = pd.to_datetime(df["MESS_DATUM"], format="%Y%m%d%H")
df = df[df['MESS_DATUM']>= "2007-01-01"]
df.set_index(['MESS_DATUM', 'STATIONS_ID'], inplace=True)
# Merge to the main_df
main_df = pd.concat([main_df, df])
# Display some status messages
clear_output(wait=True)
display('Finished file: {}'.format(file), 'This is file {}'.format(i))
display('Shape of the main_df is: {}'.format(main_df.shape))
i+=1
# Check if all types are correct
display(main_df['TT_TU'].apply(lambda x: type(x).__name__).value_counts())
# Make sure that to files or observations a duplicates, eg. scan the index for duplicate entries.
# The ~ is a bitwise operation, meaning it flips all bits.
main_df = main_df[~main_df.index.duplicated(keep='last')]
# Unstack the main_df
main_df = main_df.unstack('STATIONS_ID')
display('Shape of the main_df is: {}'.format(main_df.shape))
# Save main_df to a .csv file and a pickle to continue working in the next step
main_df.to_pickle(Path(out_file).with_suffix('.pkl'))
main_df.to_csv(Path(out_file).with_suffix('.csv'), sep=";")
display(main_df.head())
display(main_df.describe())
As this program runs, here is some of the progress output:
'Finished file: import/produkt_tu_stunde_20041101_20191231_00078.txt' 'This is file 10' 'Shape of the main_df is: (771356, 1)' float 771356 Name: TT_TU, dtype: int64 'Shape of the main_df is: (113952, 9)'
Here is what the final DataFrame looks like:
TT_TU | |||||||||
---|---|---|---|---|---|---|---|---|---|
STATIONS_ID | 3 | 44 | 71 | 73 | 78 | 91 | 96 | 102 | 125 |
MESS_DATUM | |||||||||
2007-01-01 00:00:00 | 11.4 | NaN | NaN | NaN | 11.0 | 9.4 | NaN | 9.7 | NaN |
2007-01-01 01:00:00 | 12.0 | NaN | NaN | NaN | 11.4 | 9.6 | NaN | 10.4 | NaN |
2007-01-01 02:00:00 | 12.3 | NaN | NaN | NaN | 9.4 | 10.0 | NaN | 9.9 | NaN |
2007-01-01 03:00:00 | 11.5 | NaN | NaN | NaN | 9.3 | 9.7 | NaN | 9.5 | NaN |
2007-01-01 04:00:00 | 9.6 | NaN | NaN | NaN | 8.6 | 10.2 | NaN | 8.9 | NaN |
At the end of this step, we have the file in a condensed format we can use for analysis.
Final Processing
The data contains some errors, which need to be cleaned. You can see, by looking at the
output of
main_df.describe(),
that the minimum temperature on some
stations is -999. That means that there is no plausible measurement for this particular
hour. We change this to
np.nan,
so that we can safely calculate the average daily value
in the next step.
Once these values are corrected, we need to resample to daily measurements. Pandas
resample
makes this really simple.
import numpy as np
import pandas as pd
from pathlib import Path
# Import and export paths
pkl_file = Path.cwd() / "export_uncleaned" / "to_clean.pkl"
cleaned_file = Path.cwd() / "export_cleaned" / "cleaned.csv"
# Read in the pickle file from the last cell
cleaning_df = pd.read_pickle(pkl_file)
# Replace all values with "-999", which indicate missing data
cleaning_df.replace(to_replace=-999, value=np.nan, inplace=True)
# Resample to daily frequency
cleaning_df = cleaning_df.resample('D').mean().round(decimals=2)
# Save as .csv
cleaning_df.to_csv(cleaned_file, sep=";", decimal=",")
# Show some results for verification
display(cleaning_df.loc['2011-12-31':'2012-01-04'])
display(cleaning_df.describe())
display(cleaning_df)
Here is the final DataFrame with daily average values for the stations:
TT_TU | |||||||||
---|---|---|---|---|---|---|---|---|---|
STATIONS_ID | 3 | 44 | 71 | 73 | 78 | 91 | 96 | 102 | 125 |
MESS_DATUM | |||||||||
2011-12-31 | NaN | 3.88 | 2.76 | 1.19 | 4.30 | 2.43 | NaN | 3.80 | NaN |
2012-01-01 | NaN | 10.90 | 8.14 | 4.03 | 10.96 | 10.27 | NaN | 9.01 | NaN |
2012-01-02 | NaN | 7.41 | 6.18 | 4.77 | 7.57 | 7.77 | NaN | 6.48 | 4.66 |
2012-01-03 | NaN | 6.14 | 3.61 | 4.46 | 6.38 | 5.28 | NaN | 5.63 | 3.51 |
2012-01-04 | NaN | 5.80 | 2.48 | 4.45 | 5.46 | 4.57 | NaN | 5.85 | 1.94 |
Summary
There are several aspects of this case study that I really like.
- Michael was not an expert programmer and decided to dedicate himself to learning the Python necessary for solving this problem.
- It took some time for him to learn how to accomplish multiple tasks but he persevered through all the challenges and built a complete solution.
- This was a real world problem that would be difficult to solve with other tools but could be automated with very few lines of Python code.
- The process could be time consuming to run so it’s broken down into multiple stages. This is a great idea to apply to other problems. This previous article actually served as the inspiration for many of the techniques used in the solution.
- This solution brings together many different concepts including web scraping, downloading files, working with zip files and cleaning & analyzing data with pandas.
- Michael now has a new skill that he can apply to other problems in his business.
Finally, I love this quote from Michael:
There is hardly anything as satisfying as letting the computer do the hard work for the next 20 min, while grabbing a cup of coffee.
I agree 100%. Thank you Michael for taking the time to share such a great example! I hope it gives you some ideas to apply to your own projects.
Comments