DATA SCIENCE / TIME-SERIES FORECASTING

HCrystalBall — a unified interface to time-series forecasting

Python’s time-series forecasting eco-system under the scikit-learn compatible umbrella.

Michal Chromčák

--

HCrystalBall logo
A python time-series forecasting library

Developed by the Data Science Team at HeidelbergCement

A producer of cement and other building materials since 1874, HeidelbergCement may not be your usual suspect when it comes to open-sourcing software. But don’t be fooled by the dusty cover — HC is gearing up its digital transformation, including mobile apps, data science, and data-driven production.

After having used HCrystalBall successfully in our internal projects, we decided it’s mature enough to be shared with users and developers outside the company. While our team celebrates the first open-source application in HeidelbergCement’s history, this is also a great opportunity for us to give back a tiny bit to the community from which we benefit on a daily basis.

HCrystalBall started like many other packages — scratching our own itch after we realized how cumbersome it is to compare time-series models from different packages in Python’s ecosystem.

There are fbprophet, arima / autoarima, exponential smoothing from statsmodels, and (t)bats, just to name a few.

All of them vary in the way of interacting with a model or its results, making it hard to run cross-validation and compare the output across packages.

Over time, a jupyter notebook that translated between the interfaces of different libraries turned into what HCrystalBall is today — a library, that unifies the interfaces of the above-mentioned packages to be scikit-learn compatible, enabling the usage of pipelines, grid_search, and many other useful features from the scikit-learn ecosystem. This is what we call the “wrapper” layer. For even greater convenience, we added a second layer for automated model selection on top and provided the possibility to parallelize the selection process.

If you want to try HCrystalBall right away, see our GitHub repository, read through the docs, or try examples with the prebuilt environment. If you want to learn more, just keep reading…

GitHub of HCrystalBall

HCrystalBall in action

To showcase the capabilities of HCrystalBall’s high-level convenience interface, let’s take a subset of Rossmann store sales data and predict sales for different drugstores. If you’re more interested in using the unified model API directly, please skip to the section on wrappers further down.

Loading the data

HCrystalBall offers some convenience functions to load the data in the required format, one of them being get_sales_data.

Loading sales data in the required format

The resulting dataframe contains several columns that indicate holidays and promotions or are used for slicing the data into subsets (e.g. for different stores). Apart from that, we require datetime index and numeric target column.

Peak on the data
Peak on the data

Defining search space

The next step is to define a ModelSelector object. Several points should be considered here:

  • to which frequency will the data be resampled to?
  • how many time-steps ahead do we want to forecast?
  • do we have a column that defines ISO country/region codes to automatically extract information about the public holiday? (optional)
Defining model selector

Once this is done, the next step is to define a grid_search, adding exogenous variables (optional) and/or extending it with custom models. The following example code returns 18 combinations of different pipelines with scikit-learn models, while the full grid with other model families and ensembles would contain roughly 50.

Defining search space

Running model selection

By default, the model selection will partition the data according to the values in the partition_columns (e.g. countries, stores) and run sequentially for all partitions.

If your dataset is large, you may also consider using the parallel_columns keyword — a subset of partition_columns should be passed which can be used to distribute the jobs using prefect.

The results of the model selection can be stored on disk at different levels of granularity for later inspection.

Running model selection

Visualize results

Once the selection is completed, ms.plot_results(plot_from="2015-06-01") can be used to plot the predictions of the selected models for all partitions and data splits.

Performance of cross-validation winning models for 2 partitions
Performance of cross-validation winning models for 2 partitions

If you want to supply your own plotting functionality, you can either try running with a different plotting backend or use ms.results[n].df_plot as the input for your custom code.

Content of ms.results[0].df_plot
Content of ms.results[0].df_plot

Using wrappers

Using the lower-level interface of HCrystalBall, one can directly interact with the model wrappers.

Data format

The data format on this level roughly follows the scikit-learn convention, separating the target y (pandas.Series or numpy.array) and the feature matrix X (pandas.DataFrame with datetime index and exogenous variables).

Data format on the wrapper layer

Pipelines

Defining multiple steps of data processing can be done via scikit-learn pipelines. Scikit-learn transformers should be wrapped inside TSColumnTransformer and applied to specific columns. This ensures compatibility with HCrystalBall’s dataframe-first approach. HCrystalBall’s own transformers can be used directly withing a pipeline.

HCrystalBall provides several wrappers and ensemble methods that can be combined with models and/or transformers. Availability may depend on the installed dependencies.

Using wrappers, ensembles and transformers in pipelines

Fit, predict, visualize

With your pipeline completely defined, you can now run fit and predict. In the example below, we’re also merging results for convenient plotting.

Fit, predict, merge, visualize
Pipeline’s performance on the last 10 days
Pipeline’s performance on the last 10 days

What’s next?

If HCrystalBall caught your attention, the easiest way to get started is to try the package and go through some more elaborated examples on mybinder (pre-built environment with full dependencies). Feel free to create new notebooks and use your own data.

If you don’t need the interactivity, pre-executed notebooks are part of our docs (quickstart, tutorial)

Finally, you can always build an environment with custom dependencies locally and use HCrystalBall in one of your projects.

Final word

Whatever your experience with HCrystalBall is, we would be glad to hear about it! Leave a comment here or open an issue on GitHub. You can also consider contributing — for example adding your favorite time-series model that is not covered yet.

--

--