Building pyarrow with CUDA support

The other day I was looking to read an Arrow buffer on GPU using Python, but as far as I could tell, none of the provided pyarrow packages on conda or pip are built with CUDA support. Like many of the packages in the compiled-C-wrapped-by-Python ecosystem, Apache Arrow is thoroughly documented, but the number of permutations of how you could choose to build pyarrow with CUDA support quickly becomes overwhelming.

In this post, I’ll show how to build pyarrow with CUDA support on Ubuntu using Docker and virtualenv. These directions are approximately the same as the official Apache Arrow docs, just that I explain them step-by-step and show only the single build toolchain I used.

Step 1: Docker with GPU support

Even though I use Ubuntu 18.04 LTS on a workstation with an NVIDIA GPU, whenever I undertake a project like this, I like to use a Docker container to keep everything isolated. The last thing you want to do is to debug environment errors, changing dependencies for one project and breaking something else. Thankfully, NVIDIA Docker developer images are available via DockerHub:

docker run -it --gpus=all --rm nvidia/cuda:10.1-devel-ubuntu18.04 bash

Here, the -it flag puts us inside the container at a bash prompt, --gpus=all allows the Docker container to access my workstation’s GPUs and --rm deletes the container after we’re done to save space.

Step 2: Setting up the Ubuntu Docker container

When you pull Docker containers from DockerHub, frequently they are bare-bones in terms of libraries included, and usually can also be updated. For building pyarrow, it’s useful to install the following:

apt update && apt upgrade -y

apt install git \
wget \
libssl-dev \
autoconf \
flex \
bison \
llvm-7 \
clang \
cmake \
python3-pip \
libjemalloc-dev \
libboost-dev \
libboost-filesystem-dev \
libboost-system-dev \
libboost-regex-dev  \
python3-dev -y

In a later step, we’ll use the Arrow third-party dependency script to ensure all needed dependencies are present, but these are a good start.

Step 3: Cloning Apache Arrow from GitHub

Cloning Arrow from GitHub is pretty straightforward. The git checkout apache-arrow-0.15.0 line is optional; I needed version 0.15.0 for the project I was exploring, but if you want to build from the master branch of Arrow, you can omit that line.

git clone https://github.com/apache/arrow.git /repos/arrow
cd /repos/arrow
git submodule init && git submodule update
git checkout apache-arrow-0.15.0
export PARQUET_TEST_DATA="${PWD}/cpp/submodules/parquet-testing/data"
export ARROW_TEST_DATA="${PWD}/testing/data"

Step 4: Installing remaining Apache Arrow dependencies

As mentioned in Step 2, some of the dependencies for building Arrow are system-level and can be installed via apt. To ensure that we have all the remaining third-party dependencies, we can use the provided script in the Arrow repository:

pip3 install virtualenv
virtualenv pyarrow
source ./pyarrow/bin/activate
pip install six numpy pandas cython pytest hypothesis
mkdir dist
export ARROW_HOME=$(pwd)/dist
export LD_LIBRARY_PATH=$(pwd)/dist/lib:$LD_LIBRARY_PATH

cd cpp
./thirdparty/download_dependencies.sh $HOME/arrow-thirdparty

The script downloads all of the necessary libraries as well as sets environment variables that are picked up later, which is amazingly helpful.

Step 5: Building Apache Arrow C++ library

pyarrow links to the Arrow C++ bindings, so it needs to be present before we can build the pyarrow wheel:

mkdir build && cd build

cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DCMAKE_INSTALL_LIBDIR=lib \
-DARROW_FLIGHT=ON \
-DARROW_GANDIVA=ON \
-DARROW_ORC=ON \
-DARROW_WITH_BZ2=ON \
-DARROW_WITH_ZLIB=ON \
-DARROW_WITH_ZSTD=ON \
-DARROW_WITH_LZ4=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_BROTLI=ON \
-DARROW_PARQUET=ON \
-DARROW_PYTHON=ON \
-DARROW_PLASMA=ON \
-DARROW_BUILD_TESTS=ON \
-DARROW_CUDA=ON \
..

make -j
make install

This is a pretty standard workflow for building a C or C++ library. We create a build directory, call cmake from inside of that directory to set up the options we want to use, then use make and then make install to compile and install the library, respectively. I chose all of the -DARROW_* options above just as a copy/paste from the Arrow documentation; Arrow doesn’t take long to build using these options, but it’s possibly the case that only -DARROW_PYTHON=ON and -DARROW_CUDA=ON are truly necessary to build pyarrow.

Step 6: Building pyarrow wheel

With the Apache Arrow C++ bindings built, we can now build the Python wheel:

cd /repos/arrow/python
export PYARROW_WITH_PARQUET=1
export PYARROW_WITH_CUDA=1
python setup.py build_ext --build-type=release --bundle-arrow-cpp bdist_wheel

As cmake and make run, you’ll eventually see the following in the build logs, which shows that we’re getting the behavior we want:

cmake --build . --config release --
[  5%] Compiling Cython CXX source for _cuda...
[  5%] Built target _cuda_pyx
Scanning dependencies of target _cuda
[ 11%] Building CXX object CMakeFiles/_cuda.dir/_cuda.cpp.o
[ 16%] Linking CXX shared module release/_cuda.cpython-36m-x86_64-linux-gnu.so
[ 16%] Built target _cuda

When the process finishes, the final wheel will be in the /repos/arrow/python/dist directory.

Step 7 (optional): Validate build

If you want to validate that your pyarrow wheel has CUDA installed, you can run the following:

(pyarrow) root@9260485caca3:/repos/arrow/python/dist# pip install pyarrow-0.15.1.dev0+g40d468e16.d20200402-cp36-cp36m-linux_x86_64.whl
Processing ./pyarrow-0.15.1.dev0+g40d468e16.d20200402-cp36-cp36m-linux_x86_64.whl
Requirement already satisfied: six>=1.0.0 in /repos/arrow/pyarrow/lib/python3.6/site-packages (from pyarrow==0.15.1.dev0+g40d468e16.d20200402) (1.14.0)
Requirement already satisfied: numpy>=1.14 in /repos/arrow/pyarrow/lib/python3.6/site-packages (from pyarrow==0.15.1.dev0+g40d468e16.d20200402) (1.18.2)
Installing collected packages: pyarrow
Successfully installed pyarrow-0.15.1.dev0+g40d468e16.d20200402
(pyarrow) root@9260485caca3:/repos/arrow/python/dist# python
Python 3.6.9 (default, Nov  7 2019, 10:44:02)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyarrow import cuda
>>>

When the line from pyarrow import cuda runs without error, then we know that our pyarrow build with CUDA was successful.

  • RSiteCatalyst Version 1.4.16 Release Notes
  • Using RSiteCatalyst With Microsoft PowerBI Desktop
  • RSiteCatalyst Version 1.4.14 Release Notes
  • RSiteCatalyst Version 1.4.13 Release Notes
  • RSiteCatalyst Version 1.4.12 (and 1.4.11) Release Notes
  • Self-Service Adobe Analytics Data Feeds!
  • RSiteCatalyst Version 1.4.10 Release Notes
  • WordPress to Jekyll: A 30x Speedup
  • Bulk Downloading Adobe Analytics Data
  • Adobe Analytics Clickstream Data Feed: Calculations and Outlier Analysis
  • Adobe: Give Credit. You DID NOT Write RSiteCatalyst.
  • RSiteCatalyst Version 1.4.8 Release Notes
  • Adobe Analytics Clickstream Data Feed: Loading To Relational Database
  • Calling RSiteCatalyst From Python
  • RSiteCatalyst Version 1.4.7 (and 1.4.6.) Release Notes
  • RSiteCatalyst Version 1.4.5 Release Notes
  • Getting Started: Adobe Analytics Clickstream Data Feed
  • RSiteCatalyst Version 1.4.4 Release Notes
  • RSiteCatalyst Version 1.4.3 Release Notes
  • RSiteCatalyst Version 1.4.2 Release Notes
  • Destroy Your Data Using Excel With This One Weird Trick!
  • RSiteCatalyst Version 1.4.1 Release Notes
  • Visualizing Website Pathing With Sankey Charts
  • Visualizing Website Structure With Network Graphs
  • RSiteCatalyst Version 1.4 Release Notes
  • Maybe I Don't Really Know R After All
  • Building JSON in R: Three Methods
  • Real-time Reporting with the Adobe Analytics API
  • RSiteCatalyst Version 1.3 Release Notes
  • Adobe Analytics Implementation Documentation in 60 Seconds
  • RSiteCatalyst Version 1.2 Release Notes
  • Clustering Search Keywords Using K-Means Clustering
  • RSiteCatalyst Version 1.1 Release Notes
  • Anomaly Detection Using The Adobe Analytics API
  • (not provided): Using R and the Google Analytics API
  • My Top 20 Least Useful Omniture Reports
  • For Maximum User Understanding, Customize the SiteCatalyst Menu
  • Effect Of Modified Bounce Rate In Google Analytics
  • Adobe Discover 3: First Impressions
  • Using Omniture SiteCatalyst Target Report To Calculate YOY growth
  • ODSC webinar: End-to-End Data Science Without Leaving the GPU
  • PyData NYC 2018: End-to-End Data Science Without Leaving the GPU
  • Data Science Without Leaving the GPU
  • Getting Started With OmniSci, Part 2: Electricity Dataset
  • Getting Started With OmniSci, Part 1: Docker Install and Loading Data
  • Parallelizing Distance Calculations Using A GPU With CUDAnative.jl
  • Building a Data Science Workstation (2017)
  • JuliaCon 2015: Everyday Analytics and Visualization (video)
  • Vega.jl, Rebooted
  • Sessionizing Log Data Using data.table [Follow-up #2]
  • Sessionizing Log Data Using dplyr [Follow-up]
  • Sessionizing Log Data Using SQL
  • Review: Data Science at the Command Line
  • Introducing Twitter.jl
  • Code Refactoring Using Metaprogramming
  • Evaluating BreakoutDetection
  • Creating A Stacked Bar Chart in Seaborn
  • Visualizing Analytics Languages With VennEuler.jl
  • String Interpolation for Fun and Profit
  • Using Julia As A "Glue" Language
  • Five Hard-Won Lessons Using Hive
  • Using SQL Workbench with Apache Hive
  • Getting Started With Hadoop, Final: Analysis Using Hive & Pig
  • Quickly Create Dummy Variables in a Data Frame
  • Using Amazon EC2 with IPython Notebook
  • Adding Line Numbers in IPython/Jupyter Notebooks
  • Fun With Just-In-Time Compiling: Julia, Python, R and pqR
  • Getting Started Using Hadoop, Part 4: Creating Tables With Hive
  • Tabular Data I/O in Julia
  • Hadoop Streaming with Amazon Elastic MapReduce, Python and mrjob
  • A Beginner's Look at Julia
  • Getting Started Using Hadoop, Part 3: Loading Data
  • Innovation Will Never Be At The Push Of A Button
  • Getting Started Using Hadoop, Part 2: Building a Cluster
  • Getting Started Using Hadoop, Part 1: Intro
  • Instructions for Installing & Using R on Amazon EC2
  • Video: SQL Queries in R using sqldf
  • Video: Overlay Histogram in R (Normal, Density, Another Series)
  • Video: R, RStudio, Rcmdr & rattle
  • Getting Started Using R, Part 2: Rcmdr
  • Getting Started Using R, Part 1: RStudio
  • Learning R Has Really Made Me Appreciate SAS