Our Python Monorepo

Published in

Open House

9 min readJun 2, 2020

At Opendoor we have quite a few Python services. Originally they were spread across several Git repos, which caused some issues I will describe shortly. To eliminate these issues, we decided to put all our new Python services in a single Git repo, although the services remain independent. This article explains what worked well for us, including repo organization, dependency management, code sharing, and CI/CD.

Background

First, it’s worth mentioning our original Python services setup and what issues we encountered with it. Originally, we had several Git repos containing Python services and libraries. Generally, libraries were in separate repos from the services that relied on them. They were packaged as wheel files and pushed to an internal package repository — essentially an internal PyPI — where services could then pip install them. While most of these services were in their own Git repos, one repo had grown rather large and contained several services.

We ran into several issues with this setup:

It was cumbersome to update libraries. Both the library’s and the service’s repos required their own sequentially deployed PRs, so engineers needed to merge the library change first, wait for the package to be published to our internal package repository, and bump the version in their service.
It was difficult to test library changes. It was basically impossible to test that a library change didn’t break any services or other libraries that used it. An engineer could, in theory, track down every service that used the library and run all the tests, but practically speaking, nobody has time for that.
Because of the difficulty of changing libraries, engineers wouldn’t abstract their code into libraries. Instead, they’d put all their code into their services, and if somebody else needed similar code, they would often copy and paste it.
The repo with several services became disorganized as people tried to share code. We inadvertently created cyclical dependencies, lost visibility into code ownership, and misunderstood what code was meant to be shared. Premature sharing of code sometimes led to abstractions breaking or team disputes over how to abstract it.
Search-ability was low. Python code searches required Github search, which isn’t nearly as convenient as searching in an IDE.
Every repo had its own linting and formatting configuration, which made it more difficult for developers to navigate and contribute across repos.

To avoid these issues, we decided to put all our new Python services in a single Git repo.

Monorepo Structure

Jumping directly to our solution, let’s start with an overview of how we organize our projects, libraries, and tools in the monorepo:

py
├── projects  # Each project contains code for a service and/or ETL
│   ├── project1
│   │   ├── Dockerfile
│   │   ├── pyproject.toml  # Each project has its own dependencies
│   │   ├── project1/       # Project code (Python modules) go here
│   │   └── tests/
│   └── project2...
├── lib  # Each lib is a Python package that you can install using poetry (or pip)
│   ├── lib1
│   │   ├── pyproject.toml  # Each lib specifies its dependencies
│   │   ├── opendoor/lib1/  # All internal packages are in the opendoor namespace 
│   │   └── tests/
│   └── lib2...
└── tools
    ├── pippy  # Our service generator
    ├── pyfmt  # Code formatter to be used across all Python code
    ├── ci/    # Common CI/CD infrastructure
    └── other tools...

Projects cannot import any code from other projects, eliminating import cycles, unintended sharing, and keeps the code organized. Code intended to be shared needs to be put in an internal library. Each project also has its own CI/CD pipeline, although the pipelines are all based on a common template to keep them from diverging. When a PR is created in a service, it runs that service’s tests. When the PR is merged, it builds that service’s Docker image and deploys it to our Kubernetes cluster. If a PR spans multiple services, then each included service’s CI/CD pipelines will run.

As for libraries, one major improvement over our old system is that if a PR is made in a library, then the tests for all services and other libraries that use it will run in CI. You can’t merge library changes until all the tests pass. This is only possible because all the code is in a single repo. This makes it much easier to update libraries. I’ll get back to how we do this in more detail later.

Our tools directory has our service generator (named pippy), which uses cookiecutter to generate services from a standard template. It’s basically a command line wizard that asks what the service name is, whether it needs a database, what Slack channel to send related notifications to, etc. E.g., if your service needs a database, pippy will generate a Terraform script to create one and run migrations in the deployment pipeline. It also generates the CI/CD resources needed, which I’ll get a bit more into below. The tools directory also contains common tools like pyfmt, our formatter (basically a script that runs isort and black).

This structure eliminated some of the most basic issues we’d encountered but, at the same time, triggered new strategic decisions, specifically around how we manage dependencies.

Dependency Management

This is where things get really interesting. The following two points were quite a divergence from our previous system, and required a lot of thought and testing:

We use poetry instead of pip for package management.
We use “editable” installs for libraries.

Let’s start with the first point. poetry is a modern dependency management tool for Python that provides a number of benefits over pip:

poetry does complete constraint satisfaction on dependency versions, unlike pip which happily installs packages that violate version constraints (this is changing soon though).
poetry generates a lock file, so we can ensure that everybody tests with the same versions of dependencies that we run in production. If you have a constraint like foo>=1.0 in your requirements.txt, then when you install dependencies, you might get different versions locally from CI or production. Lock files solve this by allowing foo>=1.0 as a requirement, but it will get pinned until someone runs poetry update.
poetry maintains a Python standard pyproject.toml file with direct dependencies and a separate poetry.lock file with dependencies of dependencies. The respective requirements.txt files from pip can get bloated easily. Often people do pip freeze > requirements.txt, which generates a list of every package version installed in your Python environment. The problem with this is that it includes dependencies of dependencies, so your requirements appear larger than they should be. It could also include packages you just happened to install, but aren’t required for your project.
Another problem with requirements.txt is that it’s only used for development. When packaging a wheel file, dependencies must be specified in setup.py. This discrepancy is hard to manage.
Lastly, poetry has some other features above what pip provides, such as improved integration with virtual environments and nudges towards best practices. It’s also more intuitive to developers familiar with other package managers (like yarn).

The second interesting choice we made was to use editable installs for libraries. Using poetry, this is done with “path” dependencies: poetry add ../../lib/grpc. When we install an internal library this way, everything generally works the same as if we installed it from a package repository — we can import from it, and its dependencies get installed — but it also links back to the local library directory. If you edit the library locally, there’s now no need to reinstall it, solving the problem of library changes being difficult. You can test the library and service changes locally in a very natural way as well as push and test your changes in the same PR.

When you make a PR including a library change, CI will also run the tests for all other code that uses that library. The way we accomplish that is as follows. The CI task looks at what files have changed in a PR. It then filters that down to which libraries have changed. It then looks at the poetry.lock file for each service and library (including those not in the PR) to see if it includes a changed library. Since poetry.lock includes recursive dependencies, this also identifies when a changed library is a dependency of a dependency. It then runs the tests for any service or library affected by the PR.

Another benefit of editable installs is that internal library versions will always be up to date. This can greatly reduce support costs. E.g., if the team that owns, say, our gRPC framework library wants to update it, they need to make sure they do it in a non-breaking way, or fix services as they go. Every service will use the same version. At other companies I’ve seen the scenario where libraries were versioned, so teams would make breaking changes to them. Because dedicated time was needed to fix the unavoidable update problems, services wouldn’t update quickly (or sometimes ever), leading to services using many different versions of the same library, some quite stale. This increased support costs for the team owning the library, as they had to help with multiple versions, and it slowed down development overall.

poetry can also build and publish wheel files. So we still publish our internal libraries to our internal package repository. That way they can be installed with pip in other repos that we haven’t moved into our monorepo yet. In that case, they are versioned. There’s more detail on this in this comment.

CI/CD

As mentioned before, each service and library has its own CI/CD pipeline. The tests are triggered when a file in the service or library is directly changed in a PR, or when a library it depends on is changed in a PR, but the services themselves are only deployed if a file in the service is changed (i.e., we don’t automatically redeploy services if only an internal library changes).

All the CI/CD pipeline code and configuration is shared across libraries and services as a single template. Each library and service defines a YAML file with values that populate the template. This allows us to make improvements to the pipelines across all services at once.

Libraries

All our libraries use namespace packaging, which simply means that they’re separately installable but share an opendoor package prefix (e.g., opendoor.grpc for our gRPC framework). Some notable libraries include:

A gRPC framework. This includes interceptors for automatically logging service requests, latency metrics, and capturing exceptions to send to Sentry.
A Kafka framework.
Postgres utilities.
Protobufs. This is basically all the Python types and stubs automatically generated by our protocol buffer schemas. This makes it easy to install this lib and use it to make RPCs to other services. E.g., from opendoor.protobuf.foo.foo_pb2 import FooRequest and from opendoor.protobuf.foo.foo_pb2_grpc import FooStub.
S3 utilities.
Testing utilities.
We have other more niche libraries as well, such as for ML.

Lessons Learned

All told, our Python monorepo has felt like a huge success. It’s much more organized. Updating and testing libraries is much better. It’s also been much easier to consolidate CI/CD pipelines to make global improvements.

That said, there have been some pain points:

Python developers familiar with pip had to get acquainted with poetry. While we rolled poetry out on a single service first to fix some major pain points before rolling it out more widely, it’s never easy to learn new tools when you’re trying to get things done quickly.
We bet on poetry early on, before the 1.0 release. While it’s generally been stable, we occasionally ran into issues on the bleeding edge release. It was a tough choice to tradeoff the stability of pip with the more modern features of poetry, but we felt it was headed in the right direction quickly enough.
Some other opensource tools aren’t adapted well to monorepos. E.g., linters are intended to be run on one project, in one virtual environment, at a time. If you try to run, say, pylint project1 project2, where those each have their own virtual envs, it won’t work correctly. Hence we had to write tooling around many of these tools to make sure tools always work as expected.
Likewise, many CI/CD systems aren’t setup to handle monorepos, so you’ll need to invest in some tooling to get it working. Similarly, to build Dockerfiles optimized for layer caching, we needed extra tooling. So basically, if you have a monorepo, expect to invest in tooling around it.
We lowered the bar for writing libraries so much that it’s important to keep an eye on these and make sure the code is tested, high quality, and should actually be part of a library. For that reason we have a group of Python reviewers for shared code (this isn’t a formal team within the company, just code reviewers).
As I mentioned, although I’d love to say all our Python code is in our monorepo, there’s still some code outside of it. We moved almost all of the single-service repos into the monorepo, but repos that were doing “unusual” things (multi-service, for example), we left out. This hasn’t been an issue, but it is hard to actually migrate all of our code to a consistent structure.

I think our monorepo setup should work very well for almost any small to medium sized companies using Python. It might even work well for huge companies! We currently have about a dozen internal libraries and two dozen services in our monorepo, and I think it could scale to at least 10x that. I hope this has been helpful to other teams who are thinking about how to organize their Python code. As always, we’re looking for passionate and thoughtful developers if you’re interested in working with us.