Tweag

Python Monorepo: an Example. Part 1: Structure and Tooling

4 April 2023 — by Guillaume Desforges, Clément Hurlin

For a software team to be successful, you need excellent communication. That is why we want to build systems that foster cross-team communication. Using a monorepo is an excellent way to do that. A monorepo provides:

  • Visibility: by seeing the pull requests (PRs) of colleagues, you are easily informed of what other teams are doing.

  • Uniformity: by working in one central repository, it is easier to share the configuration of linters, formatters, etc. This makes it easy to use the same code style and documentation standards.

    Uniformity smooths the onboarding of newcomers as well as the reassignment of engineers to different internal projects.

  • Easier continuous integration: new code is picked up automatically by CI, without any manual interaction, ensuring uniformity and best practices.

  • Atomic changes: because all libraries and projects are in one place, a large change can be implemented in one PR. This avoids the usual workflow of cascading updates, which causes mistakes to be caught later, rather than sooner, and causes friction in development.

    Atomic changes are implemented in our setup, thanks to a living at HEAD setup. Living at HEAD is a term popularized by Titus Winters, from Google. It means that all code in a monorepo depends on the code that is next to it on disk, as opposed to depending on released versions of the code in the same repository.

Designing a monorepo can be challenging, as it impacts the development workflow of all engineers. In addition, monorepos come with their own scaling challenges. Special care for tooling is required for a monorepo to stay performant as a team grows.

In this post, we describe a design for a Python monorepo: how we structure it; which tools we favor; alternatives that were considered; and some possible improvements. Before diving into the details, we would like to acknowledge the support we had from our client Kaiko, for which we did most of the work described in this series of blog posts.

Python environments: one global vs many local

Working on a Python project requires a Python environment (a.k.a. a sandbox), with a Python interpreter and the right Python dependencies (packages). When working on multiple projects, one can either use a single shared sandbox for all projects, or many specific ones, for each project.

On the one hand, a single sandbox for all projects makes it trivial to ensure that all developers and projects use a common set of dependencies. This is desirable as it reduces the scope of things to manage when implementing and debugging. Also, it ensures that all team members are working towards a shared, common knowledge about their software.

On the other hand, it makes it impossible for different projects to use different versions of external dependencies. It is also mandatory to install the dependencies for all projects, even when they only need a subset of them to work on a single project. These two facts can create friction among developers and reduce throughput.

To avoid losing flexibility, we decided to use multiple sandboxes, one per project. We will later improve the consistency of external dependencies across Python environments with dedicated tooling.

A sandbox can be created with Python’s standard venv module:

> python3 -m venv .venv
> source .venv/bin/activate
(.venv) > which python
/some/path/.venv/bin/python

We will later describe how this is put to use.

Choosing a Python package manager

In our scenario, we chose to stick with pip to install dependencies in sandboxes, because Poetry still doesn’t work well with PyTorch, a core library in the data ecosystem.

Over the years, pip has undergone many important breaking changes, such as editable installs (PEP 660). To improve reproducibility, we pin the version of pip in a top-level pip-requirements.txt file, with the exact version of pip to use:

# Install a specific version of pip before installing any other package.
pip==22.2.2

It will be important to install pip with this exact version before installing anything else.

Creating projects and libraries

In an organization, each team will be the owner of its own projects. For instance, there could be a web API, a collection of data processing jobs, and machine learning training pipelines. While each team is working on its own projects, it is most likely that a portion of their code is shared. Following the DRY (Don’t Repeat Yourself) principle, it is best to refactor those shared portions into libraries and make it a common effort that can benefit from everyone’s work.

In Python, there is no significant difference between projects and libraries; they are all Python packages. Because of that, we make no distinction between the two. However, for the sake of clarity, we split the monorepo structure into two top-level folders, one for projects and one for libraries:

├── libs/
└── projects/

This top-level organization highlights that libraries are shared across the entire organization.

To create a project or a library, a folder needs to be created in one or the other. It should then be populated with the following:

  • A pyproject.toml file which defines the Python package. It contains its metadata (name, version, description) and the list of its dependencies, for dependency resolution.

  • A requirements.txt file which serves as the basis for creating local sandboxes for developers and also as the default environment in continuous integration (CI). It has to list all direct dependencies, frozen at a specific version, in pip’s requirement file format.

    By freezing the versions, we don’t mean using pip freeze, because our requirements.txt are manually maintained. We mean that we require the version numbers of dependencies to use the == specifier. This is sometimes also called pinning the dependencies, which goes a long way towards reproducibility.

    We explain below in more details why we use both pyproject.toml and requirements.txt. In a nutshell, pyproject.toml is used as the central place for configuration and for deployment; while requirements.txt are used for reproducibility in local environments and in the CI.

  • A README.md file. This file’s purpose is to list the owners of this package; the people to contact if the package needs to evolve or is broken. This file also contains a short description of what the package is about and example commands to run the code or test it. It’s supposed to be a gentle introduction to newcomers to this package.

    Library owners are also specified in the top-level CODEOWNERS file, to tame the amount of notifications. We recommend configuring a repository so that reviewers are automatically chosen based on a pull request’s changes, by using CODEOWNERS to map changes to reviewers.

Formatting and linting

For formatting source code, we chose Black, because it is easy to use and is easily accepted by most developers. Our philosophy for choosing a formatter is simple: pick one and don’t discuss it for too long.

For linting source code, we chose Flake8, isort, and Pylint. We use Flake8 and isort without any tuning. Again, the rationale being that the default checks are good and easily accepted. Regarding Pylint, we use it only for checking that public symbols are documented. Because Pylint is more intrusive, activating more checks would have required more lengthy discussions, which we decided not worthwhile in the early days of our monorepo.

To make all the tools work well together, we need a little configuration:

> cat pyproject.toml  # From the repository's top-level
[tool.black]
line-length = 100
target-version = ['py38']

[tool.pylint."messages control"]
ignore = ["setup.py", "__init__.py"]
disable = "all"
enable = [
  "empty-docstring",
  "missing-class-docstring",
  "missing-function-docstring",
  "missing-module-docstring"
]

[tool.isort]
profile = "black"
known_first_party = ["mycorp"]  # see package configuration below
> cat .flake8  # From the repository's top-level
[flake8]
max-line-length = 100
# required for compatibility with Black:
extend-ignore = E203
exclude = .venv

We need to use both pyproject.toml and .flake8 because, as of writing, Flake8 doesn’t support pyproject.toml.1

In addition, we need pyproject.toml files in each project and library, because pyproject.toml files are used to list direct dependencies of a package. We will use the CI to ensure that nested pyproject.toml files and the top-level pyproject.toml file agree on the configuration of tools that are common to both.

To pin the versions of all these tools for the entire monorepo, we have a top-level dev-requirements.txt file that contains the following:

black==22.3.0
flake8==4.0.1
isort==5.10.1

To recap what we’ve described so far, at this point our monorepo’s structure is:

├── .flake8
├── dev-requirements.txt
├── pip-requirements.txt
├── pyproject.toml
├── libs/
└── projects/

With this setup, you format and lint code deterministically locally and on the CI with:

python3 -m venv .venv

# Make the sandbox active in the current shell session
source .venv/bin/activate

# Install pinned pip first
pip install -r pip-requirements.txt

# Install shared development dependencies, in a second step
# to use the pinned pip version
pip install -r dev-requirements.txt

# Black, Flake8, and isort are now available. Use them as follows:
black --check .
flake8 .
isort --check-only .

Note that it is possible to do this at the top-level, because Black, Flake8, and isort don’t need the external dependencies of the monorepo’s libraries and projects to be installed.

Typechecking

To typecheck code, we chose Microsoft’s Pyright. In our benchmarks, it proved noticeably faster than mypy and seems more widely used than Facebook’s Pyre. Compared to mypy, Pyright also has the advantage that it can execute as you type: it gives feedback without requiring to save the file being edited. Because mypy has a noticeable startup time, this made for a significant difference in user experience, consolidating our choice in favor of Pyright.

Pyright has different levels of checking. We stick to the default settings, called base. These settings make the tool easily accepted: if your code is not annotated, Pyright will mostly remain silent. If your code is annotated, in our experience, Pyright reports only errors that are relevant. In the rare cases where it reported false positives (i.e. reporting an error where there isn’t one), the context made sense. For example, if type-correctness depends on a computation that cannot be statically analyzed.

Pyright also works really well no matter the amount of annotations that external dependencies (i.e. libraries outside the monorepo) have; relying on external annotations if available, or type inference by crawling source code. Either way, we observed that Pyright uses the correct types, even for data-science libraries with a lot of union types (such as pandas).

To enable typechecking with Pyright, we specified a pinned version in the shared top-level dev-requirements.txt as follows:

pyright==1.1.239

and we configure it in the pyproject.toml file of every project and library:

> cat pyproject.toml
...
[tools.pyright]
reportMissingTypeArgument = true  # Report generic classes used without type arguments
strictListInference = true  # Use union types when inferring types of lists elements, instead of Any

As with pip and other tools, pinning Pyright’s version helps make local development and the CI deterministic.

Testing

To test our code, we chose pytest. This was an obvious choice, because all concerned developers have experience with it and it wasn’t challenged.

Among pytest’s qualities, we can cite its good progress reporting while the tests are running and easy integration with test coverage.

To make pytest available, we again specified a pinned version in the shared top-level dev-requirements.txt as follows:

pytest==7.0.1
pytest-cov==3.0.0  # Coverage extension

Sandboxes

With all of the above in place, we are now able to create sandboxes to obtain comfortable development environments. For example, suppose we have one library named base, yielding the monorepo structure as follows:

├── .flake8
├── dev-requirements.txt
├── pip-requirements.txt
├── pyproject.toml
├── libs/
│   └── base/
│       ├── README.md
│       ├── pyproject.toml
│       └── requirements.txt
└── projects/

To create base’s development environment, go to directory libs/base and execute:

python3 -m venv .venv

# Make the sandbox active in the current shell session
source .venv/bin/activate

# Install pinned pip first
pip install -r $(git rev-parse --show-toplevel)/pip-requirements.txt

# Install shared development dependencies and project/library-specific dependencies
pip install -r $(git rev-parse --show-toplevel)/dev-requirements.txt -r requirements.txt

# With project-specific dependencies installed, typecheck your code as follows:
pyright .

This could be shortened by using a tool like Nox.

Configuration of a package

We use a common namespace in all projects and libraries. This avoids one level of nesting by avoiding the src folder (which was the historical way of doing things, called the src layout).

Supposing that we choose mycorp as the namespace, this means that the code of library libs/base lives in directory libs/base/mycorp and the pyproject.toml of the library must contain:

[project]
...
packages = [
  { include = "mycorp" }
]

We now come to an important topic: the difference between pyproject.toml and requirements.txt for declaring dependencies.

requirements.txt

requirements.txt files are meant to be used to install dependencies in sandboxes both on the developers’ machines and in the CI. requirements.txt files specify both local dependencies (packages developed in the monorepo itself) and external dependencies (dependencies which are usually hosted on PyPI, such as NumPy and pandas).

To install local dependencies in our development sandboxes, we use editable installs. If a library A depends on a library B, this makes changes to B immediately available to developers of A: A depends on the code of B that is next to it in the monorepo, not on a released version. This allows the implementation of a live at HEAD workflow, as detailed below.

The requirements.txt file of a library should include all direct dependencies and they should be pinned, i.e. specify exact version numbers, using the == operator. By using pinned dependencies, we achieve a good level of reproducibility. We don’t list the transitive dependencies, because that would amount to maintaining a lockfile manually, which would be very tedious. We neither used pip compile nor pyenv, because they don’t work well in multiplatforms scenarios, as visible here for the former and here for the latter.

On this topic and others, we would have loved to use Poetry, that provides a dedicated CLI for managing lockfiles. That would have allowed us to pin both direct dependencies and transitive dependencies. However, Poetry doesn’t play well with an essential data-science package: PyTorch.2 If Poetry and PyTorch start working well together in the future, our setup can transition smoothly, because we use the pyproject.toml file as the central place for storing configuration, as we discuss below.

Note that our requirements.txt doesn’t specify the hashes of dependencies (usually done with --hash with pip). Hashes are used to defend against supply chain attacks. As our setup is meant for starting a monorepo, we intentionally don’t delve in this more advanced topic in this post.

pyproject.toml

Generally speaking, pyproject.toml files are used to configure a project or library. In this section we only deal with specifying dependencies; i.e. how we write the [tool.poetry.dependencies] section.3

In our setup, pyproject.toml files specify dependencies for deployment. Because of that, dependencies in pyproject.toml files should be loose, to avoid blocking using your code in a variety of environments.4

Taking NumPy as an example, a simple rule to specify dependencies in this scenario is to use numpy = ^X.Y.Z where either (1) X.Y.Z is the version number of NumPy used when starting to use it, or (2) X.Y.Z is the version of NumPy introducing a feature depended upon. Poetry’s documentation provides good guidance on possible specifiers.

Example

To demonstrate how this setup works in practice, let’s introduce a new library libs/fancy that depends on the local dependency libs/base and the external dependency numpy:

...as above...
└── libs/
    ├── base/
    │   └── ...as above...
    └── fancy/
        ├── README.md
        ├── pyproject.toml
        └── requirements.txt

libs/fancy/requirements.txt is as follows:

-e ../base  # local dependency, use editable install
numpy==1.22.3  # external dependency, installed from PyPI

libs/fancy/pyproject.toml is as follows:

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"

[tool.poetry]
name = "mycorp-fancy"
version = "0.0.1"
description = "Mycorp's fancy library"
authors = [ "Tweag <[email protected]>", ]
packages = [
  { include = "mycorp" }
]

[tool.poetry.dependencies]
python = ">=3.8"
numpy = "^1.2.3"
mycorp-base = "*"

# tooling configuration, omitted (same as top-level pyproject.toml)

In the spirit of our explanations above:

  • requirements.txt uses an editable install to specify the dependency to libs/base, with -e ../base.
  • pyproject.toml uses the very loose "*" qualifier to specify the dependency to libs/base.

Living at HEAD

As mentioned above, our monorepo obey the living at HEAD workflow made popular by Google. In our simple example above, it means that the library fancy depends on the code of the library base that is next to it on disk (hence the Git term at HEAD), not on a released version of base.

This makes it possible to perform atomic updates of the entire monorepo in a single PR. Whereas in a traditional polyrepo setup with separate releases, to use a new version of base, one would have to update base first (in its own PR), then release it, then update fancy to make it use the latest version of base (in another PR). In the wild, a polyrepo setup creates cascading PRs, increasing the time it takes to perform updates crossing various libraries, or causes code duplication.

In this vein, the use of editable installs in our setup is our Python-specific implementation of living at HEAD.

Updating dependencies

In our setup, both the top-level dev-requirements.txt files and each library’s requirements.txt file are manually maintained. Admittedly, maintaining these files manually is not going to scale in the long term, but it makes for a simple start while providing a good level of reproducibility. Here are how updates occur:

  1. Some library developer needs to update a dependency, because they need a new feature that was released recently. In this case, we recommend to update this dependency in the concerned library, but also in all other places where this dependency is used, to maintain a high-level of consistency within the monorepo.

    Because we advocate for good test coverage, we assume that, if tests of modified libraries still pass after the update of the dependency, the PR updating the dependencies can be safely merged.

  2. When a number of teams reach the end of sprints and are preparing for a next iteration, or it’s a low-intensity period (summer vacation). In this case, it’s a good time to update dependencies that are old, while keeping friction low, because work happening in parallel is limited.

    This scenario aligns well with the fact that our monorepo’s setup minimizes surprises: by pinning dependencies, we minimize chances of unplanned breakage (that would be caused if we were pulling latest version of dependencies). As a consequence, we can separate the periods where features are being rolled out, from the periods where maintenance (such as updates of dependencies) happen.

  3. A bot checking for vulnerabilities such as Dependenbot creates a PR to update a specific dependency. If the dependency is used with the same version everywhere in the monorepo, the bot will create a PR that updates the entire monorepo at once.

    Again here, if working under the assumption that test coverage is good, such PRs can be merged quickly if tests pass.

Conclusion

So far we have seen a monorepo structure that features:

  • a streamlined structure for libraries and projects,
  • unified formatting, linting, and typechecking, and
  • a Python implementation of the live at HEAD workflow.

We showed a setup that is both simple and achieves a great level of reproducibility, while being easy to communicate about; all the tools mentioned in this post are well known to seasoned Python developers.

In post two of this series, we describe how to implement a CI for this monorepo, using simple GitHub Actions and how templating can be used to ease onboarding and maintain consistency as more developers start working in the monorepo.


  1. See Flake8 issue #234.
  2. We are not going to detail it here, but multi-platform lockfile support for projects that depend on PyTorch is still an issue today. See Poetry issues #4231, #4704, and #6939 for recent developments. In this post, we hence stick to a simpler pip-based approach, that can be easily amended to use Poetry instead.
  3. Despite using [tools.poetry] in pyproject.toml here, we don’t use Poetry the tool; we use Poetry the backend. This was made possible by PEP 517, which made pip capable of consuming information in pyproject.toml. This is enabled by the stanza build-backend = "poetry.core.masonry.api" above. We did so to circumvent a bug in setuptools and because it made pyproject.toml the central place for configuration.
  4. pip has been using a new dependency resolver since v20.3 (October 2020), which can throw an error if version bounds are incompatible between packages to install. If version bounds are too tight for a dependency, a conflict of versions for this dependency can happen and block the installation. When code is deployed into an environment with many other packages, it will possibly be in the presence of versions of dependencies that haven’t been used so far. Specifying exact version numbers in pyproject.toml would make this impossible and as such is not desirable.
About the authors
Guillaume DesforgesGuillaume is a versatile engineer based in Paris, with fluency in machine learning, data engineering, web development and functional programming.
Clément HurlinClément is a Senior Software Engineer that straddles the management/engineer boundary. Clément studied Computer Science at Telecom Nancy and received his PhD from Université Nice Sophia Antipolis, where he proved multithreaded programs using linear logic.

If you enjoyed this article, you might be interested in joining the Tweag team.

This article is licensed under a Creative Commons Attribution 4.0 International license.

Company

AboutOpen SourceCareersContact Us

Connect with us

© 2024 Modus Create, LLC

Privacy PolicySitemap