Zato: A successful Python 3 migration story

2019-02-11, by Dariusz Suchojad

Now that Python 3 support is available as a preview for developers, this post summarizes the effort that went into making sure that Zato works smoothly using both Python 2.7 and 3.x.

In fact, the works required were remarkably straightforward and trouble-free and the article discusses the thought process behind it, some of the techniques applied or tools used.

Background

Zato is [an enterprise API integration platform and backend application server]/en/docs/3.2/intro/esb-soa.html). We support a couple dozen of protocols, data formats, several sorts of IPC and other means to exchange messages across applications.

In other words, on the lowest level, passing bytes around, transforming, extracting, changing, collecting, manipulating, converting, encoding, decoding and comparing them, including support for all kinds of natural languages from around the world, is what Zato is about at its core when it is considered from the perspective of the programming language it is implemented in.

The codebase is around 130,000 lines of code, out of which Python and Cython are 60,000 lines. This is not everything, though, because we also have 170+ external dependencies that need to work with Python 2.7 and 3.x.

The works took two people a total of 80 hours. They were spread over a longer calendar time, except for the final sprint that required more attention for several days in a row.

Preparations

Since the very beginning, it was clear that Python 3 will have to be supported at one day so the number one thing that each and every Python module has always had is this preamble:

from __future__ import absolute_import, division, print_function, unicode_literals

This is what every Python file contains and it easily saved 90% of any potential work required to support Python 3 because, among other less demanding things, it enforced a separation, though still not as strict as in Python 3, between byte and Unicode objects. The separation is a good thing and the more one works with Python 3 the clearer it becomes.

In Python 2, it was sometimes possible to mix the two. Imagine that there is a Python-derived language where JSON dicts and Python dicts can be sometimes used interchangeably.

For instance, this is a JSON object: {"key1": "value1"} and it so happens that it is also a valid Python dict so in this hypothetical language, this would work:

json = '{"key1": "value1"}'
python = {'key2': 'value2'}

result = json + python

Now the result is this:

{'key1': 'value1', 'key2': 'value2'}.

Or wait, perhaps it should be this?

'{"key1": "value1", "key2": "value2"}'

This is the central thing - they are distinct types and they should not be mixed merely because they may be related or seem similar.

Conceptually, just like upon receiving a JSON request from the network a Python application will decode it into a canonical representation, such as a dict, list or another Python object, the same should happen to other bytes, including ones that happen to represent text or similar information. In this case, the canonical format is called Unicode, and that is the whole point of employing it in one's application.

All of this was clear from the outset and the from __future__ statements helped in its execution, even if theoretically one could have been still able to mix bytes and Unicode - it was simply a matter of using the correct canonical format in a given context, i.e. a case of making sure the architecture was clean.

This particular __future__ statement was first announced in 2008 so there was plenty of time to prepare to it.

As part of the preparations, it is good to read a book about Unicode. Not just a 'Unicode for overburdened developers' kind of an article but an actual book that will let one truly appreciate the standard's breadth and scope. While reading it, do not resist the temptation to learn at least basics of two or more natural languages that you never knew about before. It will only help you develop into a better person and this is not a joke.

While programming with bytes and Unicode, it is convenient simply to forget about whether it is a 'str', 'bytes' or 'unicode' object - it is easier simply to think about bytes and text. There are bytes that can mean anything and there is text whose native, canonical form is Unicode. This is not always 100% accurate because Unicode can represent marvelous gems such as Byzantine musical notation and more but if a given application's scope is mostly constrained to text then this will work - there are bytes and there is text.

This is all fine with our own code but there are still the external libraries that Zato uses and some of them will want bytes, not text, or the other way around, in seemingly similar situations. There can be even cases like a library expecting for protocol header keys to be text and protocol header values to be bytes for rather unclear reasons. Simply accept it as a fact of life and move on with your works, there is no need to pause even for a moment to think about it.

Side projects

It was good to try out Python 3 first in a few new, smaller side-projects, GUI or command-line tools that are not part of the core yet they are important in the overall picture. The most important part of it was that creating a Python 3 application from scratch was in no way different than in Python 2, this served as a gentle introduction to Python 3-specific constructs and this knowledge was easily transferred later on to the main porting job.

Dependencies

Out of a total of 170+ dependencies, around 10 were not Python 3-compatible. All of them had not been updated in eight, twelve or more years. At this point, it is safe to assume that if there is a dependency that was last updated in 2009 and it has no Python 3 support then it never will.

What to do next depended on a particular case, each of them was some kind of a convenience library - sometimes they had to be dropped and sometimes forked. Most complex changes required in a fork were on the level of updating 'print' to 'print()' or doing away with complex installation setups that predated contemporary pip-based configuration options.

Other than that, there were no issues with dependencies, all of them were ready for Python 3.

Idioms and imports

Most of the reference information needed to make use of Python 2 and 3 was available via the python-future project which itself is a great assistance. Installing this library, along with its dependencies, sufficed for 99% of cases. There were some lesser requirements that were incorporated into a Zato-specific submodule directly, e.g. sys.maxint is at times useful as a loop terminator but ints in Python 3 have no limits so an equivalent had to be added to our own code.

Note that the page above does not show all the idioms and some changes were not always immediately obvious, like modifications to __slots__, or the way metaclasses can be declared, but there were no really impossible cases, just different things to use, either built in to Python 3 or available via future or six libraries.

A nice thing is that one is not required to immediately change all the imports in one go - they can be changed in smaller increments, e.g. 'basestring' is still available in the form of 'from past.builtins import basestring'.

Testing

A really important aspect during the migration was the ability to test sub-components of an application in isolation. This does not only include unittests, which may be too low-level, but also things such as starting only selected parts of Zato without a requirement to boot up whole servers which in turn meant each change could be tested within one second rather than ten. To a degree, this was an unexpected but really useful test of how modular our design was.

Intellectually, this was certainly the most challenging part because it required maintaining and traversing several trains of thought at once, sometimes for several days on end. This, in turn, means that it really is not a job for late afternoons only and it cannot be an afterthought, things can simply get complex very quickly.

String formatting

There is one thing that was not expected - the way str.format works with bytes and text.

For instance, this will fail in Python 3:

>>> 'aaa' + b'bbb'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Can't convert 'bytes' object to str implicitly
>>>

Just for reference, in Python 2 it does not fail:

>>> 'aaa' + b'bbb'
'aaabbb'
>>>

Still under Python 2, let's use string formatting:

>>> template = '{}.{}'
>>> template.format('aaa', b'bbb')
'aaa.bbb'
>>>

In Python 3, this is the result:

>>> template = '{}.{}'
>>> template.format('aaa', b'bbb')
"aaa.b'bbb'"
>>>

In the context of a Python 3 migration, it would have been probably more in line with other changes to the language if this had been special-cased to reject such constructs altogether.

Otherwise, it initially led to rather inexplicable error messages because the code that produces such string constants may be completely unaware of where they are used further on. But witnessed once or twice, it was apparent later on what the root cause was and this could be easily dealt with.

Things that are missed

One small, yet convenient, feature of Python 2 was the availability of some of the common codecs directly in string objects, e.g.:

>>> u'abc'.encode('hex')
'616263'
>>> u'abc'.encode('base64')
'YWJj\n'
>>> u'ελληνική'.encode('idna')
'xn--jxangifdar'
>>>

This will not work as-is in Python 3:

>>> u'abc'.encode('hex')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
LookupError: 'hex' is not a text encoding; use codecs.encode() to handle arbitrary codecs
>>>

Naturally, the functionality as such is still available in Python 3, just not via the same means.

Python 2.7

On the server side, Python 2.7 will be around for many years. After all, this is a great language that let thousands and millions of people complete amazing projects and most of enterprise applications do not get rewritten solely because one of the technical components (here, Python) changes in a way that is partly incompatible with previous versions.

Both RHEL and Ubuntu ship with Python 2.7 and both of them have long-term support well into the 2020s so the language as such will not go away. Yet, piece by piece, all the applications will be changed, modified, modularized or rewritten and gradually Python 2.7's usage will diminish.

In Zato, Python 2.7 will be supported for as long as it is feasible and one of the current migration's explicit goals was to make sure that existing user Zato environments based on Python 2.7 will continue to work out-of-the-box with Python 3 so there is no difference which Python version one chooses - both are supported and can be used.

Summary

An extraordinary aspect of the migration is that it was so unextraordinary. There were no really hard-won battles, no true gotchas and no unlooked-for hurdles. This can be likely attributed to the facts that:

Core Python developers offered information what to expect during such a job
Unicode was not treated as an afterthought
Zato reuses common libraries that are all ported to Python 3 already
Internet offers guides, hints and other pieces of information about what to do
It was easy to test Zato components in isolation
Time was explicitly put aside for the most difficult parts without having to share it with other tasks