Using Data Validation for Robust APIs

Over the past few years, I worked on two types of API projects. Some implemented proper data validation, and the others did not. Believe me: it was a huge difference!
I mostly worked on HTTP APIs and backends, and validating the body of a POST/PUT/PATCH is a common step.

Unexpected input handling is quite a challenge when implementing an API. You need to validate that the input is a well-formed JSON/XML/... (easy) and then you have to ensure that the fields are reasoned: no missing mandatory field, correct type, reasonable values, ...

I have seen too many software with no proper input validation. That caused many bugs that were sometimes very hard to find. For code written in PHP or JS (using a non-strict fashion), no proper validation led to hard bugs when some fields were implicitly cast into null or undefined and propagated into the rest of the code sneakily. Sometimes, wrong values managed to reach the database and caused quite a mess.
In more strict languages like Python, you end up with two kinds of runtime error. The easy ones, where something like x['field'] raises a KeyError exception. Also, the more difficult ones, when the x.get('field') returns None and causes a TypeError later in the code.

It is crucial to reject any non-intended values before any processing that can mess up with the program or the integrity of the database. (aside note: if your DB supports it, use constraints as a last line of defense.)

Input Validation

Let's suppose we are building a platform allowing students to subscribe to a service. Our API excepts a POST'd JSON which looks like the following:

# uid: integer, mandatory.
# age: mandatory, must be a legit age (between 18 and 35)
# name: string, mandatory, max length is 50
# grade: If provided, should be A, B, or C. By default, A

{
  "uid": 42,
  "age": 25,
  "name": "John Doe",
  "grade": "B"
}

Here is how we can do the process of validation by hand:

try:
    data = json.load(req.post.body)
except JsonDecodingError:
    raise HttpError.BadRequest('Ill-formed JSON')

try:
    uid = int(data['uid'])
except KeyError, ValueError:
    raise HttpError.BadRequest('Expecting uid:int')

try:
    age = int(data['age'])
    if not 18 < age < 35:
        raise ValueError('Bad age')
except KeyError, ValueError:
    raise HttpError.BadRequest('Expecting age:int within the range [0;100]')

try:
    name = str(data['name'])
    if len(name) > 50:
        raise ValueError('Max limit 50')
except KeyError, ValueError:
    raise HttpError.BadRequest('Expecting age:int within the range [0;100]')

try:
    grade = data.get('grade')
    if grade is None:
        grade = 'A'
    if grade not in ('A', 'B', 'C'):
        raise ValueError('A, B or C expected')
except KeyError, ValueError:
    raise HttpError.BadRequest('Expecting age:int within the range [0;100]')

backend.do_something(uid, age, name, grade)

However, we have some problem with that ad-hoc approach.

  • Doing that is tedious and error-prone. Here, our use-case is still simple. But what if we have to introduce a new rule which says that a person over 30 cannot have the C grade to subscribe?
  • The code is not easy to read. The intent is not clear. When the validation rules become more complex, it becomes more and more difficult to tell what the code expects (assuming there is no bug).
  • How to reuse this validation process elsewhere in our code? We can imagine that another endpoint accepts a list of students, each one being subject to the same validation rules.

Introducing Cerberus

Fortunately, we can do better here and use a schema-based data validation tool. In Python, there is the Cerberus library that uses an easy to read schema definition based on Python's dict. We are using Cerberus in this article, but the idea applies whatever language/library.

The idea of Cerberus is to provide a declarative schema definition of what we expect. We define a dictionary for that, but it possible to import a schema from a JSON file or a YAML file. Then, the engine makes sure any submitted data is compliant with our defined schema.

The example above is rewritten like this:

from cerberus import Validator

try:
    data = json.load(req.post.body)
except JsonDecodingError:
    raise HttpError.BadRequest('Ill-formed JSON')

student_schema = {
    'uid': {
        'required': True,
        'type': 'integer',
    },
    'age': {
        'required': True,
        'type': 'integer',
        'min': 18,
        'max': 35,
    },
    'name': {
        'required': True,
        'type': 'string',
        'maxlength': 50,
    },
    'grade': {
        'required': False,
        'type': 'string',
        'allowed': ['A', 'B', 'C'],
    }
}

student_validator = Validator(student_schema)

if not student_validator.validate(data):
    msg = student_validator.errors
    raise HttpError.BadRequest(msg)

# Now, we can safely process our data

uid = data['uid']
age = data['age']
name = data['name']
grade = data.get('grade', 'A')

backend.do_something(uid, age, name, grade)

By using a data validation framework, we got some benefits. Now, our code is:

  • Documented: The code is is self-documented by being explicit and declarative about what it is expected. Any developer can read that in a glance. The schema can be pretty-formatted to be part of the API docs as well.

  • Decoupled: Validation is decoupled from the application logic: having a set of JSON inputs tested against a set of valid schemas does not require to run the application. Thus making testing easier.

  • Reusability: Elsewhere in the code, and within the frontend too. We can send the validation schema (JSON-encoded form) to the application frontend, which can use it to generate and validate a form dynamically with JavaScript, for example.

What about errors when the input does not comply with the schema? Cerberus returns a dictionary pointing out the errors. If we provide non-compliant data such as follows:

{
  "uid": 42,
  "name": "John Doe",
  "age": 17,
  "grade": "Z"
}

We get the following dictionary error from Cerberus (the msg variable):

{
    'grade': ['unallowed value Z'],
    'age': ['min value is 18']
}

Cerberus tells us every error it has encountered. That can be sent back (JSON-encoded form) to the client, which can display precise and helpful error messages to the user/API client.

Final Words about Cerberus

As of the Cerberus library, it allows a lot of complex validation rules, such as: interdependent fields, mutually exclusive fields, regex matching, nesting a schema to another schema, and much more!

Of course, Cerberus can be extended by user-defined functions to provide more custom rules to match anything such as IP addresses, valid email addresses, ...

Using such a tool is easy, and it provides excellent benefits to improve software quality. So, let's use it more in our software!