placeholder
thoughts and learnings in software engineering by Rotem Tamir

Analyzing Python Code with Python

A first dip into DIY static code analysis

One of the special things about software engineering as a profession is the possibility of ars-poetic work: part of our work is building tools that target our own work; perhaps a few surgeons around the globe can design and meld their own scalpel, but for software engineers building our own tooling is a day to day reality.

Recently, I’ve been working on migrating a large repository of code to be built with Bazel. To do this properly I had to generate well over a hundred different BUILD files. Doing so by hand would have been slow, error-prone and exhausting, so instead I opted to write a tool that would do this automatically for me.

When describing a BUILD module for Bazel, each test file must be defined as a *_test rule, which explicitly defines any dependencies it has. For example, in order to get Bazel to run a python test, one would write something similar to:

py_test(
    name = "test_context",
    srcs = ["test_context.py"],
    main = "test_context.py",
    tags = [],
    deps = [
        ":context",
        "//libs/config",
    ],
)

The first step to writing a program that will generate this block for each test file in our repository is to figure out which of the files in our repository are test files. In the context of the repository I was working on a test file could be defined as:

  1. The file has a *.py suffix
  2. The file contains a class which inherits from unittest.TestCase
  3. The class contains at least one method who’s name begins with test_

Using ASTs for Fun and Profit

Sounds simple enough, surely something we can do with some regular expressions, right? Well, we could, but there are a lot of edge cases to consider, and wouldn’t it be great if we could see our code the same way the Python interpreter sees it?

Turns out there’s an easy way to do that, the Python standard library contains a neat little package named ast - shorthand for Abstract Syntax Tree. Wikipedia defines ASTs as:

In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code.

and the ast package documentation describes it as:

The ast module helps Python applications to process trees of the Python abstract syntax grammar. The abstract syntax itself might change with each Python release; this module helps to find out programmatically what the current grammar looks like.

In software engineering, the practice of building programs that analyze another program’s source code without actually executing it is called static code analysis. By using an AST parser for the language we are interested in, we transform that source code into data that we can process just like any other. Using the Python standard-library’s ast package, we can parse a block of python code into a data structure which we can traverse and analyze. This will be very handy in answering whether a file contains a unit test!

Finding all test files in the repo

The first step, is walking the filesystem to find all candidate files:

import os

def is_test_file(path) -> bool:
	# TODO: impl
	pass

def find_test_files(repo_root):
	test_files = []
	for root, dirs, files in os.walk(repo_root):
		for file in files:
			if not file.endswith(".py"):
				continue
			path = os.path.join(root, file)
			if is_test_file(path):
				test_files.append(path)
	return test_files

Next, let’s see what we can do with ast.

Python 3.7.7 (default, Jul 15 2020, 21:51:02)
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> source = """
... class Test(unittest.TestCase):
...   def test_hello(self):
...     pass
... """
>>> import ast
>>> tree = ast.parse(source)
>>> print(tree)
<_ast.Module object at 0x10379c6d0>

Using this neat visualizer, this is what the python AST looks like:

So when parsing a block of source code with ast.parse we get back a tree-like data structure that has:

  • Module at its root
  • With a body attribute which is a list of elements
  • In our file there is only one element a ClassDef object with a name of Test
  • This ClassDef has two attributes, body and bases
  • body has a single FunctionDef (our test method definition) which has a name of test_hello
  • bases is a list of base-classes our ClassDef inherits from, in our case, something that if we squint a little we can see is unittest.TestCase

Great, there’s everything we need in here to make our decision. Our final method will look like:

import ast

def is_test_file(abspath) -> bool:
  with open(abspath, "r") as f:
      data = f.read()
  source_ast = ast.parse(data)

  for node in source_ast.body:
    if isinstance(node, ast.ClassDef) and _is_testcase_class(node):
        for class_node in node.body:
            if isinstance(class_node, ast.FunctionDef) and class_node.name.startswith("test_"):
                return True
	  return False

def _is_testcase_class(self, classdef: ast.ClassDef) -> bool:
  for base_class in classdef.bases:
				
  # if the test class looks like class Test(TestCase)
  if isinstance(base_class, ast.Name):
      base_name = base_class.id

  # if the test class looks like class Test(unittest.TestCase):
  elif isinstance(base_class, ast.Attribute):
      base_name = base_class.attr
  else:
      continue
  if base_name == 'TestCase':
      return True
  return False

In short:

  • We read the source file into a string, then parse it with ast.parse
  • We then scan the module’s body looking for ast.ClassDef (class definition) nodes.
  • If we find one, we check whether it inherits from unittest.TestCase by looking at the class definition’s bases attribute.
  • If we find a TestCase class we iterate through that ClassDef’s body looking for a FunctionDef whos’ name begins with test_

Conclusion

Sure, we probably could have gotten OK results using a shell script with some heuristic-based regular expressions, but using AST’s to parse the source code as the interpreter sees it we can get definitive answers that are robust to any edge cases such as odd formatting and placing code inside comments or strings. Being able to create the tools to make your job easier is one of the best things about being a software engineer, being able to programmatically parse and analyze your own source code takes your tool building possibilities to the next level!