Why you should still read the docs

May 2023 ∙ nine minute read ∙

Do you feel you're fighting your tools?

Do you feel you're relying too much on autocomplete and inline documentation? ...always kinda guessing when using libraries?

Or maybe not, but getting things done just seems harder than it should be.

This can have many causes, but I've repeatedly seen junior developers struggle with a specific one, not even fully aware they're struggling in the first place.

This is a story about:

  • why you should still read the docs
  • finding the right way of doing things
  • command-line interfaces

tl;dr: Most good documentation won't show up in your IDE – rather, it is about how to use the library, and the problem the library is solving.

An example #

OK, instead of telling you, let me show you – we'll go through one possible path of developing something, and stop along the way to point out how we're doing things.


Say you're writing a command-line tool that gets data from an API and saves it in a CSV file. It might look something like this:

import csv
import time
import click

def get_data():
    # pretend some API calls happen here
    time.sleep(2)
    return [
        {"name": "Chaos", "short_name": "Chs"},
        {"name": "Discord", "short_name": "Dsc"},
        {"name": "Confusion", "short_name": "Cfn"},
        {"name": "Bureaucracy", "short_name": "Bcy"},
        {"name": "The Aftermath", "short_name": "Afm"},
    ]

def write_csv(data):
    with open("calendar.csv", 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['name', 'short_name'])
        for row in data:
            writer.writerow([row['name'], row['short_name']])

@click.command()
def main():
    """Retrieve the names of the Discordian seasons to calendar.csv."""
    data = get_data()
    write_csv(data)

if __name__ == '__main__':
    main()

Here it is in action:

$ python seasons.py --help
Usage: seasons.py [OPTIONS]

  Retrieve the names of the Discordian seasons to calendar.csv.

Options:
  --help  Show this message and exit.
$ python seasons.py
$ cat calendar.csv
name,short_name
Chaos,Chs
Discord,Dsc
Confusion,Cfn
Bureaucracy,Bcy
The Aftermath,Afm

You're using Click because it’s powerful but comes with sensible defaults.

This is pretty good code – instead of calling get_data() from write_csv(), you're passing its result to write_csv() – in other words, instead of hiding the input, you've decoupled it.1

As a result, write_csv() doesn't need to change when adding new arguments to get_data(), and we can test it without monkeypatching get_data():

import seasons

DATA = [
    {"name": "one", "short_name": "1"},
    {"name": "two", "short_name": "2"},
]
CSV_BYTES = b"name,short_name\r\none,1\r\ntwo,2\r\n"

def test_write_csv(tmp_path, monkeypatch):
    monkeypatch.chdir(tmp_path)
    seasons.write_csv(DATA)
    assert tmp_path.joinpath('calendar.csv').read_bytes() == CSV_BYTES

The path is hardcoded #

But, what if you need to run it twice, and save the output in different files? This becomes useful as soon as the output can change. Also, having to be in the right directory when running it is not cool.

The obvious solution is to pass the path in:

@click.command()
@click.option('--path', required=True)
def main(path):
    """Retrieve the names of the Discordian seasons."""
    data = get_data()
    write_csv(data, path)
def write_csv(data, path):
    with open(path, 'w', newline='') as f:

We're definitely on the right track here – the test is shorter, and we don't even need to monkeypatch the current working directory anymore:

def test_write_csv(tmp_path):
    out_path = tmp_path.joinpath('out.csv')
    seasons.write_csv(DATA, out_path)
    assert out_path.read_bytes() == CSV_BYTES

The path might not be valid #

But, what if you pass a directory as the path?

$ time python seasons.py --path .
Traceback (most recent call last):
  ...
IsADirectoryError: [Errno 21] Is a directory: '.'
python seasons.py --path .  0.07s user 0.02s system 4% cpu 2.104 total

OK, it fails with a traceback. Worse, we waited for the API calls only to throw the output away. Thankfully, Click has fancy parameter types that can take care of that:

@click.command()
@click.option('--path', type=click.Path(exists=False, dir_okay=False, writable=True), required=True)
def main(path):
$ time python seasons.py --path .
Usage: seasons.py [OPTIONS]
Try 'seasons.py --help' for help.

Error: Invalid value for '--path': File '.' is a directory.
python seasons.py --path .  0.08s user 0.02s system 92% cpu 0.112 total

Not only do we get a nice error message, we get it instantly!

The path to what, exactly? #

Thing is, --path isn't very descriptive. I wonder if there's a better name for it...

One thing I like to do is to look at what others are doing.

$ curl --help
Usage: curl [options...] <url>
 -o, --output <file> Write to file instead of stdout
 -O, --remote-name   Write output to a file named as the remote file
 ...
$ wget --help
GNU Wget 1.21.2, a non-interactive network retriever.
Usage: wget [OPTION]... [URL]...
  -o,  --output-file=FILE          log messages to FILE
  -O,  --output-document=FILE      write documents to FILE
$ sort --help
Usage: sort [OPTION]... [FILE]...
  -o, --output=FILE         write result to FILE instead of standard output
$ pandoc --help
pandoc [OPTIONS] [FILES]
  -o FILE               --output=FILE

Better yet, sometimes there's a comprehensive guide on a topic, like Command Line Interface Guidelines ...and under Arguments and flags, we find this:

Use standard names for flags, if there is a standard. If another commonly used command uses a flag name, it’s best to follow that existing pattern. That way, a user doesn’t have to remember two different options (and which command it applies to), and users can even guess an option without having to look at the help text.

Here’s a list of commonly used options: [...]

  • -o, --output: Output file. For example, sort, gcc.

Standard output #

Further down below, there's another guideline:

If input or output is a file, support - to read from stdin or write to stdout. This lets the output of another command be the input of your command and vice versa, without using a temporary file. [...]

I wonder how we could achieve this.

One way would be to check the value and use sys.stdout if it's -.

But let's pause and look at the Path docs a bit:

allow_dash (bool) – Allow a single dash as a value, which indicates a standard stream (but does not open it). Use open_file() to handle opening this value.

Encouraging; following along, open_file() says:

Open a file, with extra behavior to handle '-' to indicate a standard stream, lazy open on write, and atomic write. Similar to the behavior of the File param type.

What's this File, eh?

Declares a parameter to be a file for reading or writing. [...] The special value - indicates stdin or stdout depending on the mode. [...]

See File Arguments for more information.

Which finally takes us away from the API reference to the actual documentation:2

Since all the examples have already worked with filenames, it makes sense to explain how to deal with files properly. Command line tools are more fun if they work with files the Unix way, which is to accept - as a special file that refers to stdin/stdout.

Click supports this through the click.File type which intelligently handles files for you. [...]

OK, OK, I get it, let's use it:

@click.command()
@click.option('-o', '--output', type=click.File('w', lazy=False), required=True)
def main(output):
    """Retrieve the names of the Discordian seasons."""
    data = get_data()
    # click.File doesn't have a newline argument
    output.reconfigure(newline='')
    write_csv(data, output)

With passing a file object to write_csv(), the output part of I/O is also decoupled:

def write_csv(data, file):
    writer = csv.writer(file)
    writer.writerow(['name', 'short_name'])
    for row in data:
        writer.writerow([row['name'], row['short_name']])

Interestingly enough, csv.writer already takes an open file.

Anyway, it works:

$ python seasons.py --output -
name,short_name
Chaos,Chs
Discord,Dsc
Confusion,Cfn
Bureaucracy,Bcy
The Aftermath,Afm

Once again, the test gets simpler too – instead of writing anything to disk, we can use an in-memory stream:

def test_write_csv():
    file = io.StringIO()
    seasons.write_csv(DATA, file)
    assert file.getvalue().encode() == CSV_BYTES

...by default #

Some of the help texts we looked at say "write to FILE instead of stdout"...

Why instead?

Using commands in a pipeline is common enough that most Unix commands write to stdout by default; some don't even bother with an --output option, since you can always redirect stdout to a file:

$ tail -n+2 calendar.csv | head -n2
Chaos,Chs
Discord,Dsc
$ tail -n+2 calendar.csv > calendar-no-heading.csv
$ head -n2 calendar-no-heading.csv
Chaos,Chs
Discord,Dsc

It is so common that CLI Guidelines makes it the third thing in The Basics:

Send output to stdout. The primary output for your command should go to stdout. Anything that is machine readable should also go to stdout—this is where piping sends things by default.

We could remove --output, but with Click, doing both is trivially easy:

@click.option('-o', '--output', type=click.File('w', lazy=False), default='-')
$ python seasons.py | tail -n+2 | head -n2
Chaos,Chs
Discord,Dsc

Discussion #

Looking at our journey, I can't help but notice a few things:

  • The Click API reference seems to constantly funnel readers to the user guide part of the documentation, the part that tells you how and why to use stuff.
  • Click mentions supporting - for stdout, and more than once.
  • The Click user guide discusses File arguments before Path (in part, because it's more commonly used, but also because it's more useful out of the box).
  • csv.writer takes an open file, just like our own write_csv() does at the end.

These all hint that it may be a good idea to structure code a certain way; the more we use things the way they "want" to be used, the cleaner the code and the tests get.

Good libraries educate #

I am tempted to call Click a great library.

Yes, it does its job more than well, and has lots of features.

But, more importantly, its documentation educates the reader about the subject matter – it goes beyond API docs into how to use it, how to get specific things done with it, and the underlying problem it is solving.

Here are a few more examples:

  • Classics – other libraries in the Flask/Pallets ecosystem (like Werkzeug and Jinja), Requests, pytest, and many others – popular and old enough libraries usually have great documentation.
  • Python itself – you would be amazed at how many developers use Python every day without having gone through the tutorial even once.
  • feedparser teaches you about all kinds of Atom and RSS quirks, encodings on the web, HTTP, and how to be nice to servers you're making requests to.
    • feedparser is close to my heart because I'm using it for my feed reader library, reader, where I too am doing my best to have docs worth reading.
  • Stepping away from Python a bit, I've learned a lot from both Dagger and Mockito (and for the latter, the API docs for the "main" class pack way more than just API, and that will show up in your IDE).
  • Beancount, a command-line accounting tool, does a great job of teaching double-entry bookkeeping (it taught me almost all I know about the topic).

Read the docs #

In a way, this whole article is a ruse to tell you to RTFM, but with a bit of nuance.

Read the documentation to find out how, and more importantly, why – a mental model of the thing you're using will make you more productive, and will improve any guesses you make.

Aside from the user guide part, I also like to skim the reference for libraries I use a lot; this gives me an idea of what's available, and often I remember later that "hey, there was something in the docs about X".

But the docs suck #

Sometimes, documentation sucks or is missing entirely (doubly so for internal code). It's easy to get a habit of not bothering to look, when most of the time it's not there.

Still, always check. Worst case, you'll have wasted a minute.

On the other hand, docs should definitely be one of the criteria when picking a library.

But I'm using lots of libraries #

Quoting a friend:

Well, for you, it might be "a library", but for them it's 10 new libraries, and a 2-day deadline.

Like with most things, the Pareto principle applies:

  • Read the docs for big stuff you use often – language, standard library, test framework, web framework or database of choice, and so on; these have an outsized return on investment, and start paying out quite quickly.
  • Don't read all the docs; skimming is often enough.

On the other hand, there is value in not using that many libraries.

Conclusion #

To sum up:

  • Good libraries educate on the subject matter.
  • Read the documentation of the things you are using a lot.
  • If you feel you're fighting your tools, maybe you're using them wrong (read the docs!), but maybe they're not all that good; it's OK to look for better ones.

I'll end with a related idea from a post called Why I Don’t Use Autocomplete:

But be sure that if you do use autocomplete that you’re not leaning on it as a crutch. Be sure that you’re not deciding what to write as you write it [...] and not just using a method because the IDE tells you it’s available.

Learned something new today? Share this with others, it really helps!

  1. For a detailed example about decoupling I/O, how it affects affects testing, plus interesting historical background, check out Brandon Rhodes' great talk The Clean Architecture in Python. [return]

  2. Is it not written, "A few hours of trial and error can save you five minutes of reading the docs"? [return]