Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do you deal with large Python code bases?
8 points by StefanWestfal on May 27, 2023 | hide | past | favorite | 14 comments
Currently I am working on a Python code base with 10k - 100k loc. The service does not depend on Python SciPy tooling or similar but is web backend on top a large db. I worked with Python for several years and it is still my go to language for a lot of task. However I feel that Python advantages like dynamic typing become increasingly disadvantages when the code base grows. How do you deal with Python in large code bases? We use MyPy, black, Tox, pytest etc. any tips? I feel like (for the next project) moving to a typed language like Kotlin/Go etc. might be the better choice then using Python when not depending on the SciPy/ML/DL stack.



Python _is_ typed. Python is Dynamically Typed. The word "Dynamically" does not suddenly erase the word "Typed". Even CPython is Dynamically Typed.

You presumably mean Statically, Manifestly Typed.

Assembly Language and Forth are Untyped. The closest thing they have to types are bytes and cells.

For examples of a language that is Statically but not Manifestly Typed, have a look at Shedskin (which is a dialect of Python) or Haskell (which is a fascinating Functional, Lazy language with a regrettably poor compiler).

I continue to see mypy, or something like it, as the best way of taking a medium sized Python project into the world of large projects. For smaller projects, ruff or pylint are sufficient, but for large projects, mypy and similar are the way to go.


If databases are in play, instead of trying to understand the business logic, have a go at understanding the database structure first.

What are your tables, how they are related etc.

Then look at the modules that interact with the data layer directly and move up.

This is a language agnostic approach though, but it has worked out well for me.


That method works well for initially building the system too. Start in the db and model the relationships and then move up.


What exactly are your problem's? It is hard to give advice without knowing. To be honest I don't think it makes things easier if you switch to Go, since less verbose.


It gives you a lot of freedom and doesn't force many rules like other non-dynamic languages do. This makes it great for trying out ideas and working with data and machine learning. However, when you're writing more complex programs, you need to be more careful. That's why I feel like I encounter more bugs during runtime compared to other languages. I mentioned Go because it makes you stick to a smaller number of ways to do things, which feels more "pythonic".


Hard to tell but that sounds a bit like a test problem. If you have high test coverage (which you must do in a language like python) most bugs should only result through complex interaction.


Regression tests.

Static typing still doesn't help for the more insidious issues - errors of value.


I used to use dataclasses for value assertions, but have switched to use https://www.attrs.org/en/stable/. Its like the one tool you will need for 90% of the cases.


How are you finding mypy for your project?

I find that I waste a lot of time telling mypy to ignore some library that it doesn't understand, much more than I'm saving by catching type errors (which is never). People add it to projects because it seems like a best practice and the proper thing to do, but I'm kind of unconviced that it's making things better.


> telling mypy to ignore some library that it doesn't understand

Such as?


For me it helps with smaller mistakes like typos, naming collisions etc. But it can be more verbose.


Have a look at www.github.com/odoo/odoo . It's a full blown ERP and more, in Python.


The larger your codebase gets the more bazel becomes a requirement. Bazel is really not negotiable for large python code bases. The more bazel is put off, the more pain you will endure before you eventually are forced to use bazel. You will be forced to use bazel or a system like it because eventually your good devs will not tolerate your codebase and leave without it.

https://bazel.build/

Other than bazel you will have to start hacking away at dependency problems.

No inline imports, no circular imports. Imports all sorted at the top of your file. You will have to start enforcing good hygiene with linters.

You will need to create warnings against using the global scope.

You will need to construct the clients for all your dependencies in main()

You will need to discourage the use of calling non trivial functions in constructors. (this property largely encourages dependency injection).

There are exceptions to every rule, but if you are going to violate scope or not dependency inject, those things need to be done very mindfully.

As the structure of your code improves via good scoping and injected dependencies, it will become easier to change and easier to test.

You will have to devote some serious consideration to how to quarantine business logic from server code. Generally, your product developers shouldn't be doing much outside of defining their data and altering business logic from within a route. If the place where business logic is executed is commingled with how data-stores are manipulated, you're going to have a bad time. Likewise if the place business logic is executed is commingled with the presentation of it to customers, you're going to have a bad time.

Python does not have a culture of dependency injection because it's so easy to import antigravity and fly away. This makes writing tests hard and promotes spaghetti code. Lack of dependency injection (which means violating scoping) is the entropic force that makes codebases miserable as time increases.

Additionally, you will have to think hard about state. If you can't restart a process trivially, or balance traffic to a different machine trivially, you are going to make your operational people's lives hard. State belongs in state storage. Put it in an RDBMS, put it in redis, put it in memcached, put it in anything but a python processes memory (or disk). This means that any two requests should be able to be sent to any two machines. This is a deeply important property for scaling.

Lastly, if you do not have good answers for observability, in terms of time series data, log data, exception data, and event data (for observability only), you will have a bad time. These are generally the things it is ok to violate scope to use.


I thought bazel was only needed for C++ builds? haven't seen it used in Python projects, is the utility of bazel for the C++ binaries that scientific libraries like numpy, pandas use?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: