Hacker News new | past | comments | ask | show | jobs | submit login
Python 3.12.0 from a supply chain security perspective (sethmlarson.dev)
120 points by zdw 7 months ago | hide | past | favorite | 46 comments



This is great. On a related note, we recently compiled Python 3.12.0 to WebAssembly: using it should shield almost any use case of malicious behavior even on external untrusted modules.

https://wasmer.io/python/python

Run it locally and see that filesystem access and network are completely sandboxed by default! :)

  wasmer run python/python


What’s benefit of this over running Python in a container like Docker?

I know Docker is not 100% watertight, but it’s very unlikely a normal user encounters trojaned Python code that tries to break out from a container.

The run-time penalty for using WebAssembly for Python is pretty severe at the moment, at least what I have tried.


I’ve personally never been hit by a Trojaned Python project, why use any container or jail at all?


Is this sarcasm? Hard to tell on the internet sometimes. Anyway:

* It's scary enough to be hit one time. It's no fun to have one's all passwords, accounts and keys stolen, and sometimes have files encrypted with a ransom demand on top of that.

* Ideally we'd like to have a secure-by-default world, where the mere act of installing a dependency or running a helper script doesn't compromise your whole system. We're almost there with our phones, we just need to rethink our desktops too.

Of course I don't say you should use containers/sandboxes. It's a tradeoff everyone has to make, but there are many reasons some people (me included) prefer to sandbox everything as much as reasonably possible.


Shouldn't the OS already do containering with its processes (based on user)?

That's kind of the [functional context] approach I take with: https://github.com/matrixApi/encapsule


The typical User in linux is pretty unconstrained. They can view a lot of global resources, perform arbitrary system calls, view other processes, other users, etc. It's not really a tight sandbox.

That's why Linux offers other, more powerful mechanisms for sandboxing such as namespaces, which are the backbone of containers.


Desktop and server operating system do not prevent any process to access file system. Any process can effectively steal any work you have on your file system, and also perform malware attacks by encrypting your files.

This is not true for mobile operating systems like Android and iOS where processes cannot know about other processes or access file system.



Sandboxing is always good. However, the problem is figuring out accurate per-module policies (what to allow/disallow) For example, a default block-all network policy would also block legitimate functionality.


Hugged to death?

    500: INTERNAL_SERVER_ERROR
    Code: FUNCTION_INVOCATION_FAILED
    ID: arn1::tpcjw-1696548167206-a333e98b1ba8


That's just a matter of perspective. My laptop is already perfectly isolated from your laptop, thank you very much.


> recently compiled Python 3.12.0 to WebAssembly

Any chain is as strong as its weakest link.


But not all composition is a chain.

A sandboxed solution is at least as strong as the sandbox.


As a curiosity, what would it entail to make the two tgz byte-for-byte identical ? There was/is some discussion in setuptools about how to normalize the tarball (https://github.com/pypa/setuptools/issues/2133#issuecomment-...) coudl something similar be applied to Building Python itself ?


The suggestion there (uid = gid = 1000; uname = user; gname = users) isn't great.

Just use uid = gid = 0, and omit uname/gname.

If you're distributing software via a tarball, the uid/gid bits are meaningless. They only make sense when you archive / backup a directory and plan to extract on the same system.

If you set them to anything other than 0, it may happen that when the tarball is extracted as root user, ownership is changed to the uid/gid of the tarinfo provided those exist on the system. That's a lot of fun!

Python itself in fact tries to chown files when extracting a tarfile (under sudo).

If you set uid = gid = 0, then at least when extracting as root, the files remain owned by root.


Thanks for advice, and I assume you are the one who commented on the upstream issue. This show it is not trivial, and it would be nice to be done automatically by default.


I believe the only differences were uid/gid and username/groupname values between the two tarballs. One had the information of Thomas Wouters, the release manager of 3.12, and the other had generic GitHub Action usernames/groups.

Normalizing these values to something known like 0/0 would have done the trick.


Thanks for the article and taking the time to reply here.


> As a curiosity, what would it entail to make the two tgz byte-for-byte identical ?

It can't be that complicated. The tarballs autogenerated by GitHub (using `git archive`) were byte-for-byte identical for years, until GitHub upgraded git and things broke because entire ecosystems had started to rely on that.

[1] https://news.ycombinator.com/item?id=34586917


I suspect you're looking for pristine-tar(1)?

https://manpages.debian.org/stretch/pristine-tar/pristine-ta...

It's intended to solve exactly this problem, but in reverse -- a tarball is extracted to source, and we want to ensure that the sources we've extraced can be traced back to the original tarball.


Hum, that is interesting. I'm more thinking that in a perfect world the pristine-tar delta file should be empty. (Assuming I understand what pristine-tar is doing correctly).

For example I tend to use SOURCE_DATE_EPOCH to be the timestamp of the commit to make sure that anything that embed time is reproducible without extra instruction/manual process specific file.


My biggest takeaway from this article is the Supply chain Levels for Software Artifacts (SLSA) security framework: https://github.com/slsa-framework/slsa-verifier


> biggest takeaway from this article is the Supply chain Levels for Software Artifacts (SLSA) security framework

See also GUAC from Kusari, Google, Citi, and others:

“GUAC (Graph for Understanding Artifact Composition) aims to fill in the gaps by ingesting software metadata, like SBOMs, and mapping out relationships between software. When you know how one piece of software affects another, you’ll be able to fully understand your software security position and act as needed.”

https://guac.sh

https://www.kusari.dev


Is there any effort to integrate SLSA with PyPI? GitHub recently announced[1] that npm support for SLSA is GA now.

[1] https://github.blog/changelog/2023-09-26-npm-provenance-gene...


Great question! PyPI already supports Trusted Publishers [1], which gets you most of the benefits of SLSA build provenance (provable link between artifacts and a public software repository). Implementing Trusted Publishers is the recommended first step for ecosystems looking to implement build provenance [2].

[1] https://docs.pypi.org/trusted-publishers/

[2] https://github.com/ossf/wg-securing-software-https://docs.py...

I don't think there's a big effort /right now/ to implement complete SLSA build provenance for PyPI and expose it for users to verify.


I have to say it makes sense to me that the binaries would be built from the source tarballs rather than git, seems a bit odd that you'd have separate pipelines and binaries that are produced from a (slightly) different source.


Didn't Anaconda show that the whole Pip approach is flawed and we need a language-agnostic/multi-language approach?


npm works great on JavaScript.



You can even use npm to manage and publish Python code, or could a few years ago if they haven't added constraints on the registry since then


Correct me if I'm wrong but SLSA would only prevent artifact tampering (eg. Account takeover on pypi) instead of typo squatting or build script abuse for example?


You've got it right, SLSA build provenance in particular only tells you that the artifact you have came from X software repo, at Y commit/tag, built using Z workflow. SLSA doesn't make any mention of what is actually in the artifact (but you can now safely verify the correct commit knowing it was used as input).

Typosquatting is an interesting one, because if you've made a typo in one place but not the other (ie installing package name "requestss", but repo is "psf/requests") then SLSA would "save" you by erroring on the mismatch. But that doesn't stop you from typoing in /both/ parameters.


This sounds great if you want the PSF to make these changes I think the next step would be a PEP that describes it.


This doesn't change the Python language or packaging so wouldn't require a PEP, I'm working with release managers on this GitHub repo: https://github.com/python/release-tools


In what situation, when it comes to deployed products is any of this relevant?

Having used Python for decades, across multiple organizations starting from mega-corps and down to five programmers I've never used built Python binaries for any project that required Python.

It's not hard to build your own, and it gives you better control of what's included (Python has a handful of optional compile-time dependencies, which most projects don't need).

----

Also, on personal level, I don't think I ever use built binaries from Python.org. I either build them myself, or use whatever the distro maintainers built. Maybe if you develop on Mac / Windows then it matters... but you already chose to suffer by not having control of your tools -- another drop in a bucket, does it even matter?

NB. Also, official Docker images of Python don't use binaries from Python.org. So, nobody who deploys in containers is likely to either build themselves, or to use something other than Python.org binaries.


I think it is. There are still millions of devices which either use their own distribution's provided Python package or some form of prepackaged container.


Why would what was described in the article be of any use to these devices you describe?

If it's built by distribution -- the article is irrelevant. If they use prepackaged container -- the article is irrelevant :|


> but you already chose to suffer by not having control of your tools

~Gaslight much?~

Edit:

How about not insulting people who don’t share your point of view?


> In contemporary language, gaslighting is a colloquialism describing the subjective experience of having one's reality repeatedly questioned

I understand it's even more contemporary to start using the word for anything where you could otherwise respond "don't be a dick", but I'm not sure what alternative word we could still use to mean gaslighting if gaslighting gets hijacked for a general negative meaning


Fair point! I was basing it more on this definition:

> to grossly mislead or deceive (someone) especially for one's own advantage

But even that doesn’t fit ideally, so I edited the post.


That's not a point of view. This is a consequence of a choice. If you use an operating system that denies you the right to verify its source, then why do you care if an individual component allows you to do that? -- This doesn't make sense. Your system is compromised because you agreed to use a compromised system.

This doesn't say that whoever does this is an idiot. It's about making choices that make sense. If you sign up for a boxing club, but then start complaining about being hit in the face, then you aren't consistent. If you use Windows, and then complain that some component doesn't have a good verification process, then you are just as consistent as the guy who complains about being hit on the face in a boxing club.


Thanks for confirming your true colors. Have a nice life!


I have worked with Python for well over a decade and only time I have built my own is to test out the up coming version before it is readily available for my distro.

On Windows I just download from python.org and on Mac I get whatever homebrew gets me.

Never have I even thought about need to optimise my python executable.


> whatever homebrew gets me

Then you aren't using what's described in the article.

> On Windows

I've already commented on this. This is not a serious deployment. It doesn't matter if it's done with more supervision or less. It's for "recreational" use.


As someone that’s done both, you are certainly overstating the suffering endured by not compiling your own Python. I very much believe that this is a consequence of your ideology rather than any indication of frequency.


Suffering doesn't come from not compiling. Suffering comes from inability to trust your system. Not everyone suffers from this. Most don't care. But then the question is why would they care in the case described by the article, if they don't give a rats ass in general?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: