|
|
Subscribe / Log in / New account

Insecurity and Python pickles

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Daroc Alden
March 12, 2024

Serialization is the process of transforming Python objects into a sequence of bytes which can be used to recreate a copy of the object later — or on another machine. pickle is Python's native serialization module. It can store complex Python objects, making it an appealing prospect for moving data without having to write custom serialization code. For example, pickle is an integral component of several file formats used for machine learning. However, using pickle to deserialize untrusted files is a major security risk, because doing so can invoke arbitrary Python functions. Consequently, the machine-learning community is working to address the security issues caused by widespread use of pickle.

It has long been clear that pickle can be insecure. LWN covered a PyCon talk ten years ago that described the problems with the format, and the pickle documentation contains the following:

Warning:

The pickle module is not secure. Only unpickle data you trust.

It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

That warning might give the impression that the creation of malicious pickles is difficult, or relies on exploiting flaws in the pickle module, but executing arbitrary code is actually a core part of pickle's design. pickle has supported running Python functions as part of deserializing a stored structure since 1997.

Objects are serialized by pickle as a list of opcodes, which will be executed by a custom virtual machine in order to deserialize the object. The pickle virtual machine is highly restricted, with no ability to execute conditionals or loops. However, it does have the ability to import Python modules and call Python functions, in order to support serializing classes.

When writing a Python class, the programmer can define a __reduce__() method that gives pickle the information it needs to store an instance of that class. __reduce__() returns a tuple of information needed to save and restore the object; the first element of the tuple is a callable — a function or a class object — that will be called in order to reconstitute the object. Only the name of callable objects are stored in the pickle, which is why pickle doesn't support serializing anonymous functions.

The ability to customize the pickling of a class is the secret to pickle's ability to store such a wide variety of Python objects. For objects without special requirements, the default object.__reduce__() method — that just stores the object's instance variables — usually suffices. For objects that have more complicated requirements, having a hook available to customize pickle's behavior allows for the programmer to completely control how the object is serialized.

Limiting pickle to not support unnamed callable objects is a deliberate design choice with two advantages: allowing code upgrades, and decreasing the size of pickled objects, both before and after deserialization. The fact that pickle loads classes by name allows a programmer to serialize an object with a custom class, edit their program, and deserialize the object with the new semantics. This also ensures that unpickled objects don't come with an extra copy of their classes (and all the objects that those reference, etc.), which significantly reduces the amount of memory required to store many small unpickled objects.

pickle does support restricting which named callables can be accessed during unpickling, but finding a set of functions to allow without introducing the potential to run arbitrary code can be surprisingly difficult. Python is a highly dynamic language, and Python code is often not written with the security of unpickling in mind — both because security is not a goal of the pickle module, and because programmers often don't need to think about pickling at all.

A malicious pickle

The pickle documentation gives this example of a malicious pickle:
    import pickle
    pickle.loads(b"cos\nsystem\n(S'echo hello world'\ntR.")
This pickle imports the os.system() function, and then calls it with "echo hello world" as an argument. This particular example is not terribly malicious; malware using this technique in the real world usually executes Python code to set up a reverse shell, or download and execute the next stage of the malware. The builtin pickletools module shows how this byte stream is interpreted as instructions for the pickle machine:
    0: c    GLOBAL     'os system'
    11: (    MARK
    12: S        STRING     'echo hello world'
    32: t        TUPLE      (MARK at 11)
    33: R    REDUCE
    34: .    STOP
GLOBAL is the instruction used to import functions and classes. REDUCE calls a function with the given arguments.

Widespread use

Because pickle is so convenient, it is used in many different applications. Programs that use pickle to send data to themselves — such as programs that use multiprocessing — mostly have little to worry about on the security front. But it is common, especially in the world of machine learning, to use pickle to share data between programs developed by different people.

There are several directories of machine-learning models, such as Hugging Face, PyTorch Hub, or TensorFlow Hub, that allow users to share the weights of pre-trained models. Since Python is a popular language to use for machine learning, many of these are shared in the form of either raw pickle files, or other formats that have pickled components.

Security researchers have found models on the platforms that embed malware that is delivered via unpickling. Security company Trail of Bits recently announced an update to its LGPL-licensed tool — fickling — for detecting these kinds of payloads. Fickling disassembles pickle byte streams without executing them to produce a report about suspicious characteristics. It can also recognize polyglots — files that appear to use one file format, but can be interpreted as pickles by other software.

The machine-learning community is certainly aware of these problems. The fact that loading a model is insecure is noted in PyTorch's documentation. Hugging Face, EleutherAI, and Stability AI collaborated to design a new format — called safetensors — for securely sharing machine-learning models. Safetensors files use a JSON header to describe the contained data: the shape of each layer of the model, the numeric format used for the weights, etc. After the header, a safetensors file includes a flat byte-buffer containing the packed weights. Safetensors files can only store model weights without any associated code, making it a much simpler format. The safetensors file format has been audited (also by Trail of Bits), suggesting that it might prove to be a secure alternative.

Even with safetensors becoming the new default format to save models for several libraries, there are still many older pickle-based models in regular use. As with any transition to a new technology, it seems likely that there will be a long tail of pickle-based models.

Hugging Face has started including security warnings on files that contain pickle data, but this information is only visible if users click through to view the files associated with a model, not if they only look at the "model card". Other sources of machine-learning models, such as PyTorch Hub and TensorFlow Hub, merely host pointers to weights stored elsewhere, and therefore do not do even that small check.

pickle's compatibility with many kinds of Python objects and its presence in the standard library make it an attractive choice for developers wishing to quickly share Python objects between programs. Using pickle within a single application can be a good way to simplify communication of complex objects. Despite this, using pickle outside of its specific use case is dangerously insecure. Once pickle has made its way into an ecosystem, it can be difficult to remove, since any alternative will have a hard time providing the same flexibility and ease of use.


Index entries for this article
SecurityPython
PythonPickles


(Log in to post comments)

Insecurity and Python pickles

Posted Mar 12, 2024 18:50 UTC (Tue) by pwfxq (subscriber, #84695) [Link]

You use an interface that is clearly labeled as being insecure and then are surprised to discover people can do bad things?

Colo[u]r me shocked.

Insecurity and Python pickles

Posted Mar 12, 2024 19:38 UTC (Tue) by atnot (subscriber, #124910) [Link]

My lesson would rather be that if you provide something useful in a way that can not be used safely, it will inevitably expand in usage until it becomes a problem.

Pickle is used a lot by machine learning folks because it's an easy way to checkpoint long-running jobs. I've used it for that myself. You *could* hook up custom json serialization or something, but it's a pretty huge pain to do in python. And remember most of these people are researchers first and programmers second. And yes, sure, the pickles might be insecure, but they're you're pulling down megabytes of python code you don't understand to actually run the model anyway, so does it really matter?

And so, lacking better alternatives, the usage expands into new use cases until suddenly the theoretical issue becomes a practical one.

See also: PyYaml, which had all of these convenient functions for writing python inline in your local configuration files. Until people started using it for data interchange. Or the naive file format parsers of a nice convenient tool to resize your images. That then accidentally became the standard library people hooked up to their php sites.

Insecurity and Python pickles

Posted Mar 12, 2024 21:59 UTC (Tue) by Karellen (subscriber, #67644) [Link]

And yes, sure, the pickles might be insecure, but they're you're pulling down megabytes of python code you don't understand to actually run the model anyway, so does it really matter?

The thing is, a lot of non-computer people understand the difference between "code" and "data". They might understand it with varying levels of sophistication, but they often understand that running random programs from arbitrary websites can be inherently dangerous, in a way that viewing random pictures, or random videos, or listening to random mp3s, or reading random web pages, "shouldn't" be (genuine bugs aside).

Sure, they don't understand the code that runs a model, in the same way they don't understand the code that makes up a media player. But if they trust the entity that wrote the code, they don't need to. And they probably can (and should) trust Huggingface, or Microsoft.

And with that trusted code, they should be able to try out data files from anywhere, with relative safety, right? Right?

Because, as we've known from auto-executing macros in Office documents for over a quarter of a century now, making data executable is a bad idea. So no-one would be stupid enough to make that mistake again, in the mid-2020s, surely? You, as a regular user who's not a software dev, shouldn't have to check if data is safe to load, should you?

Insecurity and Python pickles

Posted Mar 13, 2024 17:43 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

I'm not sure it's that easy. Most people know the difference between a .exe and all other file formats (assuming a Windows context, because let's face it, that's where the users are). They may or may not be able to recognize any of the following as potentially dangerous:

*.dll (dynamic linked library, analogous to *.so on Linux) - Windows automatically loads DLLs in the same directory as an executable before searching other directories, on the theory that C:\Program Files\ is supposed to be root-owned and you're not supposed to have random *.exe files lying around your homedir.
*.reg (Windows Registry entries, exported as files) - to be fair, Windows does display a warning message on these when you double-click, but if clicked through, it will import them back into your registry.
*.vbs (VBScript, a scripting language supported by ~all versions of Windows) and various other extensions associated with VBScript.
*.bat (MS-DOS batch files, another scripting language supported by ~all versions of Windows)
*.ps1 (PowerShell script, supported by modern versions of Windows) - to be fair, Windows defaults to the "edit" verb on double-click, and unsigned scripts won't run unless a system-wide setting is adjusted.
*.hlp (WinHelp file, supported by everything older than Windows 10) - deprecated primarily because it is hideously insecure, but when I Google it, different websites seem to identify different specific problems (macros, embedded DLLs, possibly others?).

And probably other formats that I don't know about.

Insecurity and Python pickles

Posted Mar 14, 2024 15:41 UTC (Thu) by smoogen (subscriber, #97) [Link]

We have been making this mistake for a lot longer than Office macros, and we will keep making the mistake because people's brains are wired to consciously and subconsciously to choose convenience. Think about how everyone is supposed to drive 100% of the time following all the rules exactly.. and think about the number of times something happens where you drive over a limit, forget to do a turn signal, don't check your mirrors every 5-10 seconds, etc. We all know we should be doing these all the time, but there is some part in most people's brains to say "Oh I can get away with it for a short bit." and 99.999% of the time, that part is right. The same goes with choosing formats which are used to build things.
You start off with a 'this is a fast checkpoint method' and then at the next deadline/problem cycle you have 'oh lets just take that checkpoint and see if we can restart this' to 'oh we can send you the checkpoint' to 'everyone is sending everyone pickles so its a standard.'

In the end, this is where standards get forced into being. People blow up enough towns around factories and eventually someone makes an ISO about valve safety.

Insecurity and Python pickles

Posted Mar 14, 2024 16:30 UTC (Thu) by farnz (subscriber, #17727) [Link]

You start off with a 'this is a fast checkpoint method' and then at the next deadline/problem cycle you have 'oh lets just take that checkpoint and see if we can restart this' to 'oh we can send you the checkpoint' to 'everyone is sending everyone pickles so its a standard.'

This sort of creep in scope is also problematic by itself, even without "choosing convenience", because you change the context around.

The first developer, writing the "fast checkpoint" method, probably had a context of "and anyone who can tamper with the checkpoint can attach a debugger to the running code and replace it".

The second set were aware of that context, but didn't see it as a problem if you're using it as a suspend work + resume later mechanism - after all, if you could suspend work and resume it later on the same system, you can attach the debugger instead of suspending, changing the checkpoint, and resuming.

And the third set, who send checkpoints, aren't aware that the first developer made their choices in the context of "checkpoints are equivalent in threat to a debugger attached to the process loading them, and that's OK", so they simply adopt the existing checkpoint format, because surely it's secure enough, right?

Insecurity and Python pickles

Posted Mar 14, 2024 17:18 UTC (Thu) by adobriyan (subscriber, #30858) [Link]

This is easily fixable by casting spell called "Criminal Negligence" readily available in most countries.

Whoever enables eval() equivalent by default goes to prison.

All it would take is 1 sacrificial cow.

I'd suggest Microsoft PM who authorized Outlook executing attachments by default which clearly caused so much damage in 90s-2000s.
But status of limitations may have expired on the man.

Developers would probably revolt (they won't), both proprietary and open source.
It is important to absorb all the feedback from the constituents but ignore all their whining in the end
which politicians can be very good at.

Insecurity and Python pickles

Posted Mar 14, 2024 18:28 UTC (Thu) by pizza (subscriber, #46) [Link]

> This is easily fixable by casting spell called "Criminal Negligence" readily available in most countries.

Except that criminal laws are not retroactive.

> Whoever enables eval() equivalent by default goes to prison.

Ok, so users have to change the default setting to achieve common legitimate use cases.

Or they're prompted "do <potentially dangerous thing> Y/N?" so often that they automatically say "Yes" without thinking about it any more.

....That which makes computers useful also makes them dangerous. And the definition of each varies on an individual and/or situational basis.

Insecurity and Python pickles

Posted Mar 14, 2024 19:01 UTC (Thu) by adobriyan (subscriber, #30858) [Link]

> > This is easily fixable by casting spell called "Criminal Negligence" readily available in most countries.
> Except that criminal laws are not retroactive.

I'm not proposing new law. I believe all the laws are in place already, it is just that governments chose to not exercise them.

I remember Nimda pandemic while at the university. Us, Linux users were laughing at Windows suckers.
But our machine was dual booting so we were laughing at ourselves too.

It is unthinkable how Microsoft was not crucified for those stunts. It was so easy politically.

> > Whoever enables eval() equivalent by default goes to prison.
> Ok, so users have to change the default setting to achieve common legitimate use cases.

Yes, and then it is not on the manufacturer.

> Or they're prompted "do <potentially dangerous thing> Y/N?" so often that they automatically say "Yes" without thinking about it any more.

Police officers don't even get prompted by their gun's safety lock, they just disable it and shoot the criminal if necessary.
Somehow, the society lives with it and this situation is considered OK by general public, gun manufacturers, police and soldiers.
Nobody is saying "hey, police officer would have override safety lock so many times in his career that they would do it without thinking".
Maybe it is time to stop saying that when talking about software.

Insecurity and Python pickles

Posted Mar 14, 2024 19:28 UTC (Thu) by pizza (subscriber, #46) [Link]

> Police officers don't even get prompted by their gun's safety lock, they just disable it and shoot the criminal if necessary.

So? Police weaponry is specifically designed to kill or otherwise incapacitate; they have no other legitimate uses. [1]

Whereas these "dangerous" software features are used as intended by untold millions of office drones every single day.

[1] They are also used to project the implicit and explicit threat of potentially lethal force should you not submit to their authority.

Insecurity and Python pickles

Posted Mar 14, 2024 20:34 UTC (Thu) by dvdeug (subscriber, #10998) [Link]

> Maybe it is time to stop saying that when talking about software.

Do you want actual security, or are you just trying to shift the blame?

Everyone of us has run into an interface where it demands "are you sure you want to do this?" and you hit yes without thinking because you've been asked that question over and over when you're trying to do exactly that. There is some user blame where the user understands the question (e.g. "are you sure you want to delete this file?"). However, "do I want this spreadsheet to work?" -- when 99% don't have a clue what that means and the remaining 1% _could_ spend the next hour doing a forensic examination of the guts of the spreadsheet but aren't going to waste their time without reason -- is just blame-shifting.

Even a question of "this spreadsheet is trying to write to an external file. This is a warning sign of a malicious spreadsheet; do you want it to continue?" is going to auto-yessed by 50% of the users, and it's going to be a pain using the office spreadsheet that does do that for non-malicious purposes, because 15% of the users are going to auto-no that, even when they're prewarned. That's not completely security theater, but the fail cases on both sides makes it pretty ineffective.

Insecurity and Python pickles

Posted Mar 12, 2024 21:33 UTC (Tue) by flussence (subscriber, #85566) [Link]

Maybe nVidia had a point all those years ago when they were screaming from the rooftops that WebGL was a mistake, and weren't just saying it because it exposed how embarrassingly bad their own driver code was. There should be some barrier to entry, a minimum level of competence and common sense before you can even _know about_ the big toys, let alone wield them. We've taken all the molly guards away for the sake of convenience, assuming anyone in the control room would at least be smart enough to read the warning labels, and now here we are.

To be fair to Python, the language is absolutely not at fault. Pretty much every high level scripting language has an equivalent feature and the world hasn't ended because of it... though evidently not for lack of trying.

Insecurity and Python pickles

Posted Mar 14, 2024 8:45 UTC (Thu) by taladar (subscriber, #68407) [Link]

Python, the language, is absolutely at fault here. JSON serialization is trivially easy these days in languages like Rust.

Insecurity and Python pickles

Posted Mar 14, 2024 9:44 UTC (Thu) by aragilar (subscriber, #122569) [Link]

JSON isn't exactly ideal for serialising complex data losslessly (nor is it exactly fast), both of which are key aspects for pickle (and recent versions have been leaning into making it faster by being able to pass around raw buffers).

There are numerous serialisation libraries in Python which are just as widely used as libraries like serde (and can dump JSON just as easily), but they typically are designed for web-apps, rather than multiprocessing (which is a very common use of pickles).

Insecurity and Python pickles

Posted Mar 16, 2024 16:29 UTC (Sat) by notriddle (subscriber, #130608) [Link]

> JSON serialization is trivially easy these days in languages like Rust.

Not really. You have to put derive(Serialize) on every intermediate data structure. If the library you’re using doesn’t have that, you have to write the serde impl yourself. You can’t serialize Voldemort types (closures, async blocks) at all.

That is intentional, because Serialize is an API. It’s not supposed to be there unless the library author promises not to change it. For publicly downloadable data, stable file formats are a good thing.

Pickle, in its original use as a way to checkpoint jobs, sounds less like an alternative to serde and more like “I wish I was running out of a Smalltalk image.”

Insecurity and Python pickles

Posted Mar 13, 2024 19:39 UTC (Wed) by rav (subscriber, #89256) [Link]

> Safetensors files use a JSON header to describe the contained data: the shape of each layer of the model, the numeric format used for the weights, etc. After the header, a safetensors file includes a flat byte-buffer containing the packed weights.

Such a simple format - exactly how it should be. Maybe I'm a bit biased, because I came up with the same idea independently (and in another context - astrophysics rather than ML), although a multi-file format where it sounds like safetensors is a single-file format. https://github.com/Mortal/bintable/blob/main/bintable.py

Insecurity and Python pickles

Posted Mar 13, 2024 21:35 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

I'm a bit confused about why people keep reinventing the "tabular data in a binary file" wheel. Are SQLite and/or DuckDB unsatisfactory for this purpose?

Looking through your code, it appears that one of the things you're doing is moving data between the file and NumPy. I'm guessing that a major sticking point here is that SQLite provides no obvious way to efficiently move large quantities of data between itself and NumPy. Interestingly, it looks like DuckDB does actually try to support that use case to some extent.[1] But I don't know enough about your use case to say whether that's actually good enough.

[1]: https://duckdb.org/docs/api/python/overview.html and pages linked from there.

Insecurity and Python pickles

Posted Mar 13, 2024 22:05 UTC (Wed) by Wol (subscriber, #4433) [Link]

> I'm a bit confused about why people keep reinventing the "tabular data in a binary file" wheel. Are SQLite and/or DuckDB unsatisfactory for this purpose?

Maybe because a two-dimensional table is unusual, unnatural, and constricting?

Is SQLite capable of storing a 4th-normal-form structure in a single row?

Cheers,
Wol

Insecurity and Python pickles

Posted Mar 13, 2024 22:31 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

SQLite can store any individual thing you want, as long as you can serialize it to bytes. Of course, a table full of TEXT objects is much less useful than a properly normalized table, but it does provide all of the supporting infrastructure (i.e. actually writing data out to files, shared and exclusive locking, write-ahead logging, etc.) for free, so it's still better than hand-rolled code.

Insecurity and Python pickles

Posted Mar 14, 2024 1:09 UTC (Thu) by intelfx (subscriber, #130118) [Link]

> shared and exclusive locking, write-ahead logging, etc.

I don't quite see how is any of this useful in the context of a data _interchange_ file format? All of these files are written exactly once and then read or distributed. If something happens during writing of such a file, the partial result is simply discarded and recomputed because it has no meaning.

Insecurity and Python pickles

Posted Mar 14, 2024 3:55 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

There is a problem with applying YAGNI here: YAGNI is supposed to *reduce* the amount of work you have to do, not increase it.

With SQLite: import sqlite3, then write a few lines of SQL. Done.

Without SQLite: You have to write out this JSON stuff by hand, make sure your format is unambiguous, parse it back in, etc., and probably you also want to write tests for all of that functionality.

Insecurity and Python pickles

Posted Mar 14, 2024 3:58 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

(And before anyone asks: sqlite3 is a standard library module. It is already installed in every reasonably modern version of Python. You do not have to download it, take a dependency on it, or faff about with pip.)

Insecurity and Python pickles

Posted Mar 14, 2024 7:36 UTC (Thu) by Wol (subscriber, #4433) [Link]

> Without SQLite: You have to write out this JSON stuff by hand, make sure your format is unambiguous, parse it back in, etc., and probably you also want to write tests for all of that functionality.

You're assuming your JSON doesn't have a schema/definition.

There's a whole bunch of JSON-like stuff (XML/DTD, Pick/MultiValue) where having a schema is optional but enforceable.

If you *declare* that JSON/XML/MV etc without a schema is broken, then all this stuff can be automated extremely easily.

Cheers,
Wol

Insecurity and Python pickles

Posted Mar 14, 2024 9:14 UTC (Thu) by atnot (subscriber, #124910) [Link]

> With SQLite: import sqlite3, then write a few lines of SQL. Done.

That's not quite it. You need to laboriously map all of the objects you have into an SQL model first. Then learn about prepared statements, etc. if you don't already know all of this stuff, which as the average scientist you don't. That's easily a dozen lines of code.

> Without SQLite: You have to write out this JSON stuff by hand, make sure your format is unambiguous, parse it back in, etc., and probably you also want to write tests for all of that functionality.

All of this needs to be done for SQL too. You don't just magically get the valid python objects you put in out again. Even if you use a third party ORM-like thing, what about third party objects that were never intending this. And tests are needed for all this stuff.

It's not like Rust etc. where there's a defacto standard for ser/des that everything implements, all of this is real work.

Meanwhile with pickle: You import pickle and just give it the python object you want to save and it works. One line. And it's just built into the language. Sure it's insecure, but you'll fix that maybe once this paper is out.

Insecurity and Python pickles

Posted Mar 14, 2024 10:32 UTC (Thu) by aragilar (subscriber, #122569) [Link]

It depends what you're working on/what libraries you're using, but tools like pandas make it fairly easy to dump out an sqlite file (see https://pandas.pydata.org/docs/user_guide/io.html#sql-que...). The larger python web frameworks either provide serialisation support, or recommend third-party libraries and demonstrate their use in their docs. There isn't a universal library like serde, but I personally wouldn't use serde for HPC (wrong design), so I'm not sure this is the actual reason (my expectation is that people are using notebooks, and want to pick up where they left off, and so while pickle is fine as a "dump the current state of my work to a file" tool, people then start sending this state around, and it gets embedded in workflows).

Insecurity and Python pickles

Posted Mar 14, 2024 16:58 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

> That's not quite it. You need to laboriously map all of the objects you have into an SQL model first. Then learn about prepared statements, etc. if you don't already know all of this stuff, which as the average scientist you don't. That's easily a dozen lines of code.

This is a game of "don't read the thread." I made that comment in response to an assertion that some data could not be mapped into SQL because it was not 2D. In that case, you already have to turn it into bytes anyway (e.g. with numpy.ndarray.tofile() into a BytesIO object, which was already being done in the code I was commenting on in the first place). My point is that you can put metadata and other such stuff into "real" SQL columns, and store anything that doesn't easily map to SQL objects as TEXT, and then you can skip the nonsense with JSON. You have not meaningfully responded to that assertion, you've simply talked past me.

Insecurity and Python pickles

Posted Mar 14, 2024 9:28 UTC (Thu) by aragilar (subscriber, #122569) [Link]

My understand of "tabular" implies 2D (i.e. array of records)? In my experience, tabular/catalogue data makes sense in a database. Naturally, higher dimensional data requires different tools (e.g. HDF5).

Insecurity and Python pickles

Posted Mar 14, 2024 16:54 UTC (Thu) by Wol (subscriber, #4433) [Link]

> My understand of "tabular" implies 2D (i.e. array of records)? In my experience, tabular/catalogue data makes sense in a database.

Actually, I would argue that tabular data does NOT make sense in a database. It starts with the very definition of data.

Relational data is defined as "coming in rows and columns". If it doesn't fit that definition, it's not data. Relational dictates what is acceptable.

My definition (the one I use in my databases) is "data is what the user gives you". I'm happy with whatever I'm given.

Now let's define metadata. I don't know what the Relational definition of metadata is, probably "it's the schema". My definition (that I use in my databases) is "metadata is data I have inferred about the data the user gave me". And one of my cardinal rules is NEVER EVER EVER MIX data and metadata in the same table !!!!!!

But that's exactly what the job of a relational data analyst is - to infer data about the user data, then promptly mix both data and metadata up in the same table. How else would you represent a list in a 2D table?

> Naturally, higher dimensional data requires different tools (e.g. HDF5).

No. Higher dimensional data requires a multi-dimensional database. Preferably with a query language that is multi-dimensional-aware.

SQL contains huge amounts of database functionality, because it was designed to query unstructured data. So the query had to be able to express the structure. Get a database where the schema can DEscribe the structure, and the query language can be simultaneously more expressive, more powerful, and simpler, because you've pushed a lot of the complexity into the database where it belongs.

SQL puts huge amounts of complexity in the wrong place. Some complexity is unavoidable, but dealing with it in the wrong place causes much AVOIDABLE complexity.

Just look back at my rants here about relational. Just don't set off another one, the regulars here will have their heads in their hands :-)

The best way to let you see roughly where I'm coming from, is I see everything similar to an XML/DTD pair. That is EASY to manipulate with automated tools. And those tools are heavily optimised for fast efficient processing. Okay, that's not an exact description of MultiValue, but it's close. Oh - and if I store one object per XML table, the tool makes it dead easy to link different objects together.

Cheers,
Wol

Insecurity and Python pickles

Posted Mar 15, 2024 8:43 UTC (Fri) by aragilar (subscriber, #122569) [Link]

I think we're using the same words to mean different things. The data I've worked with has come in two different forms:
* arrays of records (and collections of these arrays): generally having a db makes it easier and faster to do more complex queries over these vs multiple files (or a single file with multiple arrays), and formats designed for efficient use of "tabular" data (e.g. parquet) are better than random CSV/TSV.
* n-dimensional arrays: this represent images/cubes/higher moments of physical data (vs metadata), and so are different in kind to the arrays of records. This is is where HDF5, netCDF, FITS (if you're doing observational astronomy) come in.

I think the data you're talking about is more graph-like right (and feels like the kind of thing where you want to talk about the structure of how data is related)? That feels different in kind to both the above, and so naturally tools designed for other types of data don't match?

My understand of ML/AI is generally they're pushed into one of the two bins above, but that may be a bias based on the data I encounter.

Insecurity and Python pickles

Posted Mar 15, 2024 9:52 UTC (Fri) by Wol (subscriber, #4433) [Link]

> I think we're using the same words to mean different things. The data I've worked with has come in two different forms:

No surprise ...

> * arrays of records (and collections of these arrays): generally having a db makes it easier and faster to do more complex queries over these vs multiple files (or a single file with multiple arrays), and formats designed for efficient use of "tabular" data (e.g. parquet) are better than random CSV/TSV.

So are your records one-dimensional? That makes your "arrays of records" two-dimensional - what I think of as your typical relational database table.

And what do you mean by "a complex query"? In MV that doesn't make sense. Everything that makes SQL complicated, belongs in an MV Schema - joins, case, calculations, etc etc. Pretty much ALL MV queries boil down to the equivalent of "select * from table".

> * n-dimensional arrays: this represent images/cubes/higher moments of physical data (vs metadata), and so are different in kind to the arrays of records. This is is where HDF5, netCDF, FITS (if you're doing observational astronomy) come in.

And if n=2? That's just your standard relational database aiui.

It's strange you mention astronomy. Ages back there was a shoot-out between Oracle, and Cache (not sure whether it was Cache/MV). The acceptance criteria were to hit 100K inserts/hr or whatever - I don't know what these speeds are, I'm generally limited by the speed people can type. Oracle had to struggle to hit the target - all sorts of optimisations like disabling indices on insert and running an update later etc etc. Cache won, went into production, and breezed through 250K within weeks ...

> I think the data you're talking about is more graph-like right (and feels like the kind of thing where you want to talk about the structure of how data is related)? That feels different in kind to both the above, and so naturally tools designed for other types of data don't match?

Graph-like? I'm not a visual person so I don't understand what you mean (and my degree is Chemistry/Medicine).

To me, I have RECORDs - which are the n-dimensional 4NF representation of an object, and the equivalent of a relational row!

I then have FILEs which are a set of RECORDS, and the equivalent of a relational table.

All the metadata your relational business analyst shoves in the data, I shove in the schema.

With the result that all the complexities of a SQL query, and all the multiple repetitions across multiple queries, just disappear because they're in the schema! (And with a simple translation layer defined in the schema, I can run SQL over my FILEs.)

I had cause to look up the "definition" of NoSQL recently. Version 1 was the name of a particular database. Version 2 was the use I make of it - defined by the MV crowd, "Not only SQL" - databases that can be queried by SQL but it's not their native language (in MV's case because it predates relational). Version 3 is the common one now, web stuff like JSON that doesn't really have a schema, and what there is is embedded with the data.

So I understand talking past each other with the same language is easy.

But all I have to do is define N as two (if my records just naturally happen to be 1NF), and I've got all the speed and power of your HDF5-whatever, operating like a relational database. But I don't have the complexity, because 90% of SQL has been moved into the database schema.

Cheers,
Wol

Insecurity and Python pickles

Posted Mar 14, 2024 7:16 UTC (Thu) by gspr (subscriber, #91542) [Link]

> I'm a bit confused about why people keep reinventing the "tabular data in a binary file" wheel. Are SQLite and/or DuckDB unsatisfactory for this purpose?

For ML model weights as discussed here, I've always been baffled that HDF5 isn't used more. It's an established, efficient and flexible standard with mature implementations in pretty much every language.

Insecurity and Python pickles

Posted Mar 14, 2024 9:09 UTC (Thu) by aragilar (subscriber, #122569) [Link]

HDF5 is used by at least some of the AI/ML projects (I'm guessing though most people working in the space are fairly new, and making it up as they go).

I'm not sure though you could say that there are multiple mature implementations though. Most libraries wrap libhdf5, and those that don't tend to be pretty limited (because why reinvent the wheel). That's not saying it's a bad format, but usually using libhdf5 is more than sufficient.

ML vs HDF5

Posted Mar 17, 2024 11:25 UTC (Sun) by summentier (subscriber, #100638) [Link]

My view is certainly somewhat contrarian, but I don't think HDF5 is a good file format for scientific data.

First of all, an HDF5 spec is enormous for a file format. This is not just a set of tensors organized in a tree structure, there is all sorts of additional stuff: attributes, compression, data layout, custom data types, you name it. For this reason, I disagree that there are “mature implementations in pretty much every language”, there really is only one feature-complete implementation: libhdf5, written in C, which pretty much every other language wraps around. (Yes, there is jHDF for Java, which can only read, there is JLD2 and some rust crates, but none of them support the full spec last time I checked.)

Because the HDF5 spec is so large and complex, you essentially have to tool around libhdf5 or use an HDF5 viewer every time you want to look at the data – hex dumps are of no use to you. But this also means that things like mem-mapping parts of a large datasets becomes a problem – libhdf5 to this day does not support this properly. Writing and reading HDF5 files is thus quite cumbersome and tends to be slow. Compare that with a simple binary file format like numpy, where you simply have a text header followed by some binary data, this becomes trivial.

What about HDF5 as an archiving format, then? Well, OK, there is a spec, but what use is that if, say, libhdf5 ceases to be maintained? In this case, how on Earth are we going to get the data out of a compressed data set with custom data types nestled deep into a tree hierarchy without reimplementing the spec? And even then, since there is essentially only one implementation that everyone uses, we have to pray that libhdf5 actually followed the HDF5 spec ...

In summary, I consider a tarball of binary files with text headers a la numpy a vastly superior file format to HDF5. It is clear, universally understood, and easy to view and tool around. (Of course, HDF5 being not a good format does not excuse using pickle ...)

Separate process

Posted Mar 18, 2024 19:23 UTC (Mon) by jhumphries (subscriber, #129504) [Link]

What is the overhead for unpickling an ML model? If the overhead is sufficiently high, wouldn't it make sense to just isolate the unpickle code in a separate process since the context switch overhead is negligible?

Separate process

Posted Mar 26, 2024 16:27 UTC (Tue) by sammythesnake (guest, #17693) [Link]

Getting the data from the "unpickling process" to the process you want it in would involve serialising/deserialising all over again, though!

You could potentially use this to do various sanity checks/sanitisation before re-pickling for interprocess transfer, but it would probably make more sense to do that as a preprocessing step before the data gets to your code at all.

I wonder if a safe-unpickle library could be written that does some magic on the code execution part of the unpickling process to disable access to any variables outside of the unpickled objects and ensures the methods of the created objects match the definition of the loaded modules. Come to think of it, why wouldn't this be part of the built in pickle functionality already :-/

Insecurity and Python pickles

Posted Mar 19, 2024 15:15 UTC (Tue) by lobachevsky (subscriber, #121871) [Link]

I wonder why the ML community came up with safetensor when there already is the npy format that Numpy uses to save arrays for similar reasons. The header there isn't JSON, but it's broadly similar. The differences I see are that safetensor adds arbitrary key value pairs in the header and multiple tensors in one file, which Numpy does by zipping up multiple npy files, though I'm not sure whether that is strong enough a motivation.


Copyright © 2024, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds