Pandas operations usually create a copy of the original dataframe. As some answers on SO point out, even when using inplace=True
, a lot of operations still create a copy to operate on.
Now, I think I'd be called a madman if I told my colleagues that everytime I want to, for example, apply +2
to a list, I copy the whole thing before doing it. Yet, it's what Pandas does. Even simple operations such as append always reallocate the whole dataframe.
Having to reallocate and copy everything on every operation seems like a very inefficient way to go about operating on any data. It also makes operating on particularly large dataframes impossible, even if they fit in your RAM.
Furthermore, this does not seem to be a problem for Pandas developers or users, so much so that there's an open issue #16529 discussing the removal of the inplace
parameter entirely, which has received mostly positive responses; some started getting deprecated since 1.0. It seems like I'm missing something. So, what am I missing?
What are the advantages of always copying the dataframe on operations, instead of executing them in-place whenever possible?
Note: I agree that method chaining is very neat, I use it all the time. However, I feel that "because we can method chain" is not the whole answer, since Pandas sometimes copies even in inplace=True
methods, which are not meant to be chained. So, I'm looking some other answers for why this would be a reasonable default.
inplace
mentions the reason it's being removed is that it is a misnomer. It does create a copy it just hides away the reassignment. There is almost no difference betweendf = df.some_operation)
anddf.some_operation(inplace=True)
There are (almost) no true inplace operations. In my opinion, this question is a great reason for removing theinplace
parameter, because it makes people think they're not making copies when they are.inplace=True
" from the linked answer by cs95df = df.some_operation()
is not the same asdf.some_operation(inplace=True)
, because for the latter, ever other place the dataframe is being referred to it changes, in the former, it doesn't. Of course, the underlying buffer may or may not be re-allocated.self
of a class instance like the inplace operations do)