Pandas vs. Julia – cheat sheet and comparison

joeman1000 · 2023-05-17T13:19:59

The thing that keeps me coming back to Julia is the ability to pipe (or whatever you want to call it). It makes DataFrame operations a lot cleaner since I don't need to modify in place or create new DFs at intermediate steps in a process. Here's a video showing this sort of workflow in R:

https://youtu.be/W3e8qMBypSE

ChrisRackauckas · 2023-05-17T15:11:20

Julia has a pipe syntax (|>). But I think the bigger part here is more generally APIs built around it, which people are doing some things to port tidy syntax (https://github.com/TidierOrg/Tidier.jl).

joeman1000 · 2023-05-17T21:53:39

This is very similar to DataFramesMeta:

https://github.com/JuliaData/DataFramesMeta.jl

time_to_smile · 2023-05-17T19:13:45

I'm still not entirely convinced that pipes aren't an anti-pattern. Absolutely an improvement over nested function calls:

a(b(c(d))) vs d |> c |> b |> a

but I'm not convinced pipes are better than more verbose code that explains each step:

step1 = c(d)

step2 = b(step1)

result = a(step2)

I've written a lot of tidy R and do understand the specific use cases where it really doesn't make sense to use the more verbose format, but generally find when I'm building complex mathematical models the verbose method is much easier to understand.

joeman1000 · 2023-05-17T21:47:27

I think having intermediate variables is sort of 'littering', and requires extra work in the naming which might not be necessary. Also, with pipes, you can just take out any intermediate step by commenting out a line or deleting it. You cannot do this with your method above without then going and rewriting many different arguments. I also like piping because you can quickly increment and build a solution - quicker than naming intermediate steps anyway.

_0w8t · 2023-05-17T20:33:46

Naming intermediate steps require some non-trivial efforts. It can even distract from the main task of getting the results.

In programming the code will be read multiple times and good names will help the future readers. But in data science the calculation will be most likely will not be reused. So efforts to name things will be waste of time.

geokon · 2023-05-19T08:51:03

I suggest trying to lean into it more

I suggest that trying to strictly only bind output to a symbol if it will be used in multiple places.

So when I read code and I see some "intermediary" value bound - it tells me immediately "this thing will be used in several spots". Thereby bindings actually start to convey extra information

Anyway, it's just something that's worked for me. In all other scenarios I will use threading/pipeline (maybe Clojure specific). If steps are confusing/complex then you make a local named lambda or add in the extreme case.. comments

Max-Limelihood · 2023-05-22T02:34:25

If nothing else, you can just pipe the code and then write comments explaining what's left after each step. But the verbose code can be substantially slower (which happens when piping can be used to perform all these operations lazily).

dillydogg · 2023-05-19T23:27:19

> The thing that keeps me coming back to Julia is the ability to pipe

> Provides link to R.

Is there an example of this in Julia? I use R now, and every time I give Julia a shot I go back to R because of the insane TTFP. I don't use anything remotely close to big data, and the 90-120s compile times just to replot my small data (using AlgebraOfGraphics.jl in a Pluto notebook) just kill me.

ChrisRackauckas · 2023-05-21T15:15:11

Did you try v1.9 or v1.10 yet? From others I'm hearing that the code caching changed Makie from about 70 seconds down to 10 in v1.9, and then the loading time improvements brought it to like 5 (unreleased of course, though v1.10 should be branching in a few weeks). Makie load times were of course one of the ones highlighted in the release notes of v1.9: https://julialang.org/blog/2023/04/julia-1.9-highlights/. So while Makie won't be "instant" by v1.10 (<1 second), it was one of the worst offenders before and has gone from "wtf" to "bad but manageable".

dillydogg · 2023-05-22T11:16:55

I haven't! I didn't realize that code caching was part of 1.9. Looks like I'll have to check it out. Thanks

tpoacher · 2023-05-19T15:11:53

I just use a simple chaining function for python like so https://sr.ht/~tpapastylianou/chain-ops-python/

joeman1000 · 2023-05-19T23:45:43

A neat solution, but you can’t alter the position of the argument per function.

tpoacher · 2023-05-20T14:03:02

Of course you can. In fact I'm doing just that in two places in the example.

(Yes I know what you mean, but yes you know what I mean!)

In the end, chains are about readability and logical flow; even if you don't like pre-wrapping in more meaningfully named functionals like the example, and accept the slight readability cost of using the occasional in-spot lambda or partial, I feel that this still becomes a lot more readable than "treat this symbol unconventionally in this context as a positional placeholder" hacky syntax stuff.

stoniejohnson · 2023-05-17T15:08:42

Is that not also available in Pandas?

https://pandas.pydata.org/docs/reference/api/pandas.DataFram...

joeman1000 · 2023-05-17T21:43:24

Yes, but try using this and then try Julia's way. I tried this pandas implementation once and never touched it again.

chaxor · 2023-05-17T15:50:31

In pandas you can chain commands by wrapping the whole command in (). Personally IMO looks far 'cleaner' than all of the ugly %>% everywhere.

civilized · 2023-05-17T16:07:50

The flip side is that in pandas, chaining is less uniform because it is based on methods.

In R you can pipe a data frame into any function from any package or one you just wrote, so you use %>% for any piping that happens. In pandas, you have special pandas methods that don't need the pipe, but to pipe with any other function, you have to write .pipe.

The comparison is not really between %>% and ., it's between "you just use %>% for everything" and "you use . for a bloated, somewhat arbitrary collection of special pandas methods, and .pipe for everything else".

civilized · 2023-05-17T19:48:57

The sad thing about the conventional object-oriented programming paradigm is how it put the really cool syntactic idea of piping/chaining in the straitjacket of classes and objects.

The ability to pipe shouldn't be tied to whether a function is a method of a class.

joeman1000 · 2023-05-17T21:49:20

What do you mean by wrapping the command in ()? I haven't seen this before. Do you have a link to where they mention this in the docs?

SilverBirch · 2023-05-17T08:59:27

Yeah this is basically why I keep trying and bouncing off Julia. I understand the real performance reasons why you'd choose to use Julia but the syntax is the perfect distance from python to make it extremely difficult to me. It's just close enough to get constantly confused. So if I really wanted to do much work in it I'd have swear off python - and I can't do that because for trivial stuff python is more convenient.

cookieperson · 2023-05-17T10:45:46

Pretty sure dataframes jl isn't the fastest dataframes library out there. Think it's Polars, which has bindings in Rust and python. If I remember correctly the runner up is data.table. Similarly SQL/SQLite can often beat all of these So switching to Julia for speed in this context may not even make sense anyways...

martinsmit · 2023-05-17T12:57:19

I agree with your conclusion but want to add that switching from Julia may not make sense either.

According to these benchmarks: https://h2oai.github.io/db-benchmark/, DF.jl is the fastest library for some things, data.table for others, polars for others. Which is fastest depends on the query and whether it takes advantage of the features/properties of each.

For what it's worth, data.table is my favourite to use and I believe it has the nicest ergonomics of the three I spoke about.

ChrisRackauckas · 2023-05-17T15:07:50

Indeed DataFrames.jl isn't and won't be the fastest way to do many things. It makes a lot of trade offs in performance for flexibility. The columns of the dataframe can be any indexable array, so while most examples use 64-bit floating point numbers, strings, and categorical arrays, the nice thing about DataFrames.jl is that using arbitrary precision floats, pointers to binaries, etc. are all fine inside of a DataFrame without any modification. This is compared to things like the Pandas allowed datatypes (https://pbpython.com/pandas_dtypes.html). I'm quite impressed by the DataFrames.jl developers given how they've kept it dynamic yet seem to have achieved pretty good performance. Most of it is smart use of function barriers to avoid the dynamism in the core algorithms. But from that knowledge it's very clear that systems should be able to exist that outperform it even with the same algorithms, in some cases just by tens of nanoseconds but in theory that bump is always there.

In the Julia world the one which optimizes to be fully non-dynamic is TypedTables (https://github.com/JuliaData/TypedTables.jl) where all column types are known at compile time, removing the dynamic dispatch overhead. But in Julia the minor performance gain of using TypedTables vs the major flexibility loss is the reason why you pretty much never hear about it. Probably not even worth mentioning but it's a fun tidbit.

> For what it's worth, data.table is my favourite to use and I believe it has the nicest ergonomics of the three I spoke about.

I would be interested to hear what about the ergonomics of data.table you find useful. if there are some ideas that would be helpful for DataFrames.jl to learn from data.table directly I'd be happy to share it with the devs. Generally when I hear about R people talk about tidyverse. Tidier (https://github.com/TidierOrg/Tidier.jl) is making some big strides in bringing a tidy syntax to Julia and I hear that it has had some rapid adoption and happy users, so there are some ongoing efforts to use the learnings of R API's but I'm not sure if someone is looking directly at the data.table parts.

Nilshg · 2023-05-18T09:05:18

> Indeed DataFrames.jl isn't and won't be the fastest way to do many things

Agreed, and the DF.jl developers are aware and very open about this fact - the core design trades off flexibility and user friendliness over speed (while of course trying to be as performant as possible within those constraints).

One thing that hasn't been mentioned so far is InMemoryDatasets.jl, which as far as I know is the closest to polars in Julia-land in that it chooses a different point on the flexibility-performance curve more towards the performance end. It's not very widely used as far as I can tell but could be interesting for users who need more performance than DF.jl can deliver - some benchmarks from early versions suggested performance is on par with polars: https://discourse.julialang.org/t/ann-a-new-lightning-fast-p...

martinsmit · 2023-05-18T21:52:39

> Tidier

I have not tried it. I like that the project makes broadcasting invisible, I dislike that it tries to completely replicate R's semantics and Tidyverse's syntax. Two examples: firstly, the tuples vs scalars thing doesn't seem very Julia to me. Secondly, I love that DF.jl has :column_name and variable_name as separate syntax. Tidier.jl drops this convention (from what I see in the readme).

> I'm not sure if someone is looking directly at the data.table parts

I believe there was some effort to make an i-j-by syntax in Julia but it fell through or stopped getting worked on. By this syntax I mean something like:

  # An example of using i, j, and by
  @dt flights [
    carrier == "AA",
    (mean(:arr_delay), mean(:dep_delay)),
    by = (:origin, :dest, :month)]

  # An example of expressions in by
  @dt flights [_, nrows, by = (:dep_delay > 0, :arr_delay > 0)]

The idea of ijby (as I understand it) is that it has a consistent structure: row selection/filtering comes before column selection/filtering, and is optionally followed by "by" and then other keyword arguments which augment the data that the core "ij" operations act upon.

data.table also has some nifty syntax like

  data[, x := x + 1] # update in place
  data[, x := x/nrows(.SD), by = y] # .SD =  references data subset currently being worked on

which make it more concise than dplyr.

The conciseness and structure that comes from data.table and its tendency to be much less code than comparable tidyverse transformations through some well-informed choices and reservations of syntax make it nicer for me to use.

affinepplan · 2023-05-19T02:03:23

> I would be interested to hear what about the ergonomics of data.table you find useful. if there are some ideas that would be helpful for DataFrames.jl to learn from data.table directly I'd be happy to share it with the devs.

Personally, my main usability gripe is that it's difficult to do row-wise transformations that try to combine multiple columns by name. I know one can do ``` transform(df, AsTable() => foo ∘ Tables.NamedTupleIterator) ```

But this is 1) kind of wordy and 2) can come with enormous compile times (making it unusable) for wide tables

chaxor · 2023-05-17T15:59:23

I really hope people don't come from R to Julia. People who use R are not good programmers, and will degrade the core of the language and it's principles. It would be a shame to see the equivalent of tacking on 6 different object oriented systems to a base language and fragmenting the community completely.

ChrisRackauckas · 2023-05-17T16:12:17

I'm not sure I'd have the same take. Yes, R as a language is kind of wonky and people who use R tend to not be good programmers. However, the APIs of some packages are designed well enough that even with all of those barriers it can still be easy to use for many scientists. I wouldn't copy the language, 6 different object systems and non-standard evaluation is weird. But there is a lot to learn from the APIs of the tidyverse and how it has somehow been able to cover for all of those shortcomings. It would be great to see those aspects with the data science libraries of the Julia language.

cookieperson · 2023-05-18T10:55:57

It might surprise you to learn that Julia is actively relying on code written in/for R to perform computations. You might be surprised to find out that people who can write R can also write C++ C and other languages of their choosing. You also might be surprised to learn that some of the most vetted statistical code exists in the R ecosystem. If I were someone recruiting for a niche language that had a weak ecosystem, personally I'd take all the help I could get. Can learn Julia with a background in any other programming language in a few weeks... The same can't be said about martingales... But you get to choose your strategy here...

shele · 2023-05-18T16:10:29

And thus we who transitioned to Julia from R and know a bit about martingales and less about programming have long been trying to degrade the core of the language and its principles by making `mean` a Base function.

freilanzer · 2023-05-17T16:18:57

R users in the form of statisticians should definitely come around to Julia. More high quality packages never hurt. But I agree with fragmentation and 'object systems', yet I don't think this is a huge danger for Julia.

nevereasonfroma · 2023-05-17T14:59:28

duckdb's fork, updated 2023.04 (h2oai is 2021.06): https://duckdblabs.github.io/db-benchmark/

repo: https://github.com/duckdblabs/db-benchmark

freilanzer · 2023-05-17T10:22:11

I have done both complex and trivial stuff in both languages and Julia isn't more inconvenient for trivial things.

cookieperson · 2023-05-17T10:49:05

Just make sure you find the appropriate documentation because the package changes it's syntax an awful lot over the past four years or so and there are lots of tutorials, videos, and blogs that don't apply anymore.

Similarly make sure you research the ecosystem because everything in Julia is very fragmented, IE pandas.loadcsv will require two or more packages in it's Julia equivalent.

Nilshg · 2023-05-17T14:17:50

To be clear on this: DataFrames, like most of the Julia ecosystem, follows SemVer. DataFrames 1.0 was released over two years ago (March 2021), and the API has been stable ever since.

Furthermore, Bogumil Kaminski, one of the main developers behind DataFrames, makes sure that the DataFrames tutorials he has created here (https://github.com/bkamins/Julia-DataFrames-Tutorial) are updated on every new release.

cookieperson · 2023-05-18T10:32:58

Beside the point. Old information on the internet abounds with the old syntax. It's a common hiccup for beginners in Julia. Ie they'd google something, try the syntax it errored out, they google it another way found it was updated try that nope that's outdated now too, etc. So it's worth mentioning.

affinepplan · 2023-05-17T14:02:25

I notice you coming into every single thread about Julia to criticize the language and the community. Do you have a vendetta or something?

cookieperson · 2023-05-18T10:35:33

Not a vendetta, just sick of people talking about Julia like it's this flawless thing to bring in users to make $ off of the community, only for the new comers to enter find problems and get gas lit over their existence. It's a gross cycle and someone has to say something about it.

patrick451 · 2023-05-17T19:17:51

The Rust Evangalism Strike force normalized the practice. This is what we do now.

cookieperson · 2023-05-18T10:36:59

??? What's evangelical about pointing out that anyone interested in using dataframes jl should be aware that there is a lot of out dated content on how to use it and to be mindful of that??? It's honest and helpful.

hpcjoe · 2023-05-17T19:21:14

Back in my more perl-ish days, I recall pythonistas doing this. Sad to see little has changed.

cookieperson · 2023-05-18T10:43:48

Is it that I have done something wrong here, or is it possible the Julia community can't handle its own reality. If you reread what I wrote here... You'd actually see it's advice. It's advice from someone who has helped people learn Julia for years. Common stumbling points.

chaxor · 2023-05-17T15:55:16

This is definitely clouded by personal preferences far too much. Acting as if python is void of issues in change over time? We all know the incredible pain of trying to get some ML package running written 3 months ago (let alone 3 years ago) and how much time is spent remaking some conda env inside a docker inside qemu inside... just to get the stupid thing to load. So don't act like python doesn't have it's problems in change over time.

cookieperson · 2023-05-18T10:34:05

No clue why python was even mentioned here. I hardly if ever use python. Strawman arguments are logical fallacies.

hpcjoe · 2023-05-17T19:19:07

> Julia is very fragmented

No, it isn't. I'm using Pandas, DF.jl, and even polars at work. DF.jl is by far the best/easiest/quickest to use, as its syntax is consistent. Polars is a bit more annoying as its syntax is further along the learning curve that I have gotten yet.

Pandas ... what to say about a library that will happily return a pd.Series in one moment, and a pd.DataFrame in another, for the same function call. This means you need extra code like

if type(ret_thing) = pd.Series: # then do something to coax it back to a df.

lest your actual code break.

This is of course the same language that has API differences that make no sense in, say, re.match vs re.findall vs re.search. I've been burned by all of those.

So, look, we get you hate Julia. That's fine. Go live your python life to its best. But really, stop with the misinformation/FUD. This speaks volumes about you, and tends to make the case precisely the opposite of what you think.

And yes, I use Python, Julia, C++, and many other languages in the $day_job.

cookieperson · 2023-05-18T10:30:06

Would it surprise you to know I don't use python almost ever?

Yes the ecosystem is very fragmented. It's done so by design. One of the major contributors to the language wrote a paper about it.

Nanana909 · 2023-05-17T16:44:48

Any examples? I've found Julia far easier for simple things than python. Most modern problems are mathematical and nature and I think it's pretty objective that julia looks closer to the mathematics. ----Some tests of very simple tasks in both languages.

At its most basic the obviouses becomes different:

Let's try to get a very simple object: a 3,3 matrix of random booleans in both languages: Julia:

A = rand(Bool,3,3)

Python: No standard support for Matrices. I could really do it a disservsice and comapre the "core language", but that''s obivously stupid so we'll bring in some external libraries to make it easier. Of many ways to skin the cat here's one.. Python

import numpy.numpy as np gen = np.random.default.default_rng() B = gen.choice([True,False],(3,3))

BTW julia has this choice function built into the command as well, so rand(["Which", "Word", "Will", "I", "get?"]) produces exactly what you'd expect.

----

Acutally I can't think of any cases at all off the top of my head. Sorry, I mean my np.somenamespace.another.namespace.sparce head :) I mean just going down the list of things that make code easier in Julia..

* Python requires a third party library for any kind of linear algebra. and matrix multiplication:*

  A =[1 4;6 7"; B = [2 ;3]; A*B doesn't work, i.e. you literally even multiply a matrix! This is madness.you'll need yet again a third partly library

Python doesnt have broadcasting. let's apply sin(x) to a matrix a Pythonic(+ required third party libraries)

  import numpy as np
 import math #sigh
      x = np.array([1, 2, 3, 4, 5])
      f = lambda x: sin(x)

Now in Julia (notice the . after sin) x=1:5 sin.(x) or more explicity we could write broadcast(sin, x)

Even basic string interpolation in Julia is a much nicer "trivial task" than in python alone

No special brackets, just clean "$myvar"

# Reading files is easier

Julia

    readlines("my_test.txt")

Python

open("my_test.txt").readlines()

What a stupid option in 2? if I call readlines on a filename, 99% of people, 99% of the time, want to read the lines on the file at that path. Why require two function calls?

mrtranscendence · 2023-05-17T17:58:00

It’s funny that people used to get on Python’s case for not being object oriented enough, and now we’ve come around to folks thinking Python should just throw a function for everything into the default namespace …

_0w8t · 2023-05-17T20:43:27

In Julia with multiple dispatch there is no problem to adding more things to the global namespace. But it does not work for Python, so it’s developers must be very conservative with global names.

patrick451 · 2023-05-17T19:20:07

> Sorry, I mean my np.somenamespace.another.namespace.sparce head :)

I really don't understand what you are trying to complain about. Namespaces are nice. Dumping everything into the global namespace sucks.

DNF2 · 2023-05-17T22:14:16

This is a problem for non-dispatch or singular dispatch languages, it's significantly different in the context of multiple dispatch. Namespaces are, well, not bad, but sometimes they are a solution to a problem that does not necessarily exist.

mrtranscendence · 2023-05-18T07:23:31

Is it really that different? It's fine in languages with multiple dispatch to have its builtin functions in the default namespace, but that doesn't mean that we'd want functions for everything in the default namespace. Namespacing is still useful.

For one thing, I've found it very useful when folks use syntax like `import numpy as np`, because then when I see `np.foo` I can trace back where `foo` comes from and look up the relevant documentation.

The complaint about `open("foo.txt").readlines()` vs `readlines("foo.txt")` is a red herring IMO, because nothing stops anyone from implementing a generic `readlines()` function in Python that can take a string file name or a file handler object. It's just that nobody really cares enough to because it's a complete non-issue.

DNF2 · 2023-05-18T08:33:42

But Julia _does_ have namespaces, and you can import everything with the `import` statement, or you can retrieve only the functions you need. This is also what is generally done during package development, while dumping everything is for interactive use.

tombert · 2023-05-17T09:57:50

I have not done anything even remotely significant in Julia, but the little I played with didn't seem to indicate to me that it would be bad for trivial stuff...what trivial stuff is hard in Julia but easy in Python?

Nanana909 · 2023-05-17T17:29:02

I responded in detail him about, but honestly I can't think of many that answer your qiestion. It's amost always the opposite. A I like to present this as stereotypical example of the types of differences you find the two, IMO. Generating a 3x3 matrix of random booleans.

>Julia

  A = rand(Bool,3,3)

Python: No standard support for Matrices. I could really do it a disservsice and comapre the "core language", but that''s obivously stupid so we'll bring in some external libraries to make it easier. Of many ways to skin the cat here's one..

>Python

   import numpy.numpy as np
      gen = np.random.default.default_rng()
      B = gen.choice([True,False],(3,3))

I find myself just taking `rand(["Msg1",.....,"MsnN"])` to get a random string often. And small things like this are why you see people Julia is so nice to write and become so defendant of it.

sundarurfriend · 2023-05-17T20:15:33

I think they just mean that with trivial projects, it's not worth trying them in a new language since the performance benefits are probably going to be minimal, and wouldn't really show off Julia's strengths.

tombert · 2023-05-18T09:12:36

That's fair; I have done my fair share of scripting in Node.js just because I'm familiar with it and it's fast enough to do most anything.

henlab · 2023-05-17T11:05:50

DataFrames.jl was for me on of the primary reasons to use Julia. I personally the syntax way easier than pandas, especially for more complex operations. I find this cheat sheet doesn't do in justice.

lukego · 2023-05-17T11:31:08

Tip: If you want these capabilities and your favorite language doesn't have them natively then you can consider embedding DuckDB for something comparable.

jstx1 · 2023-05-17T08:59:06

This seems very poor - the comparison is between pandas and DataFrames.jl, not Julia; syntax comparison is very surface-level; cheatsheats are low resolution; the learning curve section says nothing about the learning curve; and the conclusion is "do whatever you like".

misja111 · 2023-05-17T10:59:32

Well Pandas is a framework, not a language, so it only makes sense to compare it to DataFrames.jl and not to Julia as a lanugage.

But I agree this should have been reflected in the title of the article.

ChrisRackauckas · 2023-05-17T15:22:57

I both agree and disagree. It does look weird as library vs language. However at the same time, in Pandas everything tends to be in the Pandas library, whereas when using DataFrames.jl you tend to mix it with a lot of features that are external. Most of the calls just use overloaded functions from Julia's Base library (mean, first, last, findall). The Pandas model is to look at the Pandas docs and find the Pandas dataframe function that does your job. The DataFrames.jl model is to do whatever you would have done normally in Julia, like use the sort function, but now just use it on a DataFrame. The idea of DataFrames.jl is that you know the language and so it extends/adds as few functions as possible (joins, groupby, split-apply-combine I think are it?). This plus many other calls use functions from the more general Julia data science ecosystem (CSV.jl, JSON.jl, ...). So the title ends up being a bit apples and oranges, but the usage is also quite apples and oranges and the cheat sheet does accurately reflect that.

xgdgsc · 2023-05-17T09:35:49

But there's CSV.jl so I didn't change title. I don't see any low resolution, maybe a font choice issue? I'd say the conclusion is the right thing to say for such a short comparison.

jstx1 · 2023-05-17T11:34:56

You have many errors in the tables, for example the pandas indexing is obviously wrong.

xgdgsc · 2023-05-17T11:41:35

It' s not my website.

jstx1 · 2023-05-17T12:08:32

Sorry, thought you were the author.

rgavuliak · 2023-05-17T10:53:40

The cheatsheet goes wrong already for the first example of declaring a df: - you could do a range in python (range(11, 14)) - columns are called col_1 & col_2 vs a & b (both sets are horrible names) - pandas defines index of 0, 1, 3, while Julia would most likely have 0, 1, 2?

jstx1 · 2023-05-17T11:32:58

Also df.loc[1:3, :] doesn't get the first N rows. First because of 0-indexing, second because when your index isn't ordered integers, you'll get completely unexpected results with .loc.

joeman1000 · 2023-05-17T13:21:49

What do you mean? Julia is 1-indexed.

rgavuliak · 2023-05-17T14:35:12

Ok, in that case Python would be 0, 1, 3 and Julia 1, 2, 3. My point is that the example explicitly skips an index in the definition of a data frame for Python, but it doesn't for Julia.

markkitti · 2023-05-17T17:57:50

Surely Python would not skip the 2.

rgavuliak · 2023-05-18T08:42:14

The code snippet explicitly defines the index to be 0, 1, 3. It may just be a typo, but if even simple examples are written sloppily, how can I trust the sheet:

df = pd.DataFrame( {'col_1': [11, 12, 13], 'col_2': [21, 22, 23]}, index=[0, 1, 3])

kloch · 2023-05-17T18:23:26

I've just started getting into Julia for one of it's best use cases: It's super easy to do arbitrary precision math. But you have to be very careful when using string literals with BigInt or BigFloat:

  julia> setprecision(1024)
  julia> a=BigFloat(1.0E-300)
  1.000000000000000025059091835208759685696146807703705249925342319900466043184051484676302812181950100894962306270278254148910311464998804130812246091606190182719426627934584275510414782787015070222639260603793613924359775094030143866141479125513590882591017341692222921220404918621822029155619541859418525883262e-300

Notice without quotes on the literal you only get ~15 decimal digits of precision because the parser treats the literal as a double and then passes that to the BigFloat variable.

  julia> a=BigFloat("1.0E-300")
  9.999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999988e-301

With quotes we get the full ~308 decimal digits of precision for the configured 1024-bit binary precision.

Now we can add it to 1.0 to validate the precision of a calculation and use the @printf macro for C-style formatting to round the output to 308 decimal digits:

  julia> b=BigFloat("1.0")
  julia> using Printf
  julia> @printf("%.308f\n", (a+b))
  1.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000

I'm not sure why this is the default behavior, it seems like a really easy way for people to screw up their calculations, especially scientists that don't do a lot of programming.

DNF2 · 2023-05-17T18:52:57

I actually find the string macro syntax even more convenient: `big"1e-300"`.

wnoise · 2023-05-17T19:17:34

> I'm not sure why this is the default behavior,

Um. You said the answer earlier:

> because the parser treats the literal as a double

BigFloats aren't built in to the syntax of the language (and probably should't be, so you need to escape the parser somehow -- either pass a string to the constructor, or use the @big_str macro to get a non-standard string parsed into a BigFloat.

kloch · 2023-05-17T19:36:09

This makes perfect sense from the perspective of a language designer/computer scientist who is trying to keep their design clean and consistent.

It makes no sense to an end user that expects an argument you pass to BigFloat to be treated as a BigFloat. As an end-user I would rather have a warning or even error than to have my argument silently treated as a double.

sundarurfriend · 2023-05-17T20:07:49

It's a tricky case because they do provide the `big""` macros for literals, and mention in BigFloat's docs that:

      BigFloat(x::AbstractString) is identical to parse. This is provided for convenience since decimal literals are converted to
      Float64 when parsed, so BigFloat(2.1) may not yield what you expect.

      ...
      Examples
      ≡≡≡≡≡≡≡≡≡≡

      julia> BigFloat(2.1) # 2.1 here is a Float64
      2.100000000000000088817841970012523233890533447265625

However, saying RTFM is not a solution, especially for not-too-frequent parts of the language like BigFloats. It's still a trap many people are going to fall for.

The solution here is a good linter though, not adding more work to the already overstressed compiler. It comes back to the issue of Julia needing more mature, easy-to-work-with tooling, that could say "hey, this is technically allowed, but you probably didn't mean this".

DNF2 · 2023-05-17T21:49:40

The argument isn't silently treated as a double, it is explicitly and loudly treated as a double, because it is a literal double.

And this is not an advantage to the designer exclusively, it is very much an advantage for the end user that the treatment is explicit, consistent and predictable, instead of 'magically' reinterpreting the meaning of literals based on guessing the intent of the user.

Basically, you seem to be saying that when passing x to BigFloat, x should not be treated as the value x, but as some nearby value that might be the one the caller intended (based on some rounding logic perhaps?) Or are you perhaps saying that

    x = 1e-300
    y = BigFloat(x)

should be different from

    y = BigFloat(1e-300)

? In other words, completely discarding referential transparency?

gnramires · 2023-05-17T22:27:39

I think the argument here is that parsing of the string '1e-300' maybe should be context dependent. In this case, 1e-300 is being parsed as a double, and then forwarded to the function. Maybe it could be parsed as a bigfloat whenever it is an argument of a function expecting a bigfloat.

DNF2 · 2023-05-18T08:42:21

Yes, I got that argument, and that is exactly what I was arguing against. You cannot and should not parse the literal double `1e-300` differently dependent on which function it is later passed to. This is what `big"1e-300"` or `BigFloat("1e-300")` is for, where the BigFloat constructor parses the string.

jakobnissen · 2023-05-18T06:04:55

Julia does this for exponentiation, i.e. a literal exponent is parsef differently that exponentiation by a variable.

I think it was a mistake. Invariably, a new user discovers the discrepancy and is thoroughly confused. Let's not repeat the same mistake with big nums.

DNF2 · 2023-05-18T08:47:58

I don't think this has anything to do with whether 1e-300 and 10.0^-300 is parsed differently (and perhaps that is a mistake). The poster seems to want to parse 1e-300 directly as a BigFloat in the call `BigFloat(1e-300)`, because of the function it is passed to.

jakobnissen · 2023-05-19T14:17:44

Yeah what I mean is - we already have "magic parsing" with literal_pow, and it's confusing and unnecessary. It was a mistake. We should not make the same mistake when parsing float literals

xigoi · 2023-05-17T16:03:21

The site's cookie banner doesn't give an option to reject tracking cookies. Isn't that illegal?

nologic01 · 2023-05-17T17:00:03

Anxiously waiting for pandas to get its mojo. If there is any Python library that needs it, this is it.

make3 · 2023-05-17T17:19:26

I wish Python catches up to Julia in performance. No sense rewriting a trillion lines of code for what is a really pleasant syntax & ecosystem already.

But this is a language flamewar thing, probably not a constructive comment, sorry.

wdroz · 2023-05-17T17:45:41

You can just use Polars[0] instead of Pandas and easily beat both Pandas and DataFrames.jl

Pure Julia is faster than pure python, but there are non-pure python tools available in the python ecosystem for a ton of things.

[0] -- https://www.pola.rs/

affinepplan · 2023-05-17T22:38:57

Polars definitely doesn't "easily" beat DF.jl on all tasks.

Yes, I agree, on average polars is a bit faster for many of the simple workflows, but I certainly don't think that's unconditionally true. It's especially less true when you might want to do something out of the ordinary with your series --- in Julia it's trivial to just extract that as a vector and loop over it (fast!). In polars, one would have to make sure their function can be appropriately vectorized.

adammarples · 2023-05-17T10:15:53

It's strange, I would think that Julia's multiple dispatch would make something more like this desirable

``` df = DataFrame(CSV(File("name.csv")))

data = JSON(File("name.json")) ```

instead of the usual hodge podge of methods ``` df = CSV.read("file.csv", DataFrame)

data = JSON.parsefile("file.json") ```

affinepplan · 2023-05-17T14:03:43

Actually, that works too :) I think `DataFrame(CSV.File("name.csv"))` is what you're looking for

retrochameleon · 2023-05-17T08:21:56

Unreadable on mobile

marginalia_nu · 2023-05-17T08:43:59

You do a lot of software development on mobile?

newswasboring · 2023-05-17T15:30:15

Look up the termux community. There are actually people in third world countries who are learning programming using their phones. I've seen awesome builds which use Android phones as their primary CPU unit and cobble on scavenged monitors and keyboards and mice. It's honestly fucking awesome

Alifatisk · 2023-05-17T08:49:23

How’s that related to the article not being mobile friendly?

marginalia_nu · 2023-05-17T08:52:13

It's a reference sheet for a programming language. Like it's clearly designed to be read as you're writing code. I don't understand in which circumstance you'd optimize such a document for mobile use.

arijun · 2023-05-17T09:01:01

You have never read documentation on something you’re not currently writing code with _right_ _now_? Or looked at documentation for something you worked on in the day as you’re on your way home from work?

I opened it on mobile because I was interested in seeing how they differ, even though I’m not using either right now.

marginalia_nu · 2023-05-17T09:14:27

Well, I typically don't read a lot on mobile. Screen is too small and you can't block ads, which makes the screen even smaller.

ivvve · 2023-05-17T09:16:39

Bet your code always works on your machine too ;)

cookieperson · 2023-05-17T10:08:40

For what it's worth this is literally how the Julia community deals with feedback.

marginalia_nu · 2023-05-17T10:12:35

Well it's not my website. Although the feedback does seem very "this toaster doesn't work when I try to cook soup in it! 1/5"

Not everything needs to cater to mobile.

arijun · 2023-05-17T17:26:20

It seems more like you're saying "It's ok that your toaster doesn't cook pop-tarts, you shouldn't be having those anyway--the sugar's bad for you." Your way of doing things might not be everybody's.

Nothing has to cater to anything but I'm still allowed to be annoyed that I can't read it on mobile.

freilanzer · 2023-05-17T10:23:31

Ah, it's a post mentioning Julia, and there you are.

cookieperson · 2023-05-17T10:51:02

What can I say I am a reliable person trying to make sure less people get sucked into Julia without learning of it's downsides. To be fair, the same could be said to you ;).

marginalia_nu · 2023-05-17T09:25:17

What's that got to do with mobile phones?

xigoi · 2023-05-17T16:02:28

You totally can block ads.

Alifatisk · 2023-05-17T11:28:17

I sometimes read the docs for something while on the train, and in those cases, it’s through the mobile phone.

It happens what when I cannot stop thinking of a problem and really want to find a solution.

EuAndreh · 2023-05-17T11:18:53

We should nurture more accessibility, in this case, mobile compatibility.

For instance, consider someone who has limited access to desktop computers and have to go by with a mobile device. These individuals do exist, and their access is as legitimate as any other.

marginalia_nu · 2023-05-17T11:53:02

Mobile compatible websites are strictly worse though. It's why almost all desktop websites are just a hideous jumble of boxes these days.

It's virtually impossible to make a website that is well designed on both desktop and mobile. As long as the affordances of mouse+keyboard and touchscreen are as different as they are, one of the user groups needs will suffer a detrimental compromise.

EuAndreh · 2023-05-17T17:41:34

An empty HTML page works on both. Add some margin and spacing and it keeps working.

After that, what you can do is ruin it, targetting a webpage for a specific screen size.

marginalia_nu · 2023-05-18T09:03:30

Right, but if you aren't able to make assumptions about screen size, you can't use for example a table, which is very useful if you want to convey data.

A phone screen just isn't big enough to display tables with more than two or three columns. There's no reason desktop users should need to be crippled in this fashion.

Tables are a very powerful tool for conveying lots of structured data in a way that's useful and intuitive. Avoiding them means losing out on this.

This is the exact problem the website we're discussing is having. You just can't show the relatively small amount of information side-by-side on mobile the way they are trying to. The only solution to it is to make it less intuitive and show them on top of each other in a complete jumble.

danuker · 2023-05-17T08:52:21

I guess you could find the cheat sheet useful on a phone/tablet while you are learning the correspondences.

1equalsequals1 · 2023-05-17T08:14:32

Nah, I'll do it with SQL

rectang · 2023-05-17T08:32:39

One nice thing about solutions like pandas or Julia is that they’re much easier to write tests for or otherwise validate. I can’t tell you how many times I’ve been handed a big ball of SQL which doesn’t behave like its author thinks it does, diverging in subtle or not-so-subtle ways.

ivirshup · 2023-05-17T09:13:27

ibis in python is a really nice middle ground. Nice API + in a programming language, but executes on database backends (which could be polars or duckdb on in memory arrow tables).

akdor1154 · 2023-05-17T08:35:17

I agree, and interestingly so does the author of the article so it's a bit weird that you're receiving downvotes.

Julia's DataFrames library is more consistent than Pandas by a mile, but it's still a bit weird.

JuliaDB's IndexedTable and NDTable was a really awesome API design, it's quite a pity that JuliaDB is now unmaintained. :(

agacera · 2023-05-17T10:42:09

Same here. But have you tried duckdb? You can do sql in the pandas dfs and it is fast af.

https://duckdb.org/2021/05/14/sql-on-pandas.html

avnigo · 2023-05-17T11:19:41

  mydf = pd.DataFrame({'a' : [1, 2, 3]})
  print(duckdb.query("SELECT SUM(a) FROM mydf").to_df())

I can see the appeal, but if you're working in Python, something doesn't sit right with me when having to write out variable names as strings. E.g., if I want to refactor the code, my LSP or parser won't pick up those references.

> The SQL table name mydf is interpreted as the local Python variable mydf [...] Not only is this process painless, it is highly efficient.

It might be painless and convenient at first, but I feel like this could get you in trouble down the line. Is there a way to avoid this?

freilanzer · 2023-05-17T11:57:46

https://duckdb.org/docs/guides/python/ibis.html

cookieperson · 2023-05-17T10:53:07

Duckdb is sick. You can also do queries on parquet, etc.

cookieperson · 2023-05-17T10:09:31

SQLite is often much faster then dataframes jl and pandas.

ayhanfuat · 2023-05-17T11:44:04

I highly doubt that SQLite is faster than pandas, let alone dataframes.jl, for analytical workloads.

cookieperson · 2023-05-18T11:01:58

Might surprise you how many people are using pandas or dataframes for OLTP on a daily basis because they don't know better.

cjalmeida · 2023-05-17T18:42:54

SQLite is good for a bunch of stuff, but it's terrible for analytic workloads. Not even in the same ballpark for pandas, let alone Julia

pjmlp · 2023-05-17T10:09:29

Same here, I don't get the point of this other that "don't want to learn SQL".

cookieperson · 2023-05-17T10:38:33

A lot of data scientists don't know SQL and don't understand why people use it. That said, there are cases where in memory workloads and certain manipulations are less encumbered by dataframes APIs. But in a lot of cases... They get abused by people who really do need to push themselves a little bit to learn something new.

cjalmeida · 2023-05-17T18:54:41

You use both. Once your data fits comfortably in memory it's naive to try to build histograms, pivots and charts using pure SQL.

smabie · 2023-05-17T10:25:43

You're saying all Pandas usage (an incredibly popular library) is because people don't want to use SQL?

cookieperson · 2023-05-17T10:42:14

It's a broken argument with some truth too it. IE you can run a SQL query put the result in a dataframe to dump it to an interchange format. But in the same breathe... If you learn to use SQL a great deal of workloads often used via dataframes APIs kind of disappear, and in doing so learning a new language to use a new dataframes API isn't really worth it.

pjmlp · 2023-05-17T11:01:45

As far as I am aware, plenty of use cases can also be done via OLAP.

clircle · 2023-05-17T18:22:23

Thanks, but I'm sticking with R.

seaparter · 2023-05-17T09:03:57

Apples vs Oranges

Quindecillion · 2023-05-17T10:17:43

Which are both fruit!