Statistics with Julia [pdf] (opens in new tab)

(people.smp.uq.edu.au)

470 pointsaapeli6y ago131 comments

131 comments

I'd really recommend anyone doing mildly numerical / data-ey work in python to give Julia a patient and fair try.

I think the language is really solidly designed, and gives you ridiculously more power AND productivity than python for a whole range of workloads. There are of course issues, but even in the short time I've been following & using the language these are being rapidly addressed. In particular: generally less rich system of libraries (but some Julia libraries are state of the art across all languages, mainly due to easy metaprogramming and multiple dispatch) + generally slow compile times (but this is improving rapidly with caching etc). I would also note that you often don't really need as many "libraries" as you do in python or R, since you can typically just write down the code you want to write, rather than being forced to find a library that wraps a C/C++ implementation like in python/r.

opportune6y ago

>you can typically just write down the code you want to write, rather than being forced to find a library that wraps a C/C++ implementation like in python/r.

I don't think this is really a feature. It's nice that you can write more performant code in Julia directly and don't need to wrap lower level languages, without question, but the lack of libraries or library features is not a good thing. It's always better to use a general purpose library that's been battle tested than to write your own numerical mathematics code (because bugs in numerical code can take a long time to get noticed)

For specialized scientific computing applications, which would normally be written in C/C++, I would absolutely look into using Julia instead (though not sure what the openmp/mpi support is like). But I would also recommend against rolling your own numerical software unless you need to

jjoonathan6y ago

I don't just think it's a feature, I think it's a killer feature.

You are much less likely to reinvent the wheel if you can add your one critical niche feature / bugfix to an existing library. In python, learning C and C build systems and python's C API are gigantic barriers to doing that.

More importantly, if every fast data manipulation needs to be written in C, a few of them can be profitably shared, but you need more than a few of them. Pretty soon you wind up with a giant dumping ground of undiscoverable API bloat. See: pandas.

tomrod6y ago

Maybe I don't understand what API bloat is in this context -- can you give some more detail regarding your thoughts on pandas?

1 more reply

ChrisRackauckas6y ago

While Python has good libraries in general computing, and it has good ML libraries, it's really lacking in scientific computing (numerical linear algebra, differential equations, etc.). For example, what's a Newton-Krylov IMEX integrator in Python? Boundary value DAEs? I know of libraries for these things in Fortran, C++, and Julia... but not Python. It's also well-known that Python lacks a lot of the statistics libraries of R. When you chart it out, Python tends to just have the bare minimum of support in every area (except ML, it has good ML libraries), which if it's what you need, great! But...

cauthon6y ago

Are all the plotting/visualization options still half baked?

spacedome6y ago

I've found Plots.jl and PyPlots.jl to work well for most basic things, despite not always being entirely pleasant to use, for example the compilation time issue, but this should hopefully improve. The only real problem I had is that these are not quite sufficient for plots to be published in a paper, many visual tweaks you might want are broken or terribly documented, and I have to just use matplotlib or R. It is generally great for jupyter notebooks though. I see the current deficiencies as highlighting just how much work went into matplotlib and others to get where they are today (and even mpl is in some ways still lacking, for example 3D surfaces and meshes). It is unfortunate though, as plotting is a core functionality for their main target of computational science. But to answer your question, mostly yes. Everything seems to be slowly improving though.

kmundnic6y ago

No matter the tool I use these days for plotting, I export it as a .tex file to use PGFPlots. matlab2tikz, matplotlib2tikz, and the savefig function in Plots.jl all do the job (with the pgfplots backend). This way you can tweak the figure in the final document, which I prefer. You can adjust all of the properties of the plot in Latex.

3jckd6y ago

Yes, they are. Slow and hardly as expressive or rich as python/r counterparts.

ViralBShah6y ago

One can use matplotlib in Julia by PyCall'ing it. So it is at least as good as anything else.

1 more reply

jointpdf6y ago

This looks like a good reference for the fundamentals of both statistics and Julia, as claimed. I have a small critique, since the authors asked for suggestions.

The format for the code samples goes like (code chunk —> output/plots —> bullet points explaining the code line-by-line). This creates a bit of a readability issue. The reader will likely follow a pattern like: (Skim past the code chunk to the explanation —> Read first bullet, referencing line X —> Go back to code to find line X, keeping the explanation in mental memory —> Read second bullet point —> ...). In other words, too much switching/scrolling between sections that can be pages apart. Look at the example on pages 185-187 to see what I mean.

I’m not sure what the optimal solution is. Adding comments in the code chunks themselves adds clutter and is probably worse (not to mention creates formatting nightmares). I think my favorite format is two columns, with the code on the left side and the explanations on the right.

Here’s what I have in mind (doesn’t work on mobile): https://allennlp.org/tutorials. Does anyone know of a solution for formatting something like this?

ynazarathy6y ago

Thank you. Indeed not sure how to optimize it. Perhaps in the next version of the book. Note that the book is to be Springer published (once finished) - this puts some limitations as well.

Happy for more feedback (Yoni Nazarathy).

j88439h846y ago

I'm not sure how that allennlp site is doing it, but source is here: https://github.com/allenai/allennlp/blob/b0ea7ab6be2787495fa...

j88439h846y ago

Here's what they're doing: https://github.com/allenai/allennlp/blob/master/tutorials/ta...

psychometry6y ago

Not using PDF would be a good start. Bookdown texts tend to be good for mixed code/prose sections.

jointpdf6y ago

Yeah but PDF itself isn’t really the problem. Bookdown is nice, but if you’re using Bookdown then you’re using RMarkdown, so you can easily output the same .Rmd file as HTML/PDF/Reveal.js/EPub/etc. I’m trying to find well-executed examples or templates for what I have in mind (two column layout of code/text, maybe in landscape orientation for better spacing), but I’m drawing up blanks so far. Specifically I’m looking for either LaTeX or Reveal.js packages/templates for this.

xvilka6y ago

Note that Julia 1.2[1] is on the verge[2] of being released. Also, it is interesting to see the list[3] of GSoC and JSoC (Julia's own Summer of Code). A lot of projects target the ML/AI applications. Personally, I am waiting for proper GNN support[4] in FluxML, but seems not much interest in it.

[1] https://github.com/JuliaLang/julia/milestone/30

[2] https://discourse.julialang.org/t/julia-v1-2-0-rc2-is-now-av...

[3] https://julialang.org/blog/2019/05/jsoc19

[4] https://github.com/FluxML/Flux.jl/issues/625

caiocaiocaio6y ago

Julia looked interesting to me, so I tried 1.0 after it came out. I have a oldish laptop (fine for my needs), and every time I tried to do seemingly anything, it spent ~5 minutes recompiling libraries or something. So I've been waiting newer versions that hopefully stop doing that, or for me to buy a better computer.

anonova6y ago

Yes, this is ones of my problems with Julia. It seems to be optimized for long runs and REPL/notebook usage.

Take, for example, a simple program that creates a line plot (https://docs.juliaplots.org/latest/tutorial/):

    using Plots
    x = 1:10
    y = rand(10)
    plot(x, y)

After installing the package, the first run has to precompile(?), and subsequent runs use the package cache. But ~25 s to create a simple plot is incredibly slow and frustrating to work with.

    $ julia --version
    julia version 1.1.1
    $ time julia plot.jl
    julia plot.jl  73.71s user 4.45s system 110% cpu 1:11.04 total
    $ time julia plot.jl
    julia plot.jl  24.41s user 0.39s system 100% cpu 24.633 total
    $ time julia plot.jl
    julia plot.jl  23.38s user 0.36s system 100% cpu 23.519 total

improbable226y ago

While this probably isn't a practical way to do any real work, running it with --compile=min gives some idea what might be possible soon:

    $ julia --compile=min -e '@time (using GR; plot(rand(20)))'
      0.375836 seconds (368.83 k allocations: 20.190 MiB, 1.65% gc time)
    $ julia --compile=min -e '@time (using Plots; plot(rand(20)))'
      4.302867 seconds (6.41 M allocations: 371.485 MiB, 5.07% gc time)

ViralBShah6y ago

The time to second plot will be a few milliseconds, in the same process - in the same Julia session. So, while the time to first plot is frustrating, it is ok if your interactive session times are longer.

Of course, we continue to work on improving compile times. About half of the time is spent in LLVM compilation, which has actually become slower over time.

tomrod6y ago

What prevents the plot compilation from being pre-compiled at install?

jebej6y ago

The Plots package adds a significant overhead right now. Try using PyPlot (matplotlib) directly. These days you can use exactly the same syntax (dot-call) as in Python.

   $ time julia -e "using PyPlot;x=1:10;y=rand(10);plot(x,y);"
   real    0m5.676s

SolarNet6y ago

This is a core part of the design. It's part of why Julia is so useful for scientific computing, where one often has a large job that will require a lot of processing time, such that it is worth it to do an intensive JIT cycle every-time. And part of that is the analysis to take python-esque code and turning it into C levels of performance.

aurelian156y ago

I just looked into Julia (1.1) for scientific use (simulation of very simple dynamical systems) a few days ago. I have to admit that by the end of the day I was surprisingly frustrated. I felt that type annotations were insufficient (one of the reasons to move away from Python); in particular, I didn't find a way to specify statically sized array types as you can do with Eigen, a feature that I find incredibly useful to find mistakes at compile time. Furthermore, just plotting something (using Gadfly) took about 30 seconds the first time after Julia was started and about 20 seconds every consecutive call (on a high-end workstation, mind you).

The next day I just ended up using C++/Eigen with a simple matplotlib binding [1]. The code is nearly indistinguishable from Python/Julia (except for having more verbose types where it makes sense, using "auto" otherwise), and the entire compile+run cycle takes less time for some short runs than it takes Julia to print "Hello World".

That being said, I'm not advocating for people to use C++. I would love to use Julia, and applaud the developers for their hard work and contribution to scientific computing, but as it stands right now, it doesn't seem to be the right tool for me, since I'm relying on fast editing/execution cycles.

[1] https://github.com/lava/matplotlib-cpp

ddragon6y ago

>since I'm relying on fast editing/execution cycles

While you can't do it from the shell very well right now (rerunning the program at each step like you would with an interpreted language), that kind of fast cycle is something very common in Julia development but with a particular REPL based workflow [1] in which you use a tool like Revise.jl [2] to automatically update the definition whenever you save a file in your project (the only restriction is that it doesn't automatically updates new type definitions) and directly interacting with the program in the REPL. This way it will only recompile what you just altered, and it's very fast to actually run the code. Other interesting tools are Rebugger.jl (debugger for the REPL) [3] and OhMyREPL (coloring for the REPL) [4], which you can add to your startup.jl to always automatically load them.

[1] https://docs.julialang.org/en/v1/manual/workflow-tips/index....

[2] https://github.com/timholy/Revise.jl

[3] https://github.com/timholy/Rebugger.jl

[4] https://github.com/KristofferC/OhMyREPL.jl

adamnemecek6y ago

> I didn't find a way to specify the exact size of arrays in the type as you can do with Eigen

https://github.com/JuliaArrays/StaticArrays.jl

improbable226y ago

The default Array does not have a size as part of its type. There's a package StaticArrays which does this, typically faster below about 100 elements. But this isn't useful for catching mistakes before you run it, obviously.

Plotting is indeed slower than ideal, have not used Gadfly but Plots is more like 15s after restarting, then 10ms each time after. GR is faster, 5s or so the first.

zmk_6y ago

About Gadfly. If you were plotting it in a notebook then the default is to make it an interactive figure. In that case a lot of overhead comes from the browser trying to render it. You can explicitly invoke e.g., draw(SVG(10cm,10cm), plot(...)) to make it much faster.

ChrisRackauckas6y ago

The DiffEqTutorials has a tutorial which covers how to use statically-sized arrays to optimize the simulation of dynamical systems.

http://juliadiffeq.org/DiffEqTutorials.jl/html/introduction/...

orbifold6y ago

Julia is what happens if you let amateurs develop a compiler. The few times I’ve tried it produced gigabytes worth of stuff super slowly. The majority of packages are half backed, the only way to discover any type error is to let the program run, which coupled with multi method dispatch and hellishly slow compile times for trivial amounts of code makes the whole experience super unpleasant. Modern C++ plus some python feels more pleasant to work with (lightning fast compiles).

1 more reply

ddragon6y ago

While the aggressive JIT it's a core part of the current approach, it's still an implementation detail and not a property of the language design itself, and other compilation strategies are being developed, such as interpretation/less aggressive JIT for when you only want to run something simple a few times (like JuliaInterpreter.jl and the --compile=min flag), better sharing precompiled code between sessions (like PackageCompiler.jl and the variants) and possibly even AoT with reduced functionality (which will be useful for writing Julia libs for other languages and stuff like WASM).

caiocaiocaio6y ago

Yeah, a dev mode would be nice.

fundamental6y ago

Sure, though it does seem like there's still work to be done on the side of decreasing package load time since parsing/compiling/etc does not necessarily need to be done the second/third/etc time you load the same package. It's gotten better with past releases, though it still seems to have a ways to go.

mlevental6y ago

my bigger problem is how unstable all of the apis are. every single time i try to follow a guide/tutorial i get compilation errors because packages have shifted.

eigenspace6y ago

Now that 1.0 is out, APIs have stabilized a ton, even in the package ecosystem. But depending n your stability needs, packages might still be changing too fast.

I’d say for most people, there’s so much great progress and improvements happening that the breakages are well worth it.

1 more reply

ChrisRackauckas6y ago

This is a very good resource. The one thing I would ask is that I would like to see examples of using DifferentialEquations.jl when you get to the section on dynamical systems, especially when doing discrete event simulation and stochastic differential equations. I opened an issue in the repo and we can continue discussing there (I'll help write the code, I want to use this in my own class :P)!

Cybiote6y ago

I agree it's a wonderful resource. Which is exactly why I disagree with your suggestion. The book is uncommonly clear in how it explains fundamentals and bringing in such a powerful library ends up moving quite a bit away from that. It will no longer be just about the fundamentals of Julia on one hand and on the other, the algorithms will no longer be implementing language invariant. Losing that invariance IMO makes it less of a text on fundamentals.

ChrisRackauckas6y ago

I would say calling an ODE solver is pretty fundamental to a lot of real scientific workflows, but I am pretty biased on that.

iamcreasy6y ago

I do not remember using much calculus other than usign it to pass the college courses. Can you point me to some resources that would teach me how to use calculus(or ODE if that's more interesting) to solve interesting problems?

ynazarathy6y ago

We actually use the DifferentialEquations.jl package in one of the examples: https://github.com/h-Klok/StatsWithJuliaBook/blob/master/10_...

adamnemecek6y ago

I invite everyone to check out julia. The language is pleasant and gets out of the way. The interop is nuts. To call say numpy fft, you just do

using PyCall

np = pyimport("numpy")

np.fft.fft(rand(ComplexF64, 10))

Thats it. You call it with a julia native array, the result is in a julia native array as well.

Same with cpp

https://github.com/JuliaInterop/Cxx.jl

Or matlab

https://github.com/JuliaInterop/MATLAB.JL

It's legit magic

fny6y ago

How does Julia handle typing for interop?

StefanKarpinski6y ago

If I understand your question correctly, the answer is that there are a fixed number of native types supported by Python and NumPy, all of which correspond naturally to Julia types and are converted bidirectionally by PyCall. Julia and NumPy arrays are memory-compatible and Julia knows how to handle arrays with memory allocated by other systems, so conversion back and forth between Julia arrays and NumPy arrays is zero-copy. Other types like Python dicts are proxied in Julia as special types that Julia knows how to work with as dictionaries (user-defined data types are common in Julia), while general Python objects are just proxied transparently and `obj.method` calls are passed through to the embedded Python runtime. You can even define a function object `f` in Python and call it using `f()` syntax in Julia and vice versa. It's all highly transparent and smooth.

adamnemecek6y ago

What do you mean?

bdod66y ago

Can someone explain how this is more powerful than someone use an Python/R based workflow? E.g., I currently use a combination .ipynb, python scripts, and RStudio and this feels like it covers everything I need for any data science project.

jointpdf6y ago

I think Julia has a cleaner focus on scientific and mathematical computing than either R or Python (both for performance and understanding). i.e. the language is designed in such a way that corresponds more directly to mathematical notation and ways of thinking. If you’ve been in a graduate program that’s heavily mathematical, where you spend equal time doing pen and paper proofs and hacking together simulations and such (and frantically trying to learn a language like R/MATLAB/Python while staying afloat in your courses), you’ll appreciate the advantage of this. To my eyes, Python is too verbose and “computer science-y” and R is too quirky to fulfill this niche (I say this as someone that bleeds RStudio blue, and enjoys using Python+SciPy). I don’t think Julia is aimed at garden-variety / enterprise data science workflows. Caveat—I’m not a Julia user currently, so this is sort of a hot take.

The “Ju” in Jupyter is for Julia, so it’s designed to be used as an interactive notebook language also. The Juno IDE is modeled after RStudio.

anthony_doan6y ago

> R is too quirky to fulfill this niche

I'd like to offer a counter point or add on to this.

It's quirky enough to have many packages backed by some expert statistician.

I hope Julia get to be successful in this regard too.

jointpdf6y ago

The way I wrote that comes off as more dismissive than I intended. I think it’s quirky in the sense that there is a wide variance in styles of accomplishing things in (base) R, so something that appears perfectly natural to me can look foreign to someone else. I think this is partly the user base and partly the language itself, and of course the two are interdependent. To me, it’s a joy to write R code because of it’s flexibility and power, but I often have dreaded sharing it with others (especially as a beginner). It’s easy to look at someone else’s R scripts and think “this is horrifying”. By the way, this is referring more to scientific/statistical workflows—for more general purpose data science in R, the Tidyverse (or even just the pipe operator %>% around which the Tidyverse is built) goes a long, long ways towards helping people write expressive but readable code.

By contrast, Python feels a bit too rigid/standardized. Everyone’s code looks like it was copy+pasted from a book of truth somewhere. This is good for sharing and engineering, not as good for expressing mathematical ideas.

So whereas R has evolved organically over decades and Python is for everyone (and alternatives like MATLAB or SAS are first and foremost software for industry rather than languages), Julia seems to be thoughtfully purpose-built to be a modern language for numerical/scientific computing. It polishes off the rough edges and blends some of the best features of each language. Again, this is just an impression from someone who already thinks in R but is learning both Python/Julia.

More to your point, maybe Julia is at a stage of development where it’s good for both students (for developing computational and mathematical thinking) and experts (for slinging concise but performant code), but not yet the rank-and-file users looking to just get things done.

snicker76y ago

Fast for-loop, the ability to microoptimize numerical code (skip bounds checking in array access, SIMD optimations), GPU vector computing can use exact same code as CPU due to Julia functions being highly polymorphic. Your research code is your production code.

Also the macro system allows one to define powerful DSLs (see Gen.jl for AI).

aapeliOP6y ago

Accompanying code here: https://github.com/h-Klok/StatsWithJuliaBook

Merrill6y ago

In section "1.2 Setup and Interface" there is a very short description of the REPL and how it can be downloaded from julialang.org, as well as a much longer description of JuliaBox and how Jupyter notebooks can be run from juliabox.com for free.

Although JuliaBox has been provided for free by Julia Computing, there has been discussion that this may not be possible in the future. However, Julia Computing does provide a distribution of Julia, the Juno IDE, and supported packages known as JuliaPro for free.

For new users, would the free JuliaPro distribution be a good alternative to JuliaBox and/or downloading the REPL and kernal from julialang.org?

improbable226y ago

No, I think you should simply download the ordinary version. Jupyter, Juno, etc. are easy enough to install locally. I forget the precise details, but I think JuliaPro comes with certain versions of packages, and it's less confusing just to get the latest of what you need (using the built-in package manager).

JuliaBox (and https://nextjournal.com/) are cloud services, but if you have a real computer and want to do this for more than a few minutes, just install it. (There's also no need for virtualenv etc.)

cwyers6y ago

For people who have more Julia experience -- is this (thinking mainly of chapter 4) representative of how most Julia users do plotting? It looks like a lot of calling out to matplotlib via PyPlot. I know Julia has a ggplot-inspired library called Gadfly.jl, is PyPlot more commonly used?

chrispeel6y ago

There is not yet a universally-used package for plotting. One recent tool is Makie.jl [1]. Many use Plots.jl [2] as an interface to PyPlot, GR [3], and other backends. I.e. you can change the backend with a single command.

[1] https://github.com/JuliaPlots/Makie.jl

[2] https://github.com/JuliaPlots/Plots.jl

[3] https://github.com/jheinen/GR.jl

StefanKarpinski6y ago

Plots.jl seems to be the most popular plotting package these days: https://github.com/JuliaPlots/Plots.jl

thetwentyone6y ago

I bounce back and forth, usually using Gadfly for most plotting but Plots.jl is convenient for some stats plots (see StatsPlots.jl, which extends Plots.jl with nice built in functions for working with stats).

dlphn___xyz6y ago

whats the selling point with Julia? why would i use it over something like R?

cwyers6y ago

In R, most of the high performance code isn't written in R, it's written in Fortran or C or C++ (R has really good C++ integration via Rcpp). Python has something similar. The value prop of Julia is supposed to be that you have a language flexible enough to do the high-level stuff you'd normally do in R/Python, plus the ability to write high-performance code without having to drop into another language.

I remain skeptical that this solves a lot of real-world problems (I know a lot of users of R/Python who never need to resort to writing their own C/C++ code), but that's the sales pitch.

superdimwit6y ago

I think if you're just plugging together reasonably "vanilla" components from python / R libraries, and only using vectorised operations, those languages are fine and you can get away with using vectorised libraries wrapping C++.

The moment Julia shines is when your workloads can't be phrased by stringing together the limited set of vectorised verbs that python / r libraries give you: this is anything stateful and loopy like reinforcement learning, systematic trading, monte carlo simulations etc. It's also useful if you really care about performance and are doing "vanilla" computations at a truly large scale. If you want to avoid copying memory (i.e. doing vectorised operations), or want to tightly optimise / fused some numerical operations, it's great.

The other issue with python / r wrapping c++ libraries is that different libraries will generally not play well together (without coming out into python / r space, and doing a lot of copying / allocation). This tends to encourage large monolithic c/++ codebases like numpy and pandas, that are pretty impenetrable and difficult to extend / modify.

improbable226y ago

One more advantage to these libraries being written in Julia is that, if they are almost do what you need but not quite, it's often pretty easy to reach inside and patch the function which needs changing. You already speak the language and don't need to stop the world to do this. The barrier to doing this to (say) numpy is just much higher.

j88439h846y ago

It's supposed to be faster

Buttons8406y ago

It's a bit more nuanced than that. It's "as fast" without having to write any C.

I tried to recreate something like AlphaGo in Python using Keras, I never got the learning to work (probably because I was impatient and training on a laptop CPU), but a lot of the CPU time was simply being spent on manipulating the board state.

So I ported my "Board" object to Rust, and it was a lot faster. Things like counting liberties or removing dead stones were a lot faster, which was important.

Then I rewrote the whole thing in Julia and it was just as fast as my Python / Rust combo.

So I saw for myself that Julia does solve the two language problem. It is as pleasant to write as Python (and I like it better actually), and performed as well as Rust, based on my informal benchmarks.

j88439h846y ago

What's the nuance? It's much faster?

3 more replies

jbee6186y ago

Would love to see chapter exercises to test comprehension and reinforce learning objectives.

chakerb6y ago

I was going to ask is there any Kindle version of this, then I skimmed over the book, and I don't think it will be readable on a Kindle. And even if it does, the reading experience will definitely be inferior.

ynazarathy6y ago

The book will be published by Springer (at which point the online draft will be removed).

Yoni Nazarathy.

mruts6y ago

Julia is everything python could have been, and much more. I'm stuck with python right now as a lot of people in the data science/ML community are, but it's becoming increasingly viable to use Julia for "real" work. The Python-Julia interop story is pretty strong as well, which allows you to (somewhat) easily convert pandas/pytorch/sklearn code into Julia using Python wrappers. Julia has some unconventional things in it but they are all growing on me:

1. Indices by default start with 1. This honestly makes a ton of sense and off by one errors are less likely to happen. You have nice symmetry between the length of a collection and the last element, and in general just have to do less "+ 1" or "- 1" things in your code.

2. Native syntax for creation of matrices. Nicer and easier to use than ndarray in Python.

3. Easy one-line mathematical function definitions: f(x) = 2*x. Also being able to omit the multiplication sign (f(x) = 2x) is super nice and makes things more readable.

4. Real and powerful macros ala lisp.

5. Optional static typing. Sometimes when doing data science work static typing can get in your way (more so than for other kinds of programs), but it's useful to use most of the time.

6. A simple and easy to understand polymorphism system. Might not be structured enough for big programs, but more than suitable for Julia's niche.

Really the only thing I don't like about the language is the begin/end block syntax, but I've mentioned that before on HN and don't need to get into it again.

kbd6y ago

I can't believe I'm jumping into the inevitable 1-based indexing discussion, but I'm surprised to see you say that one-based indexing results in "less "+ 1" or "- 1" things in your code". Most arguments I've seen come out to "it's fine" (certainly) or "it's more comfortable for mathematicians" (which I can't speak to).

Besides Dijkstra's classic paper[1] showing why 0-based indexing is superior, in practice I find myself grateful for 0-based indexing in Python because of how slices and things just work out without needing +1/-1.

I'd like to understand. Could you give an example of when 1-based indexing works out better than 0-based?

[1] http://www.cs.utexas.edu/users/EWD/ewd08xx/EWD831.PDF

mruts6y ago

The classic example is getting the last element of an array. With 1-based indexing the length of the array is the index of the last element. It has a nice symmetry to it.

Also I find it elegant that for 1-indexing that the start and end value for slices are both inclusive, instead of the first one being inclusive and the last being exclusive.

Also, isn’t it just weird that the index of an element is one less than it’s “standard” index? Like if I take the first nth elements of a list, it would stand to reason that the nth element should be the last element, right?

The reason for zero indexing is historical, related to pointer offsets. I don’t think anyone chose them to be easier for people. They just made them that way because it maps closer to how contiguous values in arrays are accessed.

Also, with 1-indexing I can multiply numbers by arrays and get reasonable offsets. 3 x 1 is three, so I would get the third element of the list. But with 0-indexing, I have 0 x 3 which gives me the same element, clearly inconsistent.

There are some good reasons for 0-indexing and I have been using it in every language for my entire career. The amount of code I’ve written in Julia is marginal compared to my 0-indexing experience, so I might be missing something.

One nice this about 0-indexing is that I can slice a list in half with the same midpoint. For example a Python array with 10 elements:

fst, snd = arr[0:5], arr[5:10]

A little nicer than:

fst, snd = arr[1:5], arr[6:10]

Though you could have inclusive slices with 0-indexing, but it would be inconvenient and suffer from the same problem as 1-indexing.

kbd6y ago

> The classic example is getting the last element of an array.

Good point, in Python I don't notice that the last element is arr[len(arr)-1] because Python provides arr[-1]. I think in general your point is that it's natural for the nth element to be arr[n].

> The reason for zero indexing is historical, related to pointer offsets.

There is that, but Dijkstra's paper makes the case from first-principles that the closed, open interval of [0,n) for sequences is the most appropriate.

> with 1-indexing I can multiply numbers by arrays... 3 x 1 is three...

Sorry, I don't understand this. It makes sense that the point I don't understand is probably most-related to why Julia chose its indexing scheme and why Matlab et al. do the same.

> One nice this about 0-indexing is that I can slice a list in half with the same midpoint.

Yeah, arr[0:index] + arr[index:len(arr)] is the full list. And to your point earlier ("if I take the first nth elements of a list"), len(arr[:n]) == n seems natural.

Edit: I've been trying to formalize why Python's indexing scheme, along with its negative-indexing, is optimal (slight pseudocode):

    l = ['a','b','c']
    n = len(l)
    i = -n
    while i < n:
        print(l[i++])

prints "a b c a b c". That code makes no reference to any bound but 'n', nor any constants (1,0) or offsets, yet it iterates over the list twice through its range (first negative indices then positive).

2 more replies

jpeloquin6y ago

> Also, with 1-indexing I can multiply numbers by arrays and get reasonable offsets. 3 x 1 is three, so I would get the third element of the list. But with 0-indexing, I have 0 x 3 which gives me the same element, clearly inconsistent.

This is interesting. Suppose the task is to use this approach (index * stride) to pick every third item from a list of 9 items: [1, 2, 3, 4, 5, 6, 7, 8, 9].

With 1-indexing: Multiply the sequence of valid indices (1, 2, 3, ...) by the stride (3) and use the result to 1-index into the given list. Returns [3, 6, 9].

With 0-indexing: Multiply the sequence of valid indices (0, 1, 2, ...) by the stride (3) and use the result to 0-index into the given list. Returns [1, 4, 7].

0-indexing has the start point of the return values fixed to the origin. 1-indexing has its start point float around depending on the stride. Both work, but have different emergent properties in the given example.

1 more reply

eigenspace6y ago

Shouldn’t Dijkstra’s paper be your 0th reference?

nikhilsimha6y ago

^ This is why I love hackernews!

wodenokoto6y ago

Looking at teaching materials, indexing in R hardly gets a mention. Student are told they get get the first element by a[1] and the twelfths element by a[12], and if you want 4th, 5th and 6th, you just ask for that range, a[4:6]

For python teaching this is almost a whole chapter, with people sharing cheat sheets and building graphics to show how slicing works what not. You don't see these things in R teaching materials.

I'm sure that for the implementation of algorithms, things might be easier with zero indexing, but for a user asking for element 4,5 and 6, 1-indexing is much, much easier on the user.

stabbles6y ago

In C++ you typically access arrays with unsigned integers (size_t), and a common pitfall is:

    for (size_t i = v.size() - 1; i >= 0; --i) {
      std::cout << i << ": " << v[i] << std::endl;

To fix the infinite loop you could write:

    for (size_t i = v.size(); i > 0; --i) {
      std::cout << i - 1 << ": " << v[i - 1] << std::endl;

Neither is great. Switching to signed integers might make your compiler throw warnings at you.

However, 1-based indexing does not work out well with modular arithmetic:

    # 1 based
    v[1 + (i - 1) % v.size()]

    # 0 based
    v[i % v.size()]

There's pros and cons with both schemes.

ddragon6y ago

Yes, and in both cases the language should provide tools so you don't have to deal directly with those edge cases. For example in Julia, for modular arithmetic with 1-based indexing there is mod1 [1], and for iterating in Julia you should use eachindex which will always work for both 0 or 1 indexed arrays.

[1] https://docs.julialang.org/en/v1/base/math/#Base.mod1

[2] https://docs.julialang.org/en/v1/base/arrays/index.html#Base...

goto116y ago

I'm not really convinced by Dijkstras paper. He is basically saying indexing from zero is more natural because if you have an array of natural numbers including zero, then the range of numbers [0..n] is denoted by the index [0..n] which is logical. With 1-indexing you have have to write [1..n+1] to get the values [0..n] which is weird and ugly. Sure, but this assumes that the array in question is starting with 0 in the first place! The whole argument is begging the question.

cshenton6y ago

One reason I really like the 1 based indexing is that I can have a UInt index and 0 can act as a sentinel value. Really nice for writing things like vector embedded linked lists.

kgwgk6y ago

> Julia is everything python could have been

The goals of Python were quite different from the goals of Julia.

mruts6y ago

I’m not sure what Python’s goals are to be honest. It seems to me that the language is outclassed in every way by better, more consistent, more powerful, and more performant languages.

Python programmers seem content implementing the same things over and over again. Like, for example, flattening a list/monad.

List of things python doesn’t have but should: pattern matching, multi-line lambdas, more data structures (look at Scala for an example of what kind of data structures a standard library should provide), real threading, options, monads, futures, better performance, and more.

kgwgk6y ago

"In a 1999 report, Van Rossum highlighted the following as his goals for Python:

It should be an easy and intuitive language, just as powerful as major competitors.

It should be open source, so anyone can contribute to its development.

Its code should be understandable as plain English.

It should be suitable for everyday tasks, allowing for short development times."

https://www.computerhistory.org/fellowawards/hall/guido-van-...

“The first sound bite I had for Python was, "Bridge the gap between the shell and C."

So I never intended Python to be the primary language for programmers, although it has become the primary language for many Python users. It was intended to be a second language for people who were already experienced programmers, as some of the early design choices reflect.‘

https://www.artima.com/intv/pyscaleP.html

1 more reply

j88439h846y ago

Interesting list, thanks. FWIW, my view on these..

- I agree about pattern matching, that'd be nice.

- Multi-line lambdas haven't been important, but maybe I'm missing something.

- The list of data structures in the stdlib doesn't matter to me, since the 200k libraries on PyPI make up for it, and since packaging is easy nowadays with Poetry, they are as good as built-in but they get more frequent fixes and improvements than would be possible for the stdlib. Maybe there are some good side-effects of having extra types built in, based on a community of people using these types?

- Threading, I suppose, though Python isn't really the right language overall for that stuff anyway.

- Options, and monads, yeah that'd be nice.

- Futures are an idea whose time has come and gone IMO, but they're in asyncio anyway :\.

- For performance, PyPy is quite fast for many use cases.

Is there a language that has all these things built in and has a repl?

superdimwit6y ago

In my opinion, it's an unfortunate accident that Python became popular for numerical / data-ey workloads. It's good for some things, but fast low-overhead loopy code is definitely not one of them!

j88439h846y ago

CPython, yeah :\

I wish the pypy team had finished numpypy. Then fast numerical programs could be written in python instead of relying on all kinds of C extensions and stuff. Python would be great for numerics then.

j88439h846y ago

Does it have a type checker like mypy? Python's protocols (https://www.python.org/dev/peps/pep-0544/) are "structured enough for large programs".

mruts6y ago

Julia has static typing.

I wouldn’t go that far and say Python is suitable for large programs. It’s clearly not. Working on a large python code base is hell.

j88439h846y ago

> Julia has static typing.

Julia has its own type system, which doesn't conform to the traditional static/dynamic divide. But AFAIK it doesn't have a compile-time type checker like mypy to help me catch type errors early.

1 more reply

j88439h846y ago

Old Python, before mypy, attrs/dataclasses, etc., is a pain. Nowadays with modern tooling, it's terrific.

2 more replies

abakus6y ago

I find Julia's .> , .==, .*, ./ (dots for element-by-element ufunc)... really ugly. Numpy's design is cleaner and better.

ddragon6y ago

Why? When I see the '.' I immediately know it's a broadcasted function (for example * for matrix multiplication vs *. hadamard product), and I get the vectorized version of any function I write for free with no extra boilerplate (and the compiler will even automatically fuse them together if I chain them to avoid wasting allocations). You can even customize the broadcasting and the fusion.

plouffy6y ago

Commenting to find later.

grzm6y ago

You can effectively bookmark submissions by using the "favorite" link or just upvoting. The submission will show up in your profile under "favorite submissions" or "upvoted submissions", respectively.

the_duke6y ago

In addition, I hear that modern browsers support a ground-breaking functionality called "Bookmarks".

iamcreasy6y ago

I would not rely on browse bookmarks too much. Recently I lost a large amount of bookmarks for google chrome sync overwriting my local copy. The bookmarks.bak file was missing too.

lalaithion6y ago

To be fair, many modern social media sites break bookmarks.

6gvONxR4sf7o6y ago

Please don't do this.

j / k navigate · click thread line to collapse

131 comments

superdimwit6y ago

I'd really recommend anyone doing mildly numerical / data-ey work in python to give Julia a patient and fair try.

opportune6y ago

>you can typically just write down the code you want to write, rather than being forced to find a library that wraps a C/C++ implementation like in python/r.

jjoonathan6y ago

I don't just think it's a feature, I think it's a killer feature.

tomrod6y ago

Maybe I don't understand what API bloat is in this context -- can you give some more detail regarding your thoughts on pandas?

1 more reply

ChrisRackauckas6y ago

cauthon6y ago

Are all the plotting/visualization options still half baked?

spacedome6y ago

kmundnic6y ago

3jckd6y ago

Yes, they are. Slow and hardly as expressive or rich as python/r counterparts.

ViralBShah6y ago

One can use matplotlib in Julia by PyCall'ing it. So it is at least as good as anything else.

1 more reply

jointpdf6y ago

This looks like a good reference for the fundamentals of both statistics and Julia, as claimed. I have a small critique, since the authors asked for suggestions.

Here’s what I have in mind (doesn’t work on mobile): https://allennlp.org/tutorials. Does anyone know of a solution for formatting something like this?

ynazarathy6y ago

Thank you. Indeed not sure how to optimize it. Perhaps in the next version of the book. Note that the book is to be Springer published (once finished) - this puts some limitations as well.

Happy for more feedback (Yoni Nazarathy).

j88439h846y ago

I'm not sure how that allennlp site is doing it, but source is here: https://github.com/allenai/allennlp/blob/b0ea7ab6be2787495fa...

j88439h846y ago

Here's what they're doing: https://github.com/allenai/allennlp/blob/master/tutorials/ta...

psychometry6y ago

Not using PDF would be a good start. Bookdown texts tend to be good for mixed code/prose sections.

jointpdf6y ago

xvilka6y ago

[1] https://github.com/JuliaLang/julia/milestone/30

[2] https://discourse.julialang.org/t/julia-v1-2-0-rc2-is-now-av...

[3] https://julialang.org/blog/2019/05/jsoc19

[4] https://github.com/FluxML/Flux.jl/issues/625

caiocaiocaio6y ago

anonova6y ago

Yes, this is ones of my problems with Julia. It seems to be optimized for long runs and REPL/notebook usage.

Take, for example, a simple program that creates a line plot (https://docs.juliaplots.org/latest/tutorial/):

    using Plots
    x = 1:10
    y = rand(10)
    plot(x, y)

After installing the package, the first run has to precompile(?), and subsequent runs use the package cache. But ~25 s to create a simple plot is incredibly slow and frustrating to work with.

    $ julia --version
    julia version 1.1.1
    $ time julia plot.jl
    julia plot.jl  73.71s user 4.45s system 110% cpu 1:11.04 total
    $ time julia plot.jl
    julia plot.jl  24.41s user 0.39s system 100% cpu 24.633 total
    $ time julia plot.jl
    julia plot.jl  23.38s user 0.36s system 100% cpu 23.519 total

improbable226y ago

While this probably isn't a practical way to do any real work, running it with --compile=min gives some idea what might be possible soon:

    $ julia --compile=min -e '@time (using GR; plot(rand(20)))'
      0.375836 seconds (368.83 k allocations: 20.190 MiB, 1.65% gc time)
    $ julia --compile=min -e '@time (using Plots; plot(rand(20)))'
      4.302867 seconds (6.41 M allocations: 371.485 MiB, 5.07% gc time)

ViralBShah6y ago

Of course, we continue to work on improving compile times. About half of the time is spent in LLVM compilation, which has actually become slower over time.

tomrod6y ago

What prevents the plot compilation from being pre-compiled at install?

jebej6y ago

The Plots package adds a significant overhead right now. Try using PyPlot (matplotlib) directly. These days you can use exactly the same syntax (dot-call) as in Python.

   $ time julia -e "using PyPlot;x=1:10;y=rand(10);plot(x,y);"
   real    0m5.676s