Why I Use Nim instead of Python for Data Processing (opens in new tab)

(benjamindlee.com)

257 pointsbenjamin-lee4y ago179 comments

179 comments

It's primarily a testament to how simply mind bogglingly slow Python is outside of its optimised numerical science ecosystem. Which also why I don't use it that much, because while numerical analysis is a big part of what I do, so is what I would call "symbolic manipulation" and unless you go to quite some effort to transform every problem into a numerical one, Python is just awful at that.

But Nim is only one of a whole suite of languages that easily cruise to a 10x performance win over Python. And that isn't counting multicore - if you count that you quickly get to a 100x improvement.

Personally I use Groovy for much of what I do for similar reasons (which is somewhat unusual) but its just a placeholder for "use anything except python".

amyjess4y ago

> It's primarily a testament to how simply mind bogglingly slow Python is outside of its optimised numerical science ecosystem.

From my experience in using Python at my last job, I'll also add that Python is decent at tasks that aren't CPU-bound.

I wrote a lot of scripts that polled large amounts of network devices for information and then did something with it (typically upsert the data into a database, either via direct SQL or a REST API to whatever service owns the database). All these tasks were heavily network-bound. The amount of time the CPU was doing any work was minuscule compared to the amount of time it was waiting to get data back from the network. I doubt Nim or any other language would have been a significant performance improvement in this case.

For what it's worth, that made these scripts excellent candidates for multithreading. I'd run them with 20+ threads, and it was glorious. At first I did multiprocessing, because of all the GIL horror stories, but multiprocessing made it very difficult to cache data, so eventually I said "well, all this is network-bound so the GIL doesn't even apply" and switched over to multiprocessing.dummy (which implements pools using the same API as multiprocessing but with threads instead of processes), and I never looked back.

Edit: For what it's worth, Nim sounds like a really cool language, and it's right up my alley in several ways, I just don't think Python is particularly slow at network-bound tasks that use very little CPU.

m_mueller4y ago

I agree with your entire post but, and I‘m saying this as a fulltime python dev, there‘s often a point where it starts being bothersome, and that usually comes only later in the lifecycle of an application after it had some organic growth. Some day e.g. a sales manager comes down to your lair and asks you if you couldn‘t just also parse this little 200MB Excel spreadsheet after it came over the network such that your ETL process could save it into a new table. And boom you‘re into CPU bound lands now. Often it‘s fine, you can wait those 1-2min for a daily occuring process. But what if for example you put this whole component behind a REST API that is behind a load balancer that is set with a certain timeout? There are even strict upper limits if for example you chose AWS Lambda for your stuff.

And suddenly you need to introduce quite a bit more technical complexity into this story that‘s gonna be hard to explain to management - all they see is that you now can insert a couple of millions of DB rows and their Big Data consultants[TM] told them that this is nowadays not even worth thinking about.

Point being: If your performance ceiling is low, you‘re gonna hit it sooner.

rsyring4y ago

> there‘s often a point where it starts being bothersome

My team and I have been using Python for web, scripting, and ETL development since 2007. I don't recall the last time Python wasn't "fast enough" for anything I needed to do. I'm sure it's legitimately too slow for plenty of use cases and classes of programming domains. But for a general purpose language that makes our developers incredibly productive (which is what we optimize for organizationally), I've not "often" found the point where it becomes bothersome. On the contrary, I'd say I've rarely found it. And in those cases, the workaround to make it fast enough is there.

That's not to dispute your experience. I just want to provide a counter example.

2 more replies

DeathArrow4y ago

>Point being: If your performance ceiling is low, you‘re gonna hit it sooner.

And that isn't the only culprit. Large code bases will be hard to structure, maintain and organize. And you will spend more time writing tests than writing productive code. Because you don't have many defenses against errors and because once a bug will hit production it will be very difficult to debug.

1 more reply

someguy1010104y ago

Is there a problem with binding c or c++ to your python project in these situations?

1 more reply

csmpltn4y ago

> "I'll also add that Python is decent at tasks that aren't CPU-bound"

IO-bound tasks are almost by definition outside of your Python application's control. You yield control to the system to execute the actual task, and from that point on - you're no longer in control of how long the task will take to complete.

In other words, Python "being fast" by waiting on a Socket to complete receiving data isn't a particularily impressive feat.

bmicraft4y ago

The main point (I think) was that python is a viable language for many use cases that are not processing intensive, while also being very easy and quick to write which is often the most important thing.

gpderetta4y ago

In my experience, Python is so slow that it will make CPU bound tasks that have no business being CPU bound.

otherme1234y ago

You have to be reaaaaaally slow to be beaten by a network. Also, touched by the OP, if it takes 3 hours to write some code that in Python takes only 1, or if the compile times are huge, Python can beat other languages in speed (that edge fades when the same program is used over and over again).

But as demonstrated, Nim is fast to write and fast to compile, so Python has little edge. Just it's huge ecosystem.

baq4y ago

> Just it's huge ecosystem.

self-contradiction at its best. kindly be reminded the same advantage was the only thing that kept Java alive for so long until it finally started to enter the XXI century a couple years ago.

you just can't discount an ecosystem, especially if its huge.

1 more reply

yellow_lead4y ago

As other commenters point out, how can a language be fast in a way besides something CPU bound? You are saying it is fast when it's not doing anything. Not sure I understand.

baq4y ago

it's a shorthand for 'fast enough'. is it slow if nobody would notice if it was 10x faster?

nerdponx4y ago

You might want to look into async (asyncio or anyio) instead of or in addition to threads for network-heavy code. Async coroutines I find can be much easier to debug and develop than OS-threaded code.

CraigJPerry4y ago

The thing with Python is it's usually pretty easy to optimise quite impressively.

E.g. random example:

Sprinkle some cdef's in your python and suddenly you're faster than c++

https://github.com/luizsol/PrimesResult

https://github.com/PlummersSoftwareLLC/Primes/blob/drag-race...

25.8 seconds down to 1.5

Jensson4y ago

I've tried cython and it isn't competitive with C in real world cases. Cherry picking a case where it is competitive doesn't help your case. Its performance tend to be around the level of Java, another language people say is competitive or faster than C++ but in practice C++ is still twice faster than Java in most cases.

Still, getting Java level performance out of python is a huge improvement and should be enough for most cases.

CraigJPerry4y ago

Ahh my point was more that you can score a 17x performance boost with minimal effort rather than any absolute performance advantage over c++.

For a c++ comparison python would be much better pointing out the productivity advantage it has over the notoriously low productivity of c++ development - rather than competing on execution performance.

PartiallyTyped4y ago

There is also numba, which is very impressive in its own right, and also pypy, which supports features up to Python 3.7.

Some may consider Jax, and its XLA compiler, but unless you require gradients, numba will be significantly faster, an instance of this is available here [1].

XLA runs on a higher level than LLVM and therefore can't achieve the same optimizations as numba does using the latter. IIRC numba also has a Python to Cuda compiler, which is also very impressive.

[1] https://github.com/scikit-hep/iminuit/blob/develop/doc/tutor...

stavros4y ago

Apparently mypyc does the same if you have types in your code, though I've never used it.

p7g4y ago

> It's primarily a testament to how simply mind bogglingly slow Python is outside of its optimised numerical science ecosystem.

CPython's slowness doesn't boggle my mind at all. It's a bytecode interpreter for an incredibly dynamic language that states simplicity of implementation as a goal. I would say performance is actually pretty impressive considering all that. What _does_ boggle my mind is the performance of cutting-edge optimizing compilers like LLVM and V8!

At least there is a benefit to a simple implementation: Someone like me can dive into CPython's source and find out how things work.

flohofwoe4y ago

IME Python is pretty great as cross-platform scripting-, automation- and glue-language up to a few thousand lines of code. Essentially replacing bash scripts and Windows batch files or for simple command line tools that don't need the performance. It should be just one language in one's language toolbox though.

nimmer4y ago

> It's primarily a testament to how simply mind bogglingly slow Python is

No, Nim is truly among the top fastest languages when writing idiomatic code as shown in many benchmarks.

> But Nim is only one of a whole suite of languages that easily cruise to a 10x performance win over Python

...while also being very friendly to Python programmers, intuitive and expressive. Unlike many other languages.

sevensor4y ago

> mind bogglingly slow Python is outside of its optimised numerical science ecosystem

Granted, but inside its optimised numerical science ecosystem, Python is, in fact, fast enough. If most of your program is calls into numpy, Python will get you where you need to go. In my experience, one scalar Python math operation takes about the same amount of time as the equivalent numpy operation on a million-element array. Linked against a recent libblas, numpy will even distribute work across multiple cores. So much for the GIL.

nerdponx4y ago

I often recommend PyPy for "non-numerical" data processing when performance matters.

Also, "awful" is too harsh. Probably 90% of Python code just doesn't need to be faster than it is.

mattbillenstein4y ago

I'd guess closer to 99%...

gameswithgo4y ago

Nim isn’t unique in its performance, but it “feels” a lot like Python making it a nice gateway drug for Python users to start wasting less electricity.

IshKebab4y ago

Yeah should really be "Why I don't use Python for Data Processing". I would consider Typescript as an alternative too which I'm sure would get a similar speedup.

Also I don't know how anyone could design a language in the 21st century and make basic mistakes like this:

> Nim treats identifiers as equal if they are the same after removing capitalization (except for the first letter) and underscore, which means that you can use whichever style you want.

If that's any indication of the sanity of the rest of Nim then I'd say steer well clear!

matthiaswh4y ago

I wish this tired, boring argument didn't derail every conversation that remotely mentions Nim. See: literally every past HN post with "Nim" in the title.

Nim's underlying, perhaps understated philosophy is that it lets you write code the way you want to write code. If you like snake case, use it. If you want camel case, sure. Write your code base how you want to write it, keep it internally consistent if you want, or don't. Nim doesn't really care.

(That philosophy extends far beyond naming conventions.)

What this avoids is being stuck with antiquated standard libraries that continue to do things contrary to the language's standards for the sake of backward compatibility (arg Python!) and 3rd party libraries where someone chose a different standard because that's their preference (arg Python! JavaScript! Literally every language!). Now you're stuck with screaming linters or random `# noqa` lines stuffed in your code, and that one variable that you're using from a library sticks out like a sore thumb.

Your code is inconsistent because someone else's code was inconsistent - that's simply not a problem in Nim.

Could Nim have forced everyone to snake_case naming structures for everything from the start? Well, sure, but then the people that have never actually written code in Nim would be whining about that convention instead and we'd be in the same place. After having actually used Nim, my opinion, and I would venture to say the opinion of most, is that its identity rules were a good decision for the developers who actually write Nim code.

patrick4514y ago

> I wish this tired, boring argument didn't derail every conversation that remotely mentions Nim. See: literally every past HN post with "Nim" in the title.

This is a serious design flaw. It absolutely should be front and center when Nim is discussed.

> Nim's underlying, perhaps understated philosophy is that it lets you write code the way you want to write code. If you like snake case, use it. If you want camel case, sure. Write your code base how you want to write it, keep it internally consistent if you want, or don't. Nim doesn't really care.

I want to write code where "myFoo != my_foo". Evidently, nim doesn't allow that, so this argument seems pretty hollow.

> Now you're stuck with screaming linters or random `# noqa` lines stuffed in your code, and that one variable that you're using from a library sticks out like a sore thumb.

This is because we have crappy linters. If a linter can't tell that an identifier is camel case because it is third party, it's a bad linter.

> Could Nim have forced everyone to snake_case naming structures for everything from the start?

This would have been preferable to this madness.

goodpoint4y ago

> then the people that have never actually written code in Nim would be whining about that convention

Spot on. I wrote plenty of Nim and the style-insensitivity is a feature and not a bug.

Not allowing the use of 3 different variables named userName, user_name and username only encourages readable and robust code.

IshKebab4y ago

> 3rd party libraries where someone chose a different standard because that's their preference

C & C++ don't have a standard style but Python and JavaScript definitely do. And Rust and Go and I would guess most projects. Find me a popular library in one of those languages that doesn't use the standard style.

elcritch4y ago

> Yeah should really be "Why I don't use Python for Data Processing".

Not entirely. Nim‘s benefit here is that it’s superficially similar enough to Python that’s it’s easy for people from that world to pickup and start using Nim.

> Also I don't know how anyone could design a language in the 21st century and make basic mistakes like this: > If that's any indication of the sanity of the rest of Nim then I'd say steer well clear!

It may seem like a design mistake at first glance but it’s surprisingly useful. It’s intent is to allow a given codebase to maintain a consistent style (eg camel vs snake) even when making use of upstream libraries that use different styles. Not including the first letter avoids most of the annoyance of wantonly mixing all cap constants or lower case and linters avoid teams mismatching internal styles. Though mostly I forgot it’s there as most idiomatic Nim code sticks with camel case. I’d say not to knock it until you’ve tried it.

The rest of Nim’s design avoids many issues I consider actual blunders in a modern language such as Python’s treatment of if/else as statements rather than as expressions, and then adding things like the walrus operator etc to compensate.

rualca4y ago

> It may seem like a design mistake at first glance but it’s surprisingly useful. > It’s intent is to allow a given codebase to maintain a consistent style (eg camel vs snake) even when making use of upstream libraries that use different styles.

That doesn't sound right at all. It sounds like a design choice aimed at achieving the exact opposite: inconsistency without any positive tradeoff in return.

> Not including the first letter avoids most of the annoyance of wantonly mixing all cap constants or lower case and linters avoid teams mismatching internal styles.

That does not sound right at all. At most, it sounds like the compiler does not throw errors when stumbling on what would otherwise be syntax errors, but you still have all the mismatches in internal styles and linters complaining about code and teams wasting time with internal piss matches, and more importantly a way to foster completely futile nitpicking discussions within a language community.

IshKebab4y ago

> It’s intent is to allow a given codebase to maintain a consistent style (eg camel vs snake) even when making use of upstream libraries that use different styles.

This doesn't make sense. For an entirely new language you can just have the entire ecosystem use the same style, e.g. like Rust does. Or even Python!

2 more replies

benjamin-leeOP4y ago

I actually use TypeScript/JavaScript a lot for this reason, especially for biological algorithms that I want to run in the browser. The developer tooling is also as good as you can hope for, especially when using VS Code. I actually wrote a circular RNA sequence deduplication algorithm in it just recently [1].

With respect to the identifier resolution in Nim, it strikes me as more of a matter of preference. Especially given the universal function call syntax in Nim, at least it's consistent. For example, Nim treats "ATGCA".lowerCase() the same as lowercase("ATGCA"). I do appreciate the fact that you can use a chaining syntax instead of a nesting one when doing multiple function calls but this is also a matter of style more than substance.

[1] https://github.com/Benjamin-Lee/viroiddb/blob/main/scripts/c...

qwerty17934y ago

That's great. However the method that you use to find the canonical representative [1] is quadratic (when the string has length N, there are N rotations and for each rotation you need to check N characters to determine whether this is earlier than the best on that you have found so far). For large strings you would probably want to switch to one of the linear minimal string rotation algorithms [2], for example Booth's Algorithm.

[1] https://github.com/Benjamin-Lee/viroiddb/blob/main/scripts/c...

[2] https://en.wikipedia.org/wiki/Lexicographically_minimal_stri...

1 more reply

zmmmmm4y ago

The main beef I have with the javascript ecosystem for data analysis is lack of multicore. Yes lots of things can be solved by converting multi-core to multi-process or using other workarounds but there are a whole class of problems where shared memory access makes a huge amount of sense.

thewakalix4y ago

How is that a basic mistake? I think having two distinct identifiers that differ only by case sounds like a mistake!

otherme1234y ago

What I found useful is that even if some author uses snake in his library, it doesn't spread to your code, so you can still adhere to the standard.

IshKebab4y ago

Various programming languages (and file systems!) have had some form of case insensitivity and it always turns out to be an absolutely terrible idea:

* It makes searching for identifiers harder. For Nim you can't even use case insensitive search because of the underscore thing! Better practice your regexes.

* The case insensitivity rules are usually super complicated and don't apply to everything, so now it's an extra thing you have to mentally compute when coding. This is probably the biggest problem and I'm sure it has led to bugs, e.g. in SQL.

* Do you enjoy the tabs vs spaces debate? How about single quote Vs double quotes? Ugly inconsistently styled code? Well you'll love this!

* Unicode case insensitivity is actually really really complicated (this mostly applies to filesystems).

You're basically opening yourself up to an array of annoyances and gotchas for essentially no benefits.

I've literally never seen anyone use two identifiers that only differ by case, but if that were actually a big problem it could be solved just by making that illegal. You don't have to resort to the insanity of case insensitivity.

2 more replies

liamwestray4y ago

While I trust the author on this, I don’t think DNA datasets and string analysis was a great example.

One of the big, big things for improving performance on DNA analysis of ANY kind is converting these large text files into binary (4 letters easily converts to 2 bit encoding) and massively improves basically any analysis you’re trying to do.

Not only does it compress your dataset (2 bits vs 16 bits), it allows absurdly faster numerical libraries to be used in lieu of string methods.

There’s no real point in showing off that a compiled language is faster at doing something the slow way…

benjamin-leeOP4y ago

You make a fair point that using optimized numerical libraries instead of string methods will be ridiculously fast because they're compiled anyway. For example, scikit-bio does just this for their reverse complement operation [1]. However, they use an 8 bit representation since they need to be able to represent the extended IUPAC notation for ambiguous bases, which includes things like the character N for "aNy" nucleotide [2]. One could get creative with a 4 bit encoding and still end up saving space (assuming you don't care about the distinction between upper versus lowercase characters in your sequence [3]). Or, if you know in advance your sequence is unambiguous (unlikely in DNA sequencing-derived data) you could use the 2 bit encoding. When dealing with short nucleotide sequences, another approach is to encode the sequence as an integer. I would love to see a library—Python, Nim, or otherwise—that made using the most efficient encoding for a sequence transparent to the developer.

[1] https://github.com/biocore/scikit-bio/blob/b470a55a8dfd054ae...

[2] https://en.wikipedia.org/wiki/Nucleic_acid_notation

[3] https://bioinformatics.stackexchange.com/questions/225/upper...

liamwestray4y ago

Yeah, this is why my comment led with “I trust the author”…

I’m surprised you need the full 4 bits to deal with ambiguous bases, but it probably makes sense at some lower level I don’t understand.

benjamin-leeOP4y ago

This is because there's four bases and each can either be included or excluded from a given combination. So there are 4*2 = 16 combinations each of which with their own letter. In all honesty, these are pretty rarely used in practice these days except for N (any base) although they do sometimes show up when representing consensus sequences.

1 more reply

jjtheblunt4y ago

Aren't the reads emitting a set of size greater than 4 bases per position, with a wildcard or "?" perhaps one option?

(As in GATTACA might be read as is, but might be read as GAT?ACA.)

Still that's a minimal of 3 bits versus much longer.

[Edit : i see another commenter with the same observation, more thoroughly explained! ]

nine_k4y ago

Why do we use Python for data processing?

Because we use it as a nice syntactic frontend to numpy, a large and highly optimized library written in C++ and Fortran (sic). That is, we actually don't use "Python-native" code much, and numpy is essentially APL-like array-oriented thing where e.g. you don't normally need loops.

For native-language data processing, Python is slow; Nim or Julia would easily outperform it, while being comparably ergonomic.

stutonk4y ago

Apparently there's also a data processing library for Nim called Arraymancer[0] that's inspired by Numpy and PyTorch. It claims to be faster than both.

[0] https://mratsim.github.io/Arraymancer/

teleforce4y ago

Please add D language to the mix as well. Interestingly, you can simply replace Nim with D in the blog article and most of the contents will still make sense!

The funny thing is that Nim and Julia libraries are still wrapping Fortran numerical library while D has beaten the old and trusted Fortran library in its home turf five years back:

http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/...

dTal4y ago

>Julia libraries are still wrapping Fortran numerical library

You say that, but Julia is rapidly acquiring native numerical libraries that outperform OpenBLAS:

https://discourse.julialang.org/t/realistically-how-close-is...

elcritch4y ago

There’s been a tremendous amount of work optimizing blas _and_ ensuring it’s numerically stable. Julia made a good choice to use blas first. Though it’s good to see new native implementations.

For Nim, there’s also NimTorch which is interesting in that it builds on Nim’s C++ target to generate native PyTorch code. Even Python is technically a second class citizen for the C++ code. Most ML libraries are C++ all the way down.

https://github.com/sinkingsugar/nimtorch

ChrisRackauckas4y ago

Not necessarily true with Julia. Many libraries like DifferentialEquations.jl are Julia all of the way down because the pure Julia BLAS tools outperform OpenBLAS and MKL in certain areas. For example see:

https://github.com/YingboMa/RecursiveFactorization.jl/pull/2...

So a stiff ODE solve is pure Julia, LU-factorizations and all. This is what allows it to outperform the common C and Fortran libraries very consistently. See https://benchmarks.sciml.ai/html/MultiLanguage/wrapper_packa... and https://benchmarks.sciml.ai/html/Bio/BCR.html

leephillips4y ago

Julia is in the process of replacing many C and Fortran numerical libraries with pure Julia implementations, because they have similar performance.

haxscramper4y ago

Another nim & python thread that has not been mentioned yet here

https://news.ycombinator.com/item?id=28506531 - project allows creating pythonic bindings for your nim libraries pretty easily, which can be useful if you still want to write most of your toplevel code in python, but leverage nim's speed when it matters.

If you want to make your nim code even more "pythonic" there is a https://github.com/Yardanico/nimpylib, and for calling some python code from nim there is a https://github.com/yglukhov/nimpy

oxfordmale4y ago

The author makes a fair point, however, that is a rather non optimal implementation in Python. You likely could use chunked Pandas to speed up or the code, or at least replace some of the for loops with a list comprehension syntax.

However, in any case I would never replace Python with Nim as it is too niche of a language and you would struggle with recruiting. I could consider Julia if it's popularity keeps growing.

That is the ultimate challenge of a language. It either needs a large backer (Go and Google) or be so good, it gets a natural market adaptation(Julia). As a manager I am reluctant to adapt yet another language unless there is a healthy job market for it.

anyfactor4y ago

There are objectively better niche solutions for niche problems out there. But we pickup things that can be applied to solve a number of different problem, that are more versatile and has a community behind them.

pdimitar4y ago

There's a class of technologies falling under what I'd call "most of your engineers would pick that in a weekend".

Not all technologies require the full cycle and the normal risk management.

Daishiman4y ago

Agreed. I am certainly inclined to believe that Nim is a better language than Python, but it's not so much better thlo justify moving off of the ecosystem.

soundmasterj4y ago

This is reasonably idiomatic Python and 10x faster than the implementation in the original post:

  with open("orthocoronavirinae.fasta") as f:
      text = ''.join((line.rstrip() for line in f.readlines() if not line.startswith('>')))
      gc = text.count('G') + text.count('C')
      total = len(text)

Or if you want to be explicit, this is just as fast (and might scale better for particularly long genomes):

  gc = 0
  total = 0
  
  with open("orthocoronavirinae.fasta") as f:
      for line in f.readlines():
          if not line.startswith('>'):
              line = line.rstrip()
              gc += line.count('C') + line.count('G')
              total += len(line)

I didn't test Nim but the author reports Nim is 30x faster than his Python implementation, so mine would be about 3x slower than his Nim.

epidemian4y ago

I think this is missing the point of the article.

Yes, you can implement a faster Python version, but notice also:

* This faster version is reading all the file into memory (except comment lines). The article mentions the data being 150MB, which should fit in memory, but for larger datasets, this approach would be unfeasible

* The faster version is actually delegating a lot of work to Python's C internals by using text.count('G'). All the internal looping and comparisons is done in C, while on the original version, goes through Python

So yes, you can definitely write faster Python by delegating most of the work to C.

The point of the article is not about how to optimize Python, but about how given almost identical implementations in Python and Nim, Nim can outperform Python by 1 or 2 orders of magnitude without resorting to use C internals for basic things like looping or comparing characters.

soundmasterj4y ago

I didn't try to write optimized code, but idiomatic Python. Which also happens to be 10x faster.

To make it streaming, take the second version and remove the readlines (directly iterate over f).

Delegating work to Python's C internals is fine IMO because "batteries included" is a key feature of Python. "Nim outperforms unidiomatic Python that deliberately ignores key language features" is perhaps true, but less flashy of a headline.

And to be honest, I mainly wrote this because the other top level Python implementations for this one were terrible at the time of the post.

user59944614y ago

One liner to count gc, without buffering.

    import io
    f = io.StringIO(
    """
    AB
    CD
    EF
    GH
    """
    )

    total = sum(map(lambda s: 0 if s[0]==">" else s.count('G') + s.count('C'), f.readlines()))

    print(total)

user59944614y ago

And reading the file as binary. There's a lesson about the overhead of unicode strings here ;)

Your first example takes 3.1 seconds, my previous comment takes 2.3 seconds, this one takes 1.4 seconds.

    start = time.perf_counter()

    with open("orthocoronavirinae.fasta", "rb") as f:
    total = sum(map(lambda s: 0 if s[0]==65 else s.count(b"G") + s.count(b"C"), f.readlines()))
    
    end = time.perf_counter()
    
    print(total, " total")
    print(end-start, " seconds")

soundmasterj4y ago

This is good too. I would use a generator expression instead of the map though probably.

1 more reply

paulluuk4y ago

As a Data Engineer, I mainly use Python in conjunction with PySpark. So essentially I just use Python as an easy-to-read wrapper around Spark, and it works great when working together with Data Scientists, who are mostly used to Pandas, Keras, Tensorflow, etc.

In my use case, I don't really see how Nim would make my life easier right now.

cycomanic4y ago

One pitfall that I ran into when trying out Nim for scientific computing is that Nim follows more a computer science convention than mathematics convention for exponentiation and negation operators. That is in Nim -2^3=(-2)^3 unlike more scientific computing oriented languages where -2^3=-(2^3). To someone like myself who mainly does scientific work this was quite unexpected and it causes a surprising amount of mental overhead to avoid mistakes. I did like Nim quite a bit otherwise, but essentially found that I was missing some important numerical libraries so did not continue using it.

jb19914y ago

To be honest, regardless of the behavior of the language, I would be wrapping things in parentheses to make it explicit anyway.

leephillips4y ago

Although Nim using weird order of operations is unfortunate, your example is not well chosen. Replace the exponent 3 with an even integer and your point will be clear.

leephillips4y ago

Maybe not that weird, although not the most common syntax. In Nim 0 - 2^2 evaluates to -4, while -2^2 evaluates to +4. So in -2, the - is treated as a unary minus and binds tightly to the 2. It would be bad if Nim were doing + or - before ^, or before *. I have a feeling some other languages treat the unary minus the same way, but don’t know of any examples off hand.

cycomanic4y ago

Yes when I asked a question about this on their gitter channel, people brought up several languages I think bc is one of them for example. I understand the reasoning I still find it extremely unintuitive personally.

tzs4y ago

I don't think that is a mathematics convention vs computer science convention thing. Most languages, whether aimed at mathematics or computer science, have exponentiation at higher precedence than unary minus. (JavaScript does not, but it also does not allow unary minus directly in front of the base of an exponentiation so you have to add parenthesis no matter which interpretation you want).

The main places you find it the other way are spreadsheets and shells.

Is there an explanation from the Nim authors as to why they made such an odd choice?

cycomanic4y ago

IIRC the argument is that unitary negation is the operator with the highest presedence and one should be consistent across logical and mathematical operations. I think this is a not a completely unreasonable stance to take (and several other languages take the same), but it is unintuitive for someone who does significant scientific computing.

ignorem34y ago

So many python speed apologists. Yes, you can pour over your python code and given enough time and effort you can eek out another 25% improvement, but it's still much slower than the alternatives.

PartiallyTyped4y ago

Counter argument, how many times does one need to do data processing and what's the most expensive process in the equation?

The answer for the latter is programmer time, and some things can be scaled easily using `joblib`, or `dask`. Now, it isn't as trivial as importing parallel iterators with rust and changing `.into_iter` to `.into_par_iter`, but still needs less time, and once it is done, I don't need to think about it again.

adenozine4y ago

Python gets WAY too much work done in WAY too many fields to just handwave away "waaah, it's faster to use blah-lang"

DeathArrow4y ago

While Nim is for certain interesting and even pleasant to write code in, its small user base and environment discourage people to use it.

I don't write code only for myself.

How would I convince my employer to let me use Nim instead of a better known language?

And even I would convince my employer, if we want to start a new project how could we find programmers well-versed in Nim?

And even id we can find those people, it would mean we would have to write many things ourselves, which in other languages we can take for granted as they have libraries for almost anything.

So having a nice, performant and good language is just a small part of achieving your goals. You also need the people and the ecosystem.

Go, Rust, Kotlin, Swift and even Julia have the luck of having some industry heavyweights behind them, pushing the ecosystem and contributing with money and developers. Nim has only a bunch of passionate people behind it.

kgeist4y ago

>if we want to start a new project how could we find programmers well-versed in Nim?

If a programmer can't pick up a language like Nim in a few weekends (from what I gather, it's similar to Python and not much different from most common languages, i.e. not something relatively exotic like Haskell) then I don't know. Our mainly PHP shop transitioned to Go quite effortlessly. Today we hire PHP juniors without any Go experience (easier to find), we teach them, and then they work on Go codebases already after a month of internship. So lack of "professional Nim programmers" doesn't look like a problem to me.

Lack of libraries is a good point but from what I read, Nim compiles to C, so I understand they can have access to tens (hundreds?) of thousands C libraries without writing everything from scratch.

However, indeed, if you are to choose between, for example, Nim and Go for a new project, then I am not sure why would anyone prefer Nim. I'm really interested to know.

iamcreasy4y ago

I think it is true for people who have computer science/software engineering or any programming background, but for my statistician/data scientist friends - they skip a bit when they think they might need to learn a different language.

sound14y ago

> However, indeed, if you are to choose between, for example, Nim and Go for a new project, then I am not sure why would anyone prefer Nim. I'm really interested to know.

Same here, curious to know what HN crowd recommends between Nim vs Go for new projects.

_dain_4y ago

Nim has hygienic AST macros, so it has really good metaprogramming capabilities. An example of the sort of thing you can do with it: use pattern matching to implement a functional programming DSL:

https://nim-lang.org/blog/2021/03/10/fusion-and-pattern-matc...

This makes it really easy to reduce boilerplate and create low- or 0-cost abstractions for your problem domain. Example of this done in a microcontroller project here:

https://www.youtube.com/watch?v=j0fUqdYC71k

Async/await is also implemented by metaprogramming, rather than as a "core" part of the language:

https://www.youtube.com/watch?v=i0RB7UqxERE

Unrelated to the above, Nim also compiles to Javascript. So you can use the same language for both the backend and frontend.

1 more reply

DeathArrow4y ago

Nim is also faster than Go if we look at most benchmarks. Nim's memory management can be tuned quite a bit. Garbage collector can be even turned off. You can better tune to the underlying hardware if you need to.

https://nim-lang.org/docs/gc.html

dejj4y ago

> How would I convince my employer Turntables: if you were an employer, what would convince you to use Nim?

Hiring for Nim skills can be a signal that a company has people who learn languages beyond the run-of-the-mill ones. A bunch of passionate people you might say. That would make the company promising to work for.

WJW4y ago

Interestingly this was exactly the place Python was in back in the day: niche enough that anyone knowing it must've learned it because they thought it was cool rather than because it would get them a job. Those days are long gone of course.

cnmlp4y ago

In the case of Python, the "industry heavyweights" do very little and have a negative influence now.

Why the phrase "only a bunch of passionate people"? This is how software gets written, parasitical corporations and their unproductive developers who are installed in existing OSS projects come later and mainly associate themselves with the result (speaking of Python again).

Mikeb854y ago

> And even I would convince my employer, if we want to start a new project how could we find programmers well-versed in Nim?

Nim's easy to learn if you have any experience with any compiled language and can understand anything along the line of C#, Kotlin or Python syntax. Also because it compiles to C and JS it makes it easy to add it to a project incrementally in many cases.

nimmer4y ago

> How would I convince my employer to let me use Nim instead of a better known language?

This is a rephrasing of "nobody ever got fired for buying IBM".

Some organization prioritize innovation and technical acumen.

> So having a nice, performant and good language is just a small part of achieving your goals. You also need the people and the ecosystem.

Many applications don't need a large ecosystem. People can learn.

> Go, Rust, Kotlin, Swift and even Julia have the luck of having some industry heavyweights behind them

Python was never corporate-driven, thankfully, and it is successful.

egwor4y ago

You could make the same argument about rust right now., but rust is further along the 'programming language life path'.

Zababa4y ago

Not really, Rust is way more established than Nim. Lots of people use it, lots of companies use it, the community is very large. Maybe 5 years ago, Rust was at the same point as Nim, but even then I doubt it. Nim seems to be one of those languages that will always remain low-profile, because it's an incremental improvement and isn't pushed by a platform.

brabel4y ago

> Nim compilation process took an additional 702 ms

That's horrifyingly slow for a compiler. The author mentioned "modern languages look like Python but run as fast as C", which is a common promise those languages make that never really materialize except for a few very happy path cases they heavily optmised the language for. Julia, for example, makes this promise too, but compiles even slower than that and takes ridiculous amounts of RAM even for hello world.

Did the author post the data set they used for the examples? Would be nice to try it out on a few languages to see how fast that can compile and run on a mature language like Common Lisp (which is just as easy to write) or even node.js.

cyber_kinetist4y ago

Nim is actually one of the fastest to compile out of the compiled languages out there, on par with Go. Although this is a bit subjective, I think a second of compilation is good enough for light scripting tasks. (And being a statically-typed languages it catches a good chunk of errors before compilation is finished.)

Nim's advantage is that it uses a good old C compiler for the backend (which has been hyperoptimized for decades), but the frontend (transpiler) is also pretty fast. Nim's compilation speed should improve a bit when incremental compilation support is added (which would probably solve a lot of other current issues for Nim, for example better IDE tooling)

benjamin-leeOP4y ago

I didn't post it because it's quite big (150M) but readily available from the NCBI Virus portal [1]. I would love to see how well other languages compete both for speed and simplicity.

[1] https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType...

brabel4y ago

I couldn't get your 150M file, so I used one of the smaller files I could get by clicking on the first set shown in the table (the FASTA file was only 30KB) and duplicated it until it was around 150MB.

Here's a comparison with Common Lisp:

~/fasta-dna $ time python3 run.py

0.3797277865097147

21.828 secs

~/fasta-dna $ time sbcl --script run.lisp

0.37972778

2.415 secs

~/fasta-dna $ ls -al nc_045512.2.fasta

-rw-r--r-- 1 156095639 2021-09-25 11:15 nc_045512.2.fasta

So, almost as fast as Nim (the time includes compilation time)?

Here's the Common Lisp code:

    (with-open-file (in "nc_045512.2.fasta")
      (loop for line = (read-line in nil)
            while line
            with gc = 0 with total = 0 do
              (unless (eql (aref line 0) #\>)
                (loop for i from 0 below (length line)
                      for ch = (char line i) do
                        (setf total (1+ total))
                        (when (or (eql ch #\C) (eql ch #\G))
                          (setf gc (1+ gc)))))
            finally (format t "~f~%" (/ gc total))))

With a top-level function and some type declarations it could run even faster, I think.

EDIT: compiling the Lisp code to FASL and annotating the types brings the total runtime to 2.0 seconds. Running it from source increases the time very slightly, to 2.08 seconds, showing how the SBCL compiler is incredibly fast. Taking 0.7 seconds to compile a few lines of code is crazy, imagine when your project grows to many thousands of lines.

The Lisp code still can't really match Nim, which is really C at runtime, in speed when excluding compile-time, but if you need a scripting language, CL is great (specially when used with the REPL and SLIME).

cb3214y ago

@brabel - The Nim compiler actually builds a relatively large `system` package every time. (They are also working on speeding up compiles.) So, compile time does not scale as badly as you think. E.g., you might have to 50..100x the "user level" source code to double the time.

Also, @benjamin-lee this version of the Nim program is a bit lower level, but probably much faster:

    import memfiles as mf
    var gc = 0
    var total = 0

    var f = mf.open("orthocoronavirinae.fasta")
    for line in memSlices(f):
        let n = line.size
        let cs = cast[cstring](line.data)
        if n > 0 and cs[0] == '>': # ignore comment lines
            continue
        for i in 0 ..< n:
            let letter = cs[i]
            if letter == 'C' or letter == 'G':
                gc += 1
            total += 1

    echo(gc.float / total.float)
    mf.close(f) # not really needed; process about to end

Compile with -d:danger and so on, of course. { On a small 30kB test file I got about a 1.7x speed-up over that of the blog post. I also could not find the 150 MB file. Multiplying up the tiny 30 KB file like @brabel, I got only a 1.25x speed-up down to 0.5 seconds. So, might not be worth the low levelness, but a real file might tilt more towards the 1.7x end. }

brabel4y ago

I clicked on the big Download button and selected "all records", it downloaded over 3.5GB before I gave up... which file exactly should I use??

benjamin-leeOP4y ago

I'm sorry, I completely forgot that the file I used was from six months ago when I wrote the blog post (and then promptly forgot to publish it). In the last half year, the number of coronavirus sequences has increased dramatically. One thing that you could do to drop the file size down is to filter for only complete and unambiguous sequences, which drops the number down from 1.6 million to ~100k [1].

Alternatively, the exact file I used for the post is available for one week here with MD5 sum 3c33c3c4c2610f650c779291668450c9 [2]. Anyone who wants the file is free to reach out to me directly (email is on site).

[1] https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType...

[2] https://file.io/nUNc7cG5i8gj

1 more reply

tandav4y ago

can you upload somewhere your 150M file. If i follow the link in your comment there are bunch of small files, did you concatenate them?

dilawar4y ago

NIM also has a very good JavaScript backend. You can generate both C and JavaScript code from a nim program.

Last time I used it, I liked it but didn't use it long enough to have a strong opinion.

piqufoh4y ago

Python sometimes runs slowly, because it's not designed to run fast. It's designed to be readable and easy to write, which in turn makes developing python faster.

It's a compromise, but I always prioritise _my_ time over my computers time, so if I can write something quickly and just go and get a coffee while it runs - I will do that. I won't spend twice as long writing a single-run script just because it'll finish before the kettle has boiled.

elcritch4y ago

That’s where Nim can shine. For simple scripts both Python and Nim are about as easy to write. But the Nim version usually runs a lot faster.

Static types help for basic data munging when you haven’t used a script for months to get up to speed and make tweaks.

jimbob454y ago

Sadly, I think you’re spot-on about Nim’s future as the realization of the alternative timeline where Python didn’t make several stupid design choices (e.g. the GIL, Python 3).

It’s a shame because I think Nim has some neat features that allow it to present as a serious competitor to Rust but it will ultimately have to compete against Python instead to secure its niche.

elcritch4y ago

Oh yes, Nim definitely feels like an alternate reality where Python 2 became static and dumped some poor design choices.

Well I believe there's room between Rust and Python where Nim can grow. It made the TIOBE top 50 lately even. Likely it can eat enough market share from the edges of both Rust & Python to become more well known (more libs, tools, etc).

Rust is fantastic but tedious to program (to me at least) and it's community focuses on more formal type traits, etc making "scripting" trickier. Python is great for a mix of quick scripts, web dev, and data science but it's slow enough (and getting complex enough!) for many to want something faster and more stable yet still easy to write. Nim lives in between them and is more enjoyable to write than either for many. Also, Nim _could_ add Rust as a backend target and be relevant even if Rust displaces C/C++. ;)

Nim is also great for embedded systems too! I've been using it a fair bit and it's really nice [1]. There's a lot of room to grow in that field.

1: https://github.com/elcritch/nesper

pjmlp4y ago

There are plenty of compiled languages that are also designed to be readable and easy to write.

gameswithgo4y ago

You can have Python syntax without Python slow.

dunefox4y ago

And functional and static with F#.

nerdponx4y ago

You can also drop in PyPy and get a significant speedup on "loopy" string processing tasks with no changes in your code.

DeathArrow4y ago

>It's designed to be readable and easy to write

So is Golang.

stjohnswarts4y ago

Go is precisely what I switched to for my test/automation efforts.

samuel4y ago

For me the corollary of this post should be, try PyPy. You may get a 10x speedup for free.

scoopertrooper4y ago

For me the corollary of this comment should be, try reading the post. You may learn that the author did get a 10x speedup for free, but it was still 3.3x slower than Nim.

samuel4y ago

I did, I don't think the snarkiness was necessary.

My point, which apparently it wasn't evident enough, is that you can get the most of the benefits by doing nothing, just trying a different Python implementation, without the hassle of learning a niche language, as easy as it might be.

BTW, if you take into account compilation times the difference is even meager, and in all fairness the PyPy warmup period should have had to be discounted.

stillblue4y ago

Do you feel better about yourself by being snarky to people on the internet? How does it work? I'm curious

nxpnsv4y ago

Missing : on the first line of code. When speed mattes, people use libraries that are considerably faster than plain python. It’s these libraries that turned python so popular in data science. Giving them up maybe makes sense, but that mans a whole lot of learning and development to replace already pretty well established tools.

rualca4y ago

> When speed mattes, people use libraries that are considerably faster than plain python.

This.

The general guideline has always been that Python is ideal for glue code and non-performance-critical code, and when performance became an issue then Python code would simply be used as glue code to invoke specialized libraries. Perhaps the most popular example of this approach is bumpy, which uses BLAS and LAPACK internally to handle linear algebra stuff.

This Nim advertisement sounds awfully desperate with the way it resorts to what feels like a poorly assembled strawman, while giving absolutely nothing in return.

losvedir4y ago

I notice that python has `rstrip` while `nim` doesn't. Python's `rstrip()` is allocating a whole new line there. Does the nim iterator skip over whitespace or something? Do those bits of code output the same thing? The presence or absence of the whitespace will affect the `total` count.

jorams4y ago

This is explained a bit further into the article:

> A nice feature of the lines function is that it automatically strips newline characters such as LF and CRLF so we no longer need to doline.rstrip().

losvedir4y ago

Oh, nice! Missed that. Thanks.

mark_l_watson4y ago

Nice writeup, glad the author has a language and environment that they like.

Python has never been one of my favorite languages, but easy support in Google Colab, AWS SageMaker, etc. as well as most of my professional deep learning work using TensorFLow + Keras, it makes Python a go-to language for me. If you want a Lisp syntax on top of Python, you can try Hy (and get a free copy of my Hy book at https://leanpub.com/hy-lisp-python by setting the price to $0.00).

That said, for unpaid experiments I like Julia + Flux, which also solves the author's preference to avoid slow programming languages. Julia is really a nice language but no one has ever paid me to use it.

DeathArrow4y ago

If anyone is interested to see how Nim fares against some other programming languages, here are some benchmarks: https://github.com/kostya/benchmarks

pella4y ago

related: Biofast benchmark ( Nim, Julia, Go, Pypy, C, Crystal, .. )

"Benchmarking programming languages/implementations for common tasks in Bioinformatics"

https://github.com/lh3/biofast#fqcnt

https://lh3.github.io/2020/05/17/fast-high-level-programming...

HN: https://news.ycombinator.com/item?id=23229657

mkl954y ago

I'm an experienced Python developer who also dabbles with some applications written in C++. In my experience Python is much faster when it comes to development speed but it's way more demanding when it comes to optimization.

When you write C++, you kind of cheat because even code with high computational complexity is pretty fast. Whereas the equivalent code in Python will be awfully slow.

So, while it's true that Python requires less development time, this statement can't be used generally. I have spent hours optimizing Python code when in C++ I would have just moved on to my next task.

xiaodai4y ago

How does it compare to Julia? Anyone with experience in both Nim and Julia?

Sanguinaire4y ago

I have tinkered with both. Basically if you need to write a FORTRAN application but don't want to use FORTRAN, Julia is the correct choice. If ease of use is more important than having a diverse collection of pre-existing numerical libraries to play with, I'd go with Nim.

If Nim had cloud SDKs I would use it as my default language for pretty much everything.

agons4y ago

I would have thought Cython would be the closest analogue for comparison.

blondin4y ago

i have seldom seen data engineers write raw python loops the way you did with your examples. they usually use numpy, scipy, etc.

shoo4y ago

that can be a good approach where the computation maps cleanly onto some C or fortran code prepared earlier and wrapped as a numpy or scipy primitive. but sometimes expressing what you want to do as a bunch of numpy operations becomes a lot harder to read and perhaps also slower than if you just directly wrote the raw loops over some arrays in C.

in cases where what you want to do doesn't exactly fit standard operations, cython can be pretty nice. e.g. 200x -- 1000x speedups for translating C-oriented number crunching code from python to cython. but if you do want performance, you have to think about it while writing the code (avoid needlessly allocating memory in tight loops, data-oriented programming with simple arrays, statically type all of your variables, ...).

Mikeb854y ago

I'll probably always dislike Python. I do like Nim, but for strictly data processing I'd take R over either in a heartbeat. R has too many libraries, its semantics are perfect for data processing, RStudio is too nice and while pure R is slow as shit, in practice it's fast because it's basically a scripting language for a bunch of C and Fortran bits that are doing the real work.

If I were writing something from scratch that dealt with data, I would probably use Nim though. It's super easy to write something fast in and is more pleasant than pretty much any other compiled language.

santiagobasulto4y ago

I don't doubt Nim, looks like a great language. But that is just an awful Python implementation. I'd do it in this way:

    lines = (line for line in lines("orthocoronavirinae.fasta") if not line.startswith(">"))
    gc_lines = (1 if ('G' in line or 'C' in line) else 0 for line in lines)
    gc = sum(gc_lines)

    total = len(list(gc_lines))

    # Alternatively, a more "memory efficient" total would be:
    total = sum(1 for _ in lines)

Edit: my code is not perfect (I’m typing from my phone, I’m surprised I could even match parentheses).

My point is: this is a highly I/O bound program. The implementation matters. With the correct implementation there shouldn’t be much difference between the languages.

_dain_4y ago

>total = len(list(gc_lines))

That won't work properly; you've already exhausted the gc_lines generator in the previous line.

santiagobasulto4y ago

True, you'd need to re create the generator expression. Still, the other implementation seemed too naive.

_dain_4y ago

There's probably some itertools trick to do it with only one iteration.

EDIT: you can do it with functools.reduce and a generator of tuples:

    from functools import reduce

    with open("orthocoronavirinae.fasta") as f:
        lines = (line.strip() for line in f if not line.startswith(">"))
        sums = ((len(line), sum(1 for ch in line if ch in "CG")) for line in lines)
        total, gc = reduce(lambda x, y: (x[0] + y[0], x[1] + y[1]), sums)

Besides, I really don't think that any of our solutions will be that much faster than the one in the OP. All of them are using lazy iteration on the file object, just written differently. The differences amount to micro-optimizations. The real way to speed it up would be to use something like pandas, where the loading and summing calls into fast C implementations.

1 more reply

ad404b8a372f2b94y ago

That's not good python, it's unreadable and less optimized than the original code. Your "1 if ... else 0" condition serves no purpose and you iterate twice over the same loop for no reason. Use normal loops and save everyone who comes after a lot of reading time and bugs.

1 more reply

pddpro4y ago

Given that there is a fixed set of Alphabets, I would rather use Counters. It avoids for loops (the biggest source of non-efficiency in python) and is imho is much more idiomatic.

kiidev4y ago

Isn't that going to be even slower, since a 156mb file would probably make gc_lines a very big tuple before summing it?

makapuf4y ago

You could do gc=sum(1 for _ on lines if ...) or use Counter

henbruas4y ago

gc_lines isn't a tuple here, it's a generator expression. It will be lazily evaluated.

ellimilial4y ago

Isn’t this counting the lines with any G/C in them vs the total number of G/C literals?

baggiponte4y ago

The (…) return an iterator, right?

ellimilial4y ago

A context might be useful.

From what I gather, the author is a researcher in bioinformatics related field. This may indicate that they tend to work either alone or in a relatively small group. The domain is small scope data processing/manipulation, research/exploratory code, ,likely short-lived or even one-off.

The progress in this context will possibly be governed by sheer processing speed (e.g. it’s unlikely anyone will delve deep into the code, a lot of iterations to ‘just get it done’ instead of testing etc.).

If this is more or less correct, the point that Nim might be more useful than Python for the author sounds very sensible to me. It’s a nice spot between command line tools and more functionality-loaded languages.

benjamin-leeOP4y ago

Author here. This is spot on. The majority of the code I write is either piping data around to existing tools using shell scripting and Snakemake or writing the data processing code myself when there isn't a tool that does what I need. Usually, I'm working alone or with a few other computational biologists. Many of my scripts are one-off but they have the distinct tendency of growing in complexity and scope if they are useful. That's one of the big advantages with Nim in my mind: you can write a quick and dirty script and have it be pretty fast and then go back later and optimize it to a few percent of C without having to rewrite your code in another language. In this sense, it's quite like Julia (another really good language).

ur-whale4y ago

what's up we the all the shouting in the code ?

choneone4y ago

It's because the `font-feature-settings` of the main font leak to the code font. The feature 'case' turns the code all uppercase.

ghostly_s4y ago

Ha, I thought the author was intentionally writing the python in all-caps to highlight the similarity in syntax to the alternative language (not being familiar with Nim, I figured it must be conventional there). I was going to post a comment expressing my surprise that was even legal.

adsharma4y ago

Why not do both?

  cat test.py | py2many --nim=1 -
  http://dpaste.com//5ALVT7MK4

joshu4y ago

does nim have anything like scikit yet?

styluss4y ago

Closest I can find is https://nimble.directory/pkg/science

TekMol4y ago

TLDR: Because Python is slow

Yes, that is the achilles heel of Python.

I am always torn between Python and PHP for new projects because of this.

The Python Syntax plus its import system are huge advantages over PHP. On the other hand, you suffer a 6x slowdown if you go with Python. Decisions decisions. I so dearly wish I could have the good parts of both worlds.

ur-whale4y ago

Python is indeed slow, but why would anyone ever use php as a comparison point?

And for data processing of all things ...

Php is also very slow, on top of being many other kinds of unpleasant and broken.

DeathArrow4y ago

I had a nightmare where I was writing an operating system in PHP, which ran in hypervisor written in PHP, which ran in a virtual machine written in PHP, which in a browser written in PHP, which ran in an operating system written in PHP...

pjmlp4y ago

PHP has a proper JIT compiler on its reference implementation.

ur-whale4y ago

> PHP has a proper JIT compiler on its reference implementation.

That doesn't make it fast at all. Just faster than if it wasn't jitted.

And that certainly does not eradicate the vast ocean of other problems php has, the first of which being that is was never, ever "though out" and instead grew like a cancerous mushroom.

fctorial4y ago

> that is the achilles heel of Python

I think that is pip.

DeathArrow4y ago

Yes, Python is painfully slow, but it shouldn't matter. If you are using Python where performance, speed, correctness, testability, maintainability matters, you are doing it wrong.

Python is good where speed of development matters, where you write throw-away code testing some ideas and you want to do it fast, where you write glue code, for prototypes, for small code bases.

Once you are getting outside of that area, you better should use a language more suited for the task.

As for myself, even if I can use Python in some cases, I can churn C# code almost as fast so I prefer doing it that way in case I want to grow the code later or use it somewhere else. Being lazy, I dislike rewriting code.

j / k navigate · click thread line to collapse

179 comments

zmmmmm4y ago

But Nim is only one of a whole suite of languages that easily cruise to a 10x performance win over Python. And that isn't counting multicore - if you count that you quickly get to a 100x improvement.

Personally I use Groovy for much of what I do for similar reasons (which is somewhat unusual) but its just a placeholder for "use anything except python".

amyjess4y ago

> It's primarily a testament to how simply mind bogglingly slow Python is outside of its optimised numerical science ecosystem.

From my experience in using Python at my last job, I'll also add that Python is decent at tasks that aren't CPU-bound.

m_mueller4y ago

Point being: If your performance ceiling is low, you‘re gonna hit it sooner.

rsyring4y ago

> there‘s often a point where it starts being bothersome

That's not to dispute your experience. I just want to provide a counter example.

2 more replies

DeathArrow4y ago

>Point being: If your performance ceiling is low, you‘re gonna hit it sooner.

1 more reply

someguy1010104y ago

Is there a problem with binding c or c++ to your python project in these situations?

1 more reply

csmpltn4y ago

> "I'll also add that Python is decent at tasks that aren't CPU-bound"

In other words, Python "being fast" by waiting on a Socket to complete receiving data isn't a particularily impressive feat.

bmicraft4y ago

gpderetta4y ago

In my experience, Python is so slow that it will make CPU bound tasks that have no business being CPU bound.

otherme1234y ago

But as demonstrated, Nim is fast to write and fast to compile, so Python has little edge. Just it's huge ecosystem.

baq4y ago

> Just it's huge ecosystem.

self-contradiction at its best. kindly be reminded the same advantage was the only thing that kept Java alive for so long until it finally started to enter the XXI century a couple years ago.

you just can't discount an ecosystem, especially if its huge.

1 more reply

yellow_lead4y ago

As other commenters point out, how can a language be fast in a way besides something CPU bound? You are saying it is fast when it's not doing anything. Not sure I understand.

baq4y ago

it's a shorthand for 'fast enough'. is it slow if nobody would notice if it was 10x faster?

nerdponx4y ago

CraigJPerry4y ago

The thing with Python is it's usually pretty easy to optimise quite impressively.

E.g. random example:

Sprinkle some cdef's in your python and suddenly you're faster than c++

https://github.com/luizsol/PrimesResult

https://github.com/PlummersSoftwareLLC/Primes/blob/drag-race...

25.8 seconds down to 1.5

Jensson4y ago

Still, getting Java level performance out of python is a huge improvement and should be enough for most cases.

CraigJPerry4y ago

Ahh my point was more that you can score a 17x performance boost with minimal effort rather than any absolute performance advantage over c++.

PartiallyTyped4y ago

There is also numba, which is very impressive in its own right, and also pypy, which supports features up to Python 3.7.

Some may consider Jax, and its XLA compiler, but unless you require gradients, numba will be significantly faster, an instance of this is available here [1].

XLA runs on a higher level than LLVM and therefore can't achieve the same optimizations as numba does using the latter. IIRC numba also has a Python to Cuda compiler, which is also very impressive.

[1] https://github.com/scikit-hep/iminuit/blob/develop/doc/tutor...

stavros4y ago

Apparently mypyc does the same if you have types in your code, though I've never used it.

p7g4y ago

> It's primarily a testament to how simply mind bogglingly slow Python is outside of its optimised numerical science ecosystem.

At least there is a benefit to a simple implementation: Someone like me can dive into CPython's source and find out how things work.

flohofwoe4y ago

nimmer4y ago

> It's primarily a testament to how simply mind bogglingly slow Python is

No, Nim is truly among the top fastest languages when writing idiomatic code as shown in many benchmarks.

> But Nim is only one of a whole suite of languages that easily cruise to a 10x performance win over Python

...while also being very friendly to Python programmers, intuitive and expressive. Unlike many other languages.

sevensor4y ago

> mind bogglingly slow Python is outside of its optimised numerical science ecosystem

nerdponx4y ago

I often recommend PyPy for "non-numerical" data processing when performance matters.

Also, "awful" is too harsh. Probably 90% of Python code just doesn't need to be faster than it is.

mattbillenstein4y ago

I'd guess closer to 99%...

gameswithgo4y ago

Nim isn’t unique in its performance, but it “feels” a lot like Python making it a nice gateway drug for Python users to start wasting less electricity.

IshKebab4y ago

Yeah should really be "Why I don't use Python for Data Processing". I would consider Typescript as an alternative too which I'm sure would get a similar speedup.

Also I don't know how anyone could design a language in the 21st century and make basic mistakes like this:

> Nim treats identifiers as equal if they are the same after removing capitalization (except for the first letter) and underscore, which means that you can use whichever style you want.

If that's any indication of the sanity of the rest of Nim then I'd say steer well clear!

matthiaswh4y ago

I wish this tired, boring argument didn't derail every conversation that remotely mentions Nim. See: literally every past HN post with "Nim" in the title.

(That philosophy extends far beyond naming conventions.)

Your code is inconsistent because someone else's code was inconsistent - that's simply not a problem in Nim.

patrick4514y ago

> I wish this tired, boring argument didn't derail every conversation that remotely mentions Nim. See: literally every past HN post with "Nim" in the title.

This is a serious design flaw. It absolutely should be front and center when Nim is discussed.

I want to write code where "myFoo != my_foo". Evidently, nim doesn't allow that, so this argument seems pretty hollow.

> Now you're stuck with screaming linters or random `# noqa` lines stuffed in your code, and that one variable that you're using from a library sticks out like a sore thumb.

This is because we have crappy linters. If a linter can't tell that an identifier is camel case because it is third party, it's a bad linter.

> Could Nim have forced everyone to snake_case naming structures for everything from the start?

This would have been preferable to this madness.

goodpoint4y ago

> then the people that have never actually written code in Nim would be whining about that convention

Spot on. I wrote plenty of Nim and the style-insensitivity is a feature and not a bug.

Not allowing the use of 3 different variables named userName, user_name and username only encourages readable and robust code.

IshKebab4y ago

> 3rd party libraries where someone chose a different standard because that's their preference

elcritch4y ago

> Yeah should really be "Why I don't use Python for Data Processing".

Not entirely. Nim‘s benefit here is that it’s superficially similar enough to Python that’s it’s easy for people from that world to pickup and start using Nim.

rualca4y ago

That doesn't sound right at all. It sounds like a design choice aimed at achieving the exact opposite: inconsistency without any positive tradeoff in return.

> Not including the first letter avoids most of the annoyance of wantonly mixing all cap constants or lower case and linters avoid teams mismatching internal styles.

IshKebab4y ago

> It’s intent is to allow a given codebase to maintain a consistent style (eg camel vs snake) even when making use of upstream libraries that use different styles.

This doesn't make sense. For an entirely new language you can just have the entire ecosystem use the same style, e.g. like Rust does. Or even Python!

2 more replies

benjamin-leeOP4y ago

[1] https://github.com/Benjamin-Lee/viroiddb/blob/main/scripts/c...

qwerty17934y ago

[1] https://github.com/Benjamin-Lee/viroiddb/blob/main/scripts/c...

[2] https://en.wikipedia.org/wiki/Lexicographically_minimal_stri...

1 more reply

zmmmmm4y ago

thewakalix4y ago

How is that a basic mistake? I think having two distinct identifiers that differ only by case sounds like a mistake!

otherme1234y ago

What I found useful is that even if some author uses snake in his library, it doesn't spread to your code, so you can still adhere to the standard.

IshKebab4y ago

Various programming languages (and file systems!) have had some form of case insensitivity and it always turns out to be an absolutely terrible idea:

* It makes searching for identifiers harder. For Nim you can't even use case insensitive search because of the underscore thing! Better practice your regexes.

* Do you enjoy the tabs vs spaces debate? How about single quote Vs double quotes? Ugly inconsistently styled code? Well you'll love this!

* Unicode case insensitivity is actually really really complicated (this mostly applies to filesystems).

You're basically opening yourself up to an array of annoyances and gotchas for essentially no benefits.

2 more replies

liamwestray4y ago

While I trust the author on this, I don’t think DNA datasets and string analysis was a great example.

Not only does it compress your dataset (2 bits vs 16 bits), it allows absurdly faster numerical libraries to be used in lieu of string methods.

There’s no real point in showing off that a compiled language is faster at doing something the slow way…

benjamin-leeOP4y ago

[1] https://github.com/biocore/scikit-bio/blob/b470a55a8dfd054ae...

[2] https://en.wikipedia.org/wiki/Nucleic_acid_notation

[3] https://bioinformatics.stackexchange.com/questions/225/upper...

liamwestray4y ago

Yeah, this is why my comment led with “I trust the author”…

I’m surprised you need the full 4 bits to deal with ambiguous bases, but it probably makes sense at some lower level I don’t understand.

benjamin-leeOP4y ago

1 more reply

jjtheblunt4y ago

Aren't the reads emitting a set of size greater than 4 bases per position, with a wildcard or "?" perhaps one option?

(As in GATTACA might be read as is, but might be read as GAT?ACA.)

Still that's a minimal of 3 bits versus much longer.

[Edit : i see another commenter with the same observation, more thoroughly explained! ]

nine_k4y ago

Why do we use Python for data processing?

For native-language data processing, Python is slow; Nim or Julia would easily outperform it, while being comparably ergonomic.

stutonk4y ago

Apparently there's also a data processing library for Nim called Arraymancer[0] that's inspired by Numpy and PyTorch. It claims to be faster than both.

[0] https://mratsim.github.io/Arraymancer/

teleforce4y ago

Please add D language to the mix as well. Interestingly, you can simply replace Nim with D in the blog article and most of the contents will still make sense!

The funny thing is that Nim and Julia libraries are still wrapping Fortran numerical library while D has beaten the old and trusted Fortran library in its home turf five years back:

http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/...

dTal4y ago

>Julia libraries are still wrapping Fortran numerical library

You say that, but Julia is rapidly acquiring native numerical libraries that outperform OpenBLAS:

https://discourse.julialang.org/t/realistically-how-close-is...

elcritch4y ago

There’s been a tremendous amount of work optimizing blas _and_ ensuring it’s numerically stable. Julia made a good choice to use blas first. Though it’s good to see new native implementations.

https://github.com/sinkingsugar/nimtorch

ChrisRackauckas4y ago

https://github.com/YingboMa/RecursiveFactorization.jl/pull/2...

leephillips4y ago

Julia is in the process of replacing many C and Fortran numerical libraries with pure Julia implementations, because they have similar performance.

haxscramper4y ago

Another nim & python thread that has not been mentioned yet here

If you want to make your nim code even more "pythonic" there is a https://github.com/Yardanico/nimpylib, and for calling some python code from nim there is a https://github.com/yglukhov/nimpy

oxfordmale4y ago

However, in any case I would never replace Python with Nim as it is too niche of a language and you would struggle with recruiting. I could consider Julia if it's popularity keeps growing.

anyfactor4y ago

pdimitar4y ago

There's a class of technologies falling under what I'd call "most of your engineers would pick that in a weekend".

Not all technologies require the full cycle and the normal risk management.

Daishiman4y ago

Agreed. I am certainly inclined to believe that Nim is a better language than Python, but it's not so much better thlo justify moving off of the ecosystem.

soundmasterj4y ago

This is reasonably idiomatic Python and 10x faster than the implementation in the original post:

  with open("orthocoronavirinae.fasta") as f:
      text = ''.join((line.rstrip() for line in f.readlines() if not line.startswith('>')))
      gc = text.count('G') + text.count('C')
      total = len(text)

Or if you want to be explicit, this is just as fast (and might scale better for particularly long genomes):

  gc = 0
  total = 0
  
  with open("orthocoronavirinae.fasta") as f:
      for line in f.readlines():
          if not line.startswith('>'):
              line = line.rstrip()
              gc += line.count('C') + line.count('G')
              total += len(line)

I didn't test Nim but the author reports Nim is 30x faster than his Python implementation, so mine would be about 3x slower than his Nim.

epidemian4y ago

I think this is missing the point of the article.

Yes, you can implement a faster Python version, but notice also:

So yes, you can definitely write faster Python by delegating most of the work to C.

soundmasterj4y ago

I didn't try to write optimized code, but idiomatic Python. Which also happens to be 10x faster.

To make it streaming, take the second version and remove the readlines (directly iterate over f).

And to be honest, I mainly wrote this because the other top level Python implementations for this one were terrible at the time of the post.

user59944614y ago

One liner to count gc, without buffering.

    import io
    f = io.StringIO(
    """
    AB
    CD
    EF
    GH
    """
    )

    total = sum(map(lambda s: 0 if s[0]==">" else s.count('G') + s.count('C'), f.readlines()))

    print(total)

user59944614y ago

And reading the file as binary. There's a lesson about the overhead of unicode strings here ;)

Your first example takes 3.1 seconds, my previous comment takes 2.3 seconds, this one takes 1.4 seconds.

    start = time.perf_counter()

    with open("orthocoronavirinae.fasta", "rb") as f:
    total = sum(map(lambda s: 0 if s[0]==65 else s.count(b"G") + s.count(b"C"), f.readlines()))
    
    end = time.perf_counter()
    
    print(total, " total")
    print(end-start, " seconds")

soundmasterj4y ago

This is good too. I would use a generator expression instead of the map though probably.

1 more reply

paulluuk4y ago

In my use case, I don't really see how Nim would make my life easier right now.

cycomanic4y ago

jb19914y ago

To be honest, regardless of the behavior of the language, I would be wrapping things in parentheses to make it explicit anyway.

leephillips4y ago

Although Nim using weird order of operations is unfortunate, your example is not well chosen. Replace the exponent 3 with an even integer and your point will be clear.

leephillips4y ago

cycomanic4y ago

tzs4y ago

The main places you find it the other way are spreadsheets and shells.

Is there an explanation from the Nim authors as to why they made such an odd choice?

cycomanic4y ago

ignorem34y ago

So many python speed apologists. Yes, you can pour over your python code and given enough time and effort you can eek out another 25% improvement, but it's still much slower than the alternatives.

PartiallyTyped4y ago

Counter argument, how many times does one need to do data processing and what's the most expensive process in the equation?

adenozine4y ago

Python gets WAY too much work done in WAY too many fields to just handwave away "waaah, it's faster to use blah-lang"

DeathArrow4y ago

While Nim is for certain interesting and even pleasant to write code in, its small user base and environment discourage people to use it.

I don't write code only for myself.

How would I convince my employer to let me use Nim instead of a better known language?

And even I would convince my employer, if we want to start a new project how could we find programmers well-versed in Nim?

And even id we can find those people, it would mean we would have to write many things ourselves, which in other languages we can take for granted as they have libraries for almost anything.

So having a nice, performant and good language is just a small part of achieving your goals. You also need the people and the ecosystem.

kgeist4y ago

>if we want to start a new project how could we find programmers well-versed in Nim?

Lack of libraries is a good point but from what I read, Nim compiles to C, so I understand they can have access to tens (hundreds?) of thousands C libraries without writing everything from scratch.

However, indeed, if you are to choose between, for example, Nim and Go for a new project, then I am not sure why would anyone prefer Nim. I'm really interested to know.

iamcreasy4y ago

sound14y ago

> However, indeed, if you are to choose between, for example, Nim and Go for a new project, then I am not sure why would anyone prefer Nim. I'm really interested to know.

Same here, curious to know what HN crowd recommends between Nim vs Go for new projects.

_dain_4y ago

Nim has hygienic AST macros, so it has really good metaprogramming capabilities. An example of the sort of thing you can do with it: use pattern matching to implement a functional programming DSL:

https://nim-lang.org/blog/2021/03/10/fusion-and-pattern-matc...

This makes it really easy to reduce boilerplate and create low- or 0-cost abstractions for your problem domain. Example of this done in a microcontroller project here:

https://www.youtube.com/watch?v=j0fUqdYC71k

Async/await is also implemented by metaprogramming, rather than as a "core" part of the language:

https://www.youtube.com/watch?v=i0RB7UqxERE

Unrelated to the above, Nim also compiles to Javascript. So you can use the same language for both the backend and frontend.

1 more reply

DeathArrow4y ago

https://nim-lang.org/docs/gc.html

dejj4y ago

> How would I convince my employer Turntables: if you were an employer, what would convince you to use Nim?

WJW4y ago

cnmlp4y ago

In the case of Python, the "industry heavyweights" do very little and have a negative influence now.

Mikeb854y ago

> And even I would convince my employer, if we want to start a new project how could we find programmers well-versed in Nim?

nimmer4y ago

> How would I convince my employer to let me use Nim instead of a better known language?

This is a rephrasing of "nobody ever got fired for buying IBM".

Some organization prioritize innovation and technical acumen.

> So having a nice, performant and good language is just a small part of achieving your goals. You also need the people and the ecosystem.

Many applications don't need a large ecosystem. People can learn.

> Go, Rust, Kotlin, Swift and even Julia have the luck of having some industry heavyweights behind them

Python was never corporate-driven, thankfully, and it is successful.

egwor4y ago

You could make the same argument about rust right now., but rust is further along the 'programming language life path'.

Zababa4y ago

brabel4y ago

> Nim compilation process took an additional 702 ms

cyber_kinetist4y ago

benjamin-leeOP4y ago

I didn't post it because it's quite big (150M) but readily available from the NCBI Virus portal [1]. I would love to see how well other languages compete both for speed and simplicity.

[1] https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType...

brabel4y ago

Here's a comparison with Common Lisp:

~/fasta-dna $ time python3 run.py

0.3797277865097147

21.828 secs

~/fasta-dna $ time sbcl --script run.lisp

0.37972778

2.415 secs

~/fasta-dna $ ls -al nc_045512.2.fasta

-rw-r--r-- 1 156095639 2021-09-25 11:15 nc_045512.2.fasta

So, almost as fast as Nim (the time includes compilation time)?

Here's the Common Lisp code:

    (with-open-file (in "nc_045512.2.fasta")
      (loop for line = (read-line in nil)
            while line
            with gc = 0 with total = 0 do
              (unless (eql (aref line 0) #\>)
                (loop for i from 0 below (length line)
                      for ch = (char line i) do
                        (setf total (1+ total))
                        (when (or (eql ch #\C) (eql ch #\G))
                          (setf gc (1+ gc)))))
            finally (format t "~f~%" (/ gc total))))

With a top-level function and some type declarations it could run even faster, I think.

cb3214y ago

Also, @benjamin-lee this version of the Nim program is a bit lower level, but probably much faster:

    import memfiles as mf
    var gc = 0
    var total = 0

    var f = mf.open("orthocoronavirinae.fasta")
    for line in memSlices(f):
        let n = line.size
        let cs = cast[cstring](line.data)
        if n > 0 and cs[0] == '>': # ignore comment lines
            continue
        for i in 0 ..< n:
            let letter = cs[i]
            if letter == 'C' or letter == 'G':
                gc += 1
            total += 1

    echo(gc.float / total.float)
    mf.close(f) # not really needed; process about to end

brabel4y ago

I clicked on the big Download button and selected "all records", it downloaded over 3.5GB before I gave up... which file exactly should I use??

benjamin-leeOP4y ago

[1] https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType...

[2] https://file.io/nUNc7cG5i8gj

1 more reply

tandav4y ago

can you upload somewhere your 150M file. If i follow the link in your comment there are bunch of small files, did you concatenate them?

dilawar4y ago

NIM also has a very good JavaScript backend. You can generate both C and JavaScript code from a nim program.

Last time I used it, I liked it but didn't use it long enough to have a strong opinion.

piqufoh4y ago

Python sometimes runs slowly, because it's not designed to run fast. It's designed to be readable and easy to write, which in turn makes developing python faster.

elcritch4y ago

That’s where Nim can shine. For simple scripts both Python and Nim are about as easy to write. But the Nim version usually runs a lot faster.

Static types help for basic data munging when you haven’t used a script for months to get up to speed and make tweaks.

jimbob454y ago

Sadly, I think you’re spot-on about Nim’s future as the realization of the alternative timeline where Python didn’t make several stupid design choices (e.g. the GIL, Python 3).

It’s a shame because I think Nim has some neat features that allow it to present as a serious competitor to Rust but it will ultimately have to compete against Python instead to secure its niche.

elcritch4y ago

Oh yes, Nim definitely feels like an alternate reality where Python 2 became static and dumped some poor design choices.

Nim is also great for embedded systems too! I've been using it a fair bit and it's really nice [1]. There's a lot of room to grow in that field.

1: https://github.com/elcritch/nesper

pjmlp4y ago

There are plenty of compiled languages that are also designed to be readable and easy to write.

gameswithgo4y ago

You can have Python syntax without Python slow.

dunefox4y ago

And functional and static with F#.

nerdponx4y ago

You can also drop in PyPy and get a significant speedup on "loopy" string processing tasks with no changes in your code.

DeathArrow4y ago

>It's designed to be readable and easy to write

So is Golang.

stjohnswarts4y ago

Go is precisely what I switched to for my test/automation efforts.

samuel4y ago

For me the corollary of this post should be, try PyPy. You may get a 10x speedup for free.

scoopertrooper4y ago

For me the corollary of this comment should be, try reading the post. You may learn that the author did get a 10x speedup for free, but it was still 3.3x slower than Nim.

samuel4y ago

I did, I don't think the snarkiness was necessary.

BTW, if you take into account compilation times the difference is even meager, and in all fairness the PyPy warmup period should have had to be discounted.

stillblue4y ago

Do you feel better about yourself by being snarky to people on the internet? How does it work? I'm curious

nxpnsv4y ago

rualca4y ago

> When speed mattes, people use libraries that are considerably faster than plain python.

This.

This Nim advertisement sounds awfully desperate with the way it resorts to what feels like a poorly assembled strawman, while giving absolutely nothing in return.

losvedir4y ago

jorams4y ago

This is explained a bit further into the article:

> A nice feature of the lines function is that it automatically strips newline characters such as LF and CRLF so we no longer need to doline.rstrip().

losvedir4y ago

Oh, nice! Missed that. Thanks.

mark_l_watson4y ago

Nice writeup, glad the author has a language and environment that they like.

DeathArrow4y ago

If anyone is interested to see how Nim fares against some other programming languages, here are some benchmarks: https://github.com/kostya/benchmarks

pella4y ago

related: Biofast benchmark ( Nim, Julia, Go, Pypy, C, Crystal, .. )

"Benchmarking programming languages/implementations for common tasks in Bioinformatics"

https://github.com/lh3/biofast#fqcnt

https://lh3.github.io/2020/05/17/fast-high-level-programming...

HN: https://news.ycombinator.com/item?id=23229657

mkl954y ago

When you write C++, you kind of cheat because even code with high computational complexity is pretty fast. Whereas the equivalent code in Python will be awfully slow.

xiaodai4y ago

How does it compare to Julia? Anyone with experience in both Nim and Julia?

Sanguinaire4y ago

If Nim had cloud SDKs I would use it as my default language for pretty much everything.

agons4y ago

I would have thought Cython would be the closest analogue for comparison.

blondin4y ago

i have seldom seen data engineers write raw python loops the way you did with your examples. they usually use numpy, scipy, etc.

shoo4y ago

Mikeb854y ago

santiagobasulto4y ago

I don't doubt Nim, looks like a great language. But that is just an awful Python implementation. I'd do it in this way:

    lines = (line for line in lines("orthocoronavirinae.fasta") if not line.startswith(">"))
    gc_lines = (1 if ('G' in line or 'C' in line) else 0 for line in lines)
    gc = sum(gc_lines)

    total = len(list(gc_lines))

    # Alternatively, a more "memory efficient" total would be:
    total = sum(1 for _ in lines)

Edit: my code is not perfect (I’m typing from my phone, I’m surprised I could even match parentheses).

My point is: this is a highly I/O bound program. The implementation matters. With the correct implementation there shouldn’t be much difference between the languages.

_dain_4y ago

>total = len(list(gc_lines))

That won't work properly; you've already exhausted the gc_lines generator in the previous line.

santiagobasulto4y ago

True, you'd need to re create the generator expression. Still, the other implementation seemed too naive.

_dain_4y ago

There's probably some itertools trick to do it with only one iteration.

EDIT: you can do it with functools.reduce and a generator of tuples:

    from functools import reduce

    with open("orthocoronavirinae.fasta") as f:
        lines = (line.strip() for line in f if not line.startswith(">"))
        sums = ((len(line), sum(1 for ch in line if ch in "CG")) for line in lines)
        total, gc = reduce(lambda x, y: (x[0] + y[0], x[1] + y[1]), sums)

1 more reply

ad404b8a372f2b94y ago

1 more reply

pddpro4y ago

Given that there is a fixed set of Alphabets, I would rather use Counters. It avoids for loops (the biggest source of non-efficiency in python) and is imho is much more idiomatic.

kiidev4y ago

Isn't that going to be even slower, since a 156mb file would probably make gc_lines a very big tuple before summing it?

makapuf4y ago

You could do gc=sum(1 for _ on lines if ...) or use Counter

henbruas4y ago

gc_lines isn't a tuple here, it's a generator expression. It will be lazily evaluated.

ellimilial4y ago

Isn’t this counting the lines with any G/C in them vs the total number of G/C literals?

baggiponte4y ago

The (…) return an iterator, right?

ellimilial4y ago

A context might be useful.

benjamin-leeOP4y ago

ur-whale4y ago

what's up we the all the shouting in the code ?

choneone4y ago

It's because the `font-feature-settings` of the main font leak to the code font. The feature 'case' turns the code all uppercase.

ghostly_s4y ago

adsharma4y ago

Why not do both?

  cat test.py | py2many --nim=1 -
  http://dpaste.com//5ALVT7MK4

joshu4y ago

does nim have anything like scikit yet?

styluss4y ago

Closest I can find is https://nimble.directory/pkg/science

TekMol4y ago

TLDR: Because Python is slow

Yes, that is the achilles heel of Python.

I am always torn between Python and PHP for new projects because of this.

ur-whale4y ago

Python is indeed slow, but why would anyone ever use php as a comparison point?

And for data processing of all things ...

Php is also very slow, on top of being many other kinds of unpleasant and broken.

DeathArrow4y ago

pjmlp4y ago

PHP has a proper JIT compiler on its reference implementation.

ur-whale4y ago

> PHP has a proper JIT compiler on its reference implementation.

That doesn't make it fast at all. Just faster than if it wasn't jitted.

And that certainly does not eradicate the vast ocean of other problems php has, the first of which being that is was never, ever "though out" and instead grew like a cancerous mushroom.

fctorial4y ago

> that is the achilles heel of Python

I think that is pip.

DeathArrow4y ago

Yes, Python is painfully slow, but it shouldn't matter. If you are using Python where performance, speed, correctness, testability, maintainability matters, you are doing it wrong.

Python is good where speed of development matters, where you write throw-away code testing some ideas and you want to do it fast, where you write glue code, for prototypes, for small code bases.

Once you are getting outside of that area, you better should use a language more suited for the task.

j / k navigate · click thread line to collapse