In [7]: iso_regex = re.compile('(\\d{4})-(\\d{2})-(\\d{2})T(\\d{2}):(\\d{2}):(\\d{2}(?:\\.?\\d+))')
In [8]: %timeit iso_regex.match('2014-01-09T21:48:00.921000')
1000000 loops, best of 3: 1.05 µs per loop
But hey, once it's written in C, why go back?I'm missing the timezone, but the OP left that out, so I did too. For comparison, dateutil's parse takes ~76µs for me. Kinda makes me wonder why aniso8601 is so slow. (It's also missing a few other things, depending on if you count all the non-time forms as valid input.)
That said, cool! I might use this. One of the things that makes dateutil's parse slower is that it'll parse more than just ISO-8601: it parses many things that look like dates, including some very non-intuitive ones that have caused "bugs"¹. Usually in APIs, its "dates are always ISO-8601", and all I really need is an ISO-8601 parser. While I appreciate the theory behind "be liberal in what you accept", sometimes, I'd rather error out than to build expectations that sending garbage — er, stuff that requires a complicated parse algorithm that I don't really understand — is okay.
¹dateutil.parser.parse('') is midnight of the current date. Why, I don't know. Also, dateutil.parser.parse('noon') is "TypeError: 'NoneType' object is not iterable".
* Every part from month onwards is optional
* Separator characters are optional
* Date/time separator can be a space as well as T
* Timezone information
* Parsing the strings into numbers
* Actually creates a datetime object
I expect adding all of those will bump up the time a bit.
iso_regex = re.compile('([0-9]{4})-?([0-9]{1,2})(?:-?([0-9]{1,2})(?:[T ]([0-9]{1,2})(?::?([0-9]{1,2})(?::?([0-9]{1,2}(?:\\.?[0-9]+)?))?(?:(Z)|([+-][0-9]{1,2}):?([0-9]{1,2})))?)?)?')
It seems like it performs quite a bit worse than the library, which creates the full object. In [82]: %timeit ciso8601.parse_datetime('2014-01-09T21:48:00.921000')
1000000 loops, best of 3: 368 ns per loop
In [83]: %timeit iso_regex.match('2014-01-09T21:48:00.921000')
100000 loops, best of 3: 9.72 µs per loop
In the interest of intellectual pursuit, is there anything that can be done to the regex to speed it up?They have their own C function which parses ISO-8601 datetime strings: https://github.com/pydata/pandas/blob/2f1a6c412c3d1cbdf56610...
They have a version of strptime written in cython: https://github.com/pydata/pandas/blob/master/pandas/tslib.py...
I'm not saying these are better/worse than your solution, I haven't done any benchmarks and the pandas functions sometimes cut a few corners, but perhaps there is something useful there for reference anyways. They also don't deal directly in datetime.datetime objects, they use pandas specific intermediate objects, but should be simple enough to grok.
Having done some work with dateutil, I will tell you that dateutil.parser.parse is slow, but its main use case shouldn't be converting strings to datetimes if you already know the format. If you know the format already you should use datetime.strptime or some faster variant (like the one above). There is a nice feature of pandas where given a list of datetime-y strings of an arbitrary format, it will attempt to guess the format using dateutil's lexer (https://github.com/pydata/pandas/blob/master/pandas/tseries/...) combined with trial/error, and then try to use a faster parser instead of dateutil.parser.parse to convert the array if possible. In the general case this resulted in about a 10x speedup over dateutil.parser.parse if the format was guessable.
>>> ds = u'2014-01-09T21:48:00.921000+05:30'
>>> %timeit ciso8601.parse_datetime(ds)
100000 loops, best of 3: 3.73 µs per loop
>>> %timeit dateutil.parser.parse(ds)
10000 loops, best of 3: 157 µs per loop
A regex[1] can be fast, but the parsing is just a small part of the time spent. >>> %timeit regex_parse_datetime(ds)
100000 loops, best of 3: 13 µs per loop
>>> %timeit match = iso_regex.match(s)
100000 loops, best of 3: 2.18 µs per loop
Pandas is also slow. However it is the fastest for a list of dates, just 0.43µs per date!! >>> %timeit pd.to_datetime(ds)
10000 loops, best of 3: 47.9 µs per loop
>>> l = [u'2014-01-09T21:{}:{}.921000+05:30'.format(
("0"+str(i%60))[-2:], ("0"+str(int(i/60)))[-2:])
for i in xrange(1000)] #1000 differents dates
>>> len(set(l)), len(l)
(1000, 1000)
>>> %timeit pd.to_datetime(l)
1000 loops, best of 3: 437 µs per loop
NB: pandas is however very slow in ill-formed dates, like u'2014-01-09T21:00:0.921000+05:30' (just one figure for the second) (230 µs, no speedup by vectorization).So if you care about speed and your dates are well formatted, make a vector of dates and use pandas. If you can't use it, go for ciso8601. For thomas-st: it may be possible to speed-up parsing of list of dates like Pandas do. Another nice feature would be caching.
I'm not sure this would be any better than just manually writing out both trivial iterations of the loop:
for (i = 0; i < 2; i++)Of course there is always potential for optimization, but at this point it's fast enough for our purposes. If you can make it significantly faster please don't hesitate to submit a PR though :)
EDIT: Wouldn't most C compilers unroll the simple "for" loops? Direct link to the C code: https://github.com/elasticsales/ciso8601/blob/master/module....
It may also be better not covering everything if it keeps performance and simplicity but I just like to understand the trade-offs.
I've never really spent much time looking at pandas' to_datetime but I believe it has to handle a lot of variety in what you pass to it, which probably cause a bit of a perf hit. (Lists, arrays, Series)
http://dl.dropboxusercontent.com/u/14988785/ciso8601_compari...
Not quite related: Is there any python library that can handle timezone parsing, like the java SimpleDateFormat (http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDat...)? The timezone could be in utc offset and short name format (EST, EDT,...). I am surprised that I couldn't find one.
Even the standard PHP string parser does 0.017ms on my 3 year old netbook.
<?php
$st = microtime(true);
$cnt = 10000;
for ($i=0; $i<$cnt; $i++)
strtotime('2014-01-09T21:48:00.921000');
echo 1000 * (microtime(true) - $st) / $cnt;
Seems like this solves a non-existing issue.Lots of people deal with data rates that make webscale throughput look pretty pathetic; you are just less likely to know as it will be prop tech
How do these all compare to each other?
> good to ffi out of you're using it a lot
What does that mean?