If you know precisely what the data is used for - great, go ahead - type system is your friend.
If you don't know how the data should be used, it's often a different story. Wrapping data in hand typed classes is a terrible idea in the typical data engineering scenarios where there might be hundreds of these api endpoints, which also might be changing as the upstream sees fit. Perfect way to piss off your downstream users is to keep telling them "sorry the data is not available because I overspecified the data type and now it failed on TypeError again". Usually the downstream is the domain expert, they know which fields should be used and they don't know which ones before they start using it. Typically the best way is to pass ALL the upstream data down, materialize extra fields and NOT modify any existing field names, even when you think you're super smart and know better than domain experts. Too often it happens that a "smart" engineer though he knew better and included only some fields. Only for then to be realized that the data source contained many more gold nuggets, and it was never documented that these were cleverly dropped.
Also great for property testing / fuzzing. And other fun meta datamodel stuff like eg inferring schema from example data.
In general programming language type systems are pretty weak in comparison because they're not very programmable. (In most languages, for most people, etc .. there are fancy level type systems approaching formal proof toolkits but they're hard to use)
Ideally everyone would be using a single type definition. Admittedly that's more common with protobufs, though, where you can't send any data that's not in the definition.
Come to think of that, it's true of plain old structs too.
From Zen of Python:
> There should be one--and preferably only one--obvious way to do it
TFA accidentally even brings up the reason by dicts are so powerful: they enable easy interoperability between libraries (like a wire format). Using two libraries together that insist on their own bespoke class hierarchy is an exercise in data conversion pain. Further, if I want a point to be an object containing fields for "x" and "y", I'd much rather just use a dict rather than construct an object in some incompatible inheritance nightmare.
99% of the data i deal with on a day-to-day basis is lists and mappings.
Very conceptually simple, but with a million different implementations. Particularly in python where we have dicts, namedtuples, dataclasses, regular objects, etc etc etc, then you deal with databases (which are really just mappings of keys to rows), where the interaction works completely differently again (with annoying differences for each database of course). Then hundreds of different encodings again to send things across a network or save them to files.
None of this complexity is inherent to the problems being solved - it's all accumulated cruft and bullshit.
You would probably love Clojure. Perhaps you tried it already?
Interestingly I’ve actually been using _more_ duck-typing style programming in Nim as it’s become my daily driver.
It’s kinda funny since Nim is a statically typed language you think it’d be hard yet its so seamless to use compile time checks that it’s easy to think of it as runtime duck-typing. You can add overloaded types to a function like `proc foo(arg1: int | string, arg2: float)` and then use the `with` expression to change behavior in parts of the function handling the specifics for the types. It’s really power way to handle polymorphism and things like visitor patterns without a bunch of OO infrastructure. I take it the Python type annotations aren’t embracing that overloaded type setup?
You can even trivially use duck typing with type declarations https://nim-lang.org/docs/manual.html#generics-is-operator There’s another pattern I’ve taken to of just declaring getter/setters for things like “X” and “Y”, except just from a generic array. I mean “X” is just a convention for arr[0] right? https://github.com/treeform/vmath/blob/5d7c5e411598cd5cf9071...
Really I hope “duck typing” becomes more the norm rather than the OO stuff. I’m curious what the story in Swift on this topic is nowadays.
E.g. an example in D of a function that doesn't care too much about the type you pass in:
T doublify(T)(T v){
return v*2;
}
These are all fine: writeln(doublify(3));
writeln(doublify(3.0));
writeln(doublify(3u));
But this still throws a compile error like you'd expect: writeln(doublify("3"));That’s what Protocols are for
See this for dataclasses: https://github.com/python/mypy/issues/5374#issuecomment-8841....
Except for protocols.
https://chasemerick.files.wordpress.com/2011/07/choosingtype...
I don’t have anything against classes in theory, but I’m of the opinion that 99.9% of classes out there just shouldn’t exist.
People would take a full API response, and pass bits of it around with mutations. Understanding what the object looked like 5 functions deep was really hard. If the API changed... Oh boy.
I found many bugs just tracing the code like this. It made me a big proponent of strong typing, or at least strong type hinting.
It even has additional advantages, such as generating open api files automatically from the types and validating payloads between microservices.
Pydantic and Typeguard are too very useful libraries in this context.
Untyped languages are excellent for smaller code bases because they are more comfortable to program in and faster and more general. Types of polymorphism possible in these languages are simply not possible or much harder in typed languages. Also, as others have said, the problem domain may not be as explored yet.
Typed languages really start to shine as a code base gets huge. In these instances well maintained untyped language code bases start collapsing under the weight of their own unit tests, while moderately well or poorly well maintained instances of untyped language code bases become a mess. Mostly this is due to difficulties in communication when the code base gets worked on by so many people that it's hard for them all to communicate with each other. In these cases a typed language keeps everyone on the same page to some extent.
Both camps will hate me for saying this I think, but it's what I've observed over the years.
It also may sound like I prefer typed languages, but in fact my favorite languages to work in are Clojure and Python. My code bases as a DevOps engineer rarely pass the 10,000 line mark and never pass 100,000 line mark. It's much more comfortable for me in these untyped languages.
Untyped languages also really shine in microservices for the same reason.
Maybe that was implied?
Anyways, a lot of languages take another stance. E. Elixir where using dicts along with pattern matches calls for quite powerful abstractions.
As long as the dicts are kept shallow and the number of indirection in the code in general so, then it is alright to navigate and use.
At the end of the day you really can’t escape typing. It just makes life easier. We should stop letting languages try to remove it.
If you have a generic collection, you know it's generic. It does remove a class of errors when you start adding types, but it also adds problems in making changes as a tradeoff. Now I have to make a PR that is the change I want AND I have to modify the type, which comes with explaining/understanding that there isn't a reason to use 2 different types or what the consequences would be to create a second generic collection from the first and modify THAT instead (eg lists with different types, how big are they?).
Never was a big problem using generic collections over the last 30 years and plenty of languages are fine without the training wheels of defining every data structure as a type, so I'm not sure what this ranting is all about.
but if you're passing in/out some monstrosity which has a structure that you can only really find out by reading the code, often from top to bottom if different parts of the dictionary are referenced in different parts of the code, you are really setting yourself up for trouble down the line.
I still walk that road sometimes, but not for very long.
There are places where you just dont need the overhead of a class. Yes slotted classes make this much cheaper but so do named tuples.
If the behavior of a thing is to map values then it should stay a dict.
If the behavior is a bag of attributes then yes pick something better.
If I update an object model It’s because I want to expose new fields. Else I actively have to tell it not to.
It feels like a normal use case.
I haven’t decided if I like it better than just breaking up objects into arguments in a more simple functional style.
On one hand it’s more predictable but on the other most complex apps start passing around objects for everything. Typescript of course helps with that, as does nearly modularized code (ie not passing in full typed objects outside of the parent module which owns/generates them unless they uniquely operate on the full object).
These are the small rescissions you end up making a hundred times.
Common Lisp: https://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node252.html
Clojure: http://blog.jayfields.com/2010/07/clojure-destructuring.html
JS: https://simonsmith.io/destructuring-objects-as-function-para...
The other thing that always burns me are lists. Specifically lists of lists and lists of strings. Since python allows you to index into strings the same way as lists, for some reason I always loose track of where I am in the unpacking stack. This is when I switch to type hints.
As someone who does program professionally, my experience is that if you find yourself needing to do this (and the expectations aren't runtime-dynamic) you've almost certainly gone astray, and you'll eventually find yourself implementing a much poorer version of a type system anyway. Dict contents should always be treated as optional. If you have a case where you have 2 required keys and then a bunch of optional ones, define a struct/class that has those 2 fields and then a dict/map for extra values.
The only reason one might want a "schema" for maps is when you're dealing with something config-driven; for instance, in implementing a SQL engine, or assembling inputs to an ML model. Even then, your code shouldn't have expectations on specific contents, other than that the keys must be the same as some other map/list.
A few other similar rules I follow with weak/primitive types:
* Strings are opaque blobs. The only valid operation on strings is to test two strings for equality. No parsing, no checking "does it have a prefix", no concatenation - if you need comparisons that take multiple elements into account, write a richer type. The exceptions are in implementing wire protocols, and rendering data for human consumption.
* Booleans are not allowed as function arguments or member fields. Define a custom enum instead. You'll almost always end up wanting at least a third possible value. Even if not, it's useful to make the default state some form of invalid, so that you don't have to guess at whether a field is false because you meant false, or false because you forgot to set it.
* If there's a value property that you rely on (e.g., lists being sorted, strings being capitalized, integers being in some range), and that property needs to be preserved across a function-call boundary, wrap it in a type. It doesn't force correctness (unless you're using a language with dependent types), but it's at least stickier than a comment.
COMPLETE_STATES = 'done', 'cancelled'
if state in COMPLETE_STATES:
…
Then you decide to handle cancelled tasks separately. COMPLETE_STATES = 'done'
Boom!Can't remember a less contrived example right away, but I have broken real code by inadvertently calling string iteration and spent some time scratching my head. Granted, don't think I've seen something like this in a PR, only in local development.
P.S. I think last time I stumbled on this, it actually involved Django ORM and changing filter(state__in=COMPLETE_STATES) to filter(state__in=DONE).
So if getting a field using simple structure is mycol[key] it should look exacty the same when mycol is no longer a flexible dict containing adhoc objects but complex strongly typed immutanble trie or btree indexed array because at some point of evolution of your code it became apparent that this is exactly what you need.
The only language that I know of that has consistent interface between simple and complex (also custom) collections is Scala.
C# or C++ for example.
Or in fact, you can just define an interface with Get/Set methods. Any language with interfaces supports that and you can swap them out as you please.
Doesn't seem like the language is really the restricting factor for implementing this if you really wanted it.
The thing is that language and standard library designers don't do that.
And if you are not hellbent on creating your own collection library and using it everywhere, even when interfacing with system functions that don't understand it you pretty much have no way of swapping out your data structures without carefully replacing all lines that access them.
Scala standard collection library designers are the only ones I know of that actually went through the trouble of making this for us.
* Apply only for when parsing I/O. Do not substitute primitives with classes inside your code base for no good reason. Unless validation is needed, prefer a NamedTuple.
For example, you may have a function:
group_by_age() -> Dict[int, List[str]]
which might be perfectly good for your use case, but I can see why one might instead prefer: group_by_age() -> Dict[Age, List[CustomerId]]
for self-documentation and expressiveness.Your test assertions may also become easier to read:
assert group_by_age() == {
Age(23): [
CustomerId("0471"),
CustomerId("3390"),
],
Age(42): [
CustomerId("2334"),
],
}Compared to what? I see the article's point about dicts being, like everything else in programming, a tradeoff with benefits and limitations. But the article's needless dramatization of a pretty mundane point (and the button-pushing title) are, to these jaded eyes, a definite turn-off.
Meanwhile I'll keep using dicts when the use case calls for them, thank you. As a sibling commenter put it:
If you don't know how the data should be used, it's often a different story.
Exactly. The whole point (and benefit) of dicts is that they're squishy. Sometimes you need squishy.
Once your app involves a certain amount of business logic, those dicts just beg for you to shove in just that one more temporary field that you’ll be needing later in the calculation.
Of course you can abuse classes just the same. But I feel it happens less often. There’s more social control. The type has a name now. And you have to visit it at home whenever you’re trying to add a field to it. My impression is that some people are underestimating the psychological power of those nudging factors.
There are better ways to structure data for otherwise reusable functions.
Besides, you could argue it’s not exactly the author’s choice because `dict` is the actual name of Python’s dictionary type.