Don't let dicts spoil your code (2020) (opens in new tab)

(roman.pt)

134 pointsimankulov3y ago89 comments

89 comments

The article is not explaining the point, which I believe is: type your dicts if you want to provide strict guarantees to your downstream about data shape.

If you know precisely what the data is used for - great, go ahead - type system is your friend.

If you don't know how the data should be used, it's often a different story. Wrapping data in hand typed classes is a terrible idea in the typical data engineering scenarios where there might be hundreds of these api endpoints, which also might be changing as the upstream sees fit. Perfect way to piss off your downstream users is to keep telling them "sorry the data is not available because I overspecified the data type and now it failed on TypeError again". Usually the downstream is the domain expert, they know which fields should be used and they don't know which ones before they start using it. Typically the best way is to pass ALL the upstream data down, materialize extra fields and NOT modify any existing field names, even when you think you're super smart and know better than domain experts. Too often it happens that a "smart" engineer though he knew better and included only some fields. Only for then to be realized that the data source contained many more gold nuggets, and it was never documented that these were cleverly dropped.

fulafel3y ago

Another option besides types is using a schema library. You can do more things, like define custom validation rules over eg several fields, publish the schema as data (eg at an API endpint, openapi or json schema etc), reuse it in another language (depending on schema system), version it explicitly, and machine generate it if it comes from some external spec (like a db schema).

Also great for property testing / fuzzing. And other fun meta datamodel stuff like eg inferring schema from example data.

In general programming language type systems are pretty weak in comparison because they're not very programmable. (In most languages, for most people, etc .. there are fancy level type systems approaching formal proof toolkits but they're hard to use)

skybrian3y ago

This sounds specific to a particular company's organization where there are at least three different systems involved and no single source of truth. It seems like that's a problem in itself - how do you get everyone to refer to and update the same document?

Ideally everyone would be using a single type definition. Admittedly that's more common with protobufs, though, where you can't send any data that's not in the definition.

Come to think of that, it's true of plain old structs too.

tomazio3y ago

This is more common than you might otherwise think. I've worked at multiple companies that have multiple systems/sources of truth for various reasons. One example of that is my current company has stored and handled all its transactional data in a legacy point of sale system from the early 90s. They decided to upgrade to a modern ERP system a couple years ago, but it takes a while to fully implement and roll over to a new source system. Especially in a high transaction system that cannot go down otherwise the company will start losing a lot of money. Thus its being incrementally rolled out, resulting in both systems running together and being read and written to simultaneously.

oivey3y ago

Sometimes defining who should have authority over a singular original type definition isn't possible. This is sometimes true at companies, and it's even more true in open source projects. Even when possible, single type definitions in those cases often end up as Homer-car monstrosities that are too big and difficult to construct when only a small subset of fields are needed.

XorNot3y ago

The normal way to handle this is to deserialize into your application specific type, and store extraneous data in an extra field that is private but included in reserializations.

Because your application will fail if fields you need aren't there.

1 more reply

throwaway8943453y ago

This is one of the things I appreciate about languages like Go and Rust (I'm sure there are others as well). If the data is static, use a struct. If the data is dynamic, use a map/HashMap. No need to worry about TypedDict vs classes vs DataClasses vs etc, and no one uses HashMap for static data (they could, but virtually no one in those communities is such a glutton for punishment).

From Zen of Python:

> There should be one--and preferably only one--obvious way to do it

mathisonturing3y ago

Forget about DataClasses, TypedDict etc. Can't you achieve the same in python with a class and a dict? Is there a difference, other than perhaps being overloaded with options?

xigoi3y ago

A dataclass will get things like __repr__ for free.

1 more reply

throwaway8943453y ago

There’s also implementation differences. Accessing attributes and methods on a regular class may be slower because (IIRC) it has to do a lookup on each instance’s dict, whereas I believe dataclass implementation is more optimized (I’ve clearly forgotten the details).

bottled_poe3y ago

Python is one of those languages where everything starts to look like a nail.

throwaway8943453y ago

Yeah, I’ve programmed in Python professionally for about a decade and I used it for hobby stuff for about 5 years before that. I still don’t feel like I’ve mastered it, whereas I felt I had mastered Go after about ~2 years. I think Go implements Python’s “zen” a lot better than Python does.

xdfgh11123y ago

A popular AWS API library does this and it is infuriating. AWS added a new field but the library hasn't been updated yet? Too bad, you can't use that field then!

imankulovOP3y ago

True. Don't slap in types just because you can — add types when you need to work with your data. Most of the time, I worked with systems where my python code *was* the downstream and required data to run some business logic. In that context, types make the most sense.

oivey3y ago

Python's strapped on type annotations have been designed around traditional OOP, and it feels like a bad fit for the language. Duck typing is a tremendously powerful form of polymorphism, and none of the PEPs for type annotations do a great job of supporting it. Protocols don't work well with dataclasses and not at all with dicts. TypedDicts could have been perfect, but they explicitly disallow extra keys. Why even use a TypedDict instead of a dataclass? Why make yet another traditional OOP abstraction that was already well served by multiple other features of the language? Even more frustratingly, TypedDicts show that it could have been done. They just decided to break it on purpose.

TFA accidentally even brings up the reason by dicts are so powerful: they enable easy interoperability between libraries (like a wire format). Using two libraries together that insist on their own bespoke class hierarchy is an exercise in data conversion pain. Further, if I want a point to be an object containing fields for "x" and "y", I'd much rather just use a dict rather than construct an object in some incompatible inheritance nightmare.

djrobstep3y ago

Dealing with all these differences is one of the most frustrating, stupid things about programming today.

99% of the data i deal with on a day-to-day basis is lists and mappings.

Very conceptually simple, but with a million different implementations. Particularly in python where we have dicts, namedtuples, dataclasses, regular objects, etc etc etc, then you deal with databases (which are really just mappings of keys to rows), where the interaction works completely differently again (with annoying differences for each database of course). Then hundreds of different encodings again to send things across a network or save them to files.

None of this complexity is inherent to the problems being solved - it's all accumulated cruft and bullshit.

oivey3y ago

At least with things like databases and Pandas you can claim that there might be a valid performance reason for a different abstraction. Regular objects allow for inheritance which I usually find bad, but lots of people do like it. NamedTuples, TypedDicts, and Dataclasses are basically all rapid iterations on the same idea with the same purpose.

pharmakom3y ago

I completely agree.

You would probably love Clojure. Perhaps you tried it already?

elcritch3y ago

I haven’t programmed in Python a lot for years. Though I still somewhat follow the new features and versions and wow it surprises me how often modern Python misses an elegant solution that could simplify the ecosystem in favor of bespoke new syntax and new ways to do more incompatible OO.

Interestingly I’ve actually been using _more_ duck-typing style programming in Nim as it’s become my daily driver.

It’s kinda funny since Nim is a statically typed language you think it’d be hard yet its so seamless to use compile time checks that it’s easy to think of it as runtime duck-typing. You can add overloaded types to a function like `proc foo(arg1: int | string, arg2: float)` and then use the `with` expression to change behavior in parts of the function handling the specifics for the types. It’s really power way to handle polymorphism and things like visitor patterns without a bunch of OO infrastructure. I take it the Python type annotations aren’t embracing that overloaded type setup?

You can even trivially use duck typing with type declarations https://nim-lang.org/docs/manual.html#generics-is-operator There’s another pattern I’ve taken to of just declaring getter/setters for things like “X” and “Y”, except just from a generic array. I mean “X” is just a convention for arr[0] right? https://github.com/treeform/vmath/blob/5d7c5e411598cd5cf9071...

Really I hope “duck typing” becomes more the norm rather than the OO stuff. I’m curious what the story in Swift on this topic is nowadays.

cerved3y ago

Having a proper type system can be immensely powerful. IMHO, duck typing is just adding the burden of type checking to the application layer instead of letting a compiler or linter deal with it. Pythons lack of a good type system is what I miss most

Doxin3y ago

The compiler can still do type checking even when using duck typing. It's important to note that duck typing and weak typing are entirely orthogonal. You can have either, both, or neither.

E.g. an example in D of a function that doesn't care too much about the type you pass in:

    T doublify(T)(T v){
        return v*2;
    }

These are all fine:

    writeln(doublify(3));
    writeln(doublify(3.0));
    writeln(doublify(3u));

But this still throws a compile error like you'd expect:

    writeln(doublify("3"));

oivey3y ago

Duck typing is a superset of inheritance. If your language only supports polymorphism via inheritance, then it is strictly less expressive than a language with duck typing.

tremon3y ago

What does superset mean, here? Inheritance can mandate the presence of certain fields at compile time, which duck typing cannot do, so it doesn't fit my traditional definition of "superset".

2 more replies

ambrose23y ago

Prior to dataclasses, didn’t the library attrs come about to address a gap, and then dataclasses were added from inspiration from attrs? I mean yea, ideally the best structures were designed from the start, but the history is understandable.

oivey3y ago

Dataclasses came out in 3.7, and TypedDicts and Protocols in 3.8. I had to check. I knew they were pretty close.

ambrose23y ago

Oh I was thinking of namedtuples as the one that has been around for a while prior to dataclasses.

fortzi3y ago

> Further, if I want a point to be an object containing fields for "x" and "y", I'd much rather just use a dict rather than construct an object in some incompatible inheritance nightmare.

That’s what Protocols are for

oivey3y ago

They don't work in the way you would think for dataclasses or at all for dicts/TypedDicts.

See this for dataclasses: https://github.com/python/mypy/issues/5374#issuecomment-8841....

BerislavLopac3y ago

This is a ridiculous example, because dataclasses and protocols are orthogonal: dataclasses (help) define data structures, while protocols define behaviour.

BerislavLopac3y ago

> none of the PEPs for type annotations do a great job of supporting it

Except for protocols.

valbaca3y ago

Interesting how Clojure takes the complete opposite approach by simply making dicts immutable.

https://chasemerick.files.wordpress.com/2011/07/choosingtype...

epgui3y ago

Yeah, as a clojurist this made me laugh: just like people will naturally feel an urge to fill up conversational silence with words, people can’t seem to be able to go without their classes for more than 5 minutes.

I don’t have anything against classes in theory, but I’m of the opinion that 99.9% of classes out there just shouldn’t exist.

lkrubner3y ago

Clojure has established the gold standard for beautiful abstractions that unify broad categories of data types. It's seq interface is elegant and powerful. Python's efforts towards option data typing or strict data typing looks especially clunky, awkward, forced, and painful when compared to Clojure.

andreareina3y ago

Clojure also makes working with hashes a whole lot more ergonomic with destructuring and symbol keys.

fastball3y ago

Could you clarify a bit here? Python also has destructuring for its dicts and I'm not entirely sure what you mean by symbol keys.

andreareina3y ago

Python has tuple destructuring, which can be used with dict.items(). I'm talking about being able to destructure by name not position:

    name, address = some_dict

You could fake it if you could ask for a tuple of specific entries:

    name, address = some_dict.pluck("name", "address")

geysersam3y ago

Symbol keys are string keys for the more sophisticated among us. See :some-name vs "some_name". ;)

1 more reply

josh_fyi3y ago

You still need to know what keys to expect. The Clojure map that get replaced with a new map have the same problem as a mutable dict.

nicbou3y ago

This is something I enforced in a big rewrite at a previous company.

People would take a full API response, and pass bits of it around with mutations. Understanding what the object looked like 5 functions deep was really hard. If the API changed... Oh boy.

I found many bugs just tracing the code like this. It made me a big proponent of strong typing, or at least strong type hinting.

mejutoco3y ago

Same experience here.

It even has additional advantages, such as generating open api files automatically from the types and validating payloads between microservices.

Pydantic and Typeguard are too very useful libraries in this context.

djhaskin9873y ago

This opinion gets at the heart of the reason to use type languages or not. After all, what is a dict but an untyped struct?

Untyped languages are excellent for smaller code bases because they are more comfortable to program in and faster and more general. Types of polymorphism possible in these languages are simply not possible or much harder in typed languages. Also, as others have said, the problem domain may not be as explored yet.

Typed languages really start to shine as a code base gets huge. In these instances well maintained untyped language code bases start collapsing under the weight of their own unit tests, while moderately well or poorly well maintained instances of untyped language code bases become a mess. Mostly this is due to difficulties in communication when the code base gets worked on by so many people that it's hard for them all to communicate with each other. In these cases a typed language keeps everyone on the same page to some extent.

Both camps will hate me for saying this I think, but it's what I've observed over the years.

It also may sound like I prefer typed languages, but in fact my favorite languages to work in are Clojure and Python. My code bases as a DevOps engineer rarely pass the 10,000 line mark and never pass 100,000 line mark. It's much more comfortable for me in these untyped languages.

Untyped languages also really shine in microservices for the same reason.

madsbuch3y ago

* Don't let dicts spoil your python code

Maybe that was implied?

Anyways, a lot of languages take another stance. E. Elixir where using dicts along with pattern matches calls for quite powerful abstractions.

As long as the dicts are kept shallow and the number of indirection in the code in general so, then it is alright to navigate and use.

imankulovOP3y ago

Yes, the context is Python.

ampgt3y ago

Glad to see pydantic get mentioned here. It’s a great solution for this exact problem. I was introduced to it by FastAPI and have been using it in all my projects since.

At the end of the day you really can’t escape typing. It just makes life easier. We should stop letting languages try to remove it.

asddubs3y ago

Took me a really long time to learn this lesson. IMO this is a variation of the primitive obsession code smell, although I'd say it's way more harmful. I was really reluctant to add data classes to my code when the good old PHP array could get the job done without holding me up with a bunch of beaurocracy. Of course they give no guarantees and enforce no structure, so inevitably you get slight variations depending on what you need, or maybe you happen to have a dict that's a superset of what you're feeding in, and it just becomes really hard to reason about things. And of course since it's not a named type, tracing things back becomes really hard.

Supermancho3y ago

> so inevitably you get slight variations depending on what you need

If you have a generic collection, you know it's generic. It does remove a class of errors when you start adding types, but it also adds problems in making changes as a tradeoff. Now I have to make a PR that is the change I want AND I have to modify the type, which comes with explaining/understanding that there isn't a reason to use 2 different types or what the consequences would be to create a second generic collection from the first and modify THAT instead (eg lists with different types, how big are they?).

Never was a big problem using generic collections over the last 30 years and plenty of languages are fine without the training wheels of defining every data structure as a type, so I'm not sure what this ranting is all about.

asddubs3y ago

I'm not sure if I'm misunderstanding your point or you're understanding mine, so I'll just carefully say that the PHP array is really a dictionary + array combination type, and I was referring specifically to its use as a dictionary (since that's what TFA is about). If you're returning a list of things that are all the same type I agree that an array, or an array of a certain type if generics are available is totally fine and serviceable.

but if you're passing in/out some monstrosity which has a structure that you can only really find out by reading the code, often from top to bottom if different parts of the dictionary are referenced in different parts of the code, you are really setting yourself up for trouble down the line.

Supermancho3y ago

> if you're passing in/out some monstrosity which has a structure that you can only really find out by reading the code

This is conflating multiple issues. Using generic collections (dictionaries, if you like) and then trying to work out what's in them. Modern stacks include a debugger, so figuring out what's in them is trivial in a given code path, if you have a test environment.

Figuring out how to use them (or how to not abuse them) when you have an immature development environment, is the issue at hand.

The author of the blog implies that they will necessarily be problematic. That's incorrect on it's face, but let's get into a codebase where we have no frame of reference, documentation, a limited dev environment, and/or the tests are missing/ignore the ramifications of the dictionary containing a variety. The question then becomes one of ascribing blame in these conversations:

1. is it the fault of the original programmer in selecting a dictionary? - No

2. is it the natural outcome of using a dictionary? - No

3. is there a compelling reason to use a dictionary at all (over a typed list)? - Yes and I would say more than one.

That's all aside to the issue of "what do you do now?" Well you do research and work, same as any of these tangled codebases (which may not be tangled at all). You break it up, you write tests and then you reason about the choices made. None of this is particularly special in regards to dictionaries. It's a particular practice that is sometimes expedient or efficient and that seems to offend the author's sensibilities. It's zealotry.

BurningFrog3y ago

Whenever you use the same string key in different parts of the code, you take one more step on the Legacy Code Road...

I still walk that road sometimes, but not for very long.

klyrs3y ago

Who needs hashes when you've got variable variables? ~ me, 20 years ago, learning the hard way

andreareina3y ago

I still don't know how I feel about the fact that in PHP $$var (a) works, and (b) does exactly what you'd expect.

klyrs3y ago

Early days of PHP were total madness. GET vars initialized variables in the global namespace. Uninitialized variables treated as zero/empty string. People were rolling their own query sanitizers in the best case. Misconfigured servers often horked out script sources. Why not pour some gas on it with variable variables?

b3morales3y ago

This level of introspection can be useful in some cases. Perhaps the flaw is just in giving it such simple, attractive syntax.

1 more reply

leetrout3y ago

Yes to the principle. But typed dict is useful for more than just "the wire".

There are places where you just dont need the overhead of a class. Yes slotted classes make this much cheaper but so do named tuples.

If the behavior of a thing is to map values then it should stay a dict.

If the behavior is a bag of attributes then yes pick something better.

molly03y ago

I really liked the structure of this blog post. But It misses the positive aspects of using dictionaries. Like when you are the owner of the api you consume and just want the JSON to flow through your “application tier”

imankulovOP3y ago

Hey, molly0! Do you have a specific example in mind? I'm thinking about how common this use case could be but can't come up with anything.

molly03y ago

In my case. We have an internal REST like api. It mostly is just a wrapper around our python ORM but returns JSON.

If I update an object model It’s because I want to expose new fields. Else I actively have to tell it not to.

It feels like a normal use case.

imankulovOP3y ago

Huh, I worked with this architecture, and while it was convenient at first, it became an issue down the road. Here's what we had.

- A few times, we inadvertently exposed internal fields. They were not sensitive, just internal. Still, clients discovered and started using them and effectively blocked us from changing the data schema without updating the API version.

- A risk of inadvertently exposing sensitive fields. We never had this, but the mere fact that it was too easy to have, kept unnecessary pressure on us.

- Adjusting the serialization format for different purposes became a problem. When everything is a dict, it's difficult to bolt in custom serialization logic. We had this issue when we had to support different API versions, different object representations for different clients, or different purposes. For example, when we cache an object, we want to keep all the fields, but when we return it to an object, we need to maintain the subset of them.

- Slower onboarding. When a newcomer joins the project, they need to be aware that any database change may leak as an API attribute. They couldn't start working before they saw the whole picture.

1 more reply

icedchai3y ago

I used Pydantic on a recent project. It worked well.

dmix3y ago

When JavaScript added hash/object deconstruction (both at the argument level and assigning variables) I noticed code has been using Dict-like function arguments everywhere. It makes typing them a bit more of a pain in the ass (especially without default arguments).

I haven’t decided if I like it better than just breaking up objects into arguments in a more simple functional style.

On one hand it’s more predictable but on the other most complex apps start passing around objects for everything. Typescript of course helps with that, as does nearly modularized code (ie not passing in full typed objects outside of the parent module which owns/generates them unless they uniquely operate on the full object).

These are the small rescissions you end up making a hundred times.

valbaca3y ago

Whether to pass in a single dict/object/map/assoc-list or individually separated arguments is a tale as old as LISP (JS has some FP roots if you squint hard enough) and most LISPs (and JS has started doing this as well) mitigates this with destructuring:

Common Lisp: https://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node252.html

Clojure: http://blog.jayfields.com/2010/07/clojure-destructuring.html

JS: https://simonsmith.io/destructuring-objects-as-function-para...

xigoi3y ago

JavaScript doesn't have named arguments, so passing a lot of stuff in arguments makes code unreadable.

hahajk3y ago

I don't program professionally, and I struggle with dicts and classes. On one hand, I want to avoid the Java world of needing to learn 8 new classes to use any library. So dicts are lightweight and extensible and feel like the modern way of doing things. One the other hand, all the problems listed in the article are right. You really do need to document different expected dicts somehow, which is basically structs/classes.

The other thing that always burns me are lists. Specifically lists of lists and lists of strings. Since python allows you to index into strings the same way as lists, for some reason I always loose track of where I am in the unpacking stack. This is when I switch to type hints.

GeneralMayhem3y ago

> You really do need to document different expected dicts somehow, which is basically structs/classes.

As someone who does program professionally, my experience is that if you find yourself needing to do this (and the expectations aren't runtime-dynamic) you've almost certainly gone astray, and you'll eventually find yourself implementing a much poorer version of a type system anyway. Dict contents should always be treated as optional. If you have a case where you have 2 required keys and then a bunch of optional ones, define a struct/class that has those 2 fields and then a dict/map for extra values.

The only reason one might want a "schema" for maps is when you're dealing with something config-driven; for instance, in implementing a SQL engine, or assembling inputs to an ML model. Even then, your code shouldn't have expectations on specific contents, other than that the keys must be the same as some other map/list.

A few other similar rules I follow with weak/primitive types:

* Strings are opaque blobs. The only valid operation on strings is to test two strings for equality. No parsing, no checking "does it have a prefix", no concatenation - if you need comparisons that take multiple elements into account, write a richer type. The exceptions are in implementing wire protocols, and rendering data for human consumption.

* Booleans are not allowed as function arguments or member fields. Define a custom enum instead. You'll almost always end up wanting at least a third possible value. Even if not, it's useful to make the default state some form of invalid, so that you don't have to guess at whether a field is false because you meant false, or false because you forgot to set it.

* If there's a value property that you rely on (e.g., lists being sorted, strings being capitalized, integers being in some range), and that property needs to be preserved across a function-call boundary, wrap it in a type. It doesn't force correctness (unless you're using a language with dependent types), but it's at least stickier than a comment.

owl573y ago

Python implicit string iteration is an annoying trap.

  COMPLETE_STATES = 'done', 'cancelled'

  if state in COMPLETE_STATES:
      …

Then you decide to handle cancelled tasks separately.

  COMPLETE_STATES = 'done'

Boom!

Can't remember a less contrived example right away, but I have broken real code by inadvertently calling string iteration and spent some time scratching my head. Granted, don't think I've seen something like this in a PR, only in local development.

P.S. I think last time I stumbled on this, it actually involved Django ORM and changing filter(state__in=COMPLETE_STATES) to filter(state__in=DONE).

scotty793y ago

I would really like a language where you can swap simple data collections like dicts or arrays with others, better defined, employing better suited algorithms without changing everywhere in your code how you access them.

So if getting a field using simple structure is mycol[key] it should look exacty the same when mycol is no longer a flexible dict containing adhoc objects but complex strongly typed immutanble trie or btree indexed array because at some point of evolution of your code it became apparent that this is exactly what you need.

The only language that I know of that has consistent interface between simple and complex (also custom) collections is Scala.

Philip-J-Fry3y ago

You can do this in any language that has operator overloading, right?

C# or C++ for example.

Or in fact, you can just define an interface with Get/Set methods. Any language with interfaces supports that and you can swap them out as you please.

Doesn't seem like the language is really the restricting factor for implementing this if you really wanted it.

scotty793y ago

> You can do this in any language that has operator overloading, right?

The thing is that language and standard library designers don't do that.

And if you are not hellbent on creating your own collection library and using it everywhere, even when interfacing with system functions that don't understand it you pretty much have no way of swapping out your data structures without carefully replacing all lines that access them.

Scala standard collection library designers are the only ones I know of that actually went through the trouble of making this for us.

BiteCode_dev3y ago

> Don't let dicts spoil your code (2020) (roman.pt) * Conditions applies

* Apply only for when parsing I/O. Do not substitute primitives with classes inside your code base for no good reason. Unless validation is needed, prefer a NamedTuple.

Hackbraten3y ago

Other than validation, I can imagine several good reasons why one might want to wrap a primitive inside a class.

For example, you may have a function:

    group_by_age() -> Dict[int, List[str]]

which might be perfectly good for your use case, but I can see why one might instead prefer:

    group_by_age() -> Dict[Age, List[CustomerId]]

for self-documentation and expressiveness.

Your test assertions may also become easier to read:

    assert group_by_age() == {
        Age(23): [
            CustomerId("0471"),
            CustomerId("3390"),
        ],
        Age(42): [
            CustomerId("2334"),
        ],
    }

BiteCode_dev3y ago

Use NewType for this: https://mypy.readthedocs.io/en/stable/more_types.html?highli...

akhmatova3y ago

Functions that accept dicts are a nightmare to extend and modify.

Compared to what? I see the article's point about dicts being, like everything else in programming, a tradeoff with benefits and limitations. But the article's needless dramatization of a pretty mundane point (and the button-pushing title) are, to these jaded eyes, a definite turn-off.

Meanwhile I'll keep using dicts when the use case calls for them, thank you. As a sibling commenter put it:

If you don't know how the data should be used, it's often a different story.

Exactly. The whole point (and benefit) of dicts is that they're squishy. Sometimes you need squishy.

islandert3y ago

My take is that dicts are fine as long as your code is well tested. Yeah, dataclasses and frozen classes have much better typing support, but if you code is mostly reading and writing JSON like many modern cloud apps, it can be easier to use plain dicts combined with decent tests to make sure you don't break downstream services.

Hackbraten3y ago

I’m currently in the middle of refactoring a well-tested but dict-ly typed Python codebase into dataclasses.

Once your app involves a certain amount of business logic, those dicts just beg for you to shove in just that one more temporary field that you’ll be needing later in the calculation.

Of course you can abuse classes just the same. But I feel it happens less often. There’s more social control. The type has a name now. And you have to visit it at home whenever you’re trying to add a field to it. My impression is that some people are underestimating the psychological power of those nudging factors.

vaughands3y ago

Refactoring those services sucks though if the data model changes. :(

K0balt3y ago

I don’t usually write software in python, but when I do I try not to end up with a bag of dicts.

There are better ways to structure data for otherwise reusable functions.

Graffur3y ago

So types solve this right? Or am I misunderstanding?

imankulovOP3y ago

I'd rather phrase it as "well-defined data structures help maintain your app." In another comment, leetrout recommended using named tuples. They define a list of their attributes without saying anything about their types, and this may be a perfect choice for some scenarios.

lloydatkinson3y ago

Is it so hard to type dictionary?

imankulovOP3y ago

Hackbraten is right. When I talk about dicts, I refer to a specific Python data type. Still, the truth is, I struggle to spell "dictionary." Even now, I copied the word from your comment.

Hackbraten3y ago

Given how ubiquitous that type is, wouldn’t the four-syllable word “dictionary”, once written down hundreds of times, be prone to semantically satiating any discussion and code base?

Besides, you could argue it’s not exactly the author’s choice because `dict` is the actual name of Python’s dictionary type.

singularity20013y ago

dicts are just data, the question of mutability is orthogonal

j / k navigate · click thread line to collapse

89 comments

snidane3y ago

The article is not explaining the point, which I believe is: type your dicts if you want to provide strict guarantees to your downstream about data shape.

If you know precisely what the data is used for - great, go ahead - type system is your friend.

fulafel3y ago

Also great for property testing / fuzzing. And other fun meta datamodel stuff like eg inferring schema from example data.

skybrian3y ago

Ideally everyone would be using a single type definition. Admittedly that's more common with protobufs, though, where you can't send any data that's not in the definition.

Come to think of that, it's true of plain old structs too.

tomazio3y ago

oivey3y ago

XorNot3y ago

The normal way to handle this is to deserialize into your application specific type, and store extraneous data in an extra field that is private but included in reserializations.

Because your application will fail if fields you need aren't there.

1 more reply

throwaway8943453y ago

From Zen of Python:

> There should be one--and preferably only one--obvious way to do it

mathisonturing3y ago

Forget about DataClasses, TypedDict etc. Can't you achieve the same in python with a class and a dict? Is there a difference, other than perhaps being overloaded with options?

xigoi3y ago

A dataclass will get things like __repr__ for free.

1 more reply

throwaway8943453y ago

bottled_poe3y ago

Python is one of those languages where everything starts to look like a nail.

throwaway8943453y ago

xdfgh11123y ago

A popular AWS API library does this and it is infuriating. AWS added a new field but the library hasn't been updated yet? Too bad, you can't use that field then!

imankulovOP3y ago

oivey3y ago

djrobstep3y ago

Dealing with all these differences is one of the most frustrating, stupid things about programming today.

99% of the data i deal with on a day-to-day basis is lists and mappings.

None of this complexity is inherent to the problems being solved - it's all accumulated cruft and bullshit.

oivey3y ago

pharmakom3y ago

I completely agree.

You would probably love Clojure. Perhaps you tried it already?

elcritch3y ago

Interestingly I’ve actually been using _more_ duck-typing style programming in Nim as it’s become my daily driver.

Really I hope “duck typing” becomes more the norm rather than the OO stuff. I’m curious what the story in Swift on this topic is nowadays.

cerved3y ago

Doxin3y ago

The compiler can still do type checking even when using duck typing. It's important to note that duck typing and weak typing are entirely orthogonal. You can have either, both, or neither.

E.g. an example in D of a function that doesn't care too much about the type you pass in:

    T doublify(T)(T v){
        return v*2;
    }

These are all fine:

    writeln(doublify(3));
    writeln(doublify(3.0));
    writeln(doublify(3u));

But this still throws a compile error like you'd expect:

    writeln(doublify("3"));

oivey3y ago

Duck typing is a superset of inheritance. If your language only supports polymorphism via inheritance, then it is strictly less expressive than a language with duck typing.

tremon3y ago

What does superset mean, here? Inheritance can mandate the presence of certain fields at compile time, which duck typing cannot do, so it doesn't fit my traditional definition of "superset".

2 more replies

ambrose23y ago

oivey3y ago

Dataclasses came out in 3.7, and TypedDicts and Protocols in 3.8. I had to check. I knew they were pretty close.

ambrose23y ago

Oh I was thinking of namedtuples as the one that has been around for a while prior to dataclasses.

fortzi3y ago

> Further, if I want a point to be an object containing fields for "x" and "y", I'd much rather just use a dict rather than construct an object in some incompatible inheritance nightmare.

That’s what Protocols are for

oivey3y ago

They don't work in the way you would think for dataclasses or at all for dicts/TypedDicts.

See this for dataclasses: https://github.com/python/mypy/issues/5374#issuecomment-8841....

BerislavLopac3y ago

This is a ridiculous example, because dataclasses and protocols are orthogonal: dataclasses (help) define data structures, while protocols define behaviour.

BerislavLopac3y ago

> none of the PEPs for type annotations do a great job of supporting it

Except for protocols.

valbaca3y ago

Interesting how Clojure takes the complete opposite approach by simply making dicts immutable.

https://chasemerick.files.wordpress.com/2011/07/choosingtype...

epgui3y ago

I don’t have anything against classes in theory, but I’m of the opinion that 99.9% of classes out there just shouldn’t exist.

lkrubner3y ago

andreareina3y ago

Clojure also makes working with hashes a whole lot more ergonomic with destructuring and symbol keys.

fastball3y ago

Could you clarify a bit here? Python also has destructuring for its dicts and I'm not entirely sure what you mean by symbol keys.

andreareina3y ago

Python has tuple destructuring, which can be used with dict.items(). I'm talking about being able to destructure by name not position:

    name, address = some_dict

You could fake it if you could ask for a tuple of specific entries:

    name, address = some_dict.pluck("name", "address")

geysersam3y ago

Symbol keys are string keys for the more sophisticated among us. See :some-name vs "some_name". ;)

1 more reply

josh_fyi3y ago

You still need to know what keys to expect. The Clojure map that get replaced with a new map have the same problem as a mutable dict.

nicbou3y ago

This is something I enforced in a big rewrite at a previous company.

People would take a full API response, and pass bits of it around with mutations. Understanding what the object looked like 5 functions deep was really hard. If the API changed... Oh boy.

I found many bugs just tracing the code like this. It made me a big proponent of strong typing, or at least strong type hinting.

mejutoco3y ago

Same experience here.

It even has additional advantages, such as generating open api files automatically from the types and validating payloads between microservices.

Pydantic and Typeguard are too very useful libraries in this context.

djhaskin9873y ago

This opinion gets at the heart of the reason to use type languages or not. After all, what is a dict but an untyped struct?

Both camps will hate me for saying this I think, but it's what I've observed over the years.

Untyped languages also really shine in microservices for the same reason.

madsbuch3y ago

* Don't let dicts spoil your python code

Maybe that was implied?

Anyways, a lot of languages take another stance. E. Elixir where using dicts along with pattern matches calls for quite powerful abstractions.

As long as the dicts are kept shallow and the number of indirection in the code in general so, then it is alright to navigate and use.

imankulovOP3y ago

Yes, the context is Python.

ampgt3y ago

Glad to see pydantic get mentioned here. It’s a great solution for this exact problem. I was introduced to it by FastAPI and have been using it in all my projects since.

At the end of the day you really can’t escape typing. It just makes life easier. We should stop letting languages try to remove it.

asddubs3y ago

Supermancho3y ago

> so inevitably you get slight variations depending on what you need

asddubs3y ago

Supermancho3y ago

> if you're passing in/out some monstrosity which has a structure that you can only really find out by reading the code

Figuring out how to use them (or how to not abuse them) when you have an immature development environment, is the issue at hand.

1. is it the fault of the original programmer in selecting a dictionary? - No

2. is it the natural outcome of using a dictionary? - No

3. is there a compelling reason to use a dictionary at all (over a typed list)? - Yes and I would say more than one.

BurningFrog3y ago

Whenever you use the same string key in different parts of the code, you take one more step on the Legacy Code Road...

I still walk that road sometimes, but not for very long.

klyrs3y ago

Who needs hashes when you've got variable variables? ~ me, 20 years ago, learning the hard way

andreareina3y ago

I still don't know how I feel about the fact that in PHP $$var (a) works, and (b) does exactly what you'd expect.

klyrs3y ago

b3morales3y ago

This level of introspection can be useful in some cases. Perhaps the flaw is just in giving it such simple, attractive syntax.

1 more reply

leetrout3y ago

Yes to the principle. But typed dict is useful for more than just "the wire".

There are places where you just dont need the overhead of a class. Yes slotted classes make this much cheaper but so do named tuples.

If the behavior of a thing is to map values then it should stay a dict.

If the behavior is a bag of attributes then yes pick something better.

molly03y ago

imankulovOP3y ago

Hey, molly0! Do you have a specific example in mind? I'm thinking about how common this use case could be but can't come up with anything.

molly03y ago

In my case. We have an internal REST like api. It mostly is just a wrapper around our python ORM but returns JSON.

If I update an object model It’s because I want to expose new fields. Else I actively have to tell it not to.

It feels like a normal use case.

imankulovOP3y ago

Huh, I worked with this architecture, and while it was convenient at first, it became an issue down the road. Here's what we had.

- A risk of inadvertently exposing sensitive fields. We never had this, but the mere fact that it was too easy to have, kept unnecessary pressure on us.

- Slower onboarding. When a newcomer joins the project, they need to be aware that any database change may leak as an API attribute. They couldn't start working before they saw the whole picture.

1 more reply

icedchai3y ago

I used Pydantic on a recent project. It worked well.

dmix3y ago

I haven’t decided if I like it better than just breaking up objects into arguments in a more simple functional style.

These are the small rescissions you end up making a hundred times.

valbaca3y ago

Common Lisp: https://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node252.html

Clojure: http://blog.jayfields.com/2010/07/clojure-destructuring.html

JS: https://simonsmith.io/destructuring-objects-as-function-para...

xigoi3y ago

JavaScript doesn't have named arguments, so passing a lot of stuff in arguments makes code unreadable.

hahajk3y ago

GeneralMayhem3y ago

> You really do need to document different expected dicts somehow, which is basically structs/classes.

A few other similar rules I follow with weak/primitive types:

owl573y ago

Python implicit string iteration is an annoying trap.

  COMPLETE_STATES = 'done', 'cancelled'

  if state in COMPLETE_STATES:
      …

Then you decide to handle cancelled tasks separately.

  COMPLETE_STATES = 'done'

Boom!

P.S. I think last time I stumbled on this, it actually involved Django ORM and changing filter(state__in=COMPLETE_STATES) to filter(state__in=DONE).

scotty793y ago

The only language that I know of that has consistent interface between simple and complex (also custom) collections is Scala.

Philip-J-Fry3y ago

You can do this in any language that has operator overloading, right?

C# or C++ for example.

Or in fact, you can just define an interface with Get/Set methods. Any language with interfaces supports that and you can swap them out as you please.

Doesn't seem like the language is really the restricting factor for implementing this if you really wanted it.

scotty793y ago

> You can do this in any language that has operator overloading, right?

The thing is that language and standard library designers don't do that.

Scala standard collection library designers are the only ones I know of that actually went through the trouble of making this for us.

BiteCode_dev3y ago

> Don't let dicts spoil your code (2020) (roman.pt) * Conditions applies

* Apply only for when parsing I/O. Do not substitute primitives with classes inside your code base for no good reason. Unless validation is needed, prefer a NamedTuple.

Hackbraten3y ago

Other than validation, I can imagine several good reasons why one might want to wrap a primitive inside a class.

For example, you may have a function:

    group_by_age() -> Dict[int, List[str]]

which might be perfectly good for your use case, but I can see why one might instead prefer:

    group_by_age() -> Dict[Age, List[CustomerId]]

for self-documentation and expressiveness.

Your test assertions may also become easier to read:

    assert group_by_age() == {
        Age(23): [
            CustomerId("0471"),
            CustomerId("3390"),
        ],
        Age(42): [
            CustomerId("2334"),
        ],
    }

BiteCode_dev3y ago

Use NewType for this: https://mypy.readthedocs.io/en/stable/more_types.html?highli...

akhmatova3y ago

Functions that accept dicts are a nightmare to extend and modify.

Meanwhile I'll keep using dicts when the use case calls for them, thank you. As a sibling commenter put it:

If you don't know how the data should be used, it's often a different story.

Exactly. The whole point (and benefit) of dicts is that they're squishy. Sometimes you need squishy.

islandert3y ago

Hackbraten3y ago

I’m currently in the middle of refactoring a well-tested but dict-ly typed Python codebase into dataclasses.

Once your app involves a certain amount of business logic, those dicts just beg for you to shove in just that one more temporary field that you’ll be needing later in the calculation.

vaughands3y ago

Refactoring those services sucks though if the data model changes. :(

K0balt3y ago

I don’t usually write software in python, but when I do I try not to end up with a bag of dicts.

There are better ways to structure data for otherwise reusable functions.

Graffur3y ago

So types solve this right? Or am I misunderstanding?

imankulovOP3y ago

lloydatkinson3y ago

Is it so hard to type dictionary?

imankulovOP3y ago

Hackbraten is right. When I talk about dicts, I refer to a specific Python data type. Still, the truth is, I struggle to spell "dictionary." Even now, I copied the word from your comment.

Hackbraten3y ago

Given how ubiquitous that type is, wouldn’t the four-syllable word “dictionary”, once written down hundreds of times, be prone to semantically satiating any discussion and code base?

Besides, you could argue it’s not exactly the author’s choice because `dict` is the actual name of Python’s dictionary type.

singularity20013y ago

dicts are just data, the question of mutability is orthogonal

j / k navigate · click thread line to collapse