The development process is totally different when you write structured types first and then write your logic. 10/10 would recommend.
Usual caveat: this is what makes sense to me and my brain. Your experience may be different based on neurotype.
I’d argue that the more experience you get the more you write code for other people which involves adding lots of tooling, tests, etc. Even if the code works the first time, a more senior dev will make sure others have a “pit of success” they can fall into. This involves a lot more than just some “unit tests as an afterthought to keep the coverage up.”
I hated working with those coders because they weren't really very good and their code was always the worst to maintain. They are the equivalent of a carpenter who brags about how quickly they can bang nails but can't build a stable structure to save their life.
The best argument I've heard for doing type annotation is for documentation purposes to help future devs. But I don't completely buy this either. I touch new codebases all the time and I rarely spend much time thinking about what types will be passed. I can only assume it comes with experience.
Type annotation actually ends up taking a hell of a long time to do and is of questionable benefit if some of the codebase is not annotated. People spend sometimes hours just trying to get the type checker to say OK for code that actually works just fine!
I'm currently using marshmallow in a project, specifically using the functionality that builds parsers from dataclasses.
I was curious what the differences were.
I haven't used pydantic's ORM integration, but I don't hesitate to use pydantic models everywhere as business logic classes unless I need ludicrous speed.
That's all opinion, but I'd definitely give pydantic a swing.
My cons are: - Uses some dsl to define types - Doesn’t marshal to model objects by default, but from DICT to DICT
My pro is: - much more configurable and powerful
However, it does have a strongly opinionated approach to casting that can sometimes yield non-obvious results. This behavior is documented and I would suggest new potential adopters of the library to explore this casting/coerce feature in the context of your product/app requirements.
For the most part, it's not an huge issue, but I've run into a few surprising cases. For example, sys.maxint, 0, '-7', 'inf', float('-inf') are all valid datetime formats.
- https://pydantic-docs.helpmanual.io/usage/models/#data-conve... - https://gist.github.com/mpkocher/30569c53dc3552bc5ad73e09b48...
That's absolutely a valid and useful annotation. It tells me, and autocomplete, that "x" is probably a str, more likely than not, but I need to be aware that it might not be.
There's beautiful clarity in the articulation, and the essence is easy to grasp yet powerful. It reminds me a bit of Scott Wlaschin's Railway Oriented Programming (ROP) [0]. As a technique, ROP nicely complements "parse don't validate". As an explanation, it's similarly simple yet wonderfully effective.
I've a real admiration for people who can explain and present things so clearly. With ROP, for example, the reader learns the basics of monads without even realising it.
I feel that we don't put enough value, these days, on the ability to write clear, articulate exposition. Also, I believe that many people are not willing to read articles, books, or papers, of any meaningful length.
Everything needs to be boiled down to <10 min. read time, or <18 min. TED talks.
Anyways, I definitely agree.
The essential point of this blog post is to avoid "shotgun parsing", where parsing/validating is done just from a procedural standpoint, where it matters when exactly it happens. In the paper "Out of the Tar Pit" it is asserted that this leads to "accidental complexity" (AKA "pain and anxiety"), which is something every programmer has experienced before, possibly many times.
I've become a fan of declarative schema to (json-schema/OpenApi, clojure spec etc.) to express this kind of thing. Usually this is used at the boundaries of an application (configuration, web requests etc.) but there are many more applications for this within the flow of data transformations. If you apply the "parse don't validate" principle you turn schema-validated (sic!) data into a new thing. Whether that is a "ValidatedData" type or meta data, a pre-condition or runtime check says more about the environment you program rather than the principle in discussion. The benefit however is clear: Your code asserts that it requires parsed/validated data where it is needed, instead of when it should happen.
There's a follow-up article by the same author (that I unfortunately can't find), in which she explains this point.
As an example, returning a NonZero newtype over Int is not as type safe as using an ADT that lacks a zero value altogether. Using a NonEmpty newtype over List is not as type safe as using the NonEmpty ADT that has an element as part of its structure.
Basically newtype still has use, but it is not as airtight as a well-designed ADT.
I think this is it: https://lexi-lambda.github.io/blog/2020/11/01/names-are-not-...
Every time I read something like this my mind translates it to "after building an ad-hoc compiler you can do all the things a compiler can do. Just not as well, but you can do it." -- Same with "I don't need a compiler, my tests stop all this kind of bugs"
I’ve run across DSLs that have three or more layers of parsing and validation. Embedding different languages within each other (eg: JSON snippets within your own DSL) definitely leads to the issues the article talks about.
Also, growing your own parser without understanding standard lexer/parser basics seems far more common than it ought to be. I’m not talking brilliant design, rather the extremely naive one-character-at-a-time-in-a-really-complex-loop variety of design.
The better level of bad is, “I know what lexers/parsers are, now I’ll write something to basically implement a type-checking parser with the lexed+parsed tree as input.”
This article is basically stating, “Why not just get your parser to do it all for you in one swell foop?” When I have refactored code to follow this kind of design, I have never regretted the outcome.
import * as $ from '@appliedblockchain/assert-combinators'
const validateFooBar = (
$.object({
foo: $.string,
bar: $.boolean
})
)
// probably roughly equivalent to
/*
const validateFooBar = (x) => {
console.assert(
typeof x === 'object' &&
typeof x.foo === 'string' &&
typeof x.bar === 'boolean'
)
return x
}
*/
const test1 = { foo: "abc", bar: false }
const test2 = { foo: 0, quux: true }
const { foo, bar } = validateFooBar(test1) // ok
const oops = validateFooBar(test2) // throws error
the source is pretty readable too if you want to get an idea how it works.https://github.com/appliedblockchain/assert-combinators/blob...
https://github.com/appliedblockchain/assert-combinators/blob...
Mathematicians: Parsing is validation
This paper is an April's fool joke. I didn't think people could take that one seriously. I guess it's a good April's fool then. :)
So is the joke on Computer Scientists or Mathematicians? You decide ;)
Beware of bugs in the above code; I have only proved it correct, not tried it --Donald Knuth
This paper.... uh.... what exactly is it good for?
I suppose it could be kind of nice as some kind of undergraduate paper writing project kind of thing but it looks too professional for that.... I am kind of at a loss why this was written. Maybe it is some strange kind of satire....
[1]: http://www.sigbovik.org/ [2]: http://www.sigbovik.org/2021/proceedings.pdf
Source: I attended SIGBOVIK a few times in grad school.
Code is data after all.
General case: Validating random data as input into some program.
Particular case: Validating random source code (data) as input into some compiler (program).
Do compilers parse or validate?
> "the converse of ‘parsing is validation’ is not true."
If that were the case then you should be able to give an example of a compiler validating random source code (data) but not parsing it.
What determines the validity of random input is precisely a compiler's ability to parse it.
If you see it differently you are implicitly assuming a non-formalist perspective on what "validation" means. Tell us about it.
basically ist’s just functions that take a value of one type and return a other one
I personally use myzod as its fast it parsing, zero dependancies and you can infre types from your schemas.
Maat was created before dataclasses existed. For validation Maat offers the same. But it also allows for some really neat features such as validation on encrypted data. https://github.com/Attumm/Maat/blob/main/tests/test_validati...
Since validation is written as dictionaries its possible to store the validations in caching db such as Redis.
And since its simple its easy to extend for anyone use case. And there are no other dependencies.
Benchmarks of pydantic has Maat around twice as Pydantic.
And it's getting some wide adoption, for instance FastAPI which uses it for request validations.
Some points really elude me because Haskell uses many symbols and is very dense.
> IME, people in dynamic languages almost never program this way, though—they prefer to use validation and some form of shotgun parsing. My guess as to why? Writing that kind of code in dynamically-typed languages is often a lot more boilerplate than it is in statically-typed ones!
I feel that once you've got experience working in (usually functional) programming languages with strong static type checking, flakey dynamic code that relies on runtime checks and just being careful to avoid runtime errors makes your skin crawl, and you'll intuitively gravitate towards designs that takes advantage of strong static type checks.
When all you know is dynamic languages, the design guidance you get from strong static type checking is lost so there's more bad design paths you can go down. Patching up flakey code with ad-hoc runtime checks and debugging runtime errors becomes the norm because you just don't know any better and the type system isn't going to teach you.
More general advice would be "prefer strong static type checking over runtime checks" as it makes a lot of design and robustness problems go away.
Even if you can't use e.g. Haskell or OCaml in your daily work, a few weeks or just of few days of trying to learn them will open your eyes and make you a better coder elsewhere. Map/filter/reduce, immutable data structures, non-nullable types etc. have been in other languages for over 30 years before these ideas became more mainstream best practices for example (I'm still waiting for pattern matching + algebraic data types).
It's weird how long it's taking for people to rediscover why strong static types were a good idea.
One trick I've found very useful is to realise that Maybe (AKA Option) can be though of as "a list with at most one element". Dynamic languages usually have some notion of list/array, which we can use as if it were a Maybe/Option type; e.g. we can follow a 'parse, don't validate' approach by wrapping a "parsed" result in a list, and returning an empty list otherwise. This allows us to use their existing 'map', 'filter', etc. too ;)
(This is explored in more detail, including links to logic programming, in https://link.springer.com/chapter/10.1007%2F3-540-15975-4_33 )
If we want to keep track of useful error messages, I've found Scala's "Try" type to be useful ('Try[T]' is isomorphic to 'Either Throwable T'). Annoyingly, built-in sum type; the closest thing is usually a tagged pair like '[true, myFoo]'/'[false, myException]', which is pretty naff.
I've found scala or even LINQ to really hammer down this point, even to those who aren't into FP very much. Doing that map/flatmap makes it click for just about anyone
Not being a big fan of method chaining, a null saavy foreach would probably eliminate most of my null checks, need for Optional.
Static types are awesome for local reasoning, but they are not that helpful in the context of the larger system (this already starts at the database, see idempotency mismatch).
Code with static types is sometimes larger and more complex than the problem its trying to solve
They tightly couple data to a type system, which (can) introduce incidental complexity >(I'm still waiting for pattern matching + algebraic data types) This is a good example, if you pattern match to a specific structure (e.g. position of fields in your algebraic data type), you tightly coupled your program to this particular structure. If the structure change, you may have to change all the code which pattern matches this structure.
match event.get():
case Click((x, y)):
handle_click_at(x, y)
(Example from PEP 636[1].)In both Python and statically typed languages you can avoid this by matching against field names rather than positions, or using some other interface to access data. This is an important design aspect to consider when writing code, but does not have anything to do with dynamic programming. The only difference static typing makes is that when you do change the type in a way that breaks existing patterns, you can know statically rather than needing failing tests or runtime errors.
The same is true for the rest of the things you've mentioned: none are specific to static typing! My experience with a lot of Haskell, Python, JavaScript and other languages is that Haskell code for the same task tends to be shorter and simpler, albeit by relying on a set of higher-level abstractions you have to learn. I don't think much of that would change for a hypothetical dynamically typed variant of Haskell either!
[1]: https://www.python.org/dev/peps/pep-0636/#matching-sequences
When using a data structure, I know what set of fields I expect it to have. In TypeScript, I can ask the compiler to check that my function's callers always provide data that meets my expectations. In JavaScript, I can check for these expectations at runtime or just let my function have undefined behavior.
Either way, if my function's assumptions about the data's shape don't turn out to be correct, it will break, whether or not I use a dynamic language.
It seems that most of the people who make this argument against static typing are actually arguing against violations of the Robustness Principle[0]: "be conservative in what you send, be liberal in what you accept".
A statically typed function that is as generous as possible should be no more brittle against outside change than an equally-generous dynamically typed function. The main difference is that the statically typed function is explicit about what inputs it has well-defined behavior for.
Recently, I spent 3 years on Scala then switched jobs and spent 3 years in Ruby.
It was a shock to go back to dynamic languages, but after 3 months, I honestly couldn't tell which felt more productive or led to more stable high quality product.
In Ruby, we had all the issues people point out about dynamic languages, but the product didn't lean heavily on complex data structures or algorithms. We embraced complexity and failure and get good processes, designs and practices to deal with this.
In Scala, we had more rigour, but I also know I spent a lot of time on type design. Once things were sorted there was a lot of confidence in it, but generally, it took a lot longer to get there.
For certain systems that is absolutely worth it, for others (and in my case) it did feel like the evolution of the product meant this effort never really paid off.
I do believe that, for long lasting, larger projects, static typing tends to make the code easier to maintain as time goes on. But not every project is like that. In fact, not every project uses a single language. Some use statically typed languages for some parts, and dynamically typed for others (this is common in web dev).
Static typing really appeals to me on a personal level. I enjoy the process of analysis it requires. I love the notion of eliminating whole classes of bugs. It feels way more tidy. I took Odersky's Scala class for fun and loved it.
But in practice, they're just a bad match for projects where the defining characteristic is unstable ground. They force artificial clarity when the reality is murky. And they impose costs that pay off in the long run, which only matters if the project has a long run. If I'm building something where we don't know where we're going, I'll reach for something like Python or Ruby to start.
This has been brought home to me by doing interviews recently. I have a sample problem that we pair on for an hour or so; there are 4 user stories. It involves a back end and a web front end. People can use any tools they want. My goal isn't to get particular things done; it's to see them at their best.
After doing a couple dozen, I'm seeing a pattern: developers using static tooling (e.g., Java, TypeScript) get circa half as much done as people using dynamic tooling (Python, plain JS). In the time when people in static contexts are still defining interfaces and types, people using dynamic tools are putting useful things on the page. Making a change in the static code often requires multiple tweaks in situations where it's one change in the dynamic code. It makes the extra costs of static tooling really obvious.
That doesn't harm the static-language interviewees, I should underline. The goal is to see how they work. But it was interesting to see that it wasn't just me feeling the extra costs. And those costs are only worth paying when they create payoffs down the road.
This was more aimed at people who are new to the idea of parsing over validating. In a strong statically typed language, the type system would naturally guide you to use this approach so if this isn't natural to you then time in other languages would probably be worthwhile.
I don't think it's weird. Most of those languages were not popular in industry for various reasons, and the ones that were (especially in say, the 90s) did not have particularly capable static type systems. The boilerplate/benefit ratio was all off.
The way I describe this dichotomy personally is, I would rather use Ruby than Java 1.5. I would rather use Rust than Ruby. (Java 1.5 is the last version of Java I have significant development experience in, and they've made the type system much more capable since those days.)
I never heard about the former.
Now, even in the rare case where I write some Python, JS or PHP, I write it in a very statically typed style, immediately parsing input into well-thought-out domain classes. And for backend services, I almost always go with 3 layers of models:
1) Data Transfer Objects. Map directly to the wire format, e.g. JSON or Protobuf. Generally auto-generated from API specs, e.g. using Open API Generator or protoc. A good API spec + code gen handles most input validation well
2) Domain Objects. Hand written, purely internal to the backend service, faithfully represent the domain. The domain layer of my code works exclusively with these. Sometimes there’s a little more validation when transforming a DTO into a domain model
3) Data Access Objects. Basically a representation of DB tables. Generally auto-generated from DB schemas, e.g. using libs like Prisma for TS or SQLBoiler for Go
Can’t imagine going back to the “everything is a dictionary” style for any decent sized project, it becomes such a mess so quickly. This style is a little more work up front, when you first write the code, but WAYYYYYY easier to maintain over time, fewer bugs and easier to modify quickly and confidently, with no nasty coupling of your domain models to either DB or wire format concerns. And code gen for the DTO and DAO layers makes it barely more up-front work.
Its weird how long its taken for languages with static typing and type systems designed for correctness (and designed well for that end) rather than princupally for convenience of compilation to be available that are generally usable (considering licensing model, features, ecosystem, etc.)
I wonder how many people the author met.
For example, one good reason why strong static types are a bad idea... they prevent you from implementing dynamic dispatch.
Routers. You can't have routers.
https://www.geeksforgeeks.org/dynamic-method-dispatch-runtim...
And using it doesn't give up any of Java's type safety guarantees. The arguments and return type of the method you call (which will be invoked with dynamic dispatch) are type checked.
Does my thing have a different name? Where can I read up on how to do that best?
I see they’ve raised a lot of money. Does anyone know what their revenue model is?
Here's validating a CSV in Python (which I'm using because it's a language that's, well, less excited about types than the author's choice of Haskell, to show that the principle still applies):
def validate_data(filename):
reader = csv.DictReader(open(filename))
for row in reader:
try:
date = datetime.datetime.fromisoformat(row["date"])
except ValueError:
print("ERROR: Invalid date", row)
if date < datetime.datetime(2021, 1, 1):
print("ERROR: Last year's data", row))
# etc.
return errors
def actually_work_with_data(filename):
reader = csv.DictReader(open(filename))
for row in reader:
try:
date = datetime.datetime.fromisoformat(row["date"])
except ValueError:
raise Exception("Wait, didn't you validate this already???")
# etc.
Yes, it's a kind of silly example, but - the validation routine is already doing the work of getting the data into the form you want, and now you have some DRY problems. What happens if you start accepting additional time formats in validate_data but you forget to teach actually_work_with_data to do the same thing?The insight is that the work of reporting errors in the data is exactly the same as the work of getting non-erroneous data into a usable form. If a row of data doesn't have an error, that means it's usable; if you can't turn it into a directly usable format, that necessarily means it has some sort of error.
So what you want is a function that takes the data and does both of these at the same time, because it's actually just a single task.
In a language like Haskell or Rust, there's a built-in type for "either a result or an error", and the convention is to pass errors back as data. In a language like Python, there isn't a similar concept and the convention is to pass errors as exceptions. Since you want to accumulate all the errors, I'd probably just put them into a separate list:
@attr.s # or @dataclasses.dataclass, whichever
class Order:
name: str
date: datetime.datetime
...
def parse(filename):
data = []
errors = []
reader = csv.DictReader(open(filename))
for row in reader:
try:
date = datetime.datetime.fromisoformat(row["date"])
except ValueError:
errors.append(("Invalid date", row))
continue
if date < datetime.datetime(2021, 1, 1):
errors.append(("Last year's data", row))
continue
# etc.
data.append(Order(name=row["name"], date=date, ...))
return data, errors
And then all the logic of working with the data, whether to actually use it or to report errors, is in one place. Both your report of bad data and your actually_work_with_data function call the same routine. Your actual code doesn't have to parse fields in the CSV itself; that's already been done by what used to be the validation code. It gets a list of Order objects, and unlike a dictionary from DictReader, you know that an Order object is usable without further checks. (The author talks about "Use a data structure that makes illegal states unrepresentable" - this isn't quite doable in Python where you can generally put whatever you want in an object, but if you follow the discipline that only the parse() function generates new Order objects, then it's effectively true in practice.)And if your file format changes, you make the change in one spot; you've kept the code DRY.
The argument is that if you need to interact with or operate on some data you shouldn't be designing functions to validate the data but rather to render it into a useful output with well defined behaviour.
Say I have a string that’s supposed to represent an integer. To me, “Validate” means using a regex to ensure it contains only digits (raising an error if it doesn’t) but then continuing to work with it as a string. “Parse” means using “atoi” to obtain an integer value (but what if the string’s malformed?) and then working with that.
I first thought this article was recommending doing the latter instead of the former, but the actual recommendation (and I believe best practice) is to do both.
- don't drop the info gathered from checks while validating, but keep track of it
- if you do this, you'll effectively be parsing
- parsing is more powerful that validating
"Extra steps" would be keeping track of info gathered from checks.
At the end of parsing, you have a structure with a type. After validation, you may or may not have a structure with that type, depending on how you chose to validate.
But I think the big win is, parsers are usually much easier to compose (since they themselves are structured functions) and so if you start with the type first, you often get the "validation" behavior aspect of parsing for "free" (usually from a library). Maybe title should have been "Parse, don't manually validate."
But if your type doesn't catch all your invariants, yeah it does feel kinda just like validation.
I found myself replacing the configuration parsing code in a C++ project that was littered with exactly the validation issues described, and converted it to that which the author advocates. The result was a vastly more readable and maintainable codebase, and it was faster and less buggy to boot.
Another nice advantage is that the types are providing free/self- documentation, too.
Gotta go and program more Elixir...