This is why I never design web APIs to use the HTTP status code to indicate the application response. Always embed the application response within the HTTP payload. It should be independent of the transport mechanism. I’m ok with it not being a proper REST/RESTful service.
{ “status” : 1000, “message” : “Item not found” }
And intentionally don’t use the same status numbers as http (Ie don’t use 404 as not found, because someone will mix them up!)
I am inclined to agree that for this particular usage, an "in-body" response makes sense. 404 should be reserved for when the actual HTTP endpoint is unavailable. But in REST semantics, you would only return 404 for an endpoint like /users/12345 when user 12345 doesn't exist. So the two usages line up. Returning 200 with a body that says "user 12345 does not exist" makes a lot less sense to me.
A good example of overdoing it is when GraphQL servers return a 200 HTTP response that contains nothing but an error message, instead of returning a suitable HTTP status like 400.
The real problem with REST and HTTP is that it’s too easy to put middleware in between that doesn’t understand REST, just HTTP. As a software engineer or architect you can design the API to be perfect to your needs, but, when deployed you often lose control of how the client and server actually connect to each other. Proxies, caches, IDS, WAF, all can get in the way and don’t respect REST semantics.
So long as you have the source you are fine - and if you (or the third party/maintainer) don’t have the source then it doesn’t matter the approach because there will be bugs you can’t fix throughout the service.
With 200 codes being cached you still will have problems like stale data. Wouldn’t want the Target registers using yesterdays prices for today (especially if yesterday was a super sale day like Black Friday etc).
A 403 in the API had a very specific meaning, and when the proxy layer started returning 403s everyone had a really bad time.
(That was a long day)
And that meaning wasn't "you are authenticated as a user that can not access this resource"?
auto/silent fallbacks seem like clever ways of avoiding beeps and remaining resilient against failures in supporting systems, but in practice it they always tend to just cover up real issues until it's too late.
i think the ideal is to have a nice easily included retry library that includes reporting/alerting, configurable backoff schemes and logging that can be used everywhere on things that can have transient failures.
I'd call it a 4 hour outage because the initial "recovery" was a result of cashiers manually typing in prices for items. Then when load decreased and they discovered that scanning items worked again the problem came right back.
Maybe returning 404 for both a cache miss and a "there's no endpoint at this path" error is an issue too. For other status codes there's a distinction between temporary and permanent failure; e.g. 301 versus 302. It would've been good to use HTTP 400 Bad Request for the misconfigured URL and 404 for a cache miss.
In the 10% of stores with the early roll out of the config change the cache hit rate went to 0 right away, and that started 12 days before the outage. Alerts on cache hit rates and per-store alerts would've caught that.
Then there were 4 days where traffic to the main inventory micro-service in the data center jumped 3x which took it to what appears to be 80% of capacity. Load testing to know your capacity limits and alerts when you near that limit would've called out the danger.
Then during the outage when services slowed down due to too many requests they were taken out of rotation for failing health checks. Applying back pressure/load shedding could have kept those servers in active use so that the system could keep up.
Seems reasonable.
> high profile processes (such as POS) implement their own fallback processes to handle the possibility of issues with the SDM system in store. In the case of item data, the POS software on each register is capable of bypassing the SDM Proxy and retrying its request directly to the ILS API in the data centers.
... the system as a whole became much more complex and difficult to observe. The system was running in a degraded, abnormal, less-tested, fallback mode for days without anyone caring.
This is also a point about the normalization of deviance. When there is a background rate of the POS using the fallback path, who is to say how important an increase in that rate might be?
grok say complexity bad
Fallback is not always necessary (sometimes it is, you can't just say "whelp the engines on this plane went out, time to die") but when you have a fallback system you should think about why you have it and how bad it is to fail, and if it could be worse to succeed.
A few years ago, the guys who built Chick-fil-a's POS fog were on HN talking about their fault-tolerance and transaction queueing. It was quite interesting.
There's a lot that you can learn from high-volume POS system design that applies to just bog-standard every day programming.
Those 3 have almost become the industry standard for observability. Everywhere I have worked have used the same and it is almost a no-brainer.
My air conditioner in my house has a secondary drain pan under it. The outlet for that drain pan is right above a main window outside. If the primary condensate drain gets plugged/fails and the water overflows into the backup pan there would be a stream of water in front of a window that shouldn't otherwise be there. They want you to be able to readily notice it as you are now at risk for significant water damage if that secondary drain manages to plug up too.
Always something worth considering when designing any system - how to make it fail in a way that is noticeable!
All of the effort that went into collecting the information was for nought, because the outcome was always the same. That was collection with a flowchart, but without two or more distinct outcomes.
A second example would be search engine logs. Nobody wanted to make decisions on them, but "we could always trawl them for data later." A decade on, this had never occurred. That was collection with no flow chart. Offloading the logs, parsing them out, making the data available, week after week, month after month, year after year. Wasted effort.
So part of it is "don't waste effort," but the other part is, if there is decent information to collect, you should be doing something with it.
For example, if the shop loses power, do they have the ability to sell goods still?
One approach is to let staff members estimate the value of goods - for example at the register, the staff member looks at the cart contents, estimates that it's about $120 worth of goods, charges the customer $120, and hand writes a receipt saying "$120 of goods sold, Date, store name, signature". The staff member then uses a phone to photograph the cart and the receipt.
At the end of the shift, the shaff member drops all the photos into a big store wide Dropbox account, that the accounts department can use to pay taxes.
You'd probably want to practice this process ahead of time with every staff member.
I imagine it might actually be a good process to use on very busy days too - it is probably quicker than scanning every item at the register.
To some degree yes, we can check people out with a handheld (which has swappable batteries) and the self check registers are on the emergency power circuit.
Couple years ago when the system went down nationwide we just told people to put their name on their cart and we gave them 10% off if the came back the next day.
I ofc don’t know what I dont know, but super curious if anyone has insight into why such a complex system is required
Also, if this microservice is used for brick and mortgage mortar, can’t imagine more than a couple hundred per second? ( 2000 stores, 5 registers a store - and humans manually scanning items ) - why did that overload the micro service (guessing it wasn’t an endless exponential backoff)
Because it's much more efficient, which allows them to use simpler tech that doesn't need to scale as well.
You are also underestimated the throughput the system needs to handle. 2000 stores * 10 registers per store * 1000 scans per register per hour = 5000 scans per second.
Also , I get the simpler tech, but complexity breeds failure - if you have a hybrid on prem / cloud model, especially with only 250k skus, at that point doesn’t it make sense to keep that exclusively in the cloud.
It’s a system that scans a barcode and returns an item at its core - this is still well under the limits of using an off the shelf system like Redis behind an endpoint
If you refer to your team members or employees as "associates" you're much more likely to treat them as equals.
Similarly, if you refer to your customers as "guests", you are much more likely to treat them as such rather than simply treating them as people in your store looking to spend money. It gets to the whole sense of trying to create an experience. As a store that sells a significant amount of home goods and goods for the home, referring to customers as guests instills the sense that employees are creating a home like experience for the customer.
Neurolinguistic programming isn't just for hippies. It's a very popular pseudoscience in corporate America.
On one level I suppose it's just silly terminology, but it grates with me. I guess it's supposed to imply some kind of familiar relationship, free of the gauche trappings of economics. To me a customer demands more attention than a "guest".
It shocks me how many people don't recognize that their employer wouldn't exist if not for customers. That should be front-and-center in the minds of anyone working for a for-profit entity. I don't think there's anything gauche about economics.