https://tsdr.uspto.gov/#caseNumber=98324800&caseSearchType=U...
My company is a Fivetran client, and they named that company after a (bad) joke, but it's worth a fortune.
Having to rebrand your product after launching is a lot more painful than doing it before launching.
Amazon just builds the same thing, calls it S3 Streams, and doesn’t care about S2.
Maybe they make a buyout offer.
I highly doubt they would sue.
That’s the kind of David vs. Goliath publicity one could only dream of …
Most people would simply say "Amazon is right." Because Amazon is right. This is an intentional attempt to leverage their product branding to promote a new product. There is very little good here.
If this were open-source, academic, non-profit, or something like that, perhaps. A small venture trying to commercialize on some digital equivalent of Amazon's trade dress? I can't imagine anyone would care....
Even those times when someone is 100% right, usually, there is zero publicity. Right or wrong, most times I've seen, the small guy would settle with the big guy with the deep legal pockets and move on because litigating is too expensive.
In a situation like this one, your marketing spend / press coverage on the existing name is shot, links to your domain are shot, and perhaps you have an egg on your face, depending on how things play out.
2) the problem here is that they're in the same business segment, and explicitly reference S3.
E.g. creating a product called “Gooogle”
If this had been released instead as a Papertrail-like end-user product with dashboards, etc. instead of a "cloud primitive" API so closely tied to AWS, it would make a lot more sense. Add the ability to bring my own S3-Compatible backend (such as Digital Ocean Spaces), and boom, you have a fantastic, durable, cloud-agnostic product.
There's no end to startups which can be described as existing-open-source-software as a service, marketed as a cheaper alternative to AWS offerings.. who run on AWS.
Blob stores will also not let you do tailing reads, like you can with S2.
In AWS, S2's Express storage class takes care of writing to a quorum of 3 zonal buckets for regional durability.
I doubt object stores will go from operating at the level of blobs and byte ranges, to records and sequence numbers. But I could be wrong.
If anything, they normlise an expectation with a budget aware base.
An S3-level primitive API for streaming seems really valuable in the long-term if adopted
I also see that on your pricing page -
"We are building the S3 experience for streaming data, and that includes pricing transparency"
Love the simple and earnest copy. One can imagine what an LLM would cook up instead, I find the brevity way preferable.
If I could put in one request...a video which describes what it is and how to use it would make it easier for me to understand.
I think of that as the "Chinese wall" of shipping SDKs: can someone not familiar with your product use it effectively from a language you don't know
Gazette started as an internal tool in my previous startup (AdTech related). When forming our current business, we very briefly considered offering it as a raw service [1] before moving on to a holistic data movement platform that uses Gazette as an internal detail [2].
My feedback is: the market positioning for a service like this is extremely narrow. You basically have to make it API compatible with a thing that your target customer is already using so that trying it is zero friction (WarpStream nailed this), or you have to move further up to the application stack and more-directly address the problems your target customers are trying to solve (as we have). Good luck!
[0]: https://gazette.readthedocs.io/en/latest/ [1]: https://news.ycombinator.com/item?id=21464300 [2]: https://estuary.dev
ED: I appreciate where you are coming from, and understand the challenges ahead. Thank you for the advice.
In my opinion, the key is to find a value prop and positioning which lets prospects try your service while spending a minimum of their own risk capital / reputation points within their own org.
That makes it hard to go after core storage, because it's such a widely used, fundamental, and reliable part of most every company's infrastructure. You and I may agree that conventions of incremental files in S3 are a less-than-ideal primitive for representing streams, but plenty of companies are doing it this way just fine and don't feel that it's broken.
WarpStream, on the other hand, leaned in to the perceived complexity of running Kafka and the share of users who wanted a Kafka solution with the operational profile of using S3. Internal champions can sell trying their service because the prospect's existing thing is already understood to be a pain in the butt.
For what it's worth, if I were entering the space anew today I'd be thinking carefully about the Iceberg standard and what I might be able to do with it.
I suppose the writers could batch a group of records before writing them out as a larger blob, with background processes performing compaction, but it's still an object-backed streaming service, right?
AWS has shown their willingness to implement mostly-protocol compatible services (RDS -> Aurora), and I could see them doing the same with a Kafka reimplementation.
> I suppose the writers could batch a group of records before writing them out as a larger blob, with background processes performing compaction, but it's still an object-backed streaming service, right?
This is how it works essentially, yes. Architecting the system so that chunks that are written to object storage (before we acknowledge a write) are multi-tenant, and contain records from different streams, lets us write frequently while still targeting ideal (w/r/t price and performance) blob sizes for S3 standard and express puts respectively.
If you had a (paid for) API that sat on top of an S3 API for on-prem, that would be fantastic as well.
Kafka is great, but the whole Java ecosystem and the lack of control of what is in the topics and the stuff about co-ordinating the cluster in zookeeper is a management PITA.
I'd suggest a persistent emulator, using something like SQLite (one row per record). Even for local development, many applications need persistence. And it'd be even enough to run a single node low throughput production server which doesn't need robust durability and availability. But it still has enough overhead and limitations not to compete with your cloud offering.
What's however important is being as close as possible to your production system, behavior wise. So I'd try so share as much of the frontend code (e.g. the GRPC and REST handlers) as possible between these.
But if I care about ingress and egress costs, which many stream heavy infrastructure providers do.. This doesn’t add up.
I wish them luck, but I feel they would have had a much better chance from the start by getting some funding and having a loss leader start, then organising and passing on wholesale rates from cloud providers once they’d reached critical mass.
Instead they’re going in at retail which is very spicy. I feel like someone will clone the tech and let you self host, before big players copy it natively.
It’s a commodity space and they’re starting with a moat of a very busy 2 weeks from some Staff engineers at AWS.
I can sort of grasp what the S2 team is aiming to achieve, but it feels like I’m forced to perform unnecessary mental gymnastics to connect their platform with the specific problems it can solve for a business or product team.
I consider myself fairly technical and familiar with many of the underlying concepts, but I still couldn’t work out the practical utility without significant effort.
It’s worth noting that much of technology adoption is driven by technical product managers and similar stakeholders. However, I feel this critical audience is often overlooked in the messaging and positioning of developer tools like this.
For what it's worth, I am already familiar with this design space well enough that I don't need this kind of example in order to understand it. I've worked with Kinesis and other streaming systems before. But for people who haven't, an example might help.
What kind of business problem would someone have that causes them to turn to your service? What are the alternative solutions they might consider and how do those compare to yours? That's the kind of info they're asking for. You might benefit from pitching this such that people will understand it who have never considered streaming solutions before and don't understand the benefits. Pitch it to people who don't even realize they need this.
Errors, OTOH, I need a week or two of. But I consider these 2 different things. Logs are kind of a last resort when you really can't figure out what's going on in prod.
A distributed, but still consistent and durable log is a great building block for higher level abstractions.
2. Does the choice of storage class only affect chunks or also segments?
To me the best solution seem like combining storing writes on EBS (or even NVMe) initially to minimize the time until writes can be acknowledged, and creating a chunk on S3 standard every second or so. But I assume that would require significant engineering effort for applications that require data to be replicated to several AZs before acknowledging them. Though some applications might be willing to sacrifice 1s of writes on node failure, in exchange for cheap and fast writes.
3. You could be clearer about what "latency" means. I see at least three different latencies that could be important to different applications:
a) time until a write is durably stored and acknowledged
b) time until a tailing reader sees a write
c) time to first byte after a read request for old data
4. How do you handle streams which are rarely written to? Will newly appended records to those streams remain in chunks indefinitely? Or do you create tiny segments? Or replace and existing segment with the concatenated data?
1) Storage is priced on uncompressed data. We don't currently compress segments.
2) It only affects chunk storage. We do have a 'Native' chunk store in mind, the sketch involves introducing NVMe disks (as a separate service the core depends on) - so we can offer under 5 millisecond end-to-end tail latencies.
3) The append ack latency and end-to-end latency with a tailing reader is largely equivalent for us since latest writes are in memory for a brief period after acknowledgment. If you try the CLI ping command (see GIF on landing page) from the same cloud region as us (AWS us-east-1 only currently), you'll see end-to-end and append ack latency as basically the same. TTFB for older data is ~ TTFB to get a segment data range from object storage, so it can be a few hundred milliseconds.
4) We have a deadline to free chunks, so we we PUT a tiny segment if we have to.
Yep, this is approximately Gazette's architecture (https://github.com/gazette/core). It buys the latency profile of flash storage, with the unbounded storage and durability of S3.
An addendum is there's no need to flush to S3 quite that frequently, if readers instead tail ACK'd content from local disk. Another neat thing you can do is hand bulk historical readers pre-signed URLs to files in cloud storage, so those bytes don't need to proxy through brokers.
https://www.sunlu.com/products/new-version-sunlu-filadryer-s...
o1, o3, s2, M4, r2, ...
- Unlimited streams. Current cloud systems limit to a few thousand. With dedicated clusters, few hundred K? If you want a stream per user, you are now dealing with multiple clusters.
- Elastic throughput per stream (i.e. a partition in Kafka) to 125 MiBps append / 500 MiBps realtime read / unlimited in aggregate for catching up. Current systems will have you at tens. And we may grow that limit yet. We are able to live migrate streams in milliseconds while keeping pipelined writes flowing, which gives us a lot of flexibility.
- Concurrency control mechanisms (https://s2.dev/docs/stream#concurrency-control)
kudos for sitting down and makin it happen!
Seems like there are a lot of more lite weight self-hosted s3 around now days. Why even use S3?
I'd really love this extending more into the event sourcing space not just the log/event streaming space.
Dealing with problems like replay and log compaction etc.
Plus things like dealing with old events. Under GDPR, removing personal information/isolating it from the data/events themselves in an event sourced system are a PITA.
> I also kind of strongly dislike HtDP.
I'm researching programming pedagogy and I'm curious about your thoughts on this.
Wow man are you stil stuck on S3?
This is an alternative to systems like Kafka which don't do great at giving a serverless experience.
Or more generally, when is it better to choose S2 vs services like SQS or Kinesis?
S2 sounds like an ordered queue to me, but those exist?
https://chatgpt.com/c/676703d4-7bc8-8003-9e5d-d6a402050439
Edit: Keep downvoting, only 5.6k to go!