3K, 60fps, 130ms: achieving it with Rust (opens in new tab)

(blog.tonari.no)

427 pointsclpwn5y ago205 comments

205 comments

Aside from the Rust aspect (which is cool!), I can't believe we've come this far and still don't have low-latency video conferencing. Maybe I'm overly sensitive, but people talking over each other and the lack of conversational flow drives me crazy with things like hangouts.

Aengeuad5y ago

John Carmack always has an interesting point to make about latency: https://twitter.com/ID_AA_Carmack/status/193480622533120001

>I can send an IP packet to Europe faster than I can send a pixel to the screen. How f’d up is that?

and to relate to the other post about landlines: https://twitter.com/ID_AA_Carmack/status/992778768417722368

>I made a long internal post yesterday about audio latency, and it included “Many people reading this are too young to remember analog local phone calls, and how the lag from cell phones changed conversations.”

Artlav5y ago

> Many people reading this are too young to remember analog local phone calls, and how the lag from cell phones changed conversations

Is there somewhere to read about the changes in question?

I'm old enough to remember extensive use of analog landlines, and can't really think of any difference to a cellphone other than audio quality.

projektfu5y ago

In my world, using regular cell service (not VoLTE), seems nearly as instantaneous as I remember analog lines. I remember how hard a satellite phone call was and I never have that much latency in a call.

lostmsu5y ago

Isn't this mostly because actually showing a pixel requires a macroscopic change?

pedrocr5y ago

Cisco "telepresence" solved this 15 years ago. Standardized rooms on both sides with high quality cameras and low latencies. Polycom had a similar but worse setup at the time. The Cisco experience was very close to being in a shared meeting with the other people. It made meetings across continents work very well and was an actual competitor to flying everywhere. Between the hardware being too expensive and the link requirements being very high I only ever saw it implemented in multinational telecoms for whom it was an actual work tool but also something to impress their clients with.

Either Cisco needed to bring down the cost massively to expand access or someone needed to build it in major cities and bill by the hour to compete against flying. None of those happened so it stayed a niche. Compared to those experiences more than a decade ago the common VC is still very slowly catching up. Part of it is setup, like installing VC rooms with 2 smaller TVs side by side instead of one large one so you can see the document and the other people at decent sizes. But part of it is still the technology. Those "telepresences" were almost surely on a dedicated link running on the telecom core network that guaranteed quality instead of routing through the internet and randomly failing. I suspect getting really low latency will require that kind of telecom level QoS otherwise you'll be increasing buffer sizes to avoid freezes.

ponker5y ago

Cisco and HP Halo were incredible but the biggest problem they had was 1) the requirement to build out an actual room for it and 2) the shitty software setup experience. The big corporates that could afford to build out real estate for VCs also bogged the shit down in "enterpriseyness" that made the shit impossible to use.

Cerium5y ago

About 10 years ago I go to go on a tour of the Taiwan HP office. One thing that stands out in my mind was the telepresence rooms. Absolutely fabulous, large table, with screens across the table that showed high fidelity low latency image of whoever was sitting at a connected table.

pfranz5y ago

Latency was still a huge issue with the HP Halo. I remember a specific meeting where they talked about upgrading the audio codec which didn't seem to address things much. It was kind of a running joke that any applause or laughter would and with a huge, noticable lag between locations.

wil4215y ago

I worked at a company that had a Cisco telepresence machine on wheels. You had to make sure it was plugged into a certain color Ethernet wall jack for it to work but every room had one. You could reserve it and then wheel it to the conference room you wanted.

1 more reply

blahbhthrow37485y ago

My first job out of school was doing product verification for the cameras that were used in those Cisco systems! It was pretty impressive, I think they managed to squeeze 1080p at 60fps over USB2. Had a lot of fun building jigs and testing setups to test the MTBF on a tight time frame

bob10295y ago

The biggest problem is that of the video codecs which ultimately boils down to using interframe compression. This technique requires that a certain # of video frames be received and buffered before a final image can be produced. This requirement imposes a baseline amount of latency that can never be overcome by any means. It is a hard trade-off in information theory.

Something to consider is that there are alternative techniques to interframe compression. Intraframe compression (e.g. JPEG) can bring your encoding latency per frame down to 0~10ms at the cost of a dramatic increase in bandwidth. Other benefits include the ability to instantly draw any frame the moment you receive it, because every single JPEG contains 100% of the data. With almost all video codecs, you must have some prior # of frames in many cases to reconstitute a complete frame.

For certain applications on modern networks, intraframe compression may not be as unbearable an idea as it once was. I've thrown together a prototype using LibJpegTurbo and I am able to get a C#/AspNetCore websocket to push a framebuffer drawn in safe C# to my browser window in ~5-10 milliseconds @ 1080p. Testing this approach at 60fps redraw with event feedback has proven that ideal localhost roundtrip latency is nearly indistinguishable from native desktop applications.

The ultimate point here is that you can build something that runs with better latency than any streaming offering on earth right now - if you are willing to make sacrifices on bandwidth efficiency. My 3 weekend project arguably already runs much better than Google Stadia regarding both latency and quality, but the market for streaming game & video conference services which require 50~100 Mbps (depending on resolution & refresh rate) constant throughput is probably very limited for now. That said, it is also not entirely non-existent - think about corporate networks, e-sports events, very serious PC gamers on LAN, etc. Keep in mind that it is virtually impossible to cheat at video games delivered through these types of streaming platforms. I would very much like to keep the streaming gaming dream alive, even if it can't be fully realized until 10gbps+ LAN/internet is default everywhere.

phoboslab5y ago

Interframes are not a problem, as long as they only reference previous frames, not future ones.

I was able to get latency down to 50ms, streaming to a browser using MPEG1[1]. The latency is mostly the result of 1 frame (16ms) delay for a screen capture on the sender + 2-3 frames of latency to get through the OS stack to the screen at the receiving end. En- and decoding was about ~5ms. Plus of course the network latency, but I only tested this on a local wifi, so it didn't add much.

[1] https://phoboslab.org/log/2015/07/play-gta-v-in-your-browser...

bob10295y ago

It's funny you mention MPEG1. That's where my journey with all of this began. For MPEG1 testing I was just piping my raw bitmap data to FFMPEG and piping the result to the client browser.

I was never satisfied with the lower latency bound for that approach and felt like I had to keep pushing into latency territory that was lower than my frame time.

That said, MPEG1 was probably the simplest way to get nearly-ideal latency conditions for an interframe approach.

ekimekim5y ago

Wouldn't you then hit issues where a single dropped packet can cause noticable problems? In an intraframe solution if you lose a (part of a) frame, you just skip the frame and use the next one instead. But if you need that frame in order to render the next one, you either have to lag or display a corrupted image until your next keyframe.

I guess as long as keyframes are common and packet loss is low it'd work well enough.

2 more replies

imtringued5y ago

Interesting. I guess I'll have to rewrite a lot of code if what you are saying is true.

vlovich1235y ago

You can also just configure your video encoder to not use B-frames. Then if you make all consecutive frames P frames then the size is very maintainable. It gets trickier if your transport is lossy since a dropped P frame is a problem but it's not an unsolvable problem if you use LTR frames intelligently.

All the benefits of efficient codecs, more manageable handling of the latency downsides.

The challenges you'll run into instantly with JPEG is that the file size increase & encoding/decoding time on large resolutions outstrips any benefits you get in your limited tests. For video game applications you have to figure out how you're going to pipeline your streaming more efficiently than transferring a small 10 kb image as otherwise you're transferring each full uncompressed frame to the CPU which is expensive. Doing JPEG compression on the GPU is probably tricky. Finally decode is the other side of the problem. HW video decoders are embarrassingly fast & super common. Your JPEG decode is going to be significantly slower.

* EDIT: For your weekend project are you testing it with cloud servers or locally? I would be surprised if under equivalent network conditions you're outperforming Stadia so careful that you're not benchmarking local network performance against Stadia's production on public networks perf.

bob10295y ago

I tested: localhost (no network packets on copper), within my home network (to router and back), and across a very small WAN distance in the metro-local area (~75mpbs link speed w/ 5-10 ms latency).

The only case that started to suck was the metro-local, and even then it was indistinguishable from the other cases until resolution or framerate were increased to the point of saturating the link.

One technique I did come up with to combat the exact concern raised above regarding encoding time relative to resolution is to subdivide the task into multiple tiles which are independently encoded in parallel across however many cores are available. When using this approach, it is possible to create the illusion that you are updating a full 1080/4k+ scene within the same time frame that a tile (e.g. 256x256) would take to encode+send+decode. This approach is something that I have started to seriously investigate for purposes of building universal 2d business applications, as in these types of use cases you only have to transmit the tiles which are impacted by UI events and at no particular frame rate.

namibj5y ago

Actually, there are commercial CUDA JPEG codecs (both directions) operating at gigapixels per second. It's not a question of speed, but rather the fact that you can at least afford to use H.264's I-frame-only codec for much lower bandwidth requirements.

1 more reply

monocasa5y ago

Almost every hardware codec I've seen supports JPEG. MJPEG is certainly more rare than the more traditional video algorithms, but it certainly gets used.

lultimouomo5y ago

You can also eliminate I-frames and have I-slices distributed among several P-frames, so that you don't have spikes in bandwidth (and possibly latency if the encoder needs more time to process an I-frames)

cossatot5y ago

I think a larger issue is the focus on video as opposed to audio. Audio may be less sexy but it is far and away more important for most interpersonal communication (I'm not discussing gaming or streaming or whatever, but teleconferencing). Most of us don't care that much if we get super crisp, uninterrupted views of our colleagues or clients, but audio problems really impede discussion.

Mirioron5y ago

Video is related to this though. If audio is synced to the video then a delayed video stream also means a delayed audio stream.

1 more reply

fludlight5y ago

Early versions of Youtube nailed this. The video would frequently pause, degrade, or glitch due to buffering delays but the audio would continue to play. This made all the difference in user perception: youtube felt smooth. Other streaming services would pause both video and audio which did not feel smooth at all. Maybe they had some QoS code in their webapp to prioritize audio?

jstrong5y ago

one technique that could be used (to get high compression rates on compression applied to each frame) is to train a compression "dictionary" on the first few seconds/minutes of a data stream, and then use the dictionary to compress/decompress each frame.

izacus5y ago

Well, all the effort is regularly defeated by poor hardware - you can have 40ms latecy in the video call stack, but when people attach Bluetooth headphones which buffer everything for 300ms there's nothing really to be done.

(Be gentle on your coworkers and use cabled headphones.)

bufferoverflow5y ago

LLAC/LHDC LL bluetooth codec adds only 30ms.

AptX low latency codec adds only 40ms max.

Just buy headphones with good low latency support. They aren't even expensive anymore.

dan-robertson5y ago

Bluetooth audio is a mess of compromise. The default sbc codec is basically fine for low latency but the parameters are all pretty terrible. Everyone uses the same few default parameters which neither give particularly high quality (especially for two way audio which was designed to be compatible with phone quality), nor low latency (especially for the high quality a2dp profile). One issue is that the designs/defaults haven’t really been updated since about 2000, and the parameters are very hard to change, typically the OS’s preference is hardcoded somewhere (also, whichever device initiates the connection gets to choose the parameters so even if you configured your computer to choose “better” parameters, it would all be for naught if you let the headphones connect to the computer rather than the other way round). The other issue is that Bluetooth is quite severely bandwidth constrained and higher bandwidth could theoretically give lower latency.

snvzz5y ago

>LLAC/LHDC LL bluetooth codec adds only 30ms.

"only" is positive thinking.

I do play some rhythm games (LLSIF, deresute, mirishita) on Android. The difference between "only adds 30ms" and plugging my headphones directly to the headphone jack is the difference between unplayable and playable. The games do have a latency compensation setting (with a calibration procedure), but compensation is no substitute for the real thing: Low latency.

fomine35y ago

LLAC/AptX LL isn't adopted well on host device now, especially on Apple devices.

And even 30ms delay, using it on headphone/mic and both talker means 3022=120ms delay.

Filligree5y ago

Okay, but I want to wear wireless headphones.

Why can't I have both? Wifi doesn't seem to have this latency problem.

izacus5y ago

How do you know? :)

The latency doesn't come from bluetooth radio part itself (there ARE low latency BT headphones after all).

It comes from the fact that all audio is encoded (usually into SBC or AAC or AptX), transmitted and then decoded in the headphones. And each of those steps has buffers. And those buffers are configured by the manufacturer.

The bigger the buffer, the more stable the audio connection - there's less stuttering, less dropouts. But every buffer in the chain adds latency.

So why can't you have both? You sure can. You just need to somehow find headphones and a PC that doesn't add latency to bluetooth. Sadly that's not something that's usually documented in technical specs.

1 more reply

smabie5y ago

Wifi's latency has a high dispersion. I've seen absolutely terrible wifi latency, and latency that is under 1ms. wifi degrades gracefully, which makes it really tough to work with.

But pretty much all serious gamers use an ethernet connection because wifi is a pain in the ass. In fact, the first thing a support representative for any game will tell you when complaining about excessive lag is to try a wired connection.

wazoox5y ago

WiFi has terrible latency. Try playing a multiplayer FPS with wired networking and compare with WiFi. Or simply use remote desktop with WiFi.

2 more replies

KallDrexx5y ago

I believe RF based wireless headphones (like my Arctis 7 headphones) don't have this latency in them due to not being Bluetooth based.

There is some patented codec I think that does allow low latency bluetooth streaming (forgot the name) but that's not heavily implemented in my experience.

2 more replies

jfkebwjsbx5y ago

> Wifi doesn't seem to have this latency problem.

Wifi is one of the best things you can do to add unreliability and latency.

GuiA5y ago

There are hard limits at play. No matter what you do, you can't go from New York to London in less than ~20ms; add video/audio encoding, packet switching, decoding, etc. and it's easy to see why any latency under the 100ms mark at that spatial scale in a scalable, mainstream product would be close to a miracle.

The thing is that when we talk in a room, sound will take <10ms to reach my ears from your mouth. This is what "enables" all of the human turn taking cues in conversation (eye contact, picking up whether a sentence is about to end/whether it's a good time to chime in/etc) - I've been looking for work from people who've tried to see at what point things start feeling really bad (is it 10ms, or 50ms?), but haven't found much so far. No matter what it is though, it's likely that long distance digital communications just cannot match it.

See also this interesting comment about the feeling of "closeness" from phone copper wires:

https://news.ycombinator.com/item?id=22931809

Landlines were so fast and so "direct" in their latency (where distance correlates very directly with time, due to a lack of "hops") that local phone calls were faster than the speed of sound across a table, and for a bit after they came out--before people generally got used to seemingly random latency--local calls felt "intimate", like as if you were talking to someone in bed with their head right next to you; I also have heard stories of negotiators who had gotten really tuned to analyzing people's wait times while thinking that long distance calls were confusing and threw them off their game.

jokoon5y ago

> it's easy to see why any latency under the 100ms mark at that spatial scale in a scalable, mainstream product would be close to a miracle.

It seems normal phones are able to do it, though. At least it seems normal phones suffer less latency problem.

In a way, simplicity in technology often means better performance.

snvzz5y ago

Linux is ill-suited for realtime applications.

Google is well-aware of this, thus Fuchsia.

SeL4 would make a good base for such a device.

josh26005y ago

The media lab has done a ton of research on this. I seem to remember people being able to notice visual latency at 30ms and audio latency at 80-120ms (this is because light is faster than sound).

snvzz5y ago

>and audio latency at 80-120ms

Any rhythm game player will disagree.

Some games (e.g. llsif, for android) have "perfect" window sized to 16ms (a video frame). Even with latency compensation, these are unplayable on bluetooth yet fine on headphone jack. As the game has calibration, the resulting offset is seen to be at least 30ms worse on bluetooth.

GuiA5y ago

Interesting, would love to read more if specific papers/authors come to your mind. I suspect there's a big gap between e.g. "noticing the audio latency when audio is played as a result of pressing a button" vs "audio latency affecting the flow of a multiparty conversation".

2 more replies

eru5y ago

> The thing is that when we talk in a room, sound will take <10ms to reach my ears from your mouth. This is what "enables" all of the human turn taking cues in conversation (eye contact, picking up whether a sentence is about to end/whether it's a good time to chime in/etc) - I've been looking for work from people who've tried to see at what point things start feeling really bad (is it 10ms, or 50ms?), but haven't found much so far. No matter what it is though, it's likely that long distance digital communications just cannot match it.

Digital communication could cheat, though!

There's a lot of latency hiding you can do, if you can predict well enough what's coming next. Humans are fairly predictable most of the time.

sbierwagen5y ago

Where does Tonari actually put the camera? The perspective on the displayed image makes it look like the camera is ceiling mounted, but that would make the eye contact problem much worse than even Zoom.

wallflower5y ago

If I had to guess at a possible future, I can imagine edge computing servers that connect over 5G or fiber to your device. On these edge computing servers, they predict using AI/ML what you, as a participant, could do (video including facial and hand gestures, audio including Toastmaster type fillers like ahh, umm) in the next 50-60ms or longer and transmit their guess using rendered video frames and audio in time for the other videoconferencing participants to see “no latency” interaction. Done right, it would seem real. Done wrong, definite Max Max Head Headroom feel.

jchw5y ago

Nitpick: “audiophile-quality sound” it seems, is becoming the new “military-grade encryption.”

I don’t have many other comments to make other than I am surprised rust-analyzer was only mentioned in passing.

simias5y ago

As far as I'm concerned "audiophile" has been synonymous with "overpriced placebo" basically forever.

Beyond that I wish the article had explain a bit better why it chose these "better-than-std" crates. I'm actually using all the std variants in my projects, I'm curious to know if I'm missing out or if I just happen not to hit on their limitations.

cesarb5y ago

> Beyond that I wish the article had explain a bit better why it chose these "better-than-std" crates.

At least for parking_lot, its README has a long list with its advantages over std: https://github.com/Amanieu/parking_lot/blob/master/README.md

simias5y ago

I saw that but I was interested to know if TFA had decided to go with it because it looks better on paper or if it's because they hit a roadblock using the std counterparts and migrated to using those.

That being said since they're drop-in replacements for the most part I suppose I could just try to rebuild my project with this crate and see if I notice a difference performance-wise.

1 more reply

clpwnOP5y ago

You're right, that sounds way too fluffy.

To clarify, we're targeting "transparent" sounding audio, not "FLACs or bust" audio. Right now we send stereo 48kHz 96kb/s Opus (CELT, not SILK) that we found hit the voice transparency sweet-spot compared to the lossless audio source. We had used higher bitrates in the past, and could easily go back to them, but quality plateaued at around 96k in our experimentation.

More than choosing sane transparent-sounding encoding parameters, the biggest difference in fidelity by far was choosing the correct microphones and speakers for accurate reproduction of voices.

sneak5y ago

Voice does not extend above 22.05khz, so using sampling rates above 44.1khz is entirely objectively wasteful and useless, unless your codec only works at 48khz input or something.

Are you using 48khz for a specific reason?

namibj5y ago

Please read the official Opus FAQ to sampling rates: https://wiki.xiph.org/OpusFAQ#But_won.27t_the_resampler_hurt...

eqvinox5y ago

44.1 kHz is essentially deprecated on the hardware level since it's annoying to deal with the extra clock. It's a few cents for an extra crystal, way too expensive ;). 44100 also makes for very poor multipliers/dividers to other clocks since it includes 3²×5²×7² as factors. 48000 is much nicer with 3×5³.

dijit5y ago

The issue with 'military-grade' is that anyone in the military will attest it translates to: Cheapest possible thing that gets the job done.

Audiophile grade at least has roots in high fidelity.

Joeboy5y ago

> Audiophile grade at least has roots in high fidelity.

Does it though? Audiophiles generally seem to eschew fidelity in favour of something that sounds subjectively nice, including the psychoacoustic effects of spending a lot of money.

Eg. they seem very fond of "warmth". If you asked me to make something sound "warm", I'd be applying some soft clipping and dampening the top end, not eliminating sources of distortion.

Edit: If you actually wanted high fidelity, you'd use studio headphones / monitors, which are designed to be "unflattering", so you can be confident you'll hear any issues when mixing / mastering. People don't normally listen for pleasure with those, because they become fatiguing after a few hours.

Choosing equipment because you like the sound is a very reasonable thing to do, but it's not the same as pursuing fidelity.

snvzz5y ago

There's all sorts of audiophiles out there. Some hold beliefs rooted in pseudoscience.

And some are all about accuracy and measurements.

For instance, I use Sennheiser HD600[0], which I strongly recommend, attached to Topping DX3 Pro (old model)[1], which I cannot recommend, as the v2 model shipping now is garbage[2], a consequence of a redesign to work around high fault rates. Mine is fine as problem units fail within weeks, and I've had it for years.

[0]: https://reference-audio-analyzer.pro/en/report/hp/sennheiser...

[1]: https://www.audiosciencereview.com/forum/index.php?threads/r...

[2]: https://www.audiosciencereview.com/forum/index.php?threads/m...

fsociety5y ago

Our ears are incredibly sensitive sensors and I think attributing warmth to soft clipping and dampening the top end is not a complete picture.

Also warmth is just a single quality. I have a pair of very accurate “cold” headphones that I prefer for music and a pair of “warm” headphones for electronic music and gaming.

Past the headphones, it is not so much warmth as it is space in the sound for me. My headphone amplifier sounds effortless and that’s the best way I can describe the quality of what I hear.

1 more reply

Drew_5y ago

The audiophile definition of a "warm" sound signature has nothing to do with distortion and audiophile's do not "eschew fidelity" for different sound signatures.

1 more reply

SomeoneFromCA5y ago

You've never listened to audiophile equipment have you? If apply "some soft clipping" it will sound bad, I guarantee you, no audiophile would like it.

1 more reply

Macha5y ago

Similarly "medical grade" = "single use" in many actual medical contexts.

kevin_thibedeau5y ago

3DES is still military grade.

user59944615y ago

No it's not. It stopped being approved for usage by NIST a few years ago.

2 more replies

duskwuff5y ago

It can join the ranks of meaningless phrases like "aircraft-grade aluminum", "chef-grade cookware", and "contractor-grade tools".

masklinn5y ago

> Nitpick: “audiophile-quality sound” it seems, is becoming the new “military-grade encryption.”

It's too bad they didn't explain it. I expected they meant allowance for "full bandwidth" audio (possibly including music you can listen to).

Video conferencing systems generally use voice-only codecs compressed to shit, full of artifacts in the voice range and utterly dead outside of it.

efreak5y ago

To me, "military grade encryption" means following industry standard. "Audiophile quality" means higher quality than you need, care about, or can even tell apart from lower quality.

Skunkleton5y ago

No, "military grade encryption" means nothing. If it referenced a standard, than that might mean something. I've worked on products for the military that still used single pass DES encryption. So that was military grade. It might as well have been ROT13.

qppo5y ago

especially because all VoIP codecs sound like shit. It's intelligible, but the bar isn't high for fidelity.

youeseh5y ago

Whose ears and which military? :D

simlevesque5y ago

yeah audiophile can be so may things. To me it means 24bits or more.

kevin_thibedeau5y ago

More than 21-bits is meaningless. It's all hype beyond 24-bits.

garaetjjte5y ago

Why 21-bits specifically?

https://web.archive.org/web/20200310174634/https://people.xi...

2 more replies

SomeoneFromCA5y ago

You might be right, however 16-bit sounds really harsh to my ears, and 24-bits is the only widely used standard, better than 16-bit.

3 more replies

jonnypotty5y ago

I wish I read more things like this on hn. "We wanted to know and understand every line of code being run on our hardware, and it should be designed for the exact hardware we wanted"

freeopinion5y ago

"porting [webrtc-audio-processing] to Rust in the near-term is not likely (it's around 80k lines of C and C++ code)."

That's just one of their dependencies. It's possible to know every line without rewriting. And it's possible to rewrite and still not know every line.

They seem to strike a reasonable balance.

kingosticks5y ago

But that statement seems at odds with a dependency on the enormous WebRTC AudioProcessing C++ module. But then they also say they don't use WebRTC so maybe I misunderstand what's going on.

Shared4045y ago

My understanding is that the quoted statement was explaining why they moved away from WebRTC.

bschwindHN5y ago

We moved away from WebRTC completely for video, networking, and some audio. We still use webrtc-audio-processing for acoustic-echo-cancellation and some other niceties. Here is our Rust wrapper for that library:

https://github.com/tonarino/webrtc-audio-processing

kingosticks5y ago

I think it's unlikely they'd release and maintain a wrapper around something they stopped using.

wmf5y ago

As long as you're also comfortable reading the "over-engineering made our product inflexible, late to market, and too expensive" blog posts later.

noir_lord5y ago

I mean you could also have the "we used commodity everything, where first to market but the next folks did it better and cheaper because they could" posts - hindsight is 20/20.

I bet IBM didn't expect using off the shelf components would mean that the IBM PC was the standard for the next 30 years and it wouldn't be theirs.

-----

As someone who works at a company that relied heavily on video conferencing (half the devs off shore) - every single major solution absolutely sucks, they are flaky, unreliable, sound quality is poor, video rate is poor (and this is with fat pipes at both ends) and worst of all latency, latency when trying to have a round table conversation with people remotely is horrific, it is good to see someone pushing the limits, Skype et al haven't gotten much better in the last decade yet my internet connection at home/work is x50 times faster and even mid range business laptops have much improved graphics grunt.

ttul5y ago

If this actually works, I am desperately keen to get my hands on it. If you have the capacity for high bandwidth, why not use it? Zoom’s model must work on whatever crappy broadband people have in their home office. If you have gigabit, it doesn’t seem to make use of that extra capacity to improve video quality.

As for sound, I don’t think audiophile quality is necessary...

onion2k5y ago

As for sound, I don’t think audiophile quality is necessary...

Given you'll need about 10Mbps upstream for 60fps 3K video it seems a little unreasonable not to add on a 320Kpbs (or more) audio stream.

It could make this useful for things like streaming music concerts.

wenc5y ago

Semi-related note: there's work being done at Stanford to make it possible for remote musicians to play together in an ensemble at low latencies.

JackTrip is the resulting software -- not end-user friendly, but apparently it works.

https://ccrma.stanford.edu/groups/soundwire/software/jacktri...

(Some basic numbers: sounds takes 1 ms to travel a foot, every ms is a foot of separation between musicians, 30ms of latency = 30 ft separation = the max for jamming. So 130ms is not low enough.)

wongarsu5y ago

Also, audio quality seems to be more important for the subjective experience than video quality, even in regular video content.

nerdbaggy5y ago

If you only need a P2P video stream https://github.com/CESNET/UltraGrid/wiki is amazing and lower latency

dougmwne5y ago

Let's turn that statement around and instead of thinking about audio bitrates, focus on experience. A great "audiophile" setup can make the performers sound there in the room with you. No matter how much BS the hobby spews, when you hear a really great setup, that guitar truly sounds 6 feet away from you.

Zoom calls do not sound there in the room with you. Microphones are terrible, there's compression artifacts, latency, packet loss, background noise, and tiny speakers. No one could possibly close their eyes and forget that the other person is not there in the room with them, on any POTS or VOIP technology that exists. But what if you could create an audio communications system with an actual illusion of auditory presence. Sounds amazing!

And given that this company is trying to create wall-screen, life size ultra-HD video conferences, I'm pretty sure that "audiophile" exactly what they're going for. Personally as a remote worker, I would absolutely swoon for this.

jupp0r5y ago

I love Rust, but them deciding to redesign/reimplement webrtc after being frustrated after a week seems like a prime candidate for not invented here syndrome with Rust being the justification. There is a reason webrtc is as big as it is, it’s a complex problem to solve.

Regarding the premise of high latency in webrtc: Google Stadia has ~160ms round trip latency at 4k from my Macbook to a data center, so it’s not like that’s unachievable.

Apofis5y ago

Google is colocating in your basement.

LockAndLol5y ago

After reading it, I'm still not entirely sure what's being done.

Is it live streaming or is it the transport?

Are they doing video encoding (the audio encoding seems to be done by that webrtc-audio thing)?

Have they chosen a progressive encoding format that compresses frames and pumps them out to the wire as soon as they're done?

Is TCP or UDP involved or a new Layer 3 protocol entirely?

Have I just missed all of those parts or were they really missing amid all the Rust celebration?

clpwnOP5y ago

> After reading it, I'm still not entirely sure what's being done. > Is it live streaming or is it the transport?

tonari is the entire stack, similar in "feature scope" to WebRTC but with different goals and target environments.

> Are they doing video encoding (the audio encoding seems to be done by that webrtc-audio thing)?

Yep, this includes video encoding and transport. We don't use the WebRTC audio library for encoding or transport, just for echo cancellation and other helpful acoustic processing.

> Have they chosen a progressive encoding format that compresses frames and pumps them out to the wire as soon as they're done?

Yep, basically, if by that you mean we don't use B-frames or other codec features that would require buffering multiple video frames before receiving a compressed stream, so we're able to send out encoded frames as they arrive.

> Is TCP or UDP involved or a new Layer 3 protocol entirely?

We encapsulate our protocol in UDP since we operate on normal internet - a new protocol is out of the question without a huge lobbying force and 15 years of patience on your side.

> Have I just missed all of those parts or were they really missing amid all the Rust celebration?

We intentionally didn't get into the protocol details because we are saving that for a dedicated post (and code to back it up).

LockAndLol5y ago

Thank you very much for the answers. Glad I wasn't too far off.

Looking forward to the technical post. If you're planning on releasing all of this royalty-free and opensource, you'd be quite a boon to the free and open internet. Getting this picked up by the likes of Mozilla and getting it into a browser would be amazing.

nerdbaggy5y ago

If anybody is looking for a low latency high bandwidth P2P video streaming solution there is https://github.com/CESNET/UltraGrid/wiki It can do less than 80ms of latency

_visgean5y ago

How do you use this over network? I have installed it but its very unclear to me what I need to do in order to call my colleague in another city.

It seems that it can only connect to publicly visible hosts? Overall it looks like somebody should develop an application on top of this.

ClumsyPilot5y ago

This is cool, thanks for the link. Is this Nvidia GPU only? Mighr give it a try at some point

nerdbaggy5y ago

You don't need the GPU, just depends on the type of compression you want. It supports intel VA-API as well as NVIDIA VDPAU

MR4D5y ago

Gotta love a writeup with this line in it:

  like Brian's 1970s-era MacBook Pro

That's a writer(s) who knows what it's like to read long (aka thorough) technical articles and not bore the readers to death.

Great article!

zamalek5y ago

> We just enforce rustfmt.

After interaction with both rustfmt and go fmt, I have concluded that .editorconfig is solving a problem that really shouldn't be solved. We went through the ordeal of defining our C# coding standards where I work and, let me tell you, people (myself included) care very deeply about their way of structuring code. And it's a bloody waste of their time.

Having the language designers say, "here is how our language should be structured" is a breath of fresh air.

pier255y ago

Woah this portal thing into another place seems super exciting if they can really pull it off and maintain low latency in the real world.

imtringued5y ago

My WebRTC projects haven't suffered that much from latency. The biggest source of delays is usually caused by encoding video for me. I've had to limit streams to 720p and 25fps to reduce the time spent on CPU encoding a vp8 stream. There are also bandwidth considerations (real time encoding = significantly less compression) but the end result is slightly less than 200ms one way latency (including input lag from mouse, 15ms network latency and display lag) without any special settings. All I'm doing is feeding a ffmpeg stream to kurento and letting it broadcast it via WebRTC. This is not a web conferencing application and it is also not using WebRTC via p2p. It's closer to conventional live streaming with a sane amount of latency (compared to up to 30s of latency you commonly see on twitch). Of course I personally would prefer it if the latency can be brought down even further. 100ms or lower is like the holy grail for me and only appears to be doable with codecs that aren't supported by WebRTC. However, people don't want to install apps just for my little service and I certainly won't encode every stream via several codecs just for the tiny minority of the user base that actually ends up using the app.

GuiA5y ago

Very cool from a tech standpoint.

From a product point of view, I find it interesting that the illustrations/concept videos for these things always show people interacting very closely to the wall - e.g. playing chess, sitting around a table, etc.

https://tonari.no/static/media/family.48218197.svg

But in practice, people tend to keep their distance from it. E.g. the pictures of this setup tend to show people clustered in their own group on each side of the wall, with a solid 2-3 meters from the wall.

https://blog.tonari.no/images/ea56c74d-a55d-4183-9a7b-d69795...

It makes sense, it's awkward to be close to a large solid (emissive) surface, and humans instinctively get closer to their in group when faced with an out group. I wonder how the system could be designed to encourage participants being closer, if there is an advantage to that.

STRML5y ago

A practical problem to solve there: where do you put the cameras? I would actually prefer putting them behind the screen if possible - a few small pinholes wouldn't be that noticeable. If you could put multiple wide-angle cameras in multiple places, you could stitch them together in software and create a real feeling of closeness.

renewiltord5y ago

I'd sit closer but the picture then distorts and I am distorted for my conversation partner.

sephamorr5y ago

Why exactly do existing video streaming solutions use such small amounts of bandwidth and have terrible quality as a result? Does anyone have a deep dive into why this is the case? It seems that it would be a killer feature to make better utilization of bandwidth.

Even over wifi, speedtest shows 4ms/100mb/100mb on my internet connection, but Zoom, FaceTime, and others never use more than about 0.8Mbit/s for a video stream, and the resulting quality of audio and video is...understandably poor.

Latency too totally feels like a software problem, perhaps with too many layers of abstraction. (60fps->16ms for the camera, ~10ms for encoding with NVENC/equivalents, 35ms measured one-way latency from my laptop to my parents 4000km away, ~10ms decode, 16ms frame delay = 87ms one way). Maybe I'm asking for too much from non-realtime systems (I'm used to RTOS, extensive use of DMA, zero-copy network drivers, etc), but it seems that there is a lot of room to improve.

clpwnOP5y ago

It's worth mentioning that in our case, a significant chunk of the latency in our 130ms measurement is just the input lag of our display that we currently use. We were surprised by how slow they can be.

rasz5y ago

OnLive "solved" encoder latency 15 years ago. You dont wait 16ms for the next frame. Instead you progressively start encoding after receiving first tens of lines. This way your encoded video stream lags just couple of milliseconds behind, same for decoding. You could crudely emulate this by dividing screen into 4 rows and sending 4 concurrent video stream, instant 1/4 latency drop.

sephamorr5y ago

Sure, many of the operations in the list can be pipelined as you mention. Something like G-Sync would also allow you to sync the destination display to the arrival of the (start of) frame.

zelly5y ago

The bottleneck is not on the CPU. I'm afraid this company may have wasted their time trying to reinvent WebRTC. If you really want to get realtime video, I think the best approach is a custom codec on CUDA or better yet custom hardware (FPGA). You can only go so far on general purpose hardware before you hit a wall and get Zoom/WebEx quality.

busrf5y ago

Can you recommend some resources for the current state of the art for low latency video? Somebody else in the comments posted https://github.com/CESNET/UltraGrid/wiki, but I’m curious to learn more.

ronyfadel5y ago

Is or is not? I’m confused: if the bottleneck is not on the CPU what does CUDA solve?

zelly5y ago

The bottleneck is the video encoding/decoding/rendering, which is done on GPUs to begin with. Of course if it were done on CPU instead, then it would be significantly worse, but that's not where we're starting from. Improving stuff on the host side by, say, rewriting WebRTC in Rust won't improve the latency of your video by much or at all.

swsieber5y ago

This is welcome news.

I have been itching to convert a small headshot videostream (thing under 100x100px) to audio, stream it over mumble and then convert it back to video, just to see what the latency is like. It would obviously be a big undertaking, but not as big as this methinks.

usefulcat5y ago

"We wanted to know and understand every line of code being run on our hardware, and it should be designed for the exact hardware we wanted."

This rings very true for every high-performance thing I've ever worked on, from games to trading systems.

codefined5y ago

Any suggestions on a group video conferencing tool for use on a local network (Ethernet) that's effective? Either self-hosted or online, just for personal usage to talk with others?

dbrgn5y ago

"A week of struggling with WebRTC’s nearly 750,000 LoC behemoth of a codebase revealed just how painful a single small change could be — how hard it was to test, and feel truly safe, with the code you were dealing with."

I totally feel you. It's impressive what the WebRTC implementation has achieved, but it's just not pleasant at all to work with it.

snvzz5y ago

130ms is a world better than 500ms and a much welcome improvement, but it is still terrible.

Latency happens throughout the whole stack; Unfortunately much would need to be fixed outside this project to achieve any further significant improvement.

Operating System, firmware, blackbox hardware are some other non-negligible sources of latency. Everything adds up.

systemvoltage5y ago

@dang - Suggest altering the title to say what it is "Achieving 3K, 60fps, 130ms Video Conferencing with Rust".

eadan5y ago

This is amazing! The first thing that popped to my mind seeing the life sized "portal" was the farcaster portals from the sci-fi novel Hyperion

https://hyperioncantos.fandom.com/wiki/Farcaster

chubs5y ago

Sounds impressive, but i'm dying to know: what video codec are they using?

novok5y ago

I wonder how it compares with apple facetime on two new macbooks with ethernet connections on both sides.

They actually work on reducing latency and pushing high res video if your connection supports it.

bschwindHN5y ago

That's a great idea, I've always preferred facetime at least for the video quality. We'll do a latency test sometime, I suspect it'll be quite good!

lc5G5y ago

  for crate in $(ls */Cargo.toml | xargs dirname); do
     cargo build

Why do this instead of

  cargo --workspace build

Is it so you can time the individual crates?

ninkendo5y ago

Yeah it looks like they wanted to know how long each crate took to build individually.

But as long as we're nitpicking, nobody should just pipe `ls` into `xargs` like this, since it fails if anything has spaces in it.

Instead, do:

    for cargo_toml in */Cargo.toml; do
      crate="$(dirname "${cargo_toml}")"
      pushd $crate
      # ...
    done

Don't be that person who writes a script which won't tolerate spaces in filenames!

OJFord5y ago

Alternatively: Don't be that person who clones the repo at a path with spaces in!

ninkendo5y ago

Not having spaces in your directory names is certainly a good idea, but I'll be damned if I let any of my code have issues with them. Just because something's a good idea doesn't mean it should be a requirement :)

(The main reason for the advice of "Don't put spaces in paths" is really only because it breaks lots of poorly-written software... but that's not an excuse for your software to be poorly-written!)

clpwnOP5y ago

Yep!

Exuma5y ago

I love their homepage https://tonari.no/

aiotokyo5y ago

Thank you so much! Keep an eye for further updates.

Scaevolus5y ago

What's the codec stack for this? x264 --tune zerolatency + opus with opus_delay=20ms?

namibj5y ago

20ms is wasteful. Use minimum latency where SILK still works, afaik that's 7.5ms.

codys5y ago

This assumes that video encoding latency is lower than the audio latency.

namibj5y ago

Shouldn't be more than a single frame, which is 16 2/3 ms for 60 fps. And for e.g. JPEG it can be even shorter, especially with a rolling shutter.

jonny3835y ago

"we truly don't believe we could have achieved these numbers with this level of stability without Rust"

Oh please. This is just rust sensationalism. People don't truly believe rust is faster than C do they?

bufferoverflow5y ago

In some problems Rust is the fastest:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

jonny3835y ago

Every single one of those rust implementations is using unsafe {}, thus defeating the purpose of using rust in the first place. Run the same benchmark without unsafe {}.

turndown5y ago

>defeating the purpose of using rust in the first place

I don't think this is true; the whole point of Rust is that unsafe operations are explicit, not that you never do so.

Also, I looked at the first one, and it's only using unsafe on what are basically op code calls; I don't think it is realistic to complain about that.

igouy5y ago

Not true.

Of those 4 tasks, the rust programs for task 3 and task 4 do not use the keyword "unsafe".

For task 2, the spectral-norm Rust #6 program does use "unsafe" but #5 does not and it's almost as fast.

gpm5y ago

Developing stable complex software in C takes a hell of a lot more effort and skill than it does in Rust IMO.

bschwindHN5y ago

I don't believe Rust is faster than C, but I would argue it's faster to develop new products in Rust vs. C, and easier to produce programs which don't have data races or invalid memory accesses.

jonny3835y ago

Sure, if you wrap everything in unsafe and/or import third party libraries (with the assumption they are also safe).

nerdbaggy5y ago

I wonder how much bandwidth this uses. The less bandwidth it uses the higher the latency because of compression. Its much easier to get low latency video when you have large (Gbit+) links

vertex-four5y ago

Are they still using WebRTC, just their own implementation? Or have they switched to something else on the wire?

markdog125y ago

There's a section in the article about it: "In the beginning (or: why we're not WebRTC)"

vertex-four5y ago

I'm interested in what they are using if not WebRTC - there's several good options in this space (SRT would be my go-to choice), so it'd be really interesting to see if they rolled their own wire protocol or used something else.

beowulfey5y ago

They built it from scratch

1 more reply

realchucknorris5y ago

would loved to see a demo

alpineidyll35y ago

Awesome post

remmargorp645y ago

But does it have middle out compression?

j / k navigate · click thread line to collapse

205 comments

ccostes5y ago