>I can send an IP packet to Europe faster than I can send a pixel to the screen. How f’d up is that?
and to relate to the other post about landlines: https://twitter.com/ID_AA_Carmack/status/992778768417722368
>I made a long internal post yesterday about audio latency, and it included “Many people reading this are too young to remember analog local phone calls, and how the lag from cell phones changed conversations.”
Is there somewhere to read about the changes in question?
I'm old enough to remember extensive use of analog landlines, and can't really think of any difference to a cellphone other than audio quality.
Either Cisco needed to bring down the cost massively to expand access or someone needed to build it in major cities and bill by the hour to compete against flying. None of those happened so it stayed a niche. Compared to those experiences more than a decade ago the common VC is still very slowly catching up. Part of it is setup, like installing VC rooms with 2 smaller TVs side by side instead of one large one so you can see the document and the other people at decent sizes. But part of it is still the technology. Those "telepresences" were almost surely on a dedicated link running on the telecom core network that guaranteed quality instead of routing through the internet and randomly failing. I suspect getting really low latency will require that kind of telecom level QoS otherwise you'll be increasing buffer sizes to avoid freezes.
Something to consider is that there are alternative techniques to interframe compression. Intraframe compression (e.g. JPEG) can bring your encoding latency per frame down to 0~10ms at the cost of a dramatic increase in bandwidth. Other benefits include the ability to instantly draw any frame the moment you receive it, because every single JPEG contains 100% of the data. With almost all video codecs, you must have some prior # of frames in many cases to reconstitute a complete frame.
For certain applications on modern networks, intraframe compression may not be as unbearable an idea as it once was. I've thrown together a prototype using LibJpegTurbo and I am able to get a C#/AspNetCore websocket to push a framebuffer drawn in safe C# to my browser window in ~5-10 milliseconds @ 1080p. Testing this approach at 60fps redraw with event feedback has proven that ideal localhost roundtrip latency is nearly indistinguishable from native desktop applications.
The ultimate point here is that you can build something that runs with better latency than any streaming offering on earth right now - if you are willing to make sacrifices on bandwidth efficiency. My 3 weekend project arguably already runs much better than Google Stadia regarding both latency and quality, but the market for streaming game & video conference services which require 50~100 Mbps (depending on resolution & refresh rate) constant throughput is probably very limited for now. That said, it is also not entirely non-existent - think about corporate networks, e-sports events, very serious PC gamers on LAN, etc. Keep in mind that it is virtually impossible to cheat at video games delivered through these types of streaming platforms. I would very much like to keep the streaming gaming dream alive, even if it can't be fully realized until 10gbps+ LAN/internet is default everywhere.
I was able to get latency down to 50ms, streaming to a browser using MPEG1[1]. The latency is mostly the result of 1 frame (16ms) delay for a screen capture on the sender + 2-3 frames of latency to get through the OS stack to the screen at the receiving end. En- and decoding was about ~5ms. Plus of course the network latency, but I only tested this on a local wifi, so it didn't add much.
[1] https://phoboslab.org/log/2015/07/play-gta-v-in-your-browser...
All the benefits of efficient codecs, more manageable handling of the latency downsides.
The challenges you'll run into instantly with JPEG is that the file size increase & encoding/decoding time on large resolutions outstrips any benefits you get in your limited tests. For video game applications you have to figure out how you're going to pipeline your streaming more efficiently than transferring a small 10 kb image as otherwise you're transferring each full uncompressed frame to the CPU which is expensive. Doing JPEG compression on the GPU is probably tricky. Finally decode is the other side of the problem. HW video decoders are embarrassingly fast & super common. Your JPEG decode is going to be significantly slower.
* EDIT: For your weekend project are you testing it with cloud servers or locally? I would be surprised if under equivalent network conditions you're outperforming Stadia so careful that you're not benchmarking local network performance against Stadia's production on public networks perf.
(Be gentle on your coworkers and use cabled headphones.)
AptX low latency codec adds only 40ms max.
Just buy headphones with good low latency support. They aren't even expensive anymore.
Why can't I have both? Wifi doesn't seem to have this latency problem.
The thing is that when we talk in a room, sound will take <10ms to reach my ears from your mouth. This is what "enables" all of the human turn taking cues in conversation (eye contact, picking up whether a sentence is about to end/whether it's a good time to chime in/etc) - I've been looking for work from people who've tried to see at what point things start feeling really bad (is it 10ms, or 50ms?), but haven't found much so far. No matter what it is though, it's likely that long distance digital communications just cannot match it.
See also this interesting comment about the feeling of "closeness" from phone copper wires:
https://news.ycombinator.com/item?id=22931809
Landlines were so fast and so "direct" in their latency (where distance correlates very directly with time, due to a lack of "hops") that local phone calls were faster than the speed of sound across a table, and for a bit after they came out--before people generally got used to seemingly random latency--local calls felt "intimate", like as if you were talking to someone in bed with their head right next to you; I also have heard stories of negotiators who had gotten really tuned to analyzing people's wait times while thinking that long distance calls were confusing and threw them off their game.
It seems normal phones are able to do it, though. At least it seems normal phones suffer less latency problem.
In a way, simplicity in technology often means better performance.
Digital communication could cheat, though!
There's a lot of latency hiding you can do, if you can predict well enough what's coming next. Humans are fairly predictable most of the time.
I don’t have many other comments to make other than I am surprised rust-analyzer was only mentioned in passing.
Beyond that I wish the article had explain a bit better why it chose these "better-than-std" crates. I'm actually using all the std variants in my projects, I'm curious to know if I'm missing out or if I just happen not to hit on their limitations.
At least for parking_lot, its README has a long list with its advantages over std: https://github.com/Amanieu/parking_lot/blob/master/README.md
To clarify, we're targeting "transparent" sounding audio, not "FLACs or bust" audio. Right now we send stereo 48kHz 96kb/s Opus (CELT, not SILK) that we found hit the voice transparency sweet-spot compared to the lossless audio source. We had used higher bitrates in the past, and could easily go back to them, but quality plateaued at around 96k in our experimentation.
More than choosing sane transparent-sounding encoding parameters, the biggest difference in fidelity by far was choosing the correct microphones and speakers for accurate reproduction of voices.
Are you using 48khz for a specific reason?
Audiophile grade at least has roots in high fidelity.
Does it though? Audiophiles generally seem to eschew fidelity in favour of something that sounds subjectively nice, including the psychoacoustic effects of spending a lot of money.
Eg. they seem very fond of "warmth". If you asked me to make something sound "warm", I'd be applying some soft clipping and dampening the top end, not eliminating sources of distortion.
Edit: If you actually wanted high fidelity, you'd use studio headphones / monitors, which are designed to be "unflattering", so you can be confident you'll hear any issues when mixing / mastering. People don't normally listen for pleasure with those, because they become fatiguing after a few hours.
Choosing equipment because you like the sound is a very reasonable thing to do, but it's not the same as pursuing fidelity.
It's too bad they didn't explain it. I expected they meant allowance for "full bandwidth" audio (possibly including music you can listen to).
Video conferencing systems generally use voice-only codecs compressed to shit, full of artifacts in the voice range and utterly dead outside of it.
That's just one of their dependencies. It's possible to know every line without rewriting. And it's possible to rewrite and still not know every line.
They seem to strike a reasonable balance.
I bet IBM didn't expect using off the shelf components would mean that the IBM PC was the standard for the next 30 years and it wouldn't be theirs.
-----
As someone who works at a company that relied heavily on video conferencing (half the devs off shore) - every single major solution absolutely sucks, they are flaky, unreliable, sound quality is poor, video rate is poor (and this is with fat pipes at both ends) and worst of all latency, latency when trying to have a round table conversation with people remotely is horrific, it is good to see someone pushing the limits, Skype et al haven't gotten much better in the last decade yet my internet connection at home/work is x50 times faster and even mid range business laptops have much improved graphics grunt.
As for sound, I don’t think audiophile quality is necessary...
Given you'll need about 10Mbps upstream for 60fps 3K video it seems a little unreasonable not to add on a 320Kpbs (or more) audio stream.
It could make this useful for things like streaming music concerts.
JackTrip is the resulting software -- not end-user friendly, but apparently it works.
https://ccrma.stanford.edu/groups/soundwire/software/jacktri...
(Some basic numbers: sounds takes 1 ms to travel a foot, every ms is a foot of separation between musicians, 30ms of latency = 30 ft separation = the max for jamming. So 130ms is not low enough.)
Zoom calls do not sound there in the room with you. Microphones are terrible, there's compression artifacts, latency, packet loss, background noise, and tiny speakers. No one could possibly close their eyes and forget that the other person is not there in the room with them, on any POTS or VOIP technology that exists. But what if you could create an audio communications system with an actual illusion of auditory presence. Sounds amazing!
And given that this company is trying to create wall-screen, life size ultra-HD video conferences, I'm pretty sure that "audiophile" exactly what they're going for. Personally as a remote worker, I would absolutely swoon for this.
Regarding the premise of high latency in webrtc: Google Stadia has ~160ms round trip latency at 4k from my Macbook to a data center, so it’s not like that’s unachievable.
Is it live streaming or is it the transport?
Are they doing video encoding (the audio encoding seems to be done by that webrtc-audio thing)?
Have they chosen a progressive encoding format that compresses frames and pumps them out to the wire as soon as they're done?
Is TCP or UDP involved or a new Layer 3 protocol entirely?
Have I just missed all of those parts or were they really missing amid all the Rust celebration?
tonari is the entire stack, similar in "feature scope" to WebRTC but with different goals and target environments.
> Are they doing video encoding (the audio encoding seems to be done by that webrtc-audio thing)?
Yep, this includes video encoding and transport. We don't use the WebRTC audio library for encoding or transport, just for echo cancellation and other helpful acoustic processing.
> Have they chosen a progressive encoding format that compresses frames and pumps them out to the wire as soon as they're done?
Yep, basically, if by that you mean we don't use B-frames or other codec features that would require buffering multiple video frames before receiving a compressed stream, so we're able to send out encoded frames as they arrive.
> Is TCP or UDP involved or a new Layer 3 protocol entirely?
We encapsulate our protocol in UDP since we operate on normal internet - a new protocol is out of the question without a huge lobbying force and 15 years of patience on your side.
> Have I just missed all of those parts or were they really missing amid all the Rust celebration?
We intentionally didn't get into the protocol details because we are saving that for a dedicated post (and code to back it up).
Looking forward to the technical post. If you're planning on releasing all of this royalty-free and opensource, you'd be quite a boon to the free and open internet. Getting this picked up by the likes of Mozilla and getting it into a browser would be amazing.
It seems that it can only connect to publicly visible hosts? Overall it looks like somebody should develop an application on top of this.
like Brian's 1970s-era MacBook Pro
That's a writer(s) who knows what it's like to read long (aka thorough) technical articles and not bore the readers to death.Great article!
After interaction with both rustfmt and go fmt, I have concluded that .editorconfig is solving a problem that really shouldn't be solved. We went through the ordeal of defining our C# coding standards where I work and, let me tell you, people (myself included) care very deeply about their way of structuring code. And it's a bloody waste of their time.
Having the language designers say, "here is how our language should be structured" is a breath of fresh air.
From a product point of view, I find it interesting that the illustrations/concept videos for these things always show people interacting very closely to the wall - e.g. playing chess, sitting around a table, etc.
https://tonari.no/static/media/family.48218197.svg
But in practice, people tend to keep their distance from it. E.g. the pictures of this setup tend to show people clustered in their own group on each side of the wall, with a solid 2-3 meters from the wall.
https://blog.tonari.no/images/ea56c74d-a55d-4183-9a7b-d69795...
It makes sense, it's awkward to be close to a large solid (emissive) surface, and humans instinctively get closer to their in group when faced with an out group. I wonder how the system could be designed to encourage participants being closer, if there is an advantage to that.
Even over wifi, speedtest shows 4ms/100mb/100mb on my internet connection, but Zoom, FaceTime, and others never use more than about 0.8Mbit/s for a video stream, and the resulting quality of audio and video is...understandably poor.
Latency too totally feels like a software problem, perhaps with too many layers of abstraction. (60fps->16ms for the camera, ~10ms for encoding with NVENC/equivalents, 35ms measured one-way latency from my laptop to my parents 4000km away, ~10ms decode, 16ms frame delay = 87ms one way). Maybe I'm asking for too much from non-realtime systems (I'm used to RTOS, extensive use of DMA, zero-copy network drivers, etc), but it seems that there is a lot of room to improve.
I have been itching to convert a small headshot videostream (thing under 100x100px) to audio, stream it over mumble and then convert it back to video, just to see what the latency is like. It would obviously be a big undertaking, but not as big as this methinks.
This rings very true for every high-performance thing I've ever worked on, from games to trading systems.
I totally feel you. It's impressive what the WebRTC implementation has achieved, but it's just not pleasant at all to work with it.
Latency happens throughout the whole stack; Unfortunately much would need to be fixed outside this project to achieve any further significant improvement.
Operating System, firmware, blackbox hardware are some other non-negligible sources of latency. Everything adds up.
They actually work on reducing latency and pushing high res video if your connection supports it.
for crate in $(ls */Cargo.toml | xargs dirname); do
cargo build
Why do this instead of cargo --workspace build
Is it so you can time the individual crates?But as long as we're nitpicking, nobody should just pipe `ls` into `xargs` like this, since it fails if anything has spaces in it.
Instead, do:
for cargo_toml in */Cargo.toml; do
crate="$(dirname "${cargo_toml}")"
pushd $crate
# ...
done
Don't be that person who writes a script which won't tolerate spaces in filenames!Oh please. This is just rust sensationalism. People don't truly believe rust is faster than C do they?