Debugging audio artifacts caused by... a serial port? (opens in new tab)

(recall.ai)

56 pointsdavidgu1y ago29 comments

29 comments

Bottlenecking on logging is a common problem. Classically, Linux assumed it has a serial console device. It can still be enabled at kernel compile time.[1] Apparently, this mass of AWS instances, VMs, and Docker images works that way.

I liked the QNX approach, where the kernel sends log messages to another process, loaded on boot. When you build a boot image, you provide a logger process to read those messages. There's no expectation in embedded that there's a console available. If your system is a pump or an auto dashboard, you need to send the messages somewhere other than a "console". There might not even be a file system. In QNX, messaging is a primitive on top of which file systems and networking are built.

When I had to do logging in real time, each process used a library which did "logprintf" calls. Those went into a circular buffer which was written to a log file from another thread. If the circular buffer filled, "..." would appear in the log, and log messages would be lost, but the real-time thread would not block.

Interaction between real-time and non-real-time is always tough. It comes up a lot in networked game development.

[1] https://www.kernel.org/doc/html/latest/admin-guide/serial-co...

saagarjha1y ago

How do you debug kernel boot before that process is started?

Animats1y ago

In QNX? The boot loader loads the kernel and whatever other processes you put in the boot image. They all start as soon as the kernel starts. The kernel itself is just memory allocation, message passing, CPU dispatching, and timers. It has no I/O at all, no persistent state, and it's passive - any activity initiation has to come from user space. It doesn't even have strings.

Actual kernel debugging is rarely needed except on strange or broken hardware. If you have to do that, you use a JTAG debugger.

Further startup is handled by a startup process in the boot image. It's running in user space, and loads more drivers, file systems, networking, etc. Anything it needs to log it sends to a user space logger process that was also part of the boot image. What that does depends on the target hardware. Often, there's no console or display in embedded. It might send on a network. Or write to a circular buffer that can be read out later.

kazinator1y ago

In terms of print debugging, one way is by using lower level printing routines, like something specific to the target system you're using that sends bytes directly to a serial port.

PaulDavisThe1st1y ago

> There are a few processes in the bot that are especially latency sensitive, which we have tuned the nice value for.

This immediately signifies some level of "does not understand how this stuff works". Latency sensitive (audio or other) tasks need to be in the SCHED_RR or SCHED_FIFO scheduling class, which nice(1) has no effect on.

Conversely, using nice(1) on a SCHED_OTHER task is also unlikely to work, given that nice only impacts scheduling decisions and cannot provide RT-like behavior.

rcxdude1y ago

I think in this case, unless they were running an RT kernel, it wouldn't have helped, since the interrupt hogging the CPU was non-preemptable. But it's good advice in general.

PaulDavisThe1st1y ago

RT is in the mainline kernel now. Run with threaded_irqs, all interrupt handlers become preemptable (by default).

1 more reply

theamk1y ago

I am surprised.. all this complex discovery and the syslogs were containing the culprit line from the first moment:

    serial8250: too much work for irq4

do people not look at syslog anymore? It's one of the first things I do on unexplainable problems - check for OOM's, thermal throttling, BUGs, etc... Sure, it's not the most common problem, but the check is fast and easy.

snozolli1y ago

I'm curious why this is even an issue. I don't understand why an actual interrupt would get tripped for virtual serial port writes, and I don't understand why a virtual serial port (i.e. logging) gets swamped by what seems like a moderate stream of data.

The first result for "8250 too much work for irq4" is this: https://unix.stackexchange.com/questions/387600/understandin... (2017)

The problem is that the UART hardware that is emulated by various brands of virtual machine behaves impossibly, sending characters at an impossibly fast line speed. To the kernel, this is indistinguishable from faulty real UART hardware that is continually raising an interrupt for an empty output buffer/full input buffer. (Such faulty real hardwares exist, and you will find embedded Linux people also discussing this problem here and there.) The kernel pushes the data out/pulls the data in, and the UART is immediately raising an interrupt saying that it is ready for more.

So, is this seven year old problem still a problem, or is the virtual serial port driver actually unable to keep up with the stream of text?

toast01y ago

> I don't understand why an actual interrupt would get tripped for virtual serial port writes, and I don't understand why a virtual serial port (i.e. logging) gets swamped by what seems like a moderate stream of data.

If you're running in a VM, how do you know the difference between a virtual interrupt and a actual interrupt? Either way, you're handling an interrupt.

If I understand the issue, it's not that the virtual serial port is swamped really. It seems like the issue is that the virtual serial port is firing its empty/ready interrupt so often (probably after every byte written to the port?) that the driver thinks the interrupt is broken, so then it falls back to polling for readiness. The driver polling rate seems like it's not high enough to keep up with the logging, so then logging blocks.

Probably, it'd be better for the virtual serial port to limit the number of irqs it will send. Limiting the number of interrupts is a key benefit of the 16550 over earlier UART chips; it does this by having a 16 byte buffer and sending interrupts only when the buffer is empty/below a threshold (for outgoing) or full/above a threshold (for incoming). Getting an interrupt for each outbound byte is too many interrupts if you're logging a lot of junk on the serial console (which I'm guessing happens; modern software is full of junky logs IMHO).

Probably the right thing to do would be for virtual machine hosts to over virtio consoles as well as virtual serial ports, and for virtual machine guests to prefer to use a virtualio console rather than a serial port for logging. But it's probably possible to adjust the serial driver options as well. Virtio console could be much more efficient, as it allows transfers larger than a single byte at a time.

toast01y ago

Too late for edit... But rereading the blog and references... I suspect if this was on real hardware, with a real 16550 UART, you'd have the same issue, but without the message about too many interrupts.

You'd just have your logs backing up because you're writing more than 115kbaud, and then things that write logs become blocking. You'd need to figure that out by seeing what processes are blocked on what, rather than getting a hint because the irq behavior is weird.

1 more reply

rcxdude1y ago

Curiously, I can't find this message in recent kernels, it was patched out in 2018: https://github.com/torvalds/linux/commit/9d7c249a1ef9bf0d569... . Is it being re-added by distros or something?

In terms of how it causes the issue, I think it's more that the serial driver will monopolise the (virtual) CPU for as long as it is receiving or sending text. If there's a large buffer being dumped all at once (a common default for non-interactive streams), it could stall for long enough to cause an underrun, which doesn't need to be very long, even if the average data rate isn't very high. Consider that the virtual 8250 serial port interface isn't exactly the most efficient interface to begin with, since it's at most copying 16 bytes from host to guest at a time

kiririn1y ago

The more we abstract things, treat servers like cattle, and lose low level knowledge, the more things like this will happen

You shouldn’t have to try to reproduce this in a test environment - your infrastructure should allow profiling in live for cases like this. And it should be solved with profiling, not guesswork and bisecting

theamk1y ago

Crazy talk! Next thing you are going to say is that realtime-focused infrastructure should have native counters to detect missing deadlines, so those pops could be preemptively detected via dashboards, instead of via customer complaints?

That's not how you live in AI age. Move fast and break things; YOLO; K8S all the things; etc..

(/s in case it's not clear)

myself2481y ago

What I don't understand, is why the upstream audio doesn't just buffer while the downstream thing processing it is blocked. Why should that result in audible artifacts, can't it just catch up with the rest of the buffer later?

Buffer overruns feels very 1996-cd-burner-ish. Ope, burned a coaster, let's try this hellaciously real-time-bound thing again with inadequate buffering and I/O devices that have unpredictable latency.

What am I missing?

PaulDavisThe1st1y ago

If you (a) need low latency (b) have hardware with unpredictable/unreliable latency behavior, you're screwed by definition.

If you can manage with large latency, then sure, buffer the hell out of everything and your chances of losing data will be close to zero.

If you need low latency, you cannot buffer the hell out of everything, and anything that interferes with the data flow runs the risk of causing an underrun.

The solution in the CD burner case was easy: massive latency is fine, so buffer the hell out of it, and it just works.

The solution for low-latency audio anything is not so easy: you can't buffer the hell out of it, which means you're susceptible to hardware and software issues that disrupt timing and end up causing underruns/overruns.

anyfoo1y ago

> What I don't understand, is why the upstream audio doesn't just buffer while the downstream thing processing it is blocked. Why should that result in audible artifacts, can't it just catch up with the rest of the buffer later?

So you make your buffer larger. But the thing consuming the data is still slower than the thing producing the data, so that buffer will eventually overrun, too. The graceful way to fail is not to block, but to quickly drop the excess data. However, you're now losing data, and doing so very unexpectedly overall, because apparently this transcription service was supposed to be well able to keep up with the audio stream. (If I understood this whole thing right, it's a bit vague.)

So really, unless the root cause is taken care of, you either block or lose data, without good reason in this case.

> Buffer overruns feels very 1996-cd-burner-ish.

That was a buffer underrun, the opposite. The CD burner, moving at a fixed speed, ran out of data because the producer (the rest of the PC, effectively) was not able to replenish the buffer quickly enough. The fix, besides faster PCs, was to have very large buffers, that were able to pick up the slack for longer than the duration of a "dropout" on the PC side. But between those "dropouts", the PC was probably well able to produce the data to the CD burner at a higher rate than necessary, so the large buffer stayed filled on average.

Both buffer overruns and underruns are still very much a concern. They exist anywhere where producers and consumers operate at a mismatched rate for more than just a short time, in a real-time setting.

theamk1y ago

it's remote meeting infrastructure, so the latency is critical. When burning CDs, or playing music, it's OK to have a second or two of buffer. When doing conference call, a second of buffer means a second of latency, which means you ask a question and get a response 2 seconds back, which is pretty bad experience. And that's why conference software tries to keep latency as low as possible.

(Now, why does it produce a pop as opposed to silence/hiccup/stretched sound? probably because it was easiest to code)

anyfoo1y ago

Even if the buffer is large enough, at some point (i.e. a long enough meeting in this case), it will fill up.

> (Now, why does it produce a pop as opposed to silence/hiccup/stretched sound? probably because it was easiest to code)

Sudden "silence" pretty much is a pop, and so is the silence suddenly ending. The sharp transition at least theoretically contains energy in all frequencies (or rather the full bandwidth of this bandwidth-limited signal), which we perceive as a pop.

Bang a steel bar against a hard table, and you get a whole range of frequencies as well, also very pop-like. Do the same with a tuning fork, and after the initial bang you get a nice, clean, single tone, because the tuning fork effectively filters out all the other frequencies through its impulse response.

1 more reply

snvzz1y ago

AIUI the RT patchset (now mainlined) has specific code changes to solve this issue.

If running PREEMPT_RT, which anything handling realtime audio should really do, this should be handled.

davidguOP1y ago

At Recall.ai, we built a 10,000-node cluster, processing over 1TB/sec of raw video in real-time.

After a major migration, we faced a strange audio issue that led us on a deep dive through our infrastructure.

The culprit? Not the audio code—but a hidden interaction with AWS’s virtual serial ports.

We wrote about our journey discovering the artifacts and finding a clean fix!

1 more reply

jakekw1y ago

Wild af

1 more reply

j / k navigate · click thread line to collapse

29 comments

Animats1y ago

Interaction between real-time and non-real-time is always tough. It comes up a lot in networked game development.

[1] https://www.kernel.org/doc/html/latest/admin-guide/serial-co...

saagarjha1y ago

How do you debug kernel boot before that process is started?

Animats1y ago

Actual kernel debugging is rarely needed except on strange or broken hardware. If you have to do that, you use a JTAG debugger.

kazinator1y ago

In terms of print debugging, one way is by using lower level printing routines, like something specific to the target system you're using that sends bytes directly to a serial port.

PaulDavisThe1st1y ago

> There are a few processes in the bot that are especially latency sensitive, which we have tuned the nice value for.

Conversely, using nice(1) on a SCHED_OTHER task is also unlikely to work, given that nice only impacts scheduling decisions and cannot provide RT-like behavior.

rcxdude1y ago

I think in this case, unless they were running an RT kernel, it wouldn't have helped, since the interrupt hogging the CPU was non-preemptable. But it's good advice in general.

PaulDavisThe1st1y ago

RT is in the mainline kernel now. Run with threaded_irqs, all interrupt handlers become preemptable (by default).

1 more reply

theamk1y ago

I am surprised.. all this complex discovery and the syslogs were containing the culprit line from the first moment:

    serial8250: too much work for irq4

snozolli1y ago

The first result for "8250 too much work for irq4" is this: https://unix.stackexchange.com/questions/387600/understandin... (2017)

So, is this seven year old problem still a problem, or is the virtual serial port driver actually unable to keep up with the stream of text?

toast01y ago

If you're running in a VM, how do you know the difference between a virtual interrupt and a actual interrupt? Either way, you're handling an interrupt.

toast01y ago

1 more reply

rcxdude1y ago

Curiously, I can't find this message in recent kernels, it was patched out in 2018: https://github.com/torvalds/linux/commit/9d7c249a1ef9bf0d569... . Is it being re-added by distros or something?

kiririn1y ago

The more we abstract things, treat servers like cattle, and lose low level knowledge, the more things like this will happen

theamk1y ago

That's not how you live in AI age. Move fast and break things; YOLO; K8S all the things; etc..

(/s in case it's not clear)

myself2481y ago

What am I missing?

PaulDavisThe1st1y ago

If you (a) need low latency (b) have hardware with unpredictable/unreliable latency behavior, you're screwed by definition.

If you can manage with large latency, then sure, buffer the hell out of everything and your chances of losing data will be close to zero.

If you need low latency, you cannot buffer the hell out of everything, and anything that interferes with the data flow runs the risk of causing an underrun.

The solution in the CD burner case was easy: massive latency is fine, so buffer the hell out of it, and it just works.

anyfoo1y ago

So really, unless the root cause is taken care of, you either block or lose data, without good reason in this case.

> Buffer overruns feels very 1996-cd-burner-ish.

theamk1y ago

(Now, why does it produce a pop as opposed to silence/hiccup/stretched sound? probably because it was easiest to code)

anyfoo1y ago

Even if the buffer is large enough, at some point (i.e. a long enough meeting in this case), it will fill up.

> (Now, why does it produce a pop as opposed to silence/hiccup/stretched sound? probably because it was easiest to code)

1 more reply

snvzz1y ago

AIUI the RT patchset (now mainlined) has specific code changes to solve this issue.

If running PREEMPT_RT, which anything handling realtime audio should really do, this should be handled.

davidguOP1y ago

At Recall.ai, we built a 10,000-node cluster, processing over 1TB/sec of raw video in real-time.

After a major migration, we faced a strange audio issue that led us on a deep dive through our infrastructure.

The culprit? Not the audio code—but a hidden interaction with AWS’s virtual serial ports.

We wrote about our journey discovering the artifacts and finding a clean fix!

1 more reply

jakekw1y ago

Wild af

1 more reply

j / k navigate · click thread line to collapse