Low Latency and XRUN optimizations on Linux

Are you guys using the 3.1 betas in a low latency jack setup and if so with what level of success?
I’m running at the moment at 64 sample * 3 periods @96K with a scarlett 2i2 (that gives around around 7msec round trip latency) on an Arch Linux 64 bit, reasonably well tuned RT system.

Still, I cannot avoid the occasional xrun - let’s say around 10-15 per 2 hour session.

Acceptable, but could be better. 2.8 and 3.0 were stlightly smoother in this regard.

My current projects don’t involve any plugin or heavy stuff, the only other thing running is Guitarix (hence my low latency requirement).
This is roughly in line with what I would get with Ardour. Lighter stuff like guitarix and hydrogen run absolutely fine.

This is just to see if my expectations are realistic here: it is worth measuring latency with the betas or is some further performance optimization to be expected when 3.1 goes gold?

Cheers,

LX

Constantly doing small performance tweaks during the betas (and before), but it’s hard to improve such things in general. Must be done by looking at every “problem” in detail instead in order to optimize this.

So if you are aware of something, some sound, some song that used to perform better or should be more efficient, share it with us here and we’ll try to have a look at this…

Thanks Taktik.

I’ll run more tests against 3.0 see if I can get any reproducible behavior.

So far nothing that would suggest a software defect, it just seems a tad heavier - could be as well my system performance or tuning

I have the impression that the xruns happen mostly when doing I/O operations:

  • on file loading, sample saving, rendering.

  • but also when playing, usually just after changing pattern, when 2-3 long samples are getting activated.

Update: Still trying to achieve a relatively xrun free performance at very low latency for my guitar setup (~2ms in + 2ms out)

No particular change with b6. I tried out of curiosity the new Redux instruments and while they sounds rather impressive, they tend to kill my system pretty fast due to the DSP load. Throwing several notes at once seems to be particularly effective at producing instant glitchcore :wink: Not a particular issue, just sayin’.

I noticed on the way that the Linux FAQ could benefit from a few small updates:

http://tutorials.renoise.com/wiki/Linux_FAQ#Realtime_Threads

1 - /etc/security/limits.conf file is getting phased out on most distros in favor of /etc/security/limits.d/audio.conf

2 - The suggested settings:

YOURUSERNAME - rtprio 99

YOURUSERNAME - nice −10

While Renoise does raise the rtprio to the max (always at 96 on my system), I haven’t seen it make any use of the “nice -10”.

Manually renicing all Renoise threads doesn’t seem to bring any benefit either. Is the nice setting still of any use nowadays?

3 - That link is now gone:

http://tapas.affenbande.org/wordpress/?page_id=73.

All in all, I think it would be more useful to direct users to this more thorough tuning guide:

http://wiki.linuxaudio.org/wiki/system_configuration

Cheers,

LX

Ok, I found one place where the performance could maybe be improved:

Reordering instruments by dragging with the mouse while a song is playing seems to cause more xruns than it should, even at higher latencies.

(reproduced on 3 different songs than run otherwise completely xrun free at ~8ms latency - no problem found when doing other GUI operations).

I assume there might me some heavier calculation needed to update the patterns, but it should ideally not interrupt audio.

Inserting patterns for example, seem to behave better: the GUI gets occasionally frozen, but doesn’t particularly cause xruns.

Have you applied all proper tunings? And have you tested your system with cyclictest?

I kept wondering, too. Renoise seems a bit hard to calculate/not really realtimenice guy, but idle cyclictest reveals my system seems to have regular 2-3 ms latency spikes, limiting lowest really drop-out-free latency to above 10 ms. Then again it isn’t really tuned, just normal ubuntu with rtprio, rtirq, cpu scheduler to performance and cpu_dma_latency to zero when jack is triggered, and all bells and whistles activated all the time. I pray I have some kind of luxerious hardware I can tune to below 50-100 µs one day, but then again I just might have bad luck. On a system where only music softare/hardware is running then, or one where I can deactivate all kinds of stuff by demand. Not every system is ok for the very best realtime ultralow latency performance it seems.

Proper tunings: kinda. A light arch linux system (no DE) on a still ok’ish i5 ivy bridge class notebook + focusrite scarlett 2i2 USB audio interface.

RT kernel is built with localmodconfig. System checks ok with the realTimeConfigQuickScan (rtprio, scheduler & co)

I would not say that my system is perfectly tuned though. I am at the moment puzzled by rtirq and the exact dependencies for setting proper irq threading priority.

But I can run light audio apps like hydrogen or guitarix at 96khz / 2ms completely xrun free (2ms being the output latency as reported by jack - the actual round trip latency is more around 7ms). You can’t go that low with a standard system, so I assume my config is not too bad.

I was actually suprised that Renoise managed to run more or less ok with those settings as long as I don’t use any plugin or fat instrument.

This is what triggered my initial question: Does anynone actually uses Renoise succesfully a in such a low latency setup?

Regarding Cyclictest: I managed to avoid that part until now, but I guess that’s the only way to do more than speculate…

I’ll try to give it a shot when time permits and report back…

Well 2 ms is rather insane by my view (win and mac people can has wet dreams of this…), and only really interesting for stuff like live guitar cab simulation for an actual player monitoring, or recording via mics with (software) monitoring with effects. I’m pretty baffled that renoise will run on your system with such a low latency at all, without booming though xruns like mad. Remember also, the lower the latency the less cool dsp stuff you can stress your cpu with.

And cyclictest is the dreadful minion, yes, telling you the naked truth about your system. Only perf sched is worse. launch jack, apply some general system load, and cyclictest at the prio renoise will make its worker threads (don’t launch renoise, or it will fight cyclictest) - the rightmost “keep the highest” values are the maximum processing spikes encountered by the prog, in microseconds. So values of like 20-60 really rock, but that 1500 are 1.5ms stolen from a audio dsp worker thread giving that x-run. leave it running for quite a while. it will basically tell you what kind of crap your system is running instead of renoise, leading to dropouts that renoise has no culprit of. Not exactly what crap, but how much basically. Lol with a isolcpu core it stayed below 15 µs on a standard kernel…

That’s pretty much my use case: I’m using Renoise both as a daw and live accompagnment. So I need to play along and record live guitar and bass parts coming from Guitarix (no direct monitoring possible in that case). Rountrip latency has ideally to stay under 10ms (more than 10ms starts to feel unpleasantly sluggish and timing suffers). This means in practise about 2-3ms max for playback/capture.

Exact roundtrip latency can by the way be measured quite precisely by looping the input to the output of the soundcard with a patch cable and http://apps.linuxaudio.org/apps/all/jack_delay or the utility included in Jack). A far better explanation can be found here http://apps.linuxaudio.org/wiki/jack_latency_tests

Otherwise I wouldn’t give a damn about latency. Later on I can edit at higher settings.

I was also thinking about adding some internal track latency to Renoise itself, but it remains to be seen if this would actually help with xruns.

I can feel your pain, because you can’t always have everything in one box. I remember the ye olde days, when I was swapping programs and setups on a single core machine, between the low ms latency for mic recording and guitar “cab” simulation, and 15-20 ms latency for some software synth action mangling the recorded samples. It breaks workflow options to have to seperate it that hard. Then again switching between recording with proper monitoring and renoising would probably mean just once to render the backing, switch jack to low latency and record stuff on the guitar to pure wav playback and 2ms latency, and then switch back to 10-15 ms, let the sweat on the axe dry a bit, and use the samples in renoise.

Now I’ve delved a bit deeper into this tuning shit with cyclictest, and dropped my latency significantly by some steps, the most important being using a more recent kernel because the nouveau driver was fucking up latency that bad.

Hm, it seems that renoise really is not a real rock-solid player for the sub 10 ms games. I’ve tune the system a bit now, with cyclictest showing peaks around 200-300 µs when doing lots of stuff with the pc, in realtime configuration (not finished tuning yet, the ftrace capabs of cyclictest are my current shit!), most of the time around 20-50µs max with little breakouts. Now if the renoise threads run at round robin 95, and I run cyclic at the same time at like 96, cyclicbench will display neat low values but renoise will sporadically have dropouts and even audible clicks. These happen in the same pattern when cyclicbench is not running. So I think it might be unlikely that some system shit that runs in the background would disturb renoise, if it won’t disturb cyclic. There is something fishy.

One thing to ease the dropouts is to raise latency of course. like from 12-15 ms in jack there’s little problems anymore for me.

Another is, on a multicore machine, use 1 or 2 cores less for dsp in renoise than the system offers. This seems to help against the peaks that cyclic would also register, on a not-perfectly-tuned system.

Maybe also certain plugins/vsts can make it more likely to bust realtiming.

I might keep this up, and have more research once I’m more confident with my system tuning and tracing stuff. And also compare to 3.0. Now in 3.1 there seems to be better multicore handling, way lower cpu usage shown for the some songs when using more than one core, but easier to glitch out and it happens even spontaneously. Until then -> “large” latencies nessicary to use glitch-free.

OosplFly: thanks for your tips.

I reviewed again the lists and found cpu_dma_latency - I hadn’t done that one.

Investigated a bit and made a post about it (please shoot if I told some BS):

https://linuxmusicians.com/viewforum.php?f=27

Regarding the kernel: Latest RT patch https://aur.archlinux.org/packages/linux-rt/

Overall I concur with your findings: ghost & goblins start to show up below 10 ms, and there’s only so much that can be done via system tuning.

But it’s already working good enough to be usable and I find that all in one Guitar -> Guitarix -> Renoise set up very inspirational (2 tracks done this way - yay!).

So I feel like pushing it further. At least we may be able to update a bit the Linux tuning checklist, and possibly find some hidden Renoise issues on the way :wink:

Another is, on a multicore machine, use 1 or 2 cores less for dsp in renoise than the system offers. This seems to help against the peaks that cyclic would also register, on a not-perfectly-tuned system.

Tried that, doesn’t help with my set up. It probably depends also on the song structure (number of tracks, effect routing etc…).

on the HW side I have a dual core cpu with hyperthreading enabled (4 logical cores - one of the few notebooks out there that doesn’t offer any setting to disable HT, so things are gonna stay this way).

I empirically found the best performance with 4 cpu cores enabled.

2 is slightly worse, and interestingly selecting 3 makes things become very bad, very fast (no reasonable explanation for that).

I’m thinking about playing now with cpu affinity and ksoftirqd priority, see what else I can wreck :wink:

Cool I could bring something new to the game. It’s always worth looking into the corporate & redhat world for tuning tips, where people are serious about controlling machines and doing hf stock exchanges where every microsecond could mean $$$ or a broken product.

Hm, well the boot options are not like cpu_dma_latency. The latter is a device file, that you can open, write a number to and hold the file open to put your cpu in the desired latency state while it is held, all aquired and freed in runtime. I’m using a tiny C program that does so, and daemonize it when I start jack, and kill it when I stop jack, so my system can keep cool again watching animated cat gifs and stuff and but also realtime audio by demand without rebooting. Also you can tune the dma_latency to a specific number of microseconds you wish to tolerate this way, and check the actual c-state delays to see which states would be allowed then. Governor is for frequency, dma_latency is for power saving states.

As we’re at it. Another thing I’m not really sure how much impact it would have is called “x86_energy_perf_policy”, command line tool usually in the linux-tools packages. But after all all these tunings only really change the game when the kernel itself is realtime capable & performing well enough.

For sure your warning to apply it in laptops or sub-optimal cooled systems is relevant, haven’t thought about it yet so much. But cpu can run pretty hot when kept in performance state. Idle=poll/c0 state seems to be one of the worst devils for this, driving heat even higher. Maybe some experimenting to keep the cpus in lower frequencies or allowing some c-states can provide better action for badly cooled systems, the parameters being subject to individual experimentation.

Ah and ok bad with your HT. I’ve disabled mine, but haven’t found it so very extremely detrimental to leave it on. I just imagine it being subject to random/fluxctuating performance expectations when the load is high, as ht cores quasi get the (not precalculatable) idle rest of the real cores powers. My experimentation with using one or two cores less in renoise was on a system with 6 real cores, setting to 5 or 4 in renoise made everything a bit more stable.

I’ve also already dreamt of isolating the whole system (including interrupts) to one core, and put all music software to the rest of cores, via cpusets or whatever. I’ve already tried a bit, but it was very quirky, jack didn’t seem to like this and went to a sluggish x-run hell, so I gave up for the moment. And I guess it isn’t really worth it, and a proper realtime kernel would make better action.

Ah and sorry, didn’t mean to take this thread into a tuning thread, just initially thought it would be better to ensure & measure a system is configed real tight before reporting suboptimal realtime performance of renoise.

Focusrite isn’t known for his wonderful drivers, they choke even with ASIO on windows. In linux, their implementation is just some hacks around the standard usb audio driver… I actually own a 6i6 and i had all possible sort of problems to make it work reliably, but hardly can’t go lower than 6ms, anything chokes…

I read also a few stories about the bigger models, but never had any issue with the 2i2 on Linux: it is class compliant and fully plug and play (no fancy software mixer).

It can go real low latency wise for a usb audio interface: I just made a test this afternoon with Guitarix only, and it runs just fine with 96K32 frames3 periods (that gives roughly 1ms in/out, approx 4.5 ms roundtrip latency reported by jack_iodelay). I wouldn’t even try to go that low with Renoise though. My target is to keep roundtrip latency <10ms @96K so that I can play guitar comfortably.

The only limitation I found with the 2i2 is that the internal channels are not perfectly matched: there’s always a slightly bigger latency on channel 2 than channel 1, and it gets worse if I cross from output 1 to input 2 with the patch cable, let’s say half a ms - not a big deal with a 2 channel card, but I assume this can only get worse with more channels…

The scarletts are also said to be pretty bad regarding crosstalk (a fellow linux user actually measured it on his 2i4), but i disgress :wink:

Governor is for frequency, dma_latency is for power saving states.

There’s some link between both: since I disabled C-States with a hammer, the frequency is really locked to the max (1.70ghz on my Ivy Bridge cpu). With just the governor set to performance, I could still see the frequency oscillating in Conky, even under high load.

Does it really hep with xruns is another question. Renoise behavior seems to be exactly the same, but the system is noticeably more snappy overall. No bad surprise yet regarding cooling, so I’ll leave this way for now…

Some results with cylictest running at the same time as a test song is looping.

First at the highest prio, then in the same ballpark as renoise (85) and jackd (89):

T: 0 ( 1304) P:99 I:10000 C: 10000 Min: 3 Act: 7 Avg: 6 Max: 31
[root@pill-mobile4 gimmeapill]# cyclictest -p 89 -n -i 10000 -l 10000 -m
# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 0.97 0.60 0.33 2/244 1327           

T: 0 ( 1326) P:89 I:10000 C: 10000 Min: 4 Act: 5 Avg: 6 Max: 95
[root@pill-mobile4 gimmeapill]# cyclictest -p 80 -n -i 10000 -l 10000 -m
# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 0.74 0.73 0.44 3/246 1333          

T: 0 ( 1337) P:80 I:10000 C: 10000 Min: 4 Act: 5 Avg: 6 Max: 101
[root@pill-mobile4 gimmeapill]# cyclictest -p 85 -n -i 10000 -l 10000 -m
# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 0.92 0.68 0.53 1/245 1339           

T: 0 ( 1339) P:85 I:10000 C: 10000 Min: 4 Act: 6 Avg: 7 Max: 111

I’m not sure to understand all the settings, but the results look pretty healthy to me: no peak higher than 111µs.

I use ubuntu studio 14 on ibm x30 and after working for hours in renoise 3.0.1 I can`t experience any glitches (in projects that work under 90% cpu renoise indicator). I use internal sound card so this does not count in terms of latency, but maybe just try to extend your latency just a bit to “avoid the occasional xrun” (whats thats sounds like?)

kopias: you will always have some latency setting. Either in alsa directly in renoise, or in the jack configuration. Regardless of the soundcard or sound system you use, you’ll always have some buffer settings, be it hidden or configed directly. I always won’t suffer any xruns in renoise if I don’t overload dsp and use big enough buffers (like above 15ms - seems very stable). I do will if I set them below 10ms. “The occasional X-run” will sound like a very small dropout/plop/click, or not even be percieved at all but just registered by the software (qjackcontrol will show you xrun logs and nubers). I can sometimes hear my renoise dropouts click audibly, most often not.

Gimmeapill: I usually bench with the smp/all cores settings “sudo cyclictest --smp -p 96 -m”, you could also try using it at rr scheduling (like “-y rr” I think). How do you make jack be higher in prio than renoise? At my sys renoise will have some prio below jack, but also fire up threads at the highest prio it can which will be above jack and all drivers, and seem to be the dsp worker threads. I don’t know if this config is any good and whether I should put work to getting the prios right. I think renoise’s can’t really be configed, just by setting the hard limit for renoise which has to be another group than jack would be started in to keep jack above?

Now 111µs wort case sounds ok to me, but I don’t know how well this fares for a realtime kernel and using a single thread. Try with the “smp” setting, this is closer to what renoise will try to do when all cores are activated. Cyclictest should have a manpage. Also cyclictest is rather mainly useful to looking if a similliar task would be kept from interruption, not so for checking interruption of this task already running. It should be noted that cyclictest should run for some time (i.e. the time in which you would surely register some x-runs), and under similliar conditions, i.e. graphics stuff done, disk writings and such. If very deserate (high peaks you don’t know from where) you can make cyclictest trace via the kernel (found this way that 14.04 kernel nouveau was borked, and 4.1 kernel’s nouveau did a loooot better) or use in very big doubt hwlatdetect to find delays by your bios/hardware.

OosplFly: The results I posted are when looping a test track that triggers a few xruns (they were actual xruns occuring when the measurement was done).

But you’re right, it won’t mean much unless ran in smp mode, I have to study properly those options and runs better measurements.

How do you make jack be higher in prio than renoise? At my sys renoise will have some prio below jack, but also fire up threads at the highest prio it can which will be above jack and all drivers, and seem to be the dsp worker threads.

It was not, it behaves exactly as you describe. I was referring to the priority of the first Renoise thread. The threads we assume being for dsp are higher.

On my system with max RTPrio at 99, Renoise climbs up to 96.

Here’s how it looks in Htop sorted by Priority (I’m not on my system right now so this is from memory):

96 Renoise

96 Renoise

96 Renoise

96 Renoise

90 Jackd

85 Renoise

I tried to run cyclistest both higher and lower to see what was competing with what. I think the results already reflect that higher get served first.

In order to have jackd running higher than all Renoise threads: just set the prio of jackd in qjackctl to the max rtprio allowed by the system (that would be 99 in my case), and Renoise threads will all go underneath. The problem is that this doesn’t actually help with performance - quite the opposite.

I never managed to gain anything by raising jackd too high, or trying to outsmart the kernel’s scheduler :wink:

For the same reason, I am not using rtirq to bump the priority of the irq thread where my soundcard is (there is no sharing/conflict on that bus btw).

The best results I could get were by enabling irq threading and then…not touching them.

Also, when xruns occurs, the jack messages clearly point to Renoise (“client Renoise could not finish” or sth like that).

So, as I understand, this is more about finding the sweet spot where Renoise threads will be served soon enough, but not steal too much from jackd. Maybe I could try to lower a bit jack’s priority but I don’t expect any more dramatic improvements.

Is there any logging/debug option to launch Renoise that I should know of? Something that would print out where the ball gets dropped when an xrun caused by Renoise occurs?

I find it a bit silly to use an USB audio interface and then expect such low and stable latencies. Use internal or a pcie card instead. Even Firewire almost doubles performance. You can buy a Texas instrument pcie 2 fw adapter for about 10€ and a used but really good fw audio interface for small money. Sell that scarlet device.

Notebook, no choice (for sure I wouldn’t bother with USB if I had a PCIe slot), but I don’t think that’s the issue here.

I do have stable audio at a much lower latency than what I’ trying to achieve with Renoise. The Scarlett 2i2 works ok.

Investing in a higher end device like an RME babyface wouldn’t help much if the perf bottleneck is the system or Renoise itself…