Performance Difference Between Windows And Linux Versions

bLUEbYTE · August 6, 2011, 10:49pm

Hello,
Out of curiosity, I decided to do a performance/efficience comparison between Linux and Windows platforms.
The result was disappointing in terms of Linux.
First, system and platform information:
Intel Core 2 Duo E6850 3 GHz
DDR2-800 4GB Dual Channel RAM
Radeon HD 5770
Integrated Intel HDA and Creative Audigy2.

Linux: Linux Mint 11 x86_64 (based on Ubuntu 11.04) and Fedora 14 i386
Windows: Windows 7 SP1 x86

For my ‘empirical’ benchmark I used the demo song ‘Soon soon’ and monitored the CPU utilization stats in Renoise (upper right corner).

I have read that CPU frequency scaling can affect the accuracy of the CPU utilization monitor, so I disabled frequency scaling in BIOS and set the CPU to lowest constant frequency of 2.0 GHZ; thus also making performance difference more apparent.
On the Linux side, I also ensured that 3d window managers/desktop effects are all disabled and/or uninstalled.
On both platforms I set the sampling frequency to 48 KHz and buffer size the same (64 ms on both Windows and Linux, 1024/3 periods on Linux). Set to use 2 CPU cores on both.
Linux: ALSA Windows: DirectSound.
All the other Renoise settings are same on both platforms.

Now to my findings:

There is practically NO performance difference between Fedora, Mint Linux.
There is very very tiny difference, maybe 1-2% between running 32bit or 64bit versions of Renoise on Mint x86_64. 64bit version being 1-2% percent better in terms of peak CPU usage.
No difference when I set it to use Audigy card instead of onboad HDA.
Windows:
6-7% in intro
10-11% when vocals begin
14-15% in chorus part
18-19% in short instrumental part after the second chorus.
Linux:
10-11% in intro
16-17% when vocals begin
21-22% in chorus part
28-29% in short instrumental part after the second chorus.
So, on Windows it never goes up to 20%, whereas in linux it reaches almost 30% peak value in CPU utlization.

So, there you have it.
Now I wonder; is Linux platform inherently flawed in that it is less efficient than Windows in such workloads? Or is the Linux version of Renoise missing some optimizations that Windows version has?

vvoois · August 6, 2011, 11:03pm

It’s old news.The Ubuntu got very bloaty as well somewhere since version 8.0 and the introduction of Pulse audio as a default audio layer on top of Alsa and enabling fancy extended window managers by default.
You can get more performance out of it by simply stripping Linux down to its essential core functionalities or picking a low profile ditribution that already hasn’t much fancy stuff running around by default.
I doubt Renoise isn’t optimized for Linux, it perhaps could use some better tweaks but that is usually with all programs regardless of the platform but you can be ensured that Renoise is being optimized for maximum performance.
On the other hand, we supply notes on how to optimize Linux for high audio performance so that you also can get the best out of Linux e.g. prioritizing the audio thread and using a RT kernel Linux version.

Still Linux remains mainly for the more experienced user regarding use of priority performance apps so if you want to get the best out of it, get experienced very quick .

bLUEbYTE · August 6, 2011, 11:19pm

vV:

It’s old news.The Ubuntu got very bloaty as well somewhere since version 8.0 and the introduction of Pulse audio as a default audio layer on top of Alsa and enabling fancy extended window managers by default.
You can get more performance out of it by simply stripping Linux down to its essential core functionalities or picking a low profile ditribution that already hasn’t much fancy stuff running around by default.
I doubt Renoise isn’t optimized for Linux, it perhaps could use some better tweaks but that is usually with all programs regardless of the platform but you can be ensured that Renoise is being optimized for maximum performance.
On the other hand, we supply notes on how to optimize Linux for high audio performance so that you also can get the best out of Linux e.g. prioritizing the audio thread and using a RT kernel Linux version.

Still Linux remains mainly for the more experienced user regarding use of priority performance apps so if you want to get the best out of it, get experienced very quick .

I think you are missing the point a bit here.
In the past I set up realtime audio enabled configuration for Rosegarden, jackd etc on Gentoo.
That includes patching the kernel with Ingo Molnar’s realtime patches and configuring PAM etc.
Therefore I think not about ‘getting experienced’

Besides, what I measured, and what is the idea of my post, is that Renoise by itself uses more CPU cycles to do the same thing. This is what the CPU meter on Renoise measures anyway, isn’t it?
What you are talking about is the GENERAL system performance, the other factors that can eat CPU cycles. Besides, Renoise initializes ALSA on hw(0,0) which means talking directly to soundcard, not via pulseaudio isn’t it?
If my system was maxed out on this test, say 80% peak on Windows, then even if Linux and Windows versions of Renoise were absolutely the same in terms of efficiency, on Linux platform I would have xruns etc, because of the extra load by compiz, pulseaudio etc…
But this is not the case. This is not my point. The points is that Renoise in isolation, seems to be significantly slower on Linux.

vvoois · August 6, 2011, 11:51pm

Renoise only measures CPU usage of the audio thread, it doesn’t measure the complete CPU utilisation of the whole system.
I’m not sure if it can do so on Linux as easily as it can do it on Windows and Mac, e.g. if CPU utilisation on Linux is poluted with other resource usage, in that case it wouldn’t necessarily mean that CPU usage on Linux is less efficient, but only the measurement is.
Also treating multicore CPU can be affected differently on each platform.

taktik · August 7, 2011, 2:25pm

When comparing different OSs, make sure you are using the same !buffer size! instead of same latency. Latency has no direct impact on performance, it only helps avoiding xruns, crackles. Whereas CPU usage constantly drops with higher buffer sizes. Using your example above, you should compare:

ALSA with a buffer size of 1024 with any number of periods (any number of buffers) against ASIO on Windows with a single buffer size of 1024. DirectSound does not really use a fixed buffer size, so it’s hard to compare.

Could you post new result with this? Curious to see how much of a difference still is left after that for your machine.

Another possible cause can be the latency of thread/context switching from/to high prioritized real time threads in the OSs. To find out if this really is a problem, avoid testing with multiple CPUs enabled. But this also should not cause THAT much of a difference. Raltime (low latency kernels) can help a bit here as well, but again won’t do wonders. They mainly help avoiding xruns.

Else there’s no real “missing” optimization in the Linux builds, but we are using different compilers for both platforms (GCC on Linux/Mac, VisualCPP on Windows). This of course will have an impact on performance as well, but again should be hardly noticeable. My guess is that our GCC builds are slightly - up to 10% slower compared to Visual CPP. In overall it should be far less.
GCC builds could be a bit faster by enabling more aggressive optimization for DSPs, but we have had a lot of troubles with those aggressive optimizations in the past (especially with -ffast-math, which can also ruin sound processing quality), so ended up in a bit more conservative settings.

All the internal “hand made” optimization that we have, are implemented for all platforms and architectures already. (Well some got dropped for PowerPcs. But PowerPcs are nearly dead now anyway).

bLUEbYTE · August 7, 2011, 7:49pm

Hi,
Thank you for your detailed input on this.
I repeated the tests with the following new setup.
Windows: ASIO 48 KHz, 512 Samples, 10.7 ms
Linux: JACK 48 KHz, 512 samples, 3 periods, 32 ms

Results:
Both platforms(especially Linux) benefit from ASIO and JACK. Overall CPU utilization lowered slightly (~2%) in Windows, and noticeably (~7%) in Linux.
Now the peak CPU utilization around 2:50 time in the song is:
17.0% in Windows
22.5% in Linux

The difference now seems to be about 30-35% empirically. This is also true for the whole course of the song, i.e. Linux CPU utilization seems to be 130-135% that of Windows all the time during song playing.

Regarding the compiler flags with GCC:
Are you using mtune=generic? I have read that it might generate ‘slow’ code. Someone deassembled the binary and found redundant some redundant ops being generated compared to other mtune switches.
Maybe you could try mtune=atom? Fedora for instance uses mtune=atom for compiling their binaries for quite some time now.
Since Atom CPU’s lack instruction reordering, the optimization becomes more crucial for them given their slow speed and in-order execution core.
Maybe that would also make a difference for people that would like to use Renoise on the go on a netbook/nettop? I for one, am one of these people. I just got a Atom D525 nettop mini-PC and I would like to run Renoise on it. That is why I did those test in the first place, because I would prefer using Linux in place of Windows.
For the other faster processors, the mtune=atom would not necessary be bad since they are ‘fast’ anyway; and they have instruction reordering. Just a thought.

Anyway, maybe mfpmath=sse might also help for i386 builds without breaking anything (on x86_64 it is the default). Also -fomit-frame-pointer is definitely safe and should be especially good for i386 since it frees a register and i386 is low on CPU registers. Likewise, -O3 should be safe. Don’t know maybe Renoise Linux is already compiled with those compiler flags?

taktik · August 8, 2011, 7:14pm

This already looks more realistic. Would you mind also testing the same setting with one CPU on that machine?

Gokalp:

Regarding the compiler flags with GCC:
Are you using mtune=generic? I have read that it might generate ‘slow’ code. Someone deassembled the binary and found redundant some redundant ops being generated compared to other mtune switches.
Maybe you could try mtune=atom? Fedora for instance uses mtune=atom for compiling their binaries for quite some time now.
Since Atom CPU’s lack instruction reordering, the optimization becomes more crucial for them given their slow speed and in-order execution core.
Maybe that would also make a difference for people that would like to use Renoise on the go on a netbook/nettop? I for one, am one of these people. I just got a Atom D525 nettop mini-PC and I would like to run Renoise on it. That is why I did those test in the first place, because I would prefer using Linux in place of Windows.
For the other faster processors, the mtune=atom would not necessary be bad since they are ‘fast’ anyway; and they have instruction reordering. Just a thought.

Anyway, maybe mfpmath=sse might also help for i386 builds without breaking anything (on x86_64 it is the default). Also -fomit-frame-pointer is definitely safe and should be especially good for i386 since it frees a register and i386 is low on CPU registers. Likewise, -O3 should be safe. Don’t know maybe Renoise Linux is already compiled with those compiler flags?

We are already using -O3 for DSP stuff, else -Os. Wasn’t aware of “mfpmath=sse”, actually thought -msse enables that. Quickly tested and this indeed seems to speed up things slightly (x1.05). Should also be safe because our VisualCPP builds already use scalar SSE floating point math.

Don’t want “-fomit-frame-pointer” because of the crashlogs, but already do use “-momit-leaf-frame-pointer”.

Will look up and test “mtune=atom”. We do currently use mtune=i686. Will also try out a few more things for the next releases.

Thanks for the hints!