Performance

I have three implementations of the B&W image processing core now. One of them uses GEGL, another one is written in C# and the third one in C. On my Core 2 Duo, utilizing only one core, GEGL can process 1.2 megapixels per second, and that’s without dodging and burning. The C# implementation does 6 MP/s, including dodging and burning (8.3 MP/s without), while the C implementation does 36 MP/s, including dodging and burning (59 MP/s without). I haven’t put any effort into optimizing either the C or C# implementations. Given that my C# experience is still rather limited I guess that there’s considerable room for optimization, there. Still, I expected more. If anybody knows to make this C# program faster, I’d be thankful for any hints.

So I guess I’ll be using my C implementation for now and later add the C# one as a fall-back for better portability.

34 thoughts on “Performance

  1. C# does array bounds checking for _every_ array access. Use “unsafe” code for the body of the most critical inner loops. Bounds checking is turned off in unsafe methods/code blocks.

  2. I guess I don’t understand something here, but how can the GEGL version be so slow? I would have guessed that one would have won out over the C based on hands down.

  3. GEGL doesn’t have any special optimizations, does all the image processing in floating point and has to do the processing in stages – first the mixing, then the contrast, then the tinting. My C version uses 16 bit arithmetic and does everything in one function. I knew it would be faster, but I hadn’t thought that it would be by that much.

    Another advantage that I have in my code is that I can compute a scaled-down version of the image by actually only calculating the pixels that I need. GEGL can’t do that yet, so what I’d have to do there is to keep versions of the input image at several sizes so that I could compute scaled-down version of the result “efficiently”.

  4. I took a quick look at your C# implementation and the thing that jumps out at me is passing arrays as arguments to your methods.

    In C# all arguments are pass-by-value unless overridden in the method signature with the “ref” or “out” keyword. I expect that for methods that are being called repeatedly with arrays as arguments, there is alot of memory allocation and garbage collection going on.

    Regards,
    Warren Gallagher

  5. Arrays are reference types, so they are definitely not copied when passed to a function. What is copied is the reference to the array, but that has to be done in any case, and it’s just a pointer, anyway.

  6. If you rewrote the code to directly use the IntPtr rather than copying to an array, you’d probably get a substantial increase in performance. No guarantees on that though ;)

  7. No, that’s not the issue. I’m just copying the array once, but I’m calling the Process() function a hundred times. That’s how it would work in the application, too: You load a source image once, but every time you change a setting it has to be processed again, using the Process() function.

  8. As they said, wrap it in an unsafe block and use pointer arithmetic instead of array access. You have to use pointers if you to see any real benefit. I wrote some audio processing code a few months back, and the difference between array access and unsafe+pointer access was night and day.

  9. Not sure if this makes any difference or not, but int the following line of code:

    return (ushort)(f * 65535.0);

    …you are causing the float (f) to be cast to a double (since 65535.0 is a double, not a float) and only then to a ushort. I am not sure if the compiler will automatically just multiply by a float instead of doing what you told it to do and first cast to a double. So this may or may not make any difference.

    Use 65535.0f instead.

  10. Thanks, I though I had caught all of those. In any case, it won’t make a difference because FloatToSample isn’t called from the inner loop.

  11. I ran your C# code from a VS 2005 build, and I got about 118 million pixels/sec. Then I pinned the buffers (using the ‘fixed’ keyword) and it jumped to 140 million pixels/sec.

    For some reason I am getting an InvalidParameter exception when I try to run it on Mono to compare performance. Something with loading the bitmap. Maybe the path is incorrect. I will keep trying and post a result once I can get it to run on Mono.

  12. Hmm it just failes silently when pasting code in the comments :((

    A major problem with your code may be that the main processing function is WAY to long. In fact it is so long that it likely doesn’t get optimized *at all*.
    And mono doesn’t handle a lot of local variables well and you have a HUGE amount of them.

  13. OK, I had a bug in my timing code, so after fixing that, my results are as follows:

    .Net > 32 million pixels/sec
    Mono > 14 million pixels/sec

    After some minor but obvious optimizations…

    .Net > 34 million pixels/sec
    Mono > 16 million pixels/sec

    Personally I think the code can be optimized a lot. For instance, since the ‘i’ value does not change inside the loop, I think something can be done to get rid of the switch statement. Maybe rethink what it is supposed to do and find a way that does not require the switch statement.

  14. Did also some testing (all on .Net):

    Your sample code without layer stuff: 5.376.000 pixels/sec
    My library using simple grayscale: 76.690.442 pixels/sec
    My library using multiplying grayscale (should be comparable to your sample): 50.597.647 pixels/sec

    Found another point why your impl is slow: You are copying data over way too much. For each single pixel you do
    11 reads
    15 writes
    without counting any variable read/write. Just pure data-access.
    It is possible to do with 1 read and 1 write if you use unsafe code.

  15. Yes, the code can be really easily optimized.

    if (x = width || y = height)
    r = g = b = a = 0; // Change this to return;

    Also store many of your sub-computed values, it makes code more readable and faster.

  16. thefoxisonline: I’m not sure what you count as “reads” and “writes”. If it’s array access that you’re refering to, then I’m not sure how you could reduce their number to one, given that you have to read and write at least the RGB values of the pixel. Yes, I’m aware now that you can do that without bounds checking if you use unsafe code, but it’s still a read…

  17. What I did count is array read, write, creates as operations.
    You can read/write the values for a single pixel in a single 4-byte operation.
    Obviously you need to process them individually, but you will not get data-locality problems anymore…

  18. I dont’t know if this will help but I think it will be better to declare variables(index, maskIndex, …) outside of the body of the for loops.

  19. schani: bmp.LockBits(rect, ImageLockMode.ReadWrite, PixelFormat.Format24bppRgb) seems like 3×8 bit to me. And you should also be able to specify 32bpp.
    And Bitmaps should guarantee that data is aligned (exactly what the stride is good for ;)

  20. I’m reading the data in as 3×8 bit but I’m converting it to 3×16 bit because I want the engine to do 16 bit image processing.

  21. I’ve seen that. But it seems to me that you only do per-pixel processing so there is no need to keep the entire dataset but only the data for a single pixel. And in fact you hold each pixel multiple (6?) times.
    The library I mentioned does most calculations in double precision, however as soon as the data is saved out its still only a 32bpp image (8bits per channel).

  22. Well, the point is to do 16 bit processing from start to finish, i.e. to be able to process 16 bit source images and save the result as 16 bit images, so I don’t want to have anything to do with 8 bit images in my processing core. I’m loading the 8 bit image simply to have some image to work on.

    And I don’t think that I hold each pixel 6 times.

  23. Ok that is understandable, however then you still only need 2 copies (8 bit source and 16 bit working copy).

    You have:
    1 copy (new Bitmap)
    1 copy (values array)
    1 copy (inData array * 2)
    1 copy (outData array * 2)
    1 copy (newValues array * 2)
    and you would have
    1 copy more if you didn’t overwrite your sourceimage in the end
    and
    2 copys (masks, did not count these anyways, are per pixel, not per color)

    If you load a single 16 Megapixel photo you app needs about 400MB RAM just for data.

    You could make the app about 10 times faster than it is now. There are more than enough suggestions in your comments. If you don’t want to nobody is going to force you ;)

  24. It seemed I got some increase in speed when I didn’t perform arithmetics to define index position in the array. Instead I accessed inData directly with index, putting the ib first, then performed ++index while accessing inData for ig, then again for ir, and the opposite order for outData. Optimizations where on, unsafe code allowed and used (wrapping the inData and outData access parts). My system is really old and slow (P4 1.7gh, 512MB ram)

    Hope it helps a bit.
    Itai.

  25. I managed to reduce time by (1) eliminating the helper-counters rowIndex and rowMaskIndex, and instead of recalculating index and maskIndex each loop, I just add 3 to index and 1 to maskIndex at the end of the most inner loop. Works great with what I suggested in my previous post. In addition, I eliminated the ir, ig and ib variables and placed the inData access directly in the sgray computation line.
    In the outData lines, I put the last part of each line (which is the same in all lines) in a variable, and used it instead. In addition, this line is using (65535 – inverted), which is actually *contrasted*, so just used it instead (resulted in a noticeable gain).
    I converted the layers loop to a foreach loop. didn’t help, didn’t hurt anything, and looks nicer.
    I placed the declaration of sgray outside the loops (only the ‘int sgray;’). It helped a bit.

    Hope this helps.
    Itai.

  26. Hi again, sorry for the separate post, just wanted to share my results.
    The overall gain on my computer for a 4MP photo (2304×1728) is about 15% (from 64.8125 seconds to 56.53125 seconds), using the techniques I described in my two posts above.

    Itai.

  27. Thanks for all the helpful comments Itai and everybody else! As soon as I have time again I’ll try to improve my C# version of the image processing and put it into the application.

Comments are closed.