Tuesday, April 15, 2008

OSSpinLock :: Lock Showdown ( POSIX Locks vs OSSpinLock vs NSLock vs @synchronized )

Reader Simon Chimed in on my Last article about threading in Leopard:
Worth mentioning as well is the OSSpinLock{Lock, Unlock} combination; these busy-wait, but are by far the cheapest to acquire/release. In the case of many resources and lightweight access (very few threads, and/or little work being done per access), they may pay off significantly. Consider the set/get methods for 10000 objects. Creating an OSSpinLock for each object is cheap (it's just an int), and if the likelihood of access to an given object is small (ie, there's not 1 object constantly being accessed and 9999 very rarely), the OSSpinLock approach (implemented by memory barriers, IIRC) can really pay off, because the cross-section of thread conflict is minute.
Thanks for adding to the Discussion Simon, I did forget to include OSSpinLock. And he is right, buried inside the libkern/OSAtomic.h header is the OSSpinLock API. OSSpinLock can be a particularly fast lock, however the Apple Man Page on it makes it clear that it is best used when you expect little lock contention. I started to develop a test to gauge this performance till I discovered that someone had already developed one for this purpose on CocoaDev, however it left out @synchronized() so I took the test and extended it to add @synchronized and see how fast OSSpinLock was and how much we really pay for using @synchronized(). The test creates and gets 10,000,000 objects while using locks in the form of POSIX Mutex Locks, POSIX RWLock/unlock, OSSpinLock, NSLock (which uses POSIX locking) and the @synchronized() lock. I ran the test several times over and grabbed a snapshot which reflected what I got on the 3rd run and showed the (about) average results I got before and after this run. The graph shows the time per request which was derived from dividing the total requests by the time it took to complete all 10,000,000 objects for setting or getting. This was run on my 2.33 GHz Core 2 Duo Mac Book Pro with 2 GB RAM.
lock_showdown.png
lock_showdown_results.png
Results are in usec's. As you can see OSSpinLock is the fastest in this instance, but not by much when compared to the POSIX Mutex Lock. Also seen above we can see the pain that Google Described with regard to performance when using the @synchronized directive. When you run the test it's painfully obvious even from a user perspective that @syncrhonized does cost you in terms of performance relative to the other locking methods. Where the other locks take about 2 or (at most 3 seconds) to complete the test @synchronized takes 4 or 5 seconds to accomplish the same task. Here at the total runtimes for the tests POSIX Mutex Lock (Set): 2.382080 Seconds POSIX RW Lock (Set): 2.881769 Seconds OSSpinLock (Set): 2.278029 Seconds NSLock (Set): 2.948313 Seconds @Synchronized (Set): 4.310732 POSIX Mutex Lock (Get): 2.953534 Seconds POSIX RW Lock (Get): 3.390998 Seconds OSSpinLock (Get): 2.880452 Seconds NSLock (Get): 3.493390 Seconds @Synchronized (Get): 5.098508 Seconds Ouch! Things aren't looking good for @synchronized at this point. However in terms of a compromise for speed and efficiency I particularly like the POSIX Mutex Lock the best, because (as stated on CocoaDev) it already tries a spin lock first (which doesn't call into the kernel (why OSSpinLock is so fast here)) and only goes to the kernel when a lock is already taken by another thread. Which means in a best case scenario it is basically acting exactly like OSSpinLock and it doesn't have any of the disadvantages to OSSpinLock. OSSpinLock has to poll every so often checking to see if its resource is free which (as stated earlier) doesn't make it an attractive option when you expect some lock contention. Still Simon is right in that depending on how your application works OSSpinLock can be an attractive option to employ in your app and is something worth looking into. This isn't a "this one is the best!" type of thing, it's (as I've said before) something you have to test and see if it's right for you.

8 comments:

  1. strange results.
    When working on a tool to count retain/release in a Cocoa App, OSSpinLock() proved so much faster than pthread_lock().

    ReplyDelete
  2. hoĆ , have you seen the OSAtomic... functions?

    For the retain count example, you can do: (please paste into a text editor, I can't use the code or pre tags here!)

    // Assuming an instance variable of: unsigned int _retainCount;
    // In Leopard, the value returned by -retainCount should be an NSUInteger. However, the value of UINT_MAX signifies a special case (an unreleasable object), and so is the maximum retain count value. This means that you could just use an unsigned int variable, and cast it to NSUInteger in -(NSUInteger )retainCount.
    // Note: The following uses unsigned int, which makes it compatible with pre-Leopard systems.

    - (void)retain
    {
    // Increments using a memory barrier, which enforces the completion of any cached writes - a synchronisation point.
    OSAtomicIncrement32Barrier(&_retainCount);
    }

    - (void)release
    {
    // The OSAtomic... functions return the new value.
    // Once the retain count reaches zero, the receiver will have no owners and so can be freed.
    if (OSAtomicDecrement32Barrier(&_retainCount) == 0)
    [self dealloc];
    }

    - (unsigned int)retainCount
    {
    // Synchronise and cached retains or releases.
    OSMemoryBarrier();
    return _retainCount;
    }


    However, looking at what you said, maybe all you wanted was -retainCount all along?

    ReplyDelete
  3. Regarding the times between using @synchronized and the others, surely the difference is negligible in the example you give.
    As a user, I don't want to wait 5 seconds or 2.8 seconds. Ideally, it shouldn't happen. So the task should be taken away from the main thread.

    But looking at the figures, these things are taking fractions of microseconds each! Surely the tasks that they are supporting take much longer than this - writing a file is in the realm of milliseconds, a unit a thousand times that of a microsecond.

    However, if you are able to perform millions of tasks every seconds (e.g. calculations), one method of making it faster would be to complete tasks in bulk (a thousand or so, a milliseconds worth), and then perform the locking (synchronising).
    For user interfaces, this synchronisation should only need to occur every 1/60 of a second at most - the display rate. Any faster and the user won't see it.

    ReplyDelete
  4. sorry, I was not that clear.
    I worked on a debugging tool similar to ObjectAlloc.
    and when collecting information (for example the stack trace), I had to lock during the retain and release.

    ReplyDelete
  5. Anonymous8:52 AM

    Any way to get a hold of this test app so we can compare the same code on various architecture?

    ReplyDelete
  6. @m-g lavoie the link to the apps source code is in the article if you clicked the Cocoa-Dev link your brought straight to the page with the source code. The link is also here ( http://www.cocoadev.com/index.pl?LockingAPIs )

    ReplyDelete
  7. Anonymous12:54 PM

    Quick reply :-)

    Thanks. I stumbled on that page through Google and didn't have a reference point linking to article.

    Well, I looked at the code and I thought some of the numbers seemed perhaps inflated by the actual code running behind. From this blog post, one is left with the impression that the execution time is spent mostly on the locks. I thought those numbers felt high. Anyhow, I took the liberty of slightly altering the code (remove the pool realloc at every 100 iterations) and added a "stubb" function to compare the actual overhead of the code without doing any locking on anything.

    I also changed usage of gettimeofday to clock instead. For some reason, gettimeofday always returned me the same numbers!

    Also, for the heck of it, bumped up iterations by 50 times :-p

    Here are my results:

    Testing 50000000 iterations with setIvarMutex:...done
    total time: 11.648519 seconds

    Testing 50000000 iterations with setIvarRwlock:...done
    total time: 15.288417 seconds

    Testing 50000000 iterations with setIvarSpinlock:...done
    total time: 11.058454 seconds

    Testing 50000000 iterations with setIvarNSLock:...done
    total time: 15.434852 seconds

    Testing 50000000 iterations with setIvarStubb:...done
    total time: 9.509116 seconds

    Testing 50000000 iterations with getIvarMutex...done
    total time: 49.091204 seconds

    Testing 50000000 iterations with getIvarRwlock...done
    total time: 53.367624 seconds

    Testing 50000000 iterations with getIvarSpinlock...done
    total time: 50.074592 seconds

    Testing 50000000 iterations with getIvarNSLock...done
    total time: 53.201216 seconds

    Testing 50000000 iterations with getIvarStubb...done
    total time: 48.164448 seconds

    So, while the actual lock mechanism vary greatly in execution time, it's clear that their overall impact is rather negligible. Still OSSpinLock is still the champ in all these and while this new iteration of this code doesn't take into account the @synchronize one, I'm still surprised that a language feature would promote worse results.

    I clearly had expected @synchronized to have resulted in better performance. How wrong would I have been had I not stumbled on this post.

    Thanks

    ReplyDelete
  8. @m-g lavoie that's because @synchronized does more than just provide thread serialization to functions/data structures it also sets up exception handling for your thread ( http://developer.apple.com/documentation/Cocoa/Conceptual/ObjectiveC/Articles/chapter_12_section_1.html#//apple_ref/doc/uid/TP30001163-CH19-SW1 )

    ReplyDelete