Matt Gallagher at Playgrounds Conference 2017
High performance Swift
Hi. My name is Matt Gallagher. Jesse mentioned I write a blog called Cocoa with Love. I also do mostly video server stuff. There is an app on the app store, StreamToMe, that is mine. That doesn't get as much love as it should, because I'm mostly working on the back end stuff. Anyway, today I'm going to be talking about high performance Swift. I'm going to be talking about the sort of code that might be run 10 million, or even 100 million times a second. It's not higher level abstractions, it's the very low level stuff that might be right in the middle of one of your core loops.
I'm going to look at how you analyze code at that level, and how you can get the best performance out of that. Also, I'm going to look at some of the things that impact code at that level. I won't be talking about a lot of general optimization topics. Performance optimization is a huge topic, and there just isn't time for that today. There's also lots about Swift that impact performance that I won't be covering. Strings, copy on write, memory allocations. These are all important, but I'm going to try and keep it very narrow, at a very low level. There's the question of whether or not Swift is a high performance language. Of course, Apple went and named it Swift which implies speed.
“There's the question of whether or not Swift is a high performance language.”
Even though they use Swift as a bird in the icon, they know that they want you to think of it as a fast language. Indeed, when Craig Federighi first introduced Swift at WWDC in 2014, the first adjective he used to describe it was fast. The only comparison, the only justification he gave to that, was to compare it to Python and to Objective-C which is a spurious comparison. When you're trying to get things to work fast, even in Python, you don't write things in Python. When you're trying to get things to run fast in Objective-C, you don't write them in Objective-C. In both cases, you write them in C.
How does Swift compare to C?
It's a more interesting comparison to say, how does Swift compare to C? Indeed, why is even C considered a baseline for performance? You often see multi-language comparisons between Java, Go, C++ and at the bottom, C. C is always the 1.0 baseline on these comparison tests. It really comes down to abstractions. There's machine code, assembly language. We don't want to write in assembly language, but when you write in C, you're really writing with the same sorts of things that have almost a one-to-one mapping between assembly language and C, in a lot of cases. You have C ABI function calls.
By that, I mean it pushes a couple things to the stack and it preserves a few registers. Otherwise, it's just a jump. It has pointers and function pointers, and it has arithmetic and logic operators. That's really all there is in C. There's no grand level of abstraction that gets in the way between you and your code. The question is, can we get this in Swift? I'm going to start by looking at this C code:
…This is the Mersenne Twister. If you're a avid reader of my blog, you may have seen I actually talked about this particular piece of code in a comparative piece on random number generators.
I like this code because it's fully self-contained, there's no calls out to other functions. It's just a few bit shifts, a few ads, and a few accesses to an array. I'm going to compare that to, basically the same implementation in Swift:
It might not be too clear from that how close they are, but there they are, side by side. You can see that they have the same if condition. They have the same for loops, obviously using C for loop syntax and Swift for loop syntax. There's the same ands, and additions and bit shifts. Down the bottom, it ends with the same xors and bit shifts and ands.
This is the performance, running at 100 million times in C takes about .62 seconds, and in Swift takes 1.04 seconds.
That's writing them syntactically as similarly as possible, between the two languages. Swift is about 60 to 70% slower than C which, if you look at a number of other languages that consider themselves to be fast, languages like Java and Go, 60 to 70% is not too bad. Most of them are happy to be two times slower. If we look at the results, we can see why this performance difference exists, and we can do better.
I want to look at exactly what I'm doing here. What I've highlighted here is a test case, running the C version of the random number generator, and the Swift version:
If you're not familiar with profiling, you can go to Product > Perform Action > Profile, then select "profile your tests". What I normally end up doing is right-clicking on these little guys. You can select profile your tests, or you can do the same in your test cases over here. I'm going to jump ahead to where I've already done that. You select time profiling. You hit record, and you get numbers for your thing.
What I'm actually going to have a look at, is jump straight ahead into the actual, this is the C code. The C code is here on the left. You can see performance numbers, then you can see the assembly I've got up on the right, there.
The performance numbers aren't necessarily as interesting for what I'm showing you here. Actually, it's more interesting to look at the assembly. You can see there's one red jump here. This corresponds to the if condition. Then you can see two indented sections, those are the two loops. Each of those has another jump at the bottom of them. Other than that, you'll notice there's nothing but basic assembly instruction. That's the C code.
If we switch instead, to the Swift code, you can see a very different situation. There's immediately a lot more jump instructions:
There's code like this “DYLD-STUB$$swift_once”, which is used for initializing global variables. There's isUniquelyReferencedOrPinned, which is used to check if an array is shared. Even more alarmingly, if I get past all of the loop bits, is code like this which is jumping to 0x35a. Which, if we scroll down, we're looking for 35a in this column here. 35a is there. It's a ud2 instruction, which is a bad instruction. Which is deliberately invoked when Swift hits a precondition failure.
A number of these things that I'm looking at, and a lot of the reason for the performance difference here, is that Swift is checking each of its array indices. If the array index is out of bounds, it's jumping to the ud2 instruction. There's also a couple other things in this code where every time there's an addition, the addition is checked for overflow. All we need to do then, is really analyze why these things are coming up, and how we can get rid of them. With arrays in Swift, we have a few different options for simplifying them down. If I open up step two here:
One of the options we have is to, instead of using array indices directly on the array, we can use withUnsafeMutablePointer to get and unsafePointer into the array.
Then the array indices on the unsafeMutableBufferPointer are unchecked. We can get past the performance implications that are causing this slowdown. I'll step ahead to doing that. The no array indices, if I show you this code:
You’ll notice that it's flagged timing information on these steps. It has not flagged timing information on everything inside that withUnsafeMutableBufferPointer. Problem here is that, we're writing in Swift, but Swift still has some issues in its optimizer. The contents of that section haven't been inlined. That's a problem for performance, because when you can't inline you've got the abstraction of the function call.
All of the pushing to the stack and going into there. We're wasting time doing the inlining, but even worse than that, this is their top level call, it's still wasting, if we have a look at this, two seconds in outlined make unique buffer.
Arrays in Swift a copy on the right. If the array isn't unique, the array is going to be copied. This is a 312 pointer sized array. That's getting copied every time it needs to go into this middle section of the random number generator. Arrays here, they should work. It should be optimized, but Swift doesn't choose to optimize it. We end up having to bypass arrays. There's a trick for bypassing an array in Swift, which I'll put under here, under no arrays.
If you have a look at this, this is a tuple:
It contains, as you might expect, 312 uInt64s, all initialized to zero. It is laugh worthy, but it is also how C libraries are imported into Swift. If you have a look at what the importer does, if you have a static array in C and you import it into Swift, this is how it gets imported. If we take an unsafe pointer directly to that tuple, and we recast it to, I'll point it to uInt64, then we get essentially, a statically allocated array. Also, to get around the problem that Swift wasn't inlining the contents of these functions, you notice all I've got in the contents of this is a dollar zero.
I am returning that pointer to outside the call with UnsafeMutablePointer, which is kind of naughty but, because that's owned by self and self isn't going to change during the contents of this particular function, it is actually, just about memory safe. What happens when we do that, let's go back here. Let's have a look at the timing data. The C one was .62 seconds for 100 million iterations. The first Swift Pass was 1.04. No array index, because it failed to do that inlining. It ended up doing the copying of the array, because the optimizer erroneously thought the array wasn't unique. Went up to 1.49 seconds. Minus no arrays, we get down to .66 which isn't quite what C did.
Like I said, there were still a few things that I didn't fix. Namely, additions. I didn't fix up additions, because all additions in Swift are checked. We can fix our additions by not doing that, good. Where did it go? I've lost it. Down here. Here we go. Instead of writing I plus one, we can write I and plus one which is an unchecked addition.
Our fixed code: 0.52 seconds. A good 20% faster than C which is, of course, scurrilous and cheating. What I didn't mention when I showed you the corrected code is, what I'd done is, I'd merged those two loops that were done back-to-back into a single loop. Which, I didn't do with any intent at the time. It turns out that the effect of that is to let the contents of the loop. If you have a look at the assembly instructions here:
These are all p ands, and p x ors, and p subs, which are the vectorized versions of those calls.
What's happening is, you put the two halves of the loop together and it vectorizes. It's literally doing both halves at the same time. There's a little bit of juggling so you don't step on your own values as you're stepping through. If you merge the two loops together, you get the same call. You can fix the C to do the same thing. I actually have a fixed version of the C code somewhere. So you fix the C code in the same way. Merge the two loops, and the C code. Also uses the same vectorized instructions. Now I'm going to switch to the Swift code. Do you see the difference? Almost nothing.
It's changed the name of the function, it's changed a couple offsets. This shouldn't be surprising. C and Swift, both use the same LLVM backend optimizer. If you're prepared to limit yourself in Swift to UnsafeMutablePointers, to unchecked additions, and to other basic operations like function calls to non-generic, non-class functions, then Swift and C are the same. Any time you think you might be able to get better performance in C or Swift, it's probably because you're not doing the same thing in each language. If you actually do the same thing, and you can do the same thing, the two have identical performance because they generate identical code.
Like C, Swift has Cabi with pass by value semantics. It has pointers. It has basic arithmetic and logic operations. If you use it to write C, it performs ... I said this already. C doesn't have many performance abstractions. That's why it's considered high performance. Even if you wanted to, you can't introduce a whole lot between you and what the compiler is generating. C has the minor problem of trying to ensure that functions are inlined, but actually that's a kind of big problem in Swift, too. The code that I've showed you so far hasn't had any function calls.
I actually had to set up the compilation environment very carefully, so that there was one non-inlined function for Swift and one non-inlined function for C. When you start dealing with more functions in Swift, there's a lot more to deal with. Here's an example of some kinds of function you can invoke in Swift:
There's an empty function. Takes about two nanoseconds. There's a function passing a single parameter, 2.86 nanoseconds. Not so strange. If you include a single reference parameter, so that basic class which is passed by reference. The way I invoked it, the reference is held onto in the loop, over each one, suddenly imposes an extra, what is it? 13 nanoseconds on the call.
If it's an objective C class, it's going to be a lot more, but if you flag it in out, because you're retaining it in the surrounding loop, it's just passed as a pointer. It's just 2 nanoseconds again. These are the sorts of things you need to think about in Swift if you're really trying to get decent performance. It's, "how are all of these function calls that I'm using being interpreted by the compiler?" If you have a generic in Swift, but the declaration of that generic and the usage of that generic are in the same module, and the compiler will resolve the generic so it's not actually compiled the generic code.
If it can't resolve a generic, there's a 1 parameter, 2.86 seconds, just passing exactly the same parameter as a generic takes 12.3 seconds. The compiler opens it up, has a look at the metadata and checks what it has. If you pass something generic over a sub-parameter, so it unwraps the result and then looks at the generic, it doubles the time taken again. It has to do two inspections of metadata. On the previous slide, when we passed a class reference by in out, it improved performance, because the class reference only needed to be passed by in out. But when you pass a generic, this is an enum. This result type is an enum. Swift needs to rebuild an enum to pass back out again.
You can see, it's gone up to 84.3 nanoseconds. It gets worse, because that result t-type that I'm using is a single int, and then a one byte to say which branch of the enum it's using. If I use three ints, then it gets bigger than Swift's existential type. Which is how it uses to communicate information about generics that haven't been specialized. The existential type has room for three ints. This is now three ints plus a byte. It needs to be malloc'd on the stack. What would be an inlinable function in some cases, or merely just 2.8 nanoseconds for a call, is now 359 nanoseconds.
On the plus side, I showed you some profiling information from C code and the Swift code. When I was looking at that, I didn't really pay a whole lot of attention to the timing information, because it's all over the place and it doesn't tell you anything useful. When we get to these sorts of function calls, here's all of those functions that are on that slide, you can see that actually, function calls give very useful information in the profiler. If you're bringing up this information, you can see that their relative performance. You can see the great big red marker on this one.
At least the profiler tells you where to hone in on these sorts of things so you can say, "You know what? I need to work harder to try and specialize this, to make sure these two things are compiled in the same module. So the Swift optimizer can optimize this away and this bottom case might reduce down to this top case." I have been given my five-minute warning, so I might just finish off pretty quickly. There's other things too, to remember. I'm going to briefly show this slide.
These are the different ways you can invoke a function on a class.
A function, by default, isn't final and has to look up a v-table reference. That's what you see in that second row, the non-final function. If you flag your functions final, 2.1 nanoseconds, it's basically just passing the self parameter in out. If you have to do it dynamic, you've got a lot more overhead. Actually, the timing information here isn't the most important thing. You shouldn't really be worried about if you need to use sub-classing or other types of abstraction. Do keep in mind, particularly for your internal functions, things that aren't shared externally on your interface.
Final is the only one that can be inlined. That final isn't necessarily 2.13 nanoseconds. It might in fact, be zero. There's quite a few orders of magnitude improvements between zero and 2. There's some other things to think about. We're used to arrays in Swift being fairly quick. You saw the slide I did on the random number generator where I was originally using Swift arrays. They were only about 60 to 70% slower than C arrays, but if they're generic, then there's a lot of levels of abstraction there that suddenly aren't inlined anymore.
You do need to keep in mind that generic things when they can't be specialized, not only is there a little bit of performance problem at each level, but if there's six or seven levels it really adds up. Normally, you might be used to those six or seven levels optimizing down to zero. This isn't a malloc overhead like I showed in the other one. This is looking at the metadata on that generic parameter at each step. If you're trying to get things to run at a 100 million per second scale in Swift, it can be done but you need to understand things like pass by ownership in Swift.
With luck, this might get better. If you read the ownership manifesto that came out about a week ago, it talked about actually declaring variables owned, shared or even whole types as move only. These things are going to help a lot. Know the rules of in out. Sometimes it will help a lot. It will help a lot on class references. It can make generic enums a lot slower, so you have to understand what's happening there. Make sure your generics are specialized by keeping them in the same module, and hope things will get better in Swift 4. All right. Thanks very much.
If you enjoyed this talk, you can find more info: