Working with Binary Data in Swift
Parsing and dealing with binary data in Swift
Alright, so today I wanna talk to you a little bit about parsing and working with binary data in Swift. I kinda just stumbled into this, it's not really something that I needed to do for work or anything like that. But while I was at it I also tried to pick the scariest most Halloween-like theme that I could find in Deckset. Just to give it a bit of pizzazz.
So my name is JP Simard, I'm working on the Cocoa team at Realm and we had... Nice shout out by Javier earlier, thanks for that. We built a mobile database for iOS and so that has to do with data, but not really in the sense that I wanna talk about today. I wanna talk a little bit more about structured binary data and if you just have a raw set of bytes that you have to deal with in Swift, what are some of the good ways to do that? And it's kinda tricky because really there's no equivalent to foundations NSData in the Swift Standard Library, right? The Swift Standard Library team just kinda left us hanging with that.
So when we're working with a raw set of bytes, we have a few options. Number one is just to embrace Cocoa and that is really generally the best solution here. NSData has a bunch of smarts associated to it that will generally try to avoid making unnecessary copies and when you're dealing with data, this can matter a lot, right? You may have kilobytes, megabytes of data and making these copies in the wrong places can really hurt you, especially if you're building your API's, and you don't really wanna have to worry about that, you kinda want the system to make the best decision, in ways that say the Standard Libraries array class or array struct, rather, doesn't always guarantee. So you're not... There's nowhere in the documentation that it absolutely guarantees that say an array of UInt8's, right, so an array of eight bits is necessarily going to be contiguous. Or that if you try to make a slight variation or sub-slice of this array here and make a slight modification that, well what's gonna happen? Are you gonna fork the underlying data structure? Are you going to completely copy it over? There are times when it's not exactly obvious from the API that you wanna be explicitly doing one of those operations.
And so NSData is generally the way to go, really. And then there's other things that you can do. You can access the raw memory contents of whatever you're working with at a point or location, so using unsafe pointer. And then the three question marks that I have at the bottom here is really room for opportunity, right? Who knows? When Swift goes open source, we really don't know if there's gonna be a strong reliance on Objective-C or core foundation. How much of this is actually going to be part of that release? What's Apple going to do, as Andy Matuschak said a little earlier, when we try to move a little bit beyond and evolve past this Objective-C reliance? So that's a bit what I wanted to talk about.
One of the things that's really the best way, I think, to deal with binary data is if you can have some sort of layout aligned struct. So if you have a struct in Swift that has the exact memory layout of the bytes that you want to be parsing out of that structured binary data, it's really ideal. It can create some code like this, if you... Here we're using some unsafe pointers, and we're directly just reading in the contents of the memory directly inside our templated struct. So these are some generic methods that will really be the most efficient way to populate the struct because it's really just doing this raw memory copy. We could also do this while retaining the ownership of the data as part of NSData if we wanted to, but that's the beauty, we have the choice.
And then for encoding, decoding, you could do the same thing, alright? So this is... These are some functions that would allow you to go from any sort of contiguous memory layout that is represented by types that are stored directly in line in the structure, so say a struct with no box values or indirect values. And so this can actually get you really far away, so here is how you might be able to use this. In this case, we have an enum that we're just... This can be any enum, I just have an Either up here, but it can really just be whatever you like, any struct, anything that would have a contiguous memory location. And in this case, either with the specialized variant here, in this case, we have a string. In this case, we have a concrete layout and a concrete size for this structure, and so it's easy for the code that we had just before to know exactly how many bytes that it should be looking at. And so in this case, we can actually just go back and forth, this custom type, we didn't do anything like implement some sort of key value coding or NSKeyedArchiver or any sort of other de-serialization format, we're just really doing a raw binary dump.
Most of the time if you're doing your entire apps data model like this I'd be... Generally it'd be a bad idea for a number of reasons, but mostly because this is not optimized in any way. Usually, what's useful as in memory representation isn't exactly what's the most compact way to represent things. Sometimes, you have to byte align things for the CPU registers. It's not exactly ideal. But in this case, you can actually go pretty far and you don't need any really complex de-serialization. So that's one way that you can be parsing structured binary data, it's just doing a raw dump into some sort of memory aligned or byte aligned struct. And that's really ideal because in this case we look at the functions from before again, this is... It's really just doing this copy.
Now, there's situations in which you won't exactly have a direct mapping between the bytes that you have and the structure in which you want to represent it. And so, in that case, you have to get a little bit more creative. And so, just to give a little bit of background information, I've been hacking on this tool called Jazzy, which is a documentation generator for Swift because Apple surely isn't going to be providing us something like that, even though they've done a great job with Playground documentation. But to generate HTML docs is something that we were kind of left high and dry in the community. And so, I started working on this tool and to do this, it's not like libclang exposes an API for you to be parsing Swift.
And so, I had to use everyone's favourite service, SourceKit. And this is the scariest slide of the talk, fitting for a Halloween theme talk. I actually considered just wearing a big flashing strobe of SourceKit terminated on me as a costume, but figured that might be pushing it a bit. And so, SourceKit tends to be the thing that people really just loathe, because it's the thing that crashes, but the beauty of it, really, when it's working well, you just don't know it's there. So, that's the service is doing, things like syntax highlighting, code completion, symbol resolving in Xcode as part of the IDE. So, that's actually kind of nice.
I was toying around with kind of redoing the same calls as Xcode does, to extract the syntax highlighting information via XPC calls to SourceKit, and what I saw is that, say, you had a Swift program like this, extremely kinda simple, but you'll see why in a minute. This is really kind of a minimal case of some Swift code. I wanted to get the same information that Xcode would have to generate the syntax highlighting information. And so, you make the XPC call to SourceKit, and then you get back a bunch of bits. And you look at this, and you know it's not exactly obvious what you should be doing, but the more you kind of look at it and squint, and that's really all that reverse engineering is, is you just look at a thing for long enough, and you try to poke at it and try to see if these values mean anything, you end up realizing a few things, right.
This is structured data. You have 16 bytes for every syntax highlighting token. And in there, you have a unique identifier for what kind of token it is, if it's a string, if it's an object literal, if it's an operator, whatever. And then, after that, you have the location. And so, this is really just found out via trial and error and stuff, but the point here is not to be too scared from this. And just to see if there is a way for us to represent this in a model. So, here's a basic plain old Swift struct called the syntax token, and it has the information that's stored in this binary format. Swift lends itself really, really well to parsing this kind of structured data, especially if you can use Strideable. Strideable is a protocol available in Swift that allows you to do something like this.
So, here we're striding through. We see that we have blocks of 16 bytes, starting at the first 16 byte offset, and then from there, we know that every time that we iterate through this, or we map through these elements, we'll have the data that we want. And so, you kind of start of like this, and it tends to be extremely legible, when you're doing stuff like this. Here, we can use NSData. So, assuming that we have some data that we're starting with, here you can copy up the bytes at the specific locations, and using Strideable, this ends up really communicating the intent that we wanna be doing. And then, from there, you can just initialize your struct with this. So, in this case, we didn't have an exact one-to-one mapping with the memory layout of the struct, but we could still very succinctly kind of show how to populate the structured data, go from a raw bag of bits to the constructs that we know and love in Swift.
So, how could you improve on this? How would we make our own kinda NSData for Swift? Say, when Swift is open source, you wanna be running a Swift server on Linux, if that's even going to be possible, who knows? What should you do? Well, something that makes a whole lot of sense to me would be to have it as a collection type, specifically inextensible collection types, so that you can perform some mutation on it. And under the hood, this would probably just end up being a flat array of UInt8 with probably a little bit of additional logic about when to do the appropriate copying, how to enforce this contiguous buffer of information, and probably have a bunch of helper methods, like converting in DNS and doing other things like that. So that's all I had. I've got a few links up here that you can't click on when you're just watching this talk, but you should be able to once I post it up. And we can open up the floor for questions. Thanks.
Q1: Hi, great talk. I really enjoyed it. I had a question about the decode and code routines you had up in the beginning. I was wondering how much do those depend on guarantees that are made in the Swift documentation about memory layout of structs, or what kind of API or promises does the Swift language ever give you about memory layout?
JP: None at all. No, this is really to the letter an implementation detail that just so happens to work. Now one of the things that arguably you could say for Swift, it really depends on what you're building here, but since you do have to ship with the Standard Library, what works when you deploy your app, if there's an OS update, it's unlikely that that'll break, because you're shipping with the Standard Library and all those frameworks. But I wouldn't rely on it for any sort of production tool. I kinda came across building this as building a developer tool, and that kinda has slightly lower stakes, and I'm not doing anything like this. So.
Q2: Hello, if you go back to the slide with your example struct, with the string and the ints and so forth, is it possible... So let's say that type was not a string, and type was actually an int or something, and you knew the exact size of each of these, is it possible in Swift like it is in C to just read memory into that struct?
JP: Yeah, that's what we get here. So, looking at the decode, you're getting those bytes and just reading them in and saying that, "Well this isn't just a random collection of bytes anymore." Just pretend like this is actually our struct, and so this might look a little strange because I've templatized this and kind of made it generic, but really that's all this is doing. So if here we're going from struct to bytes to struct again, but if you just started with the bytes you could do the same thing.
Q3: As a quick follow up, have you explored the performance differences between doing that in C and doing it in Swift?
JP: No, not compared to C no, but I have compared kinda doing this moving of the memory versus doing the other approach that I mentioned with Strideable where you're actually reconstructing, and these are two very naïve approaches here that I demonstrated, because purposes of slides and stuff. But this approach is definitely a lot faster because all you're doing is reassigning that memory.
Q4: Hi JP. Thanks for the talk, I love Jazzy, I'm also looking for the day when Swift server will be able to run on Linux. My question is what are your thoughts about when you start to change one of your data types, one of your structs, and like say in your next version you wanna add a new property, a new variable to your struct, and you wanna version those binary datas, but it might be incompatible when you decode them in the future?
JP: Yeah, so this... That's really a great question, and really I'd wanna go back to one of the first points that I made, which is if you're using this as a data modelling approach I'd encourage you to consider alternatives, would be the most PC way that I could say that, I guess. Yeah, you probably should not, especially for the very case that you provided, which is it really does not play nicely with, say, you adapting your code. This is a very unforgiving approach where all the bytes have to align exactly for this to work at all. And so if you're using tools like a database, like Realm, or SQLite, or Core Data then the way that those are represented in the binary format are a lot more robust and do allow for things like schema changes while still being able to read the file. And the way that's done is through kind of a consistent data format that you can use. In Realm's case and in SQLite's case for the most part, it's generally a giant B+ tree that then has associated information to those nodes. That's a lot easier to read because the format doesn't really change, you kind of have this schema for your schema, you know? So if you're using this as a data modelling approach, don't.