Understanding Why Strings are Evil
Samuel Giddins at Playgrounds Conference 2017
Why Strings are Evil. This is meant to be a controversial talk in the physics world, refuting string theory. No, not at all.
Strings. Just a quick roadmap of the many, many things we're going to cover today.
1. What is a string?
2. Why I decided to give a talk and talk about strings for 25 minutes instead of arrays, or integers, or floating point numbers, or monads. Although, I think many of you are probably sick of the monad talks at this point.
3. Strings, the way that we use them and we see them in our programming languages.
4. Character encoding, so how do we actually represent strings on a computer.
6. And how we put all of the different facets of strings together, and what makes them interesting in trying to ship apps.
What is a String?
Okay. What is a string? Some text, it's a bunch of letters, and weird characters, and upside down question marks, and interrobangs, and octothorpes, or everyone's favorite; emoji, control characters- like new lines, tabs, spaces, backspaces, the bell character. A string is something like that; my Twitter handle, or your Swift code, that's represented as a string in a file. This is part of an executable, a compiled executable. I think this was "cat," but hey, that's a string, right? My email is a string. This is not just trying, me, to subliminally message you to get in touch with me and engage with my personal brand. Don't worry.
Yeah. Strings, first and foremost, they represent unformatted data. They don't generally say how to represent things with color, or bold, or they don't say what font you use to display a string in a label in your app. They're unformatted. They often contain a lot of extra meaning. Like in email, yeah, it's a bunch of characters, but you can use those characters to get in touch with people across the world. In some ways, strings are the universal data type. When we don't know how else to represent something, we stick it in a string, because hey, strings can represent words, and we like to talk about lots of things, and who hasn't at some point said, "Well, there's probably a better way to represent this, but let's stick it in a string. It's all just JSON under the hood, right?"
Yeah, some of the motivation for why I wanted to talk about strings for a really long time. I think I'm a pretty good developer. I have written a lot of code. It doesn't seem to crash too often, because I know when it does. You all don't hesitate to tell me. I find that working with strings can be really, really hard. I do a lot of really challenging things, and this is one that I keep coming back to. I keep on having trouble with it. I like writing string-intensive applications, because maybe I'm a masochist. I wouldn't rule that out.
Things like displaying rich text. So, you have an iOS app. Maybe, let's say you're building an email client for fun. It has to display rich text, so you get a string from the server, and you want to display it, and format it correctly, and look nice, make sure that all of your company's branding comes through when you send out ad campaigns.
You want to write a JSON parser. JSON parsers are the thing in our community still, string-intensive application.
We want to detect data in forms. We collect our user's information. Maybe we want to ship them something. We get their address, and we want to be able to take that address and determine, "How much do we need to charge them for shipping? What shipping company do we use to send them their stuff?" Things like that.
Literally, anything that involves code, because code is a string.
Communication. We communicate with each other over text. Text is represented in strings. I rest my case.
Yeah. I've spent a lot of time writing string code, a lot of late nights, a lot of time spent banging my head against the wall, and I want to work on projects that require understanding strings. All of those fun examples, things like building compilers. Basically, anything that could possibly soak up my nonexistent free time usually involves strings. I don't think I'm alone in this. By a show of hands, who here has had to use strings more than they want to in developing something? Yeah. That's a fair number of us. Strings are useful. They're hard. They're not getting any less hard. This isn't a solved problem. I want to rationalize all the pain that I've had, and try and convince myself that it's because ... It's not because I'm bad at this, but it's because they're hard.
Strings (in our languages)
Strings, as we see them in our languages.
Remember when we used to write stuff like that? I actually had to look up how to do string literals in C again, because it has been so long, and I'm not complaining about that. Yeah, in C, we just have sequences of characters, and a character is just an unsigned integer. They're null-terminated. That's it. Like, C knows nothing about strings, kind of like me.
Objective-C, we get a bit more fancy.
We have documentation for it. They're inspired by Pascal strings, which means that we have a class that points to a buffer. It stores a length, which means you can do things like put no bytes inside your strings, and your program probably won't crash because of that, and it probably won't truncate text because of that. Improvement. Canonically, they're UTF-16.
Then in Swift, everyone's favorite.
We have our emoji. We have our emoji variables. That's a violin, which has strings on it. I had to think really hard about how to put one good emoji pun in here. Yeah, in Swift, one of the big things, when Apple announced Swift 1.0, was Swift has really good, really powerful Unicode-aware strings.
That hasn't changed. In Swift, string is a value type.
We've gone away from our mess of a class cluster in Objective-C. We are no longer talking in terms of UTF Unicode code points. We work in characters. If you want to be fancy and impress all your friends, you can call them extended grapheme clusters.
In Swift, a string has:
15 associated types,
and 107 properties. Oh my.
Yeah. Strings have gotten complicated. I went to write a small compiler about a year ago, and trying to even just figure out where to start in Swift dealing with strings was hard, because there's so much there. Okay. That's strings in our languages.
We have basically a bunch of bytes. That's what our computer calls a string. How do we get characters out of them, something that we as human beings can understand as something more than just a number? That's a character encoding.
An annoying thing is sometimes, you can represent multiple characters the same way in the machine. There are many choices of character encodings, and lots of legacy cruft that our programming languages still have to deal with, and old character encodings that should have died 50 years ago.
Some common ones:
Morse code, it's a character encoding.
The original EBCDIC.
ASCII, everyone's favorite for making art.
Old Windows character encodings. If you ever remember getting emails that have lots and lots of question mark boxes in them, those were character encoding bugs.
A bunch of Unicode ones ...
This is the entirety of ASCII:
Wow, that's even legible. If I tried to put Unicode up here, I don't think there are actually enough pixels in this display for each character to find in Unicode. This is ASCII. This is sort of what we think of as plain text. It's great for English. It has pretty much all of the English alphabet that we use, it has numbers, and it's missing everything else.
What's missing? What is that everything else? Accents. If you work in Spanish or French, you need accents. Non-Latin characters. For those of you who want to communicate in Ancient Greek with your friends, and quote Homer in the original, can't do that using ASCII.
There are no emoji, which is, I know, this is the real tragedy of ASCII, is the lack of emoji. If you can't communicate with an upside down face, what are you even trying to do with your life? My favorite, math symbols. There's no way to represent a square root or an integral. Literally everything else is missing from ASCII. There are 127 characters, nothing beyond those. How do we fix that?
Unicode has a unique number for every character that works on every platform in every programming language. It's a pretty ambitious project, and a very lengthy specification that I have never read and have no intention of reading.
Why is it important? Well, we put every character in a single encoding. Back in the day, if you wanted to communicate with someone, or write a piece of research that maybe was written in English but referenced a Japanese quotation, you couldn't do it. You physically could not put those two languages in the same file. It has this idea of “canonical equivalence:. Characters that are the same but are represented in different ways are the same under Unicode. You have a lot of different underlying representations, so the different encodings, and you can pick the one which works best for you, UTF-8, UTF-16, UTF-32, depending on the specifics of how your application needs to perform.
The great thing about Swift is their strings are Unicode-aware. You can do things like compose N with tilde (ñ), both an N and then adding a tilde, versus the single code point "N with tilde." They're the same. This is a pretty revolutionary thing to happen. I know it seems really trivial, "Two things are equal. Big deal," but this really meant that we could refer to text and start processing it correctly.
Some of the lessons Unicode gave us was that byte equality didn't mean string equality. Just because things were represented the same way in the machine's memory didn't mean they were the same string. String equality is actually a really hard thing to do, because you have to compare every character, and you have to canonicalize each string and see if all the code points put together represent the same list of characters, regardless of how they're represented. Just basically, good luck with comparison and sorting, because defining how those are supposed to work is really complicated.
Yeah. Unicode is complicated. There's an entire spec for it. It's well-defined. I don't know anyone who has actually read the entire thing. Unicode is complicated because the languages in the text that we represent in our strings are complicated. They have complicated rules. They interact in different ways. They weren't necessarily built to be used together. I don't think anyone who designed the Latin character set did so so it would work together well with smiley faces and with Korean characters. Just all these things were never intended to be put together. That makes a spec that combines them all complicated by nature.
Correctness, Performance & Shipping
Yeah. We have this spec that's supposed to fix all of our problems, so let's break down what that means for us when we're trying to ship an app. Yeah. Correctness, performance, and shipping on time. Just like every other field of programming, choose two. You can't have it all.
Correctness means things like;
Handling control characters. Have you ever tested your application if you put a bell character in your labels? I have once filed an April fools radar for Apple to actually make UILabel print bell characters and make a sound. It was closed pretty quickly.
Detecting invalid byte sequences. If someone says, "Hey, here's ASCII." Is what they're giving you well-formed? Does it make sense in the encoding they say they're giving you?
Finally, strings oftentimes represent something more than just a string. If you're given an address, are you recognizing how an address is decomposed into a house number, a street, an apartment number, wait, a city, a postal code, a country, and I think I've left some of those fields out, too? Yeah. Things like; telephone numbers, email addresses, JSON- JSON is everywhere, I'm telling you. Programming languages, they all use strings, at the end of the day, but they use specific formats to represent more than just the bytes, more than just the text that you see if you print them out. They have semantic meaning beyond just a string.
Indexing strings can be slow. Thank you, Unicode.
Getting the length of a string can be slow. Thank you, Unicode.
Looping through strings and doing things can be really slow, as evidenced by the time I tried to write a JSON parser and parse a one meg JSON file, and it didn't finish, and my laptop died. That was both sad and fun at the same time. I then proceeded to delete the project so no one would ever use it.
Checking equality of strings is, as I was talking about earlier, a whole bag of hurt. What does equality of strings even mean? You have to think about, "Is shouting case equal to polite, quiet talking case?"
Simple-looking code can be slow. If you enumerate through a string, and just have a set of integers and go through them, and then subscript into the string, well, if there are Unicode characters, and if the string is encoded in a variable with encoding, that can be O(N2) .
Unicode adds performance wrinkles versus using fixed-width encodings like ASCII. You can't just assume that "Hey, each byte is a character," because bytes aren't big enough to fit all the characters we need.
String slicing can lead to memory leaks. If you have a very large string and keep a small part of it, let's say, in Swift, the string slice will keep the backing buffer alive for the lifetime of the app. String slicing can just be slow if you have to copy a substring of a large buffer.
Changing encodings is slow.
Detecting things like line breaks, paragraph breaks, page breaks, whatever that even means in your application can also be hard. The fact that a backspace character exists was the bane of my existence for one afternoon. Eventually, I just, I wrote some tests to try and make sure I handled control characters properly. Then I deleted them, because I seriously couldn't figure out what even handling them properly meant.
Text layout is a whole other bag of worms that I'm not getting into today, that we could have an entire conference on laying out text.
Super-large strings exist.
That's a string. That's not an image. That's Unicode. When I talk about strings being evil, if you need something to visualize, it's that.
That's what we all do, at the end of the day, right? All of these considerations that I've been talking about mean that designing fast and correct string-handling, is a massive undertaking. Fortunately, our programming languages try and take care of all of that for us, but we still need to, at the end of the day, think about what we're doing with our strings.
Sometimes, giving up on one of those things is okay. I recommend not giving up on the shipping thing, because in my experience, if you don't ship, you don't get paid, and that's no fun. Usually, you give up either on performance or correctness. That's fine. It's sometimes okay if you just say, "Hey, you can't put emoji in your Twitter handle." It's sad, but I'm sure it made some Twitter engineers' lives a lot easier.
Really, they're not evil. They're just complicated. I went for shock value with the talk title, trying to get to the top page of Hacker News, so I don't really mean it. Yeah. We need strings, but they're complicated. They do a lot of stuff, and they mean a lot of different things. Sometimes, we mean strings just to be a sequence of characters. Sometimes, they're unformatted representations of our data. Sometimes, they're something we just take in, and return back to a user, and show it on screen. We have to recognize that String, while it does all of those things, it doesn't do all of them at once.
Just like with those, what was it, 209 function methods on String, there are lots of overloads. Just like that, you have to pick the one that is relevant to what you're doing. Stop using strings whenever you can. Move to structured data that has more meaning than just a string. That means maybe not passing around a string of JSON everywhere in your app. Maybe parse it and turn it into a dictionary or some custom objects, literally, anything else that has more semantic meaning associated with it than just a string.
Strings, they're not so simple, in the end. That doesn't mean we don't need them to build our apps, to build programs, to be able to delight our users, to let them set names and add bios, give us addresses, phone numbers, communicate with each other. We can't just decide to not use strings because they're hard. I guess that's why I struggled with them so much. There's so much they can do, and we can't avoid them, even if we want to. Yeah. There's one last string that I thought I'd put on screen:
If you enjoyed this talk, you can find more info: