Talking to Swift

Alexis Gallagher at Swift Summit San Francisco, 2016

swiftsummit


Transcript:

Alexis: Hello everybody.

So my talk is Talking to Swift and it's all about talking to computers. Whenever I think about talking to computers, the first thing I think about is Star Trek IV. I don't know if any of you remember this. It's often known as the one with the whales where the crew of the Enterprise, they're like 300 years in the future. There's aliens attacking Earth and for some reason they need to come back to San Francisco, actually just a few miles north of here, where the Golden Gate bridge is. They need to, like they always do, they need to save the earth. One of the ways they figure out that they can do this is Scotty, the engineer of the Enterprise, needs to have a conversation with an engineer in 1986 where he's gonna explain to him how to make transparent aluminium because it's necessary to move the whales around. So it's very complicated. How do you communicate the formula for transparent aluminium and show that it's gonna work?

Then they have this interaction:

Video is playing on screen.

I think what I love about this is that the joke works two ways now. Because back then when this movie came out, the joke was silly Scotty, don't you know? Of course, you can't talk to computers, that's crazy. But when we watch it now, I feel like the joke still works because now it's sort of silly Scotty, of course you can try to talk to computers. But you're gonna just want to use the keyboard anyway.

So this joke has sort of moved through time but remained funny in different ways. I think it's an interesting question. Why is that the case? 'Cause if you think about it, Captain Picard has a 13 inch iPad Pro like I do. Quark in Deep Space Nine clearly takes Apple Pay with Touch ID.

In all these respects, I feel like we should be there with talking to computers. If you were listening to a lot of the news this year, especially in the earlier part of this year, there was a lot of talk about bots, and artificial intelligence, and messaging. These sorts of words were getting thrown around in the media a lot. I thought well this is very exciting. I work with computers. Bots are here. We're gonna be talking to our computers soon. I work in tech, I'm making apps. If bots are the next new thing, I want to start making bots.

I want to find out how to do that. So that's what got me thinking about conversational UI and trying to figure out exactly what that would mean. So what I want to do in this talk is first of all just offer my opinion on what it is, like what was everyone talking about when they talk about conversational UI? Why was all this stuff in the news? What's real? What's not real? I guess that's sort of a survey or background. Then say a bit about; okay, given the bits that are real, how do we design it? How do you design a conversational UI? I know something now about how to design apps, visual interfaces. But how do I design a conversational UI? Then last, how do we build it with Swift? What can we do with the tools in front of us right now?

So what is it? I think there are actually a few different things all sort of swirling around getting lumped together when people were talking about bots and messaging, and chatbots earlier in the year. One of those is just the rise of messaging. One simple measure of the rise of messaging is to think how much more complex just garden variety messaging applications have gotten over the last few years. iPhone OS didn't have iMessages. It actually had an app that had SMS on it because all you could do was send SMS messages, just sort of plain text. But now you've got, I need to check my notes here, emojis, images, videos, sounds, URLs, handwriting, stickers, tap backs, apps, interactive apps that you can sort of send someone in messaging. You can even get floating balloons, and lasers, and confetti. Which sounds like it would be a joke if I said a few years ago, "You're gonna be able to get confetti in your messages." You wouldn't think that was real. But it's totally real.

Less prosaically, I'd say there's been a kind of rise in messaging, not just in that we have richer media for messaging between people. But the messaging UI, the message thread, is being used for things beyond that. One example here, there's a great app called Lifeline. Lifeline's a game and you play this game just by communicating with an astronaut. You communicate with the astronaut through text messages. So you can see that UI on the right there. It looks like the threaded messaging UI that you'd expect from Tweetbot or from iMessages. It's sort of restricting the responses you can give. But the essential interaction model is the interaction model of a messaging app.

Now more ambitiously, there is this app here. This app is Luka and it provides a messaging-like interface to different bots to do different kinds of service discovery. So if you're looking for a restaurant, you can ask for restaurants. Then it'll give you an answer and you can ask a little bit more. It's not just restaurants. The tabs along the bottom indicate different sorts of queries you can do. You can ask it for simple math calculations. I think this app's quite interesting. Usually it does offer this guidance in the bottom about what you can input. But it does also offer you some capacity for freeform text communication to solve a problem.

Theoretically, this app is exactly on trend when people talk about the rise of chatbots, and messaging, and messages as a delivery service. But actually I think apps like this are quite rare. There's this one and maybe one other. I don't think it's taking off or working quite as well as one would ideally wish it to. But I would say that this idea is one of the trends in the idea of what's going on with conversational UI. The rise of the messaging thread as an evolving interaction model that’s being used for more things. For commerce, for communicating that with people but with software.

Second trend is actually chatbots, a bit like Luka was. By chatbot, I mean extended natural language chat with a software agent. So not just one message and you're done. But back and forth a little bit. I don't know maybe talk about life. Just sort of a wandering chat.

Now, the dream of that is not new at all. This is an idea we've had for a while. You can have long chats with Hal in 2001. It may not open the pod bay door but you can go back and forth with him, and he follows the discussion. Also, famously there's this guy. He talks a little bit, not very much. More in the friendly vein. I don't know, how many of you here recognize KITT? That's the car from Knight Rider with David Duchovny. KITT would help him fight crime. But then they'd also have little chats and KITT would make fun of him for hitting on women or whatever. They had a bit of a rapport with each other. So the dream's not new.

I'd say also, when you start digging into it, you find the reality is not new either. People have been making chatbots, or trying to make chatbots, as long as there were keyboards. In fact, there's a chatbot pre-installed on all of your Macs right now. There's a chatbot that provides a simulation of a psychotherapist. If you all open your terminal and invoke emacs, emacs -Q -F doctor, you get ELIZA, a psychotherapist. You can talk back and forth with ELIZA. Maybe you get frustrated trying to debug your elisp code. You can get in there and just have a little break with ELIZA. You can say things, ELIZA replies, goes back and forth. It looks at the language you've used and replays bits of your language. The first time this sort of thing came out, it was very impressive for people. Now we're more likely to recognize it as a bit of a toy.

But the point is, if you go into your emacs and look at the source code there, it's copyrighted 1985. In the comment header, they're talking already about this being classic pseudo AI. Classic because the original ELIZA code is from 1966. So this is quite old stuff.

What's new now that we'd suddenly be talking about chatbots? I think one thing that's new is that we're actually using them. We're actually having these interactions with functional products. Siri, obviously, on all the platforms in different ways. Google's Assistant. Amazon Alexa assistant which you can access through the Echo device, and the Dot device, and also through their API. There are also directory services for television like the Xfinity 1. If you think about it, even things like the Slackbot are little bits of chat like interface. When you sign up for a new Slack team, they always ask you for your name again and it's always the Slackbot chatting away at you to get this stuff.

So one thing, maybe the change is that now we're really using them. But actually this stuff isn't that new. The other take on we really use them is we've been using chatbot-like things for a while and kind of hating them. So think about all your interactions with a phone tree. That's sort of like a chatbot and it's sort of terrible. Or if you've ever tried interacting with customer service just through a typing text bot that goes to an agent. That's never been a good experience.

We've been using these things for a while. They're not so new that I would see what the excitement's about. I know there's a lot of interesting work happening in this area. But I looked around and looked around, and I couldn't find one chatbot with which I could have a kind of long, extended conversation that seemed natural. I don't think there's any magic thing that's there yet as near as I can tell.

It's worth say a word about SiriKit which Apple released at the last WWDC. So when Apple released SiriKit, I was very excited 'cause I had wanted something like SiriKit. Then I look at SiriKit and I was like oh, they didn't really give us the real thing. I think part of me had in mind that when they released SiriKit, SiriKit would be like Hal. They'd have some kind of magic thing that I could configure and it would understand all sorts of interactions. But SiriKit does not have Hal inside.

SiriKit has an API inside with a predefined set of commands. Don't call them commands, call them intents. Then some rules that Apple sort of manages about how utterances, which just means what people say, gets map to intents. You can configure it for your own service if it's one of those services. You could specific certain clots in there. But it just looks like a computer program when I look at it. There was no magic in there. I came to realize in thinking about, well of course that's what it had to be. Nobody knows how to make Hal. It was silly of me to expect otherwise and SiriKit is quite good. But this is what there is in terms of chatbots.

I suggest a new name to understand chatbots and understand this trend. Every time you see a chatbot, you should just think of it as a restricted domain bot. Because they're good within a restricted domain. But I don't think there's a generalized chatbot.

Last trend, before I wrap up my trend survey and get into the nuts and bolts here, is voice. One of the things a lot of the new services have in common is that they're all actually using voice in some way. Obviously there's Siri. I use Siri multiple times every day, usually when I'm in the car, and it works great. I use it for text transcription, for the basic set of commands where I know it can respond, and it's quite good.

Also there's the Echo device. That works really well probably 'cause it has dedicated hardware. It's got seven microphones on the top that are all about being able to detect voice, and know where it's coming from, and focus on that.

Last, I think yesterday, or on Friday, Google released their own home assistant which is gonna be like Amazon Echo, a microphone always listening to you, ready to help in some sort of a way.

So why is this happening now? In this area, I think there is actually a reason for it.

We're living through a moment where there's significant progress in speech recognition. For instance, just within the last month, Microsoft announced that they beat a record that had previously only been set by humans before on ability to accurately recognize speech in this large corpus of recorded telephone conversation data. That's impressive, that's a lot better than it used to be. Also just within the last month, Google, separately announced having made significant improvements in speech synthesis. Not beating human synthesis but doing a lot better than past efforts have been.

If you look at both of these works, these are the PDFs, what you see that they have in common is that they're all using neural networks. They're using new kinds of recurrent neural networks, convolutional neural networks, deep learning neural networks. So I would say that if you look at all these trends, there's actually something happening around voice and it's being driven by progress and research around use of neural networks.

So just a quick wrap up on what is it really, what is conversational UI. Messaging, evolving interaction model, kind of interesting but there's not magic AI there. Chatbots, still quite limited as near as I can tell. There's no magic AI there as far as I can tell. I know some people are trying to use neural networks for that. But it hasn't got that far yet. Voice, yeah. There's definitely something happening there and it's happening due to neural networks. So voice is what you have to work with. But if you want to do a chatbot, you need to be very restricted about the domain. That's my take away on the reality.

My theory on how this stuff happens is that somewhere in a room, people are doing work with neural networks, and machine learning, and they're making real breakthroughs. Then they tell people down the hall who don't know exactly what they're working on, "Hey, we're making breakthroughs on machine learning." Those people go, "Hey, they're making breakthroughs on AI related to voice and image recognition." Then they down the hall and eventually you're getting to very non-technical people in the wider media, or in the investor community, or whatever. By the time this game of telephone goes all the way out there, it gets turned into like, "AI's happening right now." That's when you get all the hype around the idea that you can talk to computers. When really the progress is more circumscribed.

So that was my take away on what's actually there right now. Improvements in voice, nothing magical around chatbots, messaging is kind of interesting. How do you design for it? How do you think about designing a interface for voice?

I think the first step is developing a character, like a character in a play. You're not writing the operating system so you don't need to be completely neutral and boring the way Siri or Google needs to be. You can be more defined in your design, more specific just in the way you would be when you're designing an app. You don't need to look plain vanilla. If you're developing an assistant, think about three kinds of assistants. These are all assistants but they're very different takes. Bruce Wayne's butler, Alfred. Reserved British accent, worried about Bruce. C-3PO, eager, verbose, like panicky, a bit feckless. Elle Wood from Legally Blonde. She's outgoing, she has this Valley girl dialect, super optimistic.

You can imagine making three different interfaces that were like these characters. I think this exercise of design is a bit like the exercise of writing. One way to think about it would be if you were writing a novel and you wanted to develop more specifically your idea of what your character was like. There are books full of exercises that give advice on how to think about that. I think these exercises are quite appropriate for thinking about the design of a voice UI. It's a domain where design is a kind of writing.

Then the other aspect of it, because of the limitations of the chatbot tech, is you need to restrict the domain. You need to say what this bot can actually do and make it something so people aren't frustrated when they hit the limits. You can't promise to be a general purpose assistant when you're just a thing that controls lights switches, and timers, and can tell you what the date is.

I think that's especially true if you're offering a pure voice interface. Because in a way, a pure voice UI is the exact opposite of the message thread. Although they're all kind of lumped together in a conversational UI, they're actually the opposite of each other. The message that is kind of magical because it's even better than a normal UI 'cause it gives you history, which normal UI doesn't. It tells you about identity, it tells you who said what when. Was this me? Was this the software agent? Was this another person? It supports rich media, not just text. You can have images, interactive things. It guides input 'cause you can have controls at the bottom that say what can be entered here so it's not just open ended and free form.

Voice is exactly the opposite. There's no history, there's no identity. You don't know who said what 'cause you can't really tell the voice apart. It's only for text and it's completely free form. It's hard to give guidance about what can be said. I think that's what leads to a lot of the frustrations with the voice based interfaces of the assistants we have now. So you want to restrict, especially when you're working on voice.

So let's get into an example. I chose a character that already exists, Francoise the Duke of Rochefoucauld. His domain is the Sorrows of Life. He's really easy to write 'cause he's a real guy. Francoise VI Duke de la Rochefoucauld, Prince de Marcillac, born 1613, died 1680. He was a wealthy, noble, good looking fellow. He wrote in the 17th century. He was also, despite being born very fortunate, unlucky. He had a marriage that he wasn't very happy with. He was exiled from Paris at one point. He was libeled, he was shot through the eye. Looked a bit like this. If you had to characterize his temperament, it's probably affected by those facts. This was his ancestral seat, which I think the Rochefoucauld's still live in, so he didn't have it so bad.

But he did have these sorrows and he wrote these great books called the Maxims. Maxims are just books of epigrams, of aphorisms. Short, little, witty statements about life and they cover everything. Some examples. They cover love, ambition, self delusion, and here I just started going through the Es in the index. Ennui, envy, esteem, evils, the exchange of secrets. If you had to characterize what he's like, as a character, you'd say he's cynical, he's worldly wise, he's witty. His domain is Life's Sorrows, things that might make you unhappy, things that make you reflect on the soul.

Just to give you a taste for this guy 'cause I like him so much, here's one of his more famous epigrams: We all have strength enough to bear the misfortunes of others. So it's always the end that has that bit of a sting. Like yeah, other people's problems, they're always pretty easy to bear. Or another one: To say that one never flirts is in itself a form of flirtation. So not many modern philosophers are giving you advice on flirtation I think it’s good therefore go back in the centuries when you need that sort of thing. I think he's the kind of guy who'd be very popular on Twitter right now until he said something that was a little too out there. Then he'd be very unpopular. Maybe he'd got shot through the eye.

How do we build it? Let's get to brass tacks here. First step, we take the Project Gutenberg copyright unencumbered text of Rochefoucauld's epigrams and we load it in to a Swift Playground. So we have all the epigrams in here. Then we also have here the index which it nicely gives you the theme associated with every one of these. We're gonna need it to talk. Talking is actually the easy part on iOS. We can do that by importing AVfoundation. 'Cause AVFoundation provides a speech synthesis API. There's really not much to say about it. It's much more flexible than what I show here with this slide. You can have it speak in French, speak a little faster than normal, speak at a higher pitch than normal. But basically you create a synthesizer and then you create an utterance object which contains text. Then you tell the synthesizer to speak the utterance object and it does. When it's done, it'll give you a call back on your delegate.

There are a lot of different voices available to you. Men, women, different accents, different languages. If you just do it by default like this, you get the default locale which in this case is English. So that's what we're gonna use to get Rochefoucauld talking. Now we also need to understand what Rochefoucauld's saying. For that, we need to import speech.

This is the speech recognition API that was introduced in iOS 10.  There wasn't actually an on stage session about this but there was sort of a secret online only session, session 509 which is 15 minutes long, and is very clear. They have pretty good sample code. This is the essence of the interaction. Only once you need to create some armature that's gonna be around for the lifetime of your app. The AVAudioEngine, AVAudioSession, SFSpeechRecognizer, you need to configure those.

Then for every single utterance that you want to recognize, you create a recognition request object which you'll use to create the recognition task which is what the speech recognition system is going to use to keep track of that utterance's recognition while it's in progress. You also need to set up some wiring to connect the AV system so the audio buffers that are coming in from the AV system get sent to the speech recognizer. During the utterance, you get these call backs. Every call back delivers a SFSpeechRecognitionResult.

But the one sort of fly in the ointment here, so this I'll say a little bit about, is that the speech recognition result is final, is never actually true. It never says it's done. So you have this problem of end-pointing, of recognizing when an utterance is finished. It's not that hard though. The solution is use NSOperations or now just Operations. Specifically you can define an Asynchronous operation. Then with this asynchronous operation, you can wrap up all of the work of recognizing a single utterance. That operation can be defined so that every time one of these call backs comes back saying, "I recognized a bit more, I recognized a bit more," it restarts a timer. Then the moment the timer's been able to go for like three seconds without a recognition, then you know you've reached the end of that utterance, and you can shut down that operation, and mark it done.

So then the back and forth of a chat is just one operation cue and these three operations in sequence over and over again.  Recognize, generate the response, speak the response, recognize again. Once you have that, then you can create this very general infrastructure. So at the top, I have a protocol called Interlocutor. That's just any object that takes something that someone said and then responds by saying something else. Then you can feed anything that implements Interlocutor to VoiceChatter. VoiceChatter just wraps it in a voice UI.

Now for my case, I'm gonna simulate Rochefoucauld. So I'm gonna have a few questions and answers that the bot asks. Then afterwards, it'll deliver a randomly chosen quotation appropriate for those questions and answers from Rochefoucauld. That's how we're gonna do our Rochefoucauld simulation. So the simple base of it, the kind of tree of possible conversations is just represented by this one tree object using a type called conversation node where every conversation node's either a question and answer pair or the statements.

Then how does that end up? You get something like this:

A video starts playing showing Alexis talking to the Rochefoucauld chatbot.

Alexis: Rochefoucauld was a cynical Frenchman. He's full of interesting wisdom. You can talk to him on your phone. Hello.

Rochefoucauld: Is your problem with your own feelings or with other people?

Alexis: My problem is with my own feelings.

Rochefoucauld: And do you suffer from love, or from ambition?

Alexis: I suffer from love.

Rochefoucauld: Envy is destroyed by true friendship, flirtation by true love.

Alexis: Thank you for that.


Isn't that great? We're talking to Rochefoucauld on iOS 10 on our phone. Because we have this general protocol oriented design, you can plug other things in. Like you can take the old ELIZA code which someone worked out in Objective-C very happily. You can do that as well.

A video starts playing showing Alexis talking to the ELIZA chatbot.

Alexis: So who's ELIZA? ELIZA is one of the first chatbots. She's a Rogerian psychotherapist who is a product of 1960s AI culture. With iOS 10 stuff you can get her working in your phone.

Alexis: Hello.

ELIZA: Hi, I'm ELIZA. What do you want to talk about?

Alexis: I want to talk about Swift.

ELIZA: Why do you want to talk about Swift?

Alexis: Swift makes me happy.

ELIZA: That's interesting.


So there we go. We also have ELIZA. All you do is create an object that takes text, gives you text out, and now you can talk to it. You can even, if you wanted to, wrap it up in message bubbles, maybe using, Jesse Squires’s awesome, message bubble view controller. Then it would look like a message UI.

But we want to do more than this 'cause now Swift runs on Linux so Swift can go anywhere. I'm gonna go just a little bit over my time for a few minutes but I'll try to race through this bit quickly. 'Cause I love it and I think it's so neat. But I also wanted to get Swift working on the Amazon Echo.

Now you can't just import Alexa unfortunately. Amazon is not that helpful yet. The Alexa service, which is what you use when you want to program for the Amazon Echo, doesn't work that way. An Alexa Custom Skill is anything you can install on the Echo. There's thousands of them there that you can install. A lot of them are like little trivia games. There's a really nice one that's a simulation of you being a detective trying to solve the murder of Batman's parents. There's a lot of cool stuff.

The easiest way to deploy an Amazon Alexa Custom Skill is using the Amazon Lambda service. Amazon Lambda is a serverless hosting service. Basically instead of defining a whole app and then running it there, you just define a function. You upload that to Amazon and they take care of like calling the function for you, and scaling it, and doing everything for you. Unfortunately Amazon Lambda only supports Node, and Python, and Java. Doesn't support Swift. So how am I gonna run Swift on the Amazon Echo? Very sad.

But then I thought this is for Swift Summit. It's a challenge but it's worth doing because the summit is there. I want to put Swift at the top of the Swift summit. It can be done.

It can be done. There's a loophole in the system, which is that the Amazon Lambda function you define, although it's written in say, node.js, is allowed to include a executable binary. So the secret here is you just define a Lambda function, which is a little bit of JavaScript code. That JavaScript code just passes its input into your Swift executable. The Swift executable passes its output out and that goes out through the JavaScript again, into Lambda, through Alexa, and onto the Echo. I won't talk too much about the bits and pieces of how it's put together. There's a project on GitHub if you're curious. But basically it's exactly what it looks like. You have a directory, a little bit of JavaScript, you put your Swift in there, you add some libraries you need. Helps to have Docker to make it a bit easier to use. There you go. Then you can get something like this.


A video starts playing showing the Amazon Echo, you can hear Alexis voice in the background. 

Alexis: Echo, ask Rochefoucauld to soothe my soul.

Echo: Usually we only praise to be praised.

Alexis: Echo, ask Rochefoucauld to speak to my condition.

Echo: It is easier to govern others than to prevent being governed.


What to make of this? I say one conclusion is: Yes, Swift can go anywhere. Swift's awesome because you can use the same codebase to define a conversation that's gonna run in a Alexa function and a Lambda function to drive an Alexa custom skill as you use on iOS. You can use literally the same code both ways and that's fantastic.

What's going on? What does conversational UI mean? Well there are real advances in speech recognition and speech synthesis. The stuff around actual processing language only seems to work in very constrained domains. So I think the best we can do is program using good speech interfaces but being very careful to word things so that we're just working within constrained domains and thinking about also designing the personality and the character.

In general, also when there's like a lot of hype in the news about AI becoming miraculous so we can talk to computers soon, don't believe the hype. If you dig into it, there's usually something there, the kernel of it, but it gets magnified and generalized on the way out. I still think it's very exciting and I think it's a fun thing to play with.

Thanks to other people who helped me think about this, especially the gentleman who worked out ELIZA in Objective-C once upon a time, and the IBM team's Kitura stuff was very helpful. Thank you all very much.