JAMES: Alright, are we about ready?
CHUCK: Yeah, but I’m kind of nebulous and confused about what we’re talking about today.
DAVID: Nebulous. I get it.
JAMES: Nebulous. Nice.
DAVID: I’m sort of cirrostratus about it all.
CHUCK: Yeah, there you go.
AMY: [Laughs] You guys.
JAMES: Alright, hopefully all those jokes are gone.
CHUCK: Oh. Yeah, the whole thing’s a little bit foggy.
[Hosting and bandwidth provided by the Blue Box Group. Check them out at BlueBox.net.]
[This podcast is sponsored by New Relic. To track and optimize your application performance, go to RubyRogues.com/NewRelic.]
[Does your application need to send emails? Did you know that 20% of all email doesn’t even get delivered to the inbox? SendGrid can help you get your message delivered every time. Go to RubyRogues.com/SendGrid, sign up for free and tell them thanks.]
[A special thanks to HoneyBadger.io for sponsoring Ruby Rogues. They do exception monitoring, uptime, and performance metrics that are an active part of the Ruby community.]
[Bandwidth for this segment is provided by CacheFly, the world’s fastest CDN. Deliver your content fast with CacheFly. Visit CacheFly.com to learn more.]
CHUCK: Hey everybody and welcome to episode 148 of the Ruby Rogues Podcast. This week on our panel, we have Avdi Grimm.
AVDI: Hello from Pennsylvania.
CHUCK: James Edward Gray.
JAMES: Hello from springtime.
CHUCK: David Brady.
DAVID: Uh, hi.
CHUCK: I’m Charles Max Wood from DevChat.TV. And we have a special guest and that’s Amy Palamountain.
AMY: Palamountain. You were close. [Laughs] Hi, everyone. I’m from New Zealand.
DAVID: So, you’re calling from tomorrow.
AMY: I am. I’m calling from the future.
DAVID: That’s awesome.
AMY: We have flying cars here. It’s pretty amazing.
CHUCK: Yeah, the podcast of tomorrow, today!
AMY: Oh, yeah. [Chuckles]
CHUCK: You want to introduce yourself real quick?
AMY: Yeah, sure. So, I’m Amy, obviously. I’m a programmer down here in New Zealand. I have programming for not too long, maybe five years I think. I was thinking about that last night. It seems to have gone quite quickly. I have spent a bit of time doing contracting and things like that. And then I started working for a company called Green Button. We’re doing high-performance compute in the cloud. And just recently in the last three months, I think, yeah it’s been three months, I’ve just joined GitHub. And I’m working on GitHub for Windows, so doing lots of native Windows programming.
CHUCK: You said you just joined GitHub but I wanted to say, “We’ve all been on GitHub for a while.”
AMY: Yeah. [Laughs] Well, thankfully I have too.
JAMES: But now she’s working on the other side.
DAVID: Yeah. There’s the free plan and the paid plan, and she’s on the get paid plan.
CHUCK: There you go.
JAMES: That’s right.
AMY: Yeah, so I haven’t been working there for too long, but so far it’s been amazing.
JAMES: That’s great.
CHUCK: Yeah, she gets all the new features before they break our setup.
DAVID: Yes, but on the other hand, she has to put up with all the new features while they’re still breaking setups.
AMY: [Chuckles] That’s true. It’s worth it though.
JAMES: That is very true. Well, I asked Amy on because she’s given this talk called cloud confusions. And there are ten points in there that I found very eye-opening going over. So Amy, you want to tell us a little bit about that talk and why you did it?
AMY: Sure. Well, actually that was the first talk that I ever gave. So, somebody I guess egged me into giving a lightning talk for the first time. And this was a topic that was relevant to me because it was when I was working at Green Button doing lots of distributed computing stuff on lots of different cloud platforms. So yeah, the idea was to give a 10-minute lightning talk and race through as quickly as I could 10 points about developing for the cloud and some of the things that perhaps caused people come unstuck or myths or those sorts of things.
JAMES: Awesome. We’re starting the stopwatch. So, go. No, I’m just kidding.
AMY: My god.
AMY: Actually, giving a talk in 10 minutes is pretty difficult. You’ve got to get your sentences down to a pretty concise format.
JAMES: Yeah. I haven’t done too many lightning talks. A couple, but then I went to this conference one time and when they originally set it all up, they gave us 45 minutes. So, I wrote a 45-minute talk and I had it down pat. And it was perfect. And then toward the end, they cut them to 30 minutes.
JAMES: And I had to shave 15 off and there was nothing I wanted to cut because it was so perfect. And yeah, that was horrible.
AMY: [Groans] Oh, I feel you. That’s tough.
JAMES: So, let’s just go through your points. I think the very first one was development is painful, right?
AMY: Well, this was more about continuous integration and continuous deployment being a pain point. So, I feel like this is probably less of an issue these days, because I feel like the APIs that we have to integrate with a lot of the cloud platforms like AWS and Windows Azure and all of those big players, I feel like their APIs have become a lot more well-rounded. So, early on some of the problems that we were having when I was contracting were things like how do we get our deployments up? How do we do this in a pain-free way? And so, this point was really about myth-busting and saying that you can still have an automatic build process and an automatic deploy process. It’s just a matter of having a good look over those APIs. And now we’ve got things like Puppet and Shift. So, there’s some pretty amazing work going on in those areas to help you automate a lot of the provisioning of your environments right from the get go.
JAMES: Yeah, I think that’s pretty cool. I was looking at DigitalOcean just the other day. They’re a new small, well I don’t guess they’re small anymore, VPS service. And even they have just a great API as far as being able to bring up instances and stuff. And it’s really getting to where everybody can do that now. It’s cool.
CHUCK: Well, yeah. And you have a lot of that with even some of the long-running ones like Amazon and things. You can provision a server from Chef provided you have all the automations set up. And it goes, it creates the instance, and then boom, boom, boom, you’re done. And it’s up and running, doing your app stuff.
JAMES: And that makes it so much cooler because you can, I think where I began to understand why it was so critical was the realization that you could bring things up in times of need. We’re doing this thing at work now where we have these TV commercials that hit every so often and our traffic spikes really big right around the time of the TV commercial. So, you can artificially increase your infrastructure for a while and then bring it back down when you don’t need that much load anymore.
DAVID: Is there a name for that now? Back in the 90s it was getting Slashdotted. And then it was getting Penny Arcaded or getting wanged. And then it was getting redditted. And the TV thing specifically, this has happened to me twice. And both times, [chuckles] our product got on Oprah, which is now also not around anymore. So, what are the kids calling that these days?
DAVID: We don’t have a word for it.
JAMES: That’s awesome. I like Amy’s. I think that’s a winner.
AMY: [Laughs] Yeah, I think these things are super important. Continuous deployment is something that everyone seems to be striving for these days. And there’s absolutely no reason why you can’t be checking in your code and seeing your changes, maybe not necessarily straight into live. You could be that bold, but you might have them being continuously deployed to a test environment or something like that. But there’s absolutely no reason why you can’t achieve those things with all the amazing APIs that these cloud platforms have these days.
JAMES: For sure. So, okay, point number two. I’m married to my cloud. What’s that about?
AMY: So, this is an interesting one. So, this is about building your software in a way that correctly abstracts all of the parts of the cloud platform so that you can move from cloud platform to cloud platform as you need to. And you might not think that this is a super important thing for you. You might think, “Well, Amazon is working great for me and I’m running my apps here. And it’s just perfect.” But who knows what’s going to happen? You might see that a computer that comes out with a machine that supports, maybe they have more RAM, and you need high-memory instances. Or you need more cores, or you need better network infrastructure, and your current cloud platform doesn’t actually support those. So, you want to have the option, really, to be able to move around. And so, this was particularly interesting for me at Green Button, because we had a platform that was designed to burst compute workloads. And the idea was that you could submit a compute-intensive job to us and we would route that job to potentially more than one cloud. So, we might say you need lots of memory or you need lots of CPU compute cycles. We can give you a competitive cost on Amazon, perhaps not such a competitive price point on another cloud provider, and you would be able to choose which platform it would run on. So, we had a piece of middleware that essentially ran on all, well not all, but many cloud platforms. So, we had OpenStack. We had, what did we have? We had Delphi vCloud. We had Amazon AWS, Windows Azure, and we had this platform that would run across all of those. And so, the key point there was that for us to do that, we needed to abstract away quite a lot of the important infrastructure, things like all your queuing, all of your storage, all of your auto-provisioning logic, all of those kinds of things needed to be correctly abstracted away so that you could very quickly take your implementation, swap it out with another cloud platform, and just write new implementations against those new APIs.
JAMES: That’s a great idea. You have a point in here that I love that says, “If you don’t own it, abstract away the gory details.”
AMY: That’s right.
JAMES: That’s a great point.
AVDI: Can you expand a little bit on what those abstractions look like?
AMY: So, for example we had queuing providers which would have the standard queuing methods where you could peek, you could pop, you could push things onto a queue. You could do all of those kinds of things. And we looked at what a queue was and created these abstractions over the top of it. And then we had implementations that would be specific to the underlying provider itself. So, we did that for all of our queuing logic. We did that for all of our NoSQL non-relational stores. We did that for relational stores. Basically, anything that was off-the-box integration point. And this seems obvious, but one of the things that you see is that people don’t consider that they might have to potentially move from one cloud to another. And so, the idea is that…
AVDI: Well, and it’s also something that often falls into the YAGNI category, or people feel like it’s in the “You aren’t gonna need it” category when…
AVDI: …when getting something off the ground. I’m curious. Did you see that need coming? Or did you get burned by being married to a cloud at some point?
AMY: Yes, for sure. So, I think for us we actually saw that need coming.
AMY: Because we wanted to be able to run on multiple cloud platforms. But I feel like the YAGNI thing, that’s a great point. And we shouldn’t be building software that’s bloated and has all this useless stuff that we might need maybe someday. But coding against generic interfaces is something that doesn’t actually cost you anything to do. Even if you just have one implementation, coding against a generic interface is a cheap and easy thing for you to do upfront.
AVDI: That’s a fantastic point.
JAMES: Also, it makes things like testing a lot easier, because you have that clear wall where you can just abstract away the service that you’re not really concerned with right now.
AMY: Absolutely. We had a case where our core processing was this huge state machine basically. And what was really nice about that is we could abstract away all of the dependencies, the integration points, and essentially run our entire process in memory and get this full-on integration smoke test processes running in a very cheap, easy to replicate manner. Because we don’t want to be running smoke tests where we’re spinning up 10, 20 machines to test our code, which was essentially provisioning many machines and running workloads across many machines. We want to be able to do that quickly and in memory if we can.
JAMES: And I assume your abstractions don’t have to be ridiculously complicated. If you have a queue and it supports the normal queue actions, enqueue, dequeue, that kind of stuff, it may be as simple as one class that just wraps that one. So, you call queue.enqueue and it actually calls AWS, I can’t remember what they call their queue, SMS or something. Anyway, it calls the enqueue method on that. It can be simple.
AMY: That’s right. It’s about getting rid of all the bells and whistles and solving the core problem.
JAMES: That’s cool.
AVDI: Yeah. That’s a great point. I think that’s one thing that people miss a lot of the time with abstractions, is that you only need the abstraction for the services that you use in your app. You don’t need to build an abstraction that abstracts all the possible features that people use.
JAMES: Exactly. Okay, so number three is the tricky one, the definition of insanity. We could spend a lot of time one this. [Chuckles]
AMY: Yeah. [Chuckles]
DAVID: It’s like Josh and I wrote that sentence together.
JAMES: It is. Yeah [inaudible].
CHUCK: One part defini-, never mind.
AMY: Oh, I think the thing to keep in mind is if it can fail, it will. And if you’re just going to keep hammering away at something pretending that it’s going to succeed one day, that is the definition of insanity. If you’re encountering a fault, so something’s gone down and you’re just blindly retrying until it comes back up, you’re not going to have a [fun] time.
AVDI: That’s almost always my approach when GitHub is down.
AMY: F5. F5.
AMY: Well, we’re working on that. So, hopefully you won’t have to hit F5 as much. [Chuckles]
AVDI: I’m sorry. Go on.
JAMES: You talk in here about transient faults. How about that’s something we really think is…
AVDI: See, this is what happens when you hit F5 too many times.
AVDI: You can actually hurt the service that you’re trying to get through to.
JAMES: Awesome. Why don’t we talk about transient faults? What is a transient fault?
AMY: So, a transient fault is something that is exactly what it sounds like. It happens transiently. So, you may one minute be able to get data from your data store and the next, you try and query it and you don’t get anything back. And this is particularly important when you’re in an environment like in a shared cloud scenario where you’ve got things that are affecting whether or not these services can respond to you other than yourself. So, other people are querying this high load. Who knows what’s going to happen? You just need to expect that you’re not going to get your data back. And just retrying may not be the best thing to do here. And so, this is about understanding the nature of the faults that can happen. So, quite often you find, especially with some of these REST APIs over the top of things like Azure table storage, for example, when things fail, they’ll return to you some status codes. Not some HTTP status codes, but some, in the body of them and such they’ll have more detailed codes about what they think is actually happening on the service itself. And so, you need to be looking to these codes and making some informed decisions about whether this is a fault that you should be able to retry soon, retry later, or, “Hey, actually this is quite catastrophic and it’s unlikely that you’ll be able to retrieve your data right now.” So, it’s really about understanding the nature of your faults. And that the faults are not all born equal and that retrying again is not always the best thing to do. One of the things you can do is use an exponential back-off in your retry policy. So, you get an error message for example and you can’t get your data. So, you might back off for a second and then try again. And if it fails a second time, you might back off for say five seconds and then try again, to give the service some time to recover. But that again comes back to that error code thing. You need to really understand the nature of the fault, because if it’s, I can’t think of a particular example right now, but if it’s something that even a transient try policy isn’t going to handle, then that’s not going to help you particularly much.
JAMES: Avdi, I think you had an awesome example in one of your error talks at a conference once where you were trying to do something with the network and that failed. So, you went into your error handling code which tried to report the error to some network service or something and it just snowballed from there.
AVDI: Oh, yeah. Well, the example there, yeah, it’s actually a classic example. It was a very young system where we had been emailing error reports to ourselves. And again, since it was a very young system, we’d just been using somebody’s Gmail account as the SMTP server for emailing to ourselves. And we started on something. We pushed up a new version of the code that had a problem. And so, I started sending a whole bunch of error reports, which then hit the Gmail SMTP rate limit. And so, SMTP started failing and that caused the Ruby SMTP module to start raising exceptions. And the error reporting code had not been written to handle SMTP exceptions. And that exception then propagated all the way up and shut down the whole process. And the interesting thing about this was that wholly unrelated systems started shutting down because they were using the same email account to report status.
AVDI: And so, they were exceptioning out as well and croaking. And that was an interesting one to unravel. So yeah, firewall your exception reporting code.
AMY: Yeah. [Laughter]
JAMES: I think it was, wasn’t it Oz that went down a while back? And some error reporting service was running on it. Was it Hoptoad or something? I can’t remember now, that went down with it and so a bunch of applications started having trouble [chuckles] because that was down. It’s funny how those dominos fall.
AMY: I think another important point is you need to be aware of where you are in your application’s flow. So, for example you don’t want to be doing something silly like trying an exponential back-off when you’re in the middle of serving a web request to somebody. That would be not so great. So, it’s not only about understanding the kinds of fault that the service that you’re integrating with might actually have, but it’s about understanding whether it’s appropriate for you to retry at any given time. I’ve seen some pretty awful code that does this. And you look at it and you’re like, “Well, that’s kind of obvious. Let’s get rid of this.” And it’s easy to, especially if you’re hiding everything behind abstractions, it’s easy to miss sometimes that these things are going to be handling these transient faults. And perhaps you don’t want them to handle those transient faults.
AVDI: That’s actually a fantastic point, because my first instinct with this stuff is as soon as I handle a transient fault in one piece of the application like that, my instinct is to abstract that code out and have a generic URL requester that can always handle, that wherever it’s used will handle these kinds of transient faults using an exponential back-off. And it’s probably going to have some sort of default of try three times with a back-off or something like that. And yeah, there are going to be cases where there it’s actually not appropriate.
AMY: I found that one of the things that I tend to do in that scenario is tend to push up the error handling code as high as I can. And that might seem to go against the grain, but I found that often is more helpful than hurtful. People worry about duplication, right? Oh, well I don’t want to duplicate this error handling code. It’s actually not such a big deal. Handle the errors as high up as appropriate, I would say.
JAMES: I feel like we should point out that Netflix has some pretty amazing open source tools for this kind of stuff. The SimianArmy that they use to test their own stuff, so there’s things like Chaos Monkey which will go through and randomly shut services down, so that you can see how your infrastructure handles that. What happens when a part of it just goes away? Or one of the other pieces that are very interesting is Latency Monkey. So, it will do similar things but increase the length of a request. So, you’ll make it and instead of being zippy as it usually is, it’ll just insert a nice pause in there. And you can see how well your system handles these kinds of things. And they run these tools on their system to make sure that it can cope with these kinds of problems.
AMY: Yeah. Actually, this relates to one of the points later in the talk that I gave, around basically accepting that things are going to fail, and that you need to embrace that failure and build for failure. So, you’re right. The Netflix Chaos Monkey is actually an amazing idea that I feel like we should be [chuckles] just putting into our environments by default, almost. Yeah, it’s incredible. I think when it goes around, it looks for groups of things like service groups, things that seem to be related, and just will use some configuration that you give it basically around probabilities and numbers to start terminating systems, which I think it’s absolutely a fantastic idea. It forces you to think in this way of, “Hey, things a going to fail. And that’s a normal state for my application to be in, is failure mode, basically.”
JAMES: And handling it doesn’t always have to be complicated. I worked on this one application which was a financial system in the cloud. And we had these issues where a lot of the times, we had issues with syncing things up, making sure which order the requests came in and can we safely apply all of them, or can we safely apply this one but not that one, and things like that. And a lot of times we found the solution was much easier when we would run into a scenario where we detect something went wrong. A lot of times it amounted to just communicating that well to the user, just coming back to the user and saying, “Oh, you know what? We can’t do this right now because,” and then usually the user knew the right thing to do. Or, “This has been queued but it might take a while to get there, or show up in your list,” or something like that. Just giving them a valuable feedback often solved the problem.
AMY: Yeah, and letting them make a decision about how they need to solve. We know what they need to do next. That’s great.
JAMES: Right. And it didn’t require crazy complicated code infrastructure. Alright, number four, the limitations of storage.
AMY: So, this one’s a favorite of mine. [Sigh] These storage endpoints that we use, these REST storage endpoints that we use to store our data and query our data, they advertise that they will be able to scale at this amazing rate and you’ll just be able to throw things at them and they’ll handle whatever you throw at them. So, there’s always going to be upper limits as to what those endpoints are going to be able to handle, right? And you’re not going to be able to anticipate those. We would find at Green Button that we would frequently hit throttling. That would be something that would happen quite a lot. We were querying often and too often. And the thing that we were being told there in a nutshell was, “Hey, you can query this data in a more efficient way.” And so, you had to take a hard look about how you’re structuring your data, how you’re storing things, how you’re bucketing things, are you using caching effectively, all of those kinds of things. And you need to come up designs, data designs that actually help you avoid hitting those hard limits that these storage providers actually have.
JAMES: That’s a really good point.
AVDI: We’re on point number four and I already wanted to change careers.
JAMES: What career are you moving to? Just out of curiosity.
AVDI: [Chuckles] Hobo.
JAMES: No, it’s a great point, right? Just because you can have this, just because we have this storage and it seems super easy to access and all of that, doesn’t mean you should use three queries when you can use one.
AMY: For sure.
JAMES: Yeah. You’ll hit some kind of cap. Or you’ll hit the point where, one of the problems I’ve run into in the past was a simple file upload. But then whatever they uploaded, then this service then in turn uploaded to S3.
JAMES: So, it was double the upload. Whenever someone came and uploaded something, there was the time and effort to get it up to our server and then an equivalent amount of time and effort to get it up to S3. And just rearranging that so that the upload went straight to S3, and then we found that there was a huge win, because it took away all the time that our server was spending doing all those stuff.
AMY: I think one of the things that make this really hard is that it often depends on how your users are using your application. So, you might have grand plans about how you think your users are going to use your application, and then you find you put it out into the wild and they start doing weird and wonderful things. And it turns out that data point A, they actually want to get with data point C, not data point B, which you grouped together. And you might have to do some rearranging around how you’re actually storing that data so that you’re querying A with C rather than A and B, which is useless according to the average usage pattern.
JAMES: That’s a good point. Don’t guess. [Chuckles] We need the actual data of how it’s being used.
AMY: Yeah, for sure. One thing that I seemed to learn over and over again is that the ways that you think people are going to use your software, that’s not how they’re going to, you know. [Chuckles] You can only hold so many paths in your head. And it can be surprising when you get your software out to people and they turn around and they have some pretty amazing interpretations on how to actually use it. Oh, an example that I’ll give is we thought we were building something for managing teams and organizations. So, putting people into team structures where there was a tree basically where permissions were inherited through a tree. And we gave it to our customer, and it turns out what they were actually doing was using it to structure projects. So, there weren’t teams of people, there were actually projects that had dependencies in a tree fashion. And it was just crazy, because we never thought that they would use the software in that way. But it turned out that that’s exactly what they were doing. And so, it comes back to that data thing. If the goal posts are constantly moving, which they should be – if people are using your app and you’re getting feedback from the people that are using your app, the goal posts should be constantly moving – then you need to be reassessing how you’re actually structuring your data to avoid some of these limitations that you’re going to hit in these cloud platforms, like throughput for example.
JAMES: Right. And you have those limits that say they’re infinite, just means they bill in the shape of a hockey stick.
AMY: Yup. [Chuckles]
JAMES: The farther you get down there, the higher the bill goes.
AMY: And there will be hard limits. You will get responses back that say, “Hey, we just can’t service your request anymore. You’re hammering us,” and that should be a warning sign for you to say, “Well, can we do better? Can we structure our data in a different way? Can we implement some caching? Can we spread our data out a little bit differently and be a little bit more efficient?”
JAMES: Number five, we all know this one. We’ve got to be ‘web scale’.
AMY: So, what does that mean? Does anyone know?
AMY: I don’t know what that means.
CHUCK: Did she just ask for a definition?
JAMES: I’ve seen the comic.
DAVID: No, I think she’s trolling us.
AMY: Yeah, that’s a funny one. I feel like there’s this epidemic of choice of technology as a function of fashion. And I don’t know. That just goes against so much. You really should be building an awesome product, not awesome tech. The people who are using your software don’t really care about whether you’re running on Redis or MySQL or all that. They don’t care. Yeah, just because Netflix is using it or Twitter is using it doesn’t mean that it’s right for you.
JAMES: That’s so awesome. Yeah.
CHUCK: But they’re web scale.
DAVID: So, I can, what do they call that, retread an old joke. I can tell you what web scale means. It means an extra $60 an hour in my billing rate.
JAMES: [inaudible] I really like this slide you have in here where you say if it works for you then it’s great, whatever.
JAMES: It doesn’t have to be built on whatever. It can be built on whatever works.
AMY: I feel like one of the things that people do is, well recently there’s been this whole NoSQL movement, right? Which is great, and it actually does solve a lot of problems and it can really get you quite far, if you’re using it for the right reasons. A lot of people see this NoSQL movement and they think, “Oh, well I can’t be using relational stores anymore,” and that’s just totally not true. You know, relational stores work just as well as non-relational stores for particular kinds of data. And you really need to be thinking about…
DAVID: Even better for some.
AMY: Exactly, exactly. And I feel like this function of fashion thing is clouding, clouding [chuckles] clouding, huh…
AMY: …clouding people’s judgments. Sorry, that was really lame.
DAVID: Actually, if you hadn’t called attention to it, it would have been really clever, because I… anyway, carry on.
AMY: Thinking about things like whether you need the properties of fully atomic transactions. You need ACID for this this group of transactions. Do you actually need that? So, these bits of data actually need to be stored relational. Those are the kinds of things that you should be thinking of, as opposed to, “I’m going to use Redis because it’s cool,” or, “I’m going to use whatever’s the next hotness.”
AVDI: Well, I think there’s usually a little, a slight layer of rationalization on top of that. And it’s usually, “I’m going to use Redis because it’s really, really fast for giving benchmark X.”
AMY: Oh, and that’s entirely appropriate.
JAMES: And sometimes you don’t consider the other side. “It’s really, really great at this, but.” [Chuckles]
AMY: Sure, and I feel like it’s fine for you to start with what you know and then, if you’ve got great abstractions, then it should be not too difficult for you to swap that out when you realize that you need some of these data stores which are showing that benchmarking things that are in line with what you’re trying to achieve.
JAMES: Sometimes it’s just a matter of how much you need to throw at this particular problem, too. So, there’s a part of our application where we need document database like storage. We have these structures and they’re kind of arbitrary and nested and stuff. And we could bring in a document database and handle all of it that way. But then we have this other component in our infrastructure that we have to handle and handle the failure rates of, as we were just talking about, and that kind of thing. Or we can shove a JSON column in a Postgres database. And that gets us pretty close. Is that as full-featured as a full document database? No. Does it meet our particular need? Yes, or maybe in this particular case. But sometimes, just knowing whether or not you actually need the latest, greatest, full-featured thing in a certain area, or if you can get by with a semi-workable solution then it won’t be too much pain.
DAVID: Has anybody ever used serialize in Rails for not JSONifying an object for a document store?
JAMES: Well, by default it does YAML. So, I would say a lot of the people use it that way.
DAVID: Oh, okay, alright.
JAMES: I think. But nowadays in Rails 4, you can just declare the column JSON, which is amazing.
DAVID: Oh, that’s cool.
JAMES: And you just shove some Ruby object tree in there. It JSONifies the whole thing, sticks it in there, and in Postgres you can index that. You can query it. It’s cool. See Ruby Tapas for more info.
AVDI: Although I really need to cover Active Record.
JAMES: Yeah, now you have to do the Active Record side. You did the SQL side, which was cool. Number six. Go Daddy goes down. [Chuckles] This should be about why we don’t use Go Daddy? No, it’s…
CHUCK: I’m hoping they go down. I mean…
JAMES: Yeah, it’s about something else. What’s this about, Amy?
AMY: This was actually around the time, I think it was last year, when Go Daddy had a massive denial of service attack. And it seemed like half the internet went down. It was pretty catastrophic. And so, it made me think. DNS is a single point of failure, right? You’re trying to identify single points of failure in your app. And if you’re not identifying DNS as one of those, then you’ve got a potential fall for some pretty significant hurt. And there are some things that you can do. You can pay for failover solutions. There are companies out there that do that. And the interesting thing about that is the way that they work is that they lower the time to live. And the name service, well not the name service themselves, but they’ve got this services that are checking primary IPs and making sure that they’re up. And then if they’re down, they’re redirecting traffic to a secondary IP. And that only works when you’ve got a low-ish time to live, which is going against some of the things that you get when you’re using high time to lives, right? [Chuckles] That’s the whole point.
AMY: So, it’s like, “Okay, well what can we do here?” And there’s actually something really simple that you can do that will help you stay up in a situation like this. And it’s just manage the failover yourself. Have more than one domain name. Have it with two different providers. Have your name service… You’re not depending on one set of name service or two sets of name service. Have that spread out. And then within your application, at an application level, if you detect failure, then fall back to that secondary URL that you have. And then say you’ve got a shop for example, an online store. And DNS goes down and nobody can access your website. Well, the thing that you can do there is you can say hey on Twitter. “Hey guys. It’s actually still up. It’s here.” And it’s not going to give you 100% uptime, but at least it puts it in your hands. You’re not worrying about poisoned caches all over the shell. And you’re not waiting for things to clear, and their time to live to expire, and things to naturally restore themselves. You’re able to control some of that downtime yourself.
DAVID: That’s genius. There’s a problem that I’ve seen. I first started seeing it in the late 90s and I’ve seen it about once a year ever since then, where if you try to request a domain name and it propagates and it goes all the way up to the big six or seven ICANN servers, or whatever, the core central DNS things. And it finds your top level domain registered at Go Daddy or registered wherever you’re registered. But it goes to your DNS servers and they’re down, it will return failure. And this type of cache miss is the most expensive thing that DNS servers can do. And so, they automatically chill that request to prevent you from doing a denial of service. And so, somebody tries to get your domain and you’ve got a hiccup in your DNS server and you’ve got TTL set to 60 seconds. But guess what? Your ISP cached it for 86400 seconds, which is a full day. And so, you’re telling your VP of whatever that we’re up and we have this load balancing failover yadda, yadda, yadda. But the reality is that everybody that got frozen out during the service interruption is going to stay frozen out for the next 24 hours. But yeah, if you have a second domain name to failover to, that’s genius. That actually, that’s going in my tool belt. Thank you.
AMY: I was, what was I doing? When we, because we had some services on Go Daddy, and I had to basically go around and manually fix these things, right?
AMY: And give them hard IPs to point to.
AMY: So that they could stay up. And I was like, “Well, hang on a second. I totally didn’t need to do this.” [Chuckles] This could have been automatic.
DAVID: Yeah. Talking to a customer on the phone through how to edit the hosts file on Windows.
DAVID: Step one: realize that there actually is a hosts file on Windows. [Laughs]
AMY: Yeah, exactly.
JAMES: Don’t worry. This won’t be painful. [Chuckles]
CHUCK: I thought you were trying to say, step one: find a solid wall. Step two: bang head.
DAVID: There actually is a hosts file on Windows. And you have to shut down everything, all instances of Internet Exploder, and anything that has a ddl or, what are the kids calling it now, atl or com. Anything that’s got a web browser integrated into it, you have to shut it all down so that all of the IE dll’s shut down. And that’s way more tech support than I wanted to do. So, I’m going to stop talking.
DAVID: It’s a nightmare.
JAMES: And if you have trouble with any of this, please call David Brady at…
DAVID: No! No!
DAVID: No, I take it back. I know nothing. I know nothing!
CHUCK: Yeah, we did that once. I had a guy working for me when I ran the tech support or product support department at Mozy. He would surf Facebook all day, so we went and fixed his hosts file.
CHUCK: And so, he went to Facebook.com and it loaded up the internal web service that we used to do our work.
JAMES: That’s amazing.
CHUCK: But anyway, tangent ended. [Chuckles]
JAMES: Speaking of slow and Facebook. [Chuckles] Number seven, our app is slow.
AMY: Right. So, this is around realizing the need for scaling up. So, I’ll just define what I mean by scaling up. So, scaling out is when you fan out the number of application servers you have for example. And maybe you’ve got a website and you’re running it across four or five different machines to cope with load. Scale up is about using the resources within those individual machines as efficiently as possible.
JAMES: Or increasing the resources of those individual machines, right?
AMY: Totally. But remember, these things cost, right? Every time you increase the resources and you throw more resources at something, you’re paying for it. Every time you horizontally scale or vertically scale, you’re paying for it. So, spending time optimizing what you have to run in the most efficient way possible is fun and also super important. One of the things at Green Button, we would run customer’s jobs. So, for example people would submit the animations that had been, perhaps they were using 3ds Max and had built this animation. We would actually render it up on the cloud and would spin up a number of nodes and we’d render frame by frame. But the cost of that is directly proportional, or directly related to, sorry, the efficiency in which the shaders and the textures that we’re using were able to be rendered right. So, if they could spend time making that more efficient then they are going to save themselves some significant coin, because the cost of actually running on the cloud actually goes down because they’re not spending as much time using those CPU cycles.
JAMES: Yeah. Get farther on the same hardware.
AMY: Yeah, for sure.
JAMES: I do think you have to balance it with the cost of sometimes you have some piece of infrastructure in place. It’s working. It’s maybe not ideal in how it’s going through it. But if re-architecting that is significant, I do think you have to balance it with the development cost and the new infrastructure you’re going to introduce, which is obviously going to have some new bugs and issues you have to work through and things like that. I think you’re right, that a lot of the time, it’s very worth it, that typically you make that one-time payment and then you pay it off over the course of how long it makes everything better. But I think I have seen scenarios where it did make sense to take the box you’re on and just go up, because of the nature of the problem or something.
AMY: One of the things that I see happen a lot is SQL instances for example, perhaps they aren’t performing as well as they should. And people will throw more resource at a SQL instance when probably it would be more cost-effective. I would say most of the time, especially for SQL, your SQL server especially, it’s more cost-effective for you to actually identify the bottleneck that you’re introducing by perhaps adding an index or perhaps using a user-defined function, which is a little less intensive. Let’s not introduce deadlocks everywhere. Those things are going to cost you money. But you’re totally right. If you’re talking about micro-optimizations at an application level, then maybe that’s not worth your time. It really depends on the workload that you’re actually running. But something that I feel like you shouldn’t forget is that vertical scale is going to cost you just as much as horizontal scale.
JAMES: Oh yeah. I think it’s just a tradeoff you want to stay aware of, how much time am I going to spend to fix this? And how much bang will I get out of it? But a lot of the times, the formula looks great, because most of the time, whatever you to do fix it, you’re going to do once. You’re going to gain that benefit going forward for however long you go forward. It’s almost a no-brainer in a lot of cases. I agree.
AMY: Totally agree. The next one is diagnostics. It’s around being able to tell what’s going on, on your infrastructure and within your application in a super easy and quick way. I guess we all know diagnostics is extremely important. But I feel like it’s even more important when you don’t own the hardware that you’re running on? You can’t just always rely on being able to jump in on a machine, especially in a heavily-distributed environment. Just jumping on one machine and checking the logs or checking the event logs is probably not going to be an efficient way of debugging and figuring our problems when they happen. So, some of the things that are important there is having really great resource monitoring. So, this comes back to this vertical scale thing. You don’t want to be paying for things that you’re not using. And you don’t know that unless you’re actually monitoring those things. And you want to be able to see that at a glance, if you can. And then the other thing is logging. Logging becomes super, super important. And it’s not just logging for the sake of logging. But it’s being able to correlate activities that are happening across different parts of your system so that when a user rings up and they say, “I had this issue,” then you can actually trace everything that they did through all the various parts of your system that may or may not be intimately connected.
JAMES: And that’s getting trickier and trickier these days as we move toward SOA architectures, small pieces talking to each other, because so many pieces may have touched an individual request.
AMY: That’s right. At Green Button, what we were doing is people would submit jobs and it would hit our REST API and various things would be logged. It would then get put in a queue and then processes would put that message up and it would do various things on it, but then get fanned out across multiple machines where the processing work would actually be done. More queuing, more nodes, and eventually it would get back to the user. And so one of the things that we did there is that we would log. We didn’t just have one system log. We had a process-level log. So, the net effect was that the user could download the logs at every, obviously for the things that we said that they could see. We had some flag where we could say that this log was a user message and this log wasn’t a user message. But the net effect was that the user was able to download a log and see, “Hey, it’s successfully submitted. Hey, it was successfully hooked up by this next node. Hey, the processing that appeared on node 2, for example, or node B or whatever, went through these steps and it errored here. Node C however, had a great time. Things were then packed up. And now you’re reading a log.” And so, we were able to do that by storing the log information against the actual user ID. And so, it wasn’t a log file. We were actually storing it in a non-relational store where somebody just hit a table for the logs on their job. And they query that and see where the relevant pieces that occur for their workload. Being able to correlate on user ID and query quickly on user ID I think is super, super important.
JAMES: You have a slide that says diagnostics needs to be a first-class concern. I really like that. I had someone very smart tell me once that logging is a feature like any other. And so, it needs to have a story card and you work on it and stuff. And we don’t tend to do that because servers come with automatic default logging where you turn it on and, “Oh, this request was handled with these parameters.” And so, we think, “Oh, logging’s done.” But that’s just about the HTTP request. That doesn’t tell you anything about the business of what that action did or the logic internal to it. And unless you take steps to put that in some kind of diagnostics, then it just won’t be.
AMY: Yeah, you’ve got an excellent point that your logs are only as useful as the information that you can derive out of them. And so, if you’re just logging everything to a flat file and you’re [chuckles] viewing, or just a flat format, and you’re viewing the things that are happening in your system in a flat fashion, you’re not going to be able to very easily deduce correlations between certain events. And I think that’s why it needs to be a first-class concern, because the correlation, the way that you draw causation between things, is something that is app-specific. It depends on what you’re actually building. And so, if you’re not thinking about how you’re going to be able to see that stuff upfront, then logging in a flat format is probably not going to be much help to you.
JAMES: Great point.
AMY: It’s also great to be able to see all the exceptions that are occurring as they’re occurring as well. I feel like that’s something that I’ve been burned by in the past, is you’ve deployed something up and you think that it’s working great. And if you’re not having logs come to a central place where you can quickly view things, you can very quickly miss these exceptions that are not catastrophic. They’re not causing your app to crash, but they’re undesirable and those are bugs that you should be fixing. And I feel like if you’re not surfacing this information in a central place, you’re going to miss all of that.
JAMES: Sure. Our app is still up. Everything’s working fine. But all of a sudden, nobody’s signed up since the last deploy. That’s weird. [Chuckles]
AMY: Yeah, exactly.
CHUCK: What a coincidence.
AMY: You don’t want to be leaving that.
JAMES: Okay, so number ten. We’ve covered a little, but let’s see if we have anything else to say about it, which is you’re going to fail.
AMY: Yeah. So, this is around building to break, the Chaos Monkey that you mentioned earlier. Netflix actually does some amazing stuff in this area. I think the Chaos Monkey’s open source. I think it’s on GitHub and you can pull it down and install it on your own environment. And it will just go around and wreak havoc and find things in your app that you didn’t expect to break and force you to deal with those issues. I think they’ve got it so that you can configure it for when it’s going to run. So, you might only have the Chaos Monkey actually running inside your environment in office hours, for example so that somebody’s on-call and able to deal with these things as they happen. But you’d rather find out about those things [chuckles] while you’re in the office as opposed to…
JAMES: [Chuckles] When you’re on a beach in Hawaii.
AMY: Yeah, exactly. [Laughs] So, forcing yourself to actually acknowledge that things are going to break and actively having them break is I think a great thing.
JAMES: I saw a really good presentation from some Netflix developers one time. And they were talking about how they ran these, the SimianArmy. And it ran in production, not a testing environment, in production so that they proved that the system was handling that. It had to handle that. And they said after they had run Chaos Monkey for a good period of time, it got to the point where Chaos Monkey didn’t typically find problems anymore. Their developers had adapted and they knew how to write in such a way that that service may not be there anymore and, “I’ve got to have a backup plan,” kind of thing. But then they introduced Latency Monkey and things got worse again, because nobody counts on the request that takes 28 seconds or something like that. So then, people would modify their code with timeouts and stuff. It was interesting how it changed things.
AMY: I think the other thing is software isn’t static. You see that it got better over time and I’m putting money on the fact that it did. But that patch that you deployed last week that things are looking pretty great, until this week when everything crashes.
AMY: Having something that goes around and causes chaos is going to help you find those things, because software isn’t static. And it is always changing.
DAVID: I got into computers because I thought they were logical and deterministic. And systems now have as many parts to them as large organic systems. And so, they’re fussy and hypochondriac. It’s like we’ve all worked at that server that’s just fussy. And there’s no reason why. Or a service that works forever suddenly stops working.
DAVID: And when you find the answer, it totally makes sense. Some idiot upgraded the version of Python on one of the Linux servers and it wasn’t backwards compatible or whatever. And you think that these are… there’s a joke in I think Code Complete that bitrot is a superstition. That software decays over time. Well, that’s just foolish superstition. That means you don’t understand your system. No.
DAVID: It’s actually an admission that you don’t understand your system and all the pieces that go into it. And there really is bitrot. Okay, the software didn’t change. But the bajillion dependencies, something in them did.
JAMES: You’re in real trouble if you have to recompile a Linux kernel. That’s all I’ve got to say.
DAVID: I [inaudible] my system right now that I have to recompile the graphics drivers for every time the server reboots.
DAVID: I’m in special kind of hell, yes.
DAVID: And I recognize that it’s my own sys admin idiocy that got me into this boat and my own bloody mindedness that keeps me from getting out of this boat. But you know, here I am. [Laughs]
AMY: You love it.
JAMES: Well, Amy thanks. That’s was a cool set of points and awesome stuff.
JAMES: It made me want to go deploy some things to a cloud.
AMY: You should.
CHUCK: That’s right. Make it rain.
JAMES: And make it rain.
AMY: Thanks so much for having me.
JAMES: Thank you for coming on.
CHUCK: Alright, should we do some picks?
JAMES: Yeah, let’s do it.
AVDI: Oh, I have one more question before we go.
JAMES: Go for it.
JAMES: Fire away.
AVDI: How do I learn to make slides that look as gorgeous as yours?
JAMES: Yeah, good point. [Chuckles]
AMY: Oh, man. They take so long.
AMY: Sometimes, I get to the end of it and I’m like, “What am I doing with my life?”
DAVID: Step one: seven years in Japan studying calligraphy.
AMY: I think the biggest thing you can do there is just pick some nice colors with good contrast, some nice fonts, and then just make everything really big.
JAMES: I love the really big, yeah.
AVDI: Great point.
AMY: Yeah, I just like it super, super big.
JAMES: And use cartoon characters.
JAMES: Yeah. [Chuckles]
AMY: Yeah, but don’t spend hundreds of hours, because it’s not worth…
AMY: Well, I don’t know, debatable. It might be worth it. It is fun.
JAMES: For sure.
AVDI: Well, I thought they were really neat.
AMY: Thank you.
JAMES: They do look good.
CHUCK: Alright, should we do some picks now?
AMY: Sounds good.
CHUCK: Avdi, you want to start us with picks?
AVDI: I’ve got a pretty whole-hearted pick this time. I guess they’re always whole-hearted, but especially so. So, I’ve started traveling more, doing more international travel for conferences in the last two years. And probably the biggest headache when it comes to international travel apart from the usual unavoidable running from plane to plane sort of thing is dealing with having a working phone in a foreign country. And if you’ve ever done this, you know that the usual situation, if you’re using a US cellphone carrier, the usual situation is there is going to be exorbitant roaming fees, particularly when it comes to data. There are going to be incredible, insane roaming fees. If you use your phone normally and don’t pay attention, you will easily come home to literally hundreds to thousands of charges and overages, because the rates that they charge for roaming are just insane. And so, this has resulted in a lot of desperately caching Google Maps while I’m on Wi-Fi and then dashing from one Wi-Fi oasis to another trying to stay in contact with the world, because the truth is I have become pretty dependent on my phone, especially when it comes to navigating around places. So anyway, before my last trip, which was to Rotterdam, I switched cellphone carriers to T-Mobile. T-Mobile has recently totally changed up their international roaming policy. And the way it now works is in 100 different countries, international data roaming is free. Well, not free, but unlimited. Unlimited just like it is on the plan you have at home. It’s not high-speed. It’s not 4G LTE. You can pay extra for that. But as far as 3G goes, you get off the plane and you’ve got unlimited data. And that’s just breathtaking when you’ve gotten used to the alternative. So yeah, I wasn’t going to pick it until I tried it out, but it worked perfectly. Got off the plane for my changeover in Heathrow, worked perfectly, got off the plane in Rotterdam, worked perfectly, and I just didn’t have to worry about being out of contact and not being able to find my way around or get in touch with people. So I was really, really impressed by that. They have a lot of other enlightened policies going on like the way they don’t do contracts and they don’t do subsidized phones. They’ll finance a phone for you, but they don’t do the usual subsidized phone stuff. And the phones they sell are unlocked and all kinds of cool stuff. But yeah, the international roaming thing really sealed it for me. They do have less data coverage in the states. I gave up my 3G coverage for where I live by going over to this plan. But to me, it’s worth it. So yeah, I’ve actually gone on so long raving about T-Mobile that I’m going to not do any of my other picks.
AVDI: And yield the floor.
CHUCK: Alright. David, what are your picks?
DAVID: Oh, speaking of people who ramble, right?
CHUCK: I did not say that.
CHUCK: I didn’t say I wasn’t thinking it.
DAVID: Yeah, there you go. Gosh. I’ve spent the past four days sick and in bed. And that’s when I catch up on my Netflix and my YouTube and my Hulu and all that TV stuff. And I wanted to pick a series that I was catching up on, but everybody’s seen everything I’ve seen. I’m three or four years behind everybody. I’m currently watching Dexter and I’m really, really enjoying that. That’s on Netflix now and it’s a lot of fun. But I just found on Netflix something that I used to watch on the web years ago and absolutely just laughed and laughed and laughed. Some discretion is advised. This is definitely a D. Brady pick. But there’s a show called Happy Tree Friends.
[DAVID’S WIFE]: Oh, not that!
JAMES: That’s got to be the best pick ever right there.
CHUCK: Don’t pick that! [Laughter]
DAVID: That’s my wife, ladies and gentlemen. And that’s everything you need to know about Happy Tree, well I have to say this…
JAMES: That sealed the deal right there.
DAVID: It’s a cute kiddie cartoon. [Hums] La, la, la. And it’s like the cartoons we watched as a kid and there’s giggling animals. And about 30 seconds in the dismemberments start. It is absolutely adult entertainment. It is basically, if you wanted a cartoon version of Saw and Saw II, this would be it, Happy Tree Friends. I am a bad, bad man for picking this. And I feel really, really guilty about it. And if cartoon violence turns you off, stay far, far, far away. But if the horrible, horrible juxtaposition of those two Venn diagram circles makes you giggle helplessly, you now have a new fix for that.
JAMES: That’s so awesome.
DAVID: That’s my pick. I’ll stop.
CHUCK: Okay. James, what are your picks?
JAMES: I don’t really have much this week. But I’ll just pick a quick YouTube video. It’s less than three minutes. One of the things that bug me is people not knowing the difference between evolution and natural selection. So, if you don’t know the difference between the two, or don’t think you know, this is a less than three-minute video that will explain the difference to you. It’s a good thing to watch, just so you use the term correctly and don’t bug people like me. That’s it. That’s all I got.
CHUCK: So, you mean the difference isn’t just the spelling?
JAMES: [Chuckles] Not just.
CHUCK: [Chuckles] Alright. I’ve got a couple of picks. I think I’ve picked this before, but I’m going to pick it again because it’s relevant to point one and some of the other things we talked about. I’ve really gotten into using chef-solo. In fact, I like it a lot better than the other Chef alternative, which is to have a server manage all your stuff. I’m sure once I have the gazillion servers in my billion dollar business, then that may change. But for right now, chef-solo is awesome. I’m also really liking Librarian, which is like Bundler for your Chef recipes and makes my life a little bit easier because if I want a recipe, I just pull it in.
JAMES: I don’t get it. Why Librarian? Wouldn’t it be like cookbook?
CHUCK: Cookbook’s already taken as the collection of recipes.
JAMES: I see. Hmm, I don’t know. It doesn’t fit.
CHUCK: And Librarian I guess manages your Cookbooks and not your recipes.
JAMES: It’s missing a metaphor. I don’t know. It seems weird.
DAVID: This is a very common namespace exhaustion problem.
CHUCK: Anyway, regardless of what it’s called, it’s cool. So, I think I’m just going to leave that for my picks. Amy, what are your picks?
AMY: And those are my picks.
CHUCK: Awesome. Well, thanks for coming on the show. We really appreciate you…
CHUCK: …taking the time. And a lot of these points, I’m going to have to go listen to a couple of times and let them sink in.
AMY: Thank you so much for having me on the show. It was good, great fun.
CHUCK: No problem. One other announcement, we are reading ‘Object Design: Roles, Responsibilities, and Collaborations’ for our book club book. So, make sure that you get a copy. There has been some discussion on how you get it. Apparently it’s on Safari Books Online if you have an account or access to an account like that. I did find it on Amazon and I paid about $50 to get it. So, if you want a hard copy, it’s going to be a little bit pricier maybe than you expected a book to be. And were there any others? There was some discussion on Parley on how to get it.
JAMES: I think Safari is what everybody’s saying is the easiest and best way to get it. And we have the episode scheduled for May 9th, I believe.
CHUCK: Yup. So, it’ll come out the following Wednesday. Alright, well thanks again. We’ll wrap up the show. We’ll catch you all next week.
[Would you like to join a conversation with the Rogues and their guests? Want to support the show? We have a forum that allows you to join the conversation and support the show at the same time. You can sign up at RubyRogues.com/Parley.]