RR Scaling Rails with Steve Corona
- Published on:
01:38 - Steve Corona Introduction
01:57 - Twitpic
03:51 - Learning on the Job vs. Skills and Experience
04:31 - Scaling
05:16 - TwitPic as Rails Application
- PHP to Rails
- Records/Database Issues
10:19 - Thumbnail API’s
10:58 - Restoring and Migrating Data
- Read-Only Functionality
12:02 - MySQL
- Building Apps
- Converting Tables
13:40 - Data Migration Timeline
- Rewriting PHP
- Creating Framework
- Converting to Rails
16:45 - Programmer vs. API Builder Mindset
17:30 - Go
17:31 - Node
18:24 - JRuby
21:34 - TorqueBox
23:12 - Background Functionality
26:27 - async
29:48 - Resque
29:57 - Github: Introducing Resque
32:41 - Turning Things Around
32:57 - Is Ruby Fast Yet?
40:49 - Algorithms
43:09 - N+1 in Rails
45:35 - Micro-Optimizations
51:26 - Profiler Tools
51:45 - statsd
53:01 - New Relic
Education with Coraline Ada Ehmke and Katrina Owen
[This episode is sponsored by Rackspace. Are you looking for a place to host your latest creation? Want terrific support, high performance all backed by the largest open source cloud? What if you could try it for free? Try out Rackspace at RubyRogues.com/Rackspace and get a $300 credit over six months. That’s $50 per month at RubyRogues.com/Rackspace.]
[Snap is a hosted CI and continuous delivery services that goes far beyond letting you do continuous deployment. Snap’s first class support for deployment pipelines lets you push any healthy build to multiple environments automatically and on demand. This means with Snap, you can deploy your staging environment today. Verify it works and later deploy the exact same build to production. Snap deploys your application to cloud services like Heroku, Digital Ocean, AWS, and many, many more. You can also use Snap to push your gems to RubyGems. Best of all, setting up your build is simple and intuitive. Try Snap free for 30 days. Sign up at SnapCI.com/RubyRogues.]
JAMES: Hello everyone and welcome to episode 162 of the Ruby Rogues Podcast. Chuck is out today, so I’m James Gray and I’ll be your host. With me today are Avdi Grimm…
AVDI: Hello from Pennsylvania.
JAMES: Saron Yitbarek.
SARON: You said my name correctly. I’m very proud of you.
JAMES: [Chuckles] I practiced. [Chuckles]
JAMES: And with us today, a special guest, Steve Corona.
STEVE: Hey, thanks for having me on. I’m really excited to [help] you guys.
JAMES: Thanks for taking the time. We appreciate it. Steve, since this is the first time you’ve been on, do you want to introduce yourself?
STEVE: Sure, yeah. So, I’m Steve Corona. I was CTO of Twitpic, the big photo sharing app for twitter that everyone has used or heard about. Did that for a couple of years, and now I’m at a startup in San Francisco called BigCommerce. And I’m the principal architect there.
JAMES: Wow, cool. So, Twitpic, that’s why we asked you to come on today. It turned out to be kind of a big deal, didn’t it?
STEVE: Yeah, it did. And it was quite a rollercoaster. So, I started Twitpic, geez, back in 2008. So, it’s been, wow longer than I can even fathom how long it’s been. But yeah, it blew up overnight, really just took off, and got all this traffic coming in. And what I had started with helping Noah, who was the CEO at the time, we got in touch when the user base was really tiny, 30,000 users. And he really didn’t know what he was doing and needed some help. And I really had no idea what I was doing either. I dropped out of college. I was into programming and computer science-y stuff. But by no means was I super experienced or super classically trained, and had to spend the better part of almost five years just learning how to deal with all this traffic. And it was like drinking from the fire hose because it never slowed down. It actually ramped up more and more and more. And so, for the longest time, for the first year or two, I actually had to sleep next to my laptop because I couldn’t figure out how to get the site to stay up all night. And so, my solution for that was well, when my phone goes off because it’s down, I’ll wake up, roll over, do some command line magic to restart it, and go back to sleep. And for the longest time, that was my solution until I learned more and more and more and figured the scaling world out and how to do all this stuff automatically and how to have better response time and how to deal with millions and millions of users. And eventually, by the time I wrapped up at Twitpic and parted ways, it was all automatic. It was just magical scaling abilities where everything just handled itself. So, it was quite a rollercoaster over the years.
SARON: Wow. That’s incredible. You know, it makes me wonder. How much of what you did while you were there was learning on the job versus skills and information that you came in with?
STEVE: That’s a good question. So, I had a little bit of experience. So, before Twitpic I was at an ad network. And so, ad networks at the time, back in 2008 timeframe, were the only web properties that had to deal with scaling. They were the forerunners of big data, getting into all that stuff. So, I knew about Memcached and maybe some scaling terminology just from that exposure. But a lot of it was learning on the job. And at the time… nowadays there’s so much information on the web about scale. You can google how to scale X, Y, Z and you’ll fine quite a bit of good information. But five, six years ago, that really was still on the forefront. There wasn’t a lot of good information. A lot of that information was locked up in big companies like Google and maybe Facebook at the time, where that kind of knowledge was built on the job there but they weren’t sharing it as much as they are now. So, there was definitely a lot of learning. I learned so much on the job. But it was just that that information at the time wasn’t available. It wasn’t common knowledge. Now everyone says, “Oh yeah, just throw Memcached in there and solve scaling problems yourself.” But [chuckles] at the time that information just didn’t exist.
JAMES: So, Twitpic was a Rails application, I’m assuming?
STEVE: So, Twitpic actually has a Rails history that’s quite interesting. And I’ll give you the short but long version of it. So, when Twitpic was first started, when I came on to Twitpic at that 30,000 user mark, it was a PHP app. And it was a really nasty PHP app. It was the kind of PHP that maybe your 14-year-old little brother might write. It’s spaghetti and lots of files and this whole nasty mess. And I had always been a Rails guy and was like, “Yeah. I want to do this in Rails.” But at the time, as we grew, we just couldn’t justify the switch. We didn’t have enough Rails programmers on our team. And the reality is that if you use Rails, or Ruby, Ruby on Rails, you need more hardware. PHP is really convenient in the sense that it’s the runtime is a pretty nasty thin wrapper over C. So, it has better performance at the cost of really crappy code. On the flipside, Rails just has more code and needs more hardware. So, as CTO I’m like, “Yeah, I want to use Rails. I want to push better technology.” But the guys with the money at the end of the day make the decision that, “Hey, we can’t double the amount of hardware that we have to support this technology that you want to use.” So, for the longest time, it was a PHP app. We built our own framework. We built our own ORM. And the technical debt just kept building and building and building and building. And it was becoming unmaintainable to add a very simple feature just because of all this technical debt that we had. It was very difficult. So, as a secret side, back project we had, was a Twitpic clone written in Rails that we maintained and updated but wasn’t public for quite a long time and had this just private Git repository that we contributed to. As we built out new features in the PHP app, we’d go in and contribute in the Rails version of it. And it was just our way of looking at it and saying, “Wow. Look how much less code we have for this fundamentally simple app.” Picture sharing, there’s not a lot there. You show the picture, you upload the picture, and have an API. And there’s not too much functionality. So, replicating that in Rails is not that much code. It’s pretty simple. And so, having that to look between and saying, “Wow, we have this really complicated 100,000, 200,000-line PHP app. But look, we can do the same thing with 5,000 lines of Ruby or something like that.” And we had that for the longest time. And we started learning more about JRuby. We started learning more about different technologies for deploying Ruby that were faster than PHP that would actually use less hardware. A year or so before I left, so that would be, put us around 2012, we started really focusing on converting that PHP app to a Rails app, to start migrating and moving that functionality in production to Rails. And so, at the time it was actually a lot harder than if we had done it when we had 30,000 users because in 2012 timeframe, we had 50 million users. We had 4 billion photos uploaded. And it was a ton of data. And then you have to start thinking, well the actual migration to Rails isn’t that difficult. It’s actually that data migration. Rails has a very opinionated way that database records are stored and what their name and the format. So, simple things like, “Hey, in your PHP app you’re storing your timestamp in the wrong format and it’s not called created_at or updated_at. You had a column like date_uploaded and date_changed or something like that.” Those small issues when you have billions of records actually become huge issues. Changing that database field from one timestamp format to the Rails opinionated format is quite hard with that many records. But I still, I really wanted to get Rails out there. I really wanted us to use Rails. PHP had lost a lot of the more forward-thinking development in the community. Rails and the other technologies out there, that’s where all the real big thinking was happening. And I really wanted us to get on Rails. So, I said, “Well, how can we do this in a way that isn’t that big switch? It’s not doing everything at once? How can we do it piece by piece?” So, I came upon this thought that we need to start thinking of ourselves instead of as programmers as just API developers, API builders. And that got me on the train of thinking that well, actually we can swap out pieces of functionality with a Rails service, one piece at a time, without having to do the whole thing at once. So, the first thing we did was we swapped out our thumbnail API which was the most amount of our traffic. You go on Twitter for example, and Twitter’s constantly pulling from our thumbnail API to get different image sizes. And when you post an image, they just pull the image from the API so they can show it online. So, we’re talking 10,000 requests per second or something like that for our thumbnail API. But it’s read-only. It’s pretty low-touch. And so, that’s the first piece of the application that we replaced with Rails. And we used JRuby and TorqueBox for that. And then from there, it cascade and we replaced each feature of the site with a domain-specific Rails service and went from there.
JAMES: That’s interesting. Did you end up with spots where you had some data that was stored in the old PHP setup and some data was stored in the new Rails setup? Or as you would migrate that section over, did you also migrate the data over?
STEVE: Yeah, so I thought the way that we solved that was actually pretty interesting. So, what we did was the first piece I guess to mention is that all of the services that we brought over initially were all read-only functionality. I figured that was the easiest way, if we had all writes going through PHP and we have different parts that are read-only like showing it, displaying an image on the website. That’s pretty much read-only. Or the thumbnail API, well that’s read-only, or the profile page, read-only. If we did that first, it would be easier than trying to manage two writes and two different locations at the same time. So, we did the read-only aspect. But we still had that fundamental clash between how our data is stored and how Active Record wants the data to be stored. And the way that we solved that was we actually made, we use MySQL for our database, so we used MySQL database we’ve used to shoehorn the old format of the data into the new format of the data. And it was pretty efficient. There was a little bit of overhead there. But we could take our existing data and we can transform it into the format that Active Record expects. And so, then everyone’s happy. We can have a really beautiful data model on Active Record without having to have all these oddly-named fields that didn’t really mesh up with how a Rails app, the best practices for building a Rails app. So, we can have those nice data fields but at the same time, really lose that complexity of having a write come into the Rails app or to the PHP app. So, that was our migration strategy. All our writes went into PHP and reads started coming out of Rails. And then eventually we did that big migration where we said, “Okay. We’ve got most of the really high-risk.” All the read-only stuff, that’s where all of the traffic happens, and that’s the high-risk part. Serving 10,000 requests per second, that sort of thing, that’s where the high-risk is. The number of people uploading images, well it’s, I don’t know, something like 50 images per second getting uploaded. That’s very few compared to 10,000 per second. So, once we got all that read-only functionality out, we were able to go and then do that big, nasty data migration where we actually convert those fields. And that was just a lot of MySQL DBA magic to convert the tables to the new format, had a little maintenance downtime while we did that. Then we were able to take that write functionality and to that in Rails.
SARON: Yeah, I’d love to get a sense of the timeline a little bit. So, how long were you messing with this Twitpic clone in Rails secretively and then moving on to actually changing different parts of it, and then doing the whole data migration? What was that, how long did all that take?
STEVE: I guess there’s two different phases there. The part where we were building this app or had our Rails clone of it while we were developing the PHP app day-to-day, that was several years before we actually made the migration. So, it actually started that we were going to re-write the Twitpic app. We knew that we needed a rewrite. So, we initially started rewriting it in Rails because we said, “Okay, this is the time. We’re going to do it all in Rails and we’re just going to do a big swap.” And as we started getting into it and building the core functionality out in Rails, it really became obvious pretty fast that with the amount of data that we were dealing with, there was really no way that we were going to be able to easily do a clean swap. It just was too much and too short of a time to take that big PHP app and swap it out with a complete Rails app. So, once we made that decision, we’re like, “Wow. This is not possible. This really stinks. We really would like to do this in Rails, but it’s just not feasible with the amount of resources we have right now.” That got put on the backburner and then we did a rewrite in PHP. And that’s when we created our own ORM layer, when we created our own framework, and that sort of thing. And we rebuilt the app in PHP. But that’s what created that mentality shift where, “Wow. It stinks that we’re having to do this in PHP. It stinks that we’re having to build out this big app that we really want to use Rails for and we can’t.” And that’s where that thinking of, “Hey, let’s start thinking about Rails. Let’s start learning more and more about. Let’s get the team spun up on Rails.” And so, at that point, that was probably around 2009, 2010 when we first did that big rewrite. And then two years later is actually when we were able to deploy our first Rails service. But that in between time, we hadn’t had that realization of we can do this piece by piece. We don’t have to do this all in one big chunk, replace the whole thing at one time with the big Rails app. And I think that’s what took us so long, was getting over that hump of thinking that we can do it piece by piece, that we can do it as services. It was a mentality shift that took us a good year, year and a half to get to before we said, “Oh yeah, we can just do our thumbnail API. And that’s read-only and it’s a small subset of traffic and it’s actually not that much functionality. And that’s quite easy to do.” So, once we had that mentality shift, I’d say it probably only took about a month before we deployed that first service. It was quite quick once you had that mental realization. But getting to that mental realization, at least for me and my team at the time, I think that was the hardest part, was realizing you can do it piece by piece in a safe way.
SARON: And when you talk about that, you’re talking about the programmer versus API builder mindset that you mentioned. Is that right?
STEVE: That’s right, yeah. I think as a programmer you think of everything as, “How do I code this up? How do I write this together in an application?” But when you take a step back and you’re like, “Really, I’m just building APIs,” and where those APIs come from, what repository in GitHub or what program or what service I have running on my machine, that doesn’t actually matter. What matters is that at the end of the day, these APIs, they share the incoming user requests and they share the backend data. But the actual program on your server that’s responding to them, that doesn’t matter so much. So, it could be in Rails. It could be in PHP. It could be in Go. It could be in Node. It doesn’t matter that much where it’s coming from as long as you’re able to hook it all together. And so, that was the mindset that really allowed me to have this way of thinking that we can build services instead of everything needs to be tightly coupled in one big application.
JAMES: And URLs on the web are just another API, right? [Chuckles]
STEVE: That’s exactly it, yeah. Everything is just an API. And so, as long as you can publish an interface that your users expect in some format, whether it’s a URL or an API, like a REST API, as long as you publish this, this is our contract of what we’re going to respond to, it doesn’t matter who serves it.
JAMES: That’s a great point. So, JRuby and TorqueBox, that’s how you deployed the Rails version. Why don’t you tell us about TorqueBox and what that is?
STEVE: Yeah. So, JRuby is really interesting and I’ll to TorqueBox in a second. But everyone knows JRuby is Ruby on the JVM. And I think it gets a really bad rap because JVM has such untypical characteristics, especially coming from the Ruby or VM world where there’s just consistent performance. It’s either slow or it’s fast, but it’s that one performance aspect. And you’re like, “Okay, it’s slow. I tested it. Done,” where JRuby is really interesting because the performance dynamics change so vastly depending on so many different factors, tuning settings or even just how long it’s been running for, how much of the code it’s optimized and then compiled and run through the Hotspot compiler. So, initially we figured and we knew that we just didn’t have the hardware resources or the financial resources to support running on MRI. So, we had deployed. We were doing, so the backstory here is we were deploying a separate application. We were working a separate startup at the time. We had Twitpic but we also had a little test startup that we were playing with. And we had done that in Rails and we had done that with MRI. And we got a lot of traffic on that startup. It was called [Helo]. It was a social sharing website. And using MRI for that with a lot of traffic, it was our first foray into actually deploying Rails into production. It was low-risk because it wasn’t tied to the Twitpic brand. And when we got a lot of traffic on there, we said, “Wow. This is taking up a lot of hardware resources for us to deploy Rails with MRI. We’re not seeing the typical number of requests per box that we’d expect with PHP. It was just, it used way more CPU. At the end of the day, it used more CPU. And so, from that experience we knew that when we went to Twitpic and we wanted to eventually deploy this Rails app, that we would need fast, more hardware to support doing it in Rails and to do it in MRI. And so, that always was a point for us, a sticking point, that wow, it’s really awesome to use this new technology but we really don’t have the money to dump into buying more hardware. And then I started learning more and more about JRuby. And okay, it has better performance characteristics and wanted to know, okay, so what’s the best way to deploy Rails on JRuby? And there’s a bunch of different servers you can use. With JRuby, you have this hybrid Ruby/Java library effect where you can deploy Rails on traditional Java servers like Tomcat and GlassFish I think is another one. You can use these traditional Java services. But I wasn’t too familiar with them. I had used Tomcat years and years and years ago and I didn’t have a good experience. So I’m like, well I really don’t want to deal with XML hell. I don’t want to be writing all these XML configuration files. That’s so un-Ruby-like. I really want to find something that is Ruby-ish. And you can look. And okay, there’s Puma, but Puma’s very lightweight and it was still very experimental at the time when we were doing this. And so, I was looking around and I stumbled upon this piece of software called TorqueBox. TorqueBox is a container, like Unicorn or Puma, for deploying a Rails app. But what it actually is, more than just a container for deploying a Rails app is it takes all of these amazing Java technologies and Java servers and Java caching servers and Java queue background worker services and it pulls them all into this central box and puts a really Ruby-like interface on top of it. So, you get all of this really amazing Java web serving technology that has been in development for years and years and years and years, probably 10, 15 years of development of really strong work and scaling lessons and optimization. And putting just a thin Ruby layer on top of it to coordinate it all together and give you this really nice Ruby-like deployment experience, I was sold. It did everything right. It did caching for you. It did serving the web requests. It also handled running background jobs. So, if you wanted to do something like resize an image in the background, you could easily do that, whereas traditionally, you’d have to get Resque or something like that set up to do the background jobs. TorqueBox has all that built in. So, it was a huge selling point. The fact that it was built on this awesome Java technology, another huge selling point. So, that’s what we used and that’s what we standardized on to deploy our Rails services when we eventually did that thumbnail service, that first service that we deployed. We used TorqueBox in production for that. And it worked fantastically. It was quite amazing to use.
JAMES: I would assume that Twitpic, you were talking about the ease of background functionality and stuff and I would assume a site like Twitpic has a fair bit of that for probably resizing images and stuff like that, right?
STEVE: That’s right, yeah. So, when we were using, even using PHP, we were using PHP Resque, which is the PHP port of the famous Resque backgrounding library. And so, we had always really had this fundamental shift of, and this happened years before the Rails app, but we were at, so at one point we were doing everything synchronously within the web request. And it just wasn’t scalable. When Twitter would go down, it would also take us down because we were trying to post comments and images to Twitter inline during the web request. And when Twitter was down, which back in 2009, 2010, it went down a lot. So, we were doing that all inline and inside the web request. And so, when Twitter was down, our web request would basically stall out waiting for the timeout to hit, waiting for Twitter to respond. And it would just create this huge domino effect of traffic coming in but we’re not putting anything out. And so, there’s all these people trying to hit you but all your servers are locked up trying to connect to Twitter, basically. So early on, we had a need for backgrounding jobs. So, we did things like when you posted a comment, the comment that we were going to post to Twitter, that was the background job. Or when you uploaded an image, that created actually several background jobs. A job to resize each size of the image we want to create. There’s a thumbnail and a mini-thumbnail and an iPhone optimized size and a retina size, and all these different sizes, each of those gets a background worker. And also, things like when you upload an image it actually includes a lot of personal data about you. It includes your GPS location if you take it with an iPhone. And we don’t want to share that on the web with everyone that is viewing your image. So, we need to have another background process that goes and pulls that data out and sanitizes the image so your location and security, or your location and everything is secure. We don’t want that data out there. But all of these mundane tasks, we had gotten really good at putting into background jobs. Another background job is comment spam. So, when you post a comment, we actually tick off a background job and we say, okay, is this comment spam? Are there a lot of links in it? All that kind of analysis was done in the background. Trying to do the least amount of work we could in a single web request so we can serve web requests really fast and we can do all this busy work in the background somewhere else on a different server farm that’s not sharing resources with generating your website and giving you that feedback. We want to get that to you as soon as possible. So, what TorqueBox offered was that using Java threads, because now with JRuby you actually get real life Java threads, real life actual operating system threads, not just green threads, no global interpreter lock, anything like that, you get true threads. So, TorqueBox had a really nice library that was very similar to Resque where you’d have some function that you wanted to call and you would call the model.the_function.async and bam, that would run in a separate thread in the background. So, you can have in your code, say you have a model and in that model you have, is comment spam or something like that, you can just run that asynchronously in the background by putting async on the end of the function name. And bam, TorqueBox grabs that, pulls it in a thread and runs that code there. So, it was really, really convenient because we had so much background work, because we had so much stuff we wanted to do asynchronously. Having it baked into the framework or the model that we were using, the deployment strategy, without having to deploy Resque or another queuing service and set all that up, was super, super convenient, especially for programmer productivity. When you take the more traditional route, okay, I have a new background job I want to write. You have to go and configure it and set up the queue channel for it and then set up a new background worker class and make sure it all gets distributed correctly and all this. There’s so many logistics that go into it, whereas when you use it and it’s baked right into your server, it’s so easy because you just pop the asnyc part on the end of the function call and now it runs in a background thread. So, for just programmer productivity, it was super, super easy to continue doing more and more work in the background and not using the excuse that it’s a pain in the butt to set up to do it.
JAMES: I think that’s a really good point you’re making there, that Rails provides the web application UI database connectivity experience. But to any realistic app, you’re going to have other things outside of that, like some kind of background processing is common and you gave a ton of great examples. Or the ability to do periodic jobs to clean things out and stuff like that. So, it sounds to me like one of the big wins you feel like with TorqueBox was that that had already been thought out and you had these pretty Ruby interfaces that you could just use and get to those mechanisms.
STEVE: That’s right, yeah. And also the fact that there are so many ways to do queuing out there. There’s a bajillion different queuing libraries, queuing servers. It seems like every tech company has their own queuing server that they’re pushing. And that’s great. But a lot of those are so immature and so untested because you had just a handful people actually using them. The beautiful thing about TorqueBox is that it’s building on top of all these existing Java technologies that a lot of big enterprise, Fortune 500 companies are using. And so, because of that you get this really great queuing technology that has been thought out for so long that there’s really good monitoring around. With TorqueBox, you can use Java MX, JMX, something like that, monitoring hooks to hook right in and get all that visibility and data that you want from your queuing system very traditionally. It has all that stuff baked in already. And so, you feel I think more, I guess you feel more secure in knowing that, okay my jobs are getting processed and I know which ones are failing and I know if there’s a problem pretty immediately. So, the fact that it’s just baked on such mature technology for me was a huge selling point, versus using something like, I don’t know. Resque is great. I love Resque and it’s come a long way. And it’s getting pretty mature now. But back a couple of years when we were using it, it was, GitHub wrote a blog post, Here is Resque and the world started using it. But was it really that mature? Were there edge cases in there? Yes, there were some problems. And there are problems with any immature piece of software you use. So, the fact that it was just baked in with this mature Java goodness but had a pretty layer on top was so key. It was really a gift, I think.
JAMES: Yeah, that’s a good point. We still see a slight hiccup sometimes with using something like Resque. Redis just seems to have, it just screams and screams and screams, but then every now and then we see Redis pause for four seconds while it sorts something out.
JAMES: And so, that backs traffic up. And then you have to be able to prepare to handle those kinds of short delays and stuff. It’s interesting.
STEVE: Yeah. And it’s quite a bad experience for your users as well. No one wants to load a page and then have it wait four seconds after you upload an image. If it takes four, five, ten seconds, whatever, for your image to get processed, that’s not a good experience. You’re confused. Why is my image not there? Why is it showing broken? And then, oh wait, it appeared because in reality the background job, Redis was paused and now the background job process and your image got resized and uploaded. But that short span is a really bad user experience. So, I think it’s quite crucial to have a really strong, if you’re using background jobs, to have a really strong background job setup. It’s not an area where you want things to fail silently. Typically with backgrounding systems, there’s not a lot of visibility. Luckily, Resque has a really nice web UI where you can get some good visibility into what jobs have failed and how many jobs are backed up or processing and that sort of thing. But a lot of the queuing systems just don’t really have that. Visibility is an afterthought. It’s not really baked into the system. And it’s important that if you’re using a backgrounding library or backgrounding system, that you do have that visibility, because there’s a lot of things that can go wrong. There can be a huge influx of new jobs and there’s a 30-minute wait for a job that gets put on top of the queue to actually make it and get processed. And not having visibility into that is, I think a big mistake. And it’s something that can bite you in the butt if you’re not, if you don’t have monitoring and visibility set up into that.
JAMES: That’s a good point. So, shifting the subject a little, you took this app that had all these performance characteristic problems and slowly over time, you evolved it into something that was more efficient. Maybe you can tell us what were the big wins? What were the top three things you did that really turned things around?
STEVE: Yeah. That’s a good question. So, this touches back on what I had mentioned a little earlier, is that JRuby gets a really bad rap. There’s a website. I think it’s called Is JRuby Fast Yet? Or maybe it’s just Is Ruby Fast Yet? And it compares different Ruby VMs. And JRuby, on that benchmark, in a lot of benchmarks I’ve seen, consistently comes in last place. And so, your initial reaction is, wow, JRuby is slow. Why would anyone use that? Just use MRI. That’s what everyone uses. Blah, blah, blah. But what I actually noticed is that it takes a long, long time for JRuby, probably more JRuby/Rails combination than just JRuby in general, but it takes a long, long time for JRuby and Rails to actually become fast. There’s a lot of stuff that the JVM is doing. It’s doing Hotspot compilation. It’s pulling in, dealing with auto-loading classes. And then, okay now the class is auto-loaded. You have that in the cache. So, we would see startup times where you’d start up your TorqueBox server and start hitting it with traffic and the response time would be abysmal for the first 30 seconds, 45 seconds. It would be terrible. We’re talking, we’re setting it 10,000 requests per second. So, it’s not like it’s just getting a couple of hits here and there. It’s getting millions and millions of requests over that time span. And it’s just responding very, very slow, can take up to a second per page to render. But then over time, you start to see if you turn on the debugging logs you see that, okay, it’s doing all this stuff. You can have it print out every time it Hotspot compiles a class. And you can see, even five minutes into the process, it’s still finding piece of code that it can optimize or that it can compile into machine code. And so, it takes a long time with a lot of traffic coming in for the JVM to even get warmed up. We’re talking five minutes in we start to see actual reasonable response times. And actually, if you let it run for long enough, the response times get very low. We go to that one second per page, which is unacceptable, down to 20, 30 milliseconds per page. We’re talking about the same pages that were generating, just longer in the JVM startup process, further down that path where more and more and more and more of that Rails code, the code that Rails framework and your own code gets compiled and is more optimized. So, because of that JRuby gets a terrible rap because a lot of the benchmarks are things that happen relatively quickly. And so, you’re paying that huge load up penalty without ever getting to the part where the JVM is quite fast, where JRuby on the JVM with Rails is actually faster than MRI, at least in my experience, that you can shoot out pages at scale anyway. If you’re talking about, okay, I want to generate one page, MRI is going to be way faster because you don’t have to pay that penalty. But when we’re talking about generating millions and millions and millions of pages over the span of days or weeks, I think that the total aggregate time of time to generate those pages will be less on JRuby and it will take longer using a more traditional Ruby VM like MRI. So, because of that, it was hard at first to get over that performance characteristics and deal with things a little differently. You really have a huge penalty to pay on when you’re starting up or restarting TorqueBox. But there are still a lot of things you can optimize. So, if you were deploying with Unicorn, you would optimize the number of Unicorn processes you have. With TorqueBox, you optimize the number of threads you have. What’s the max amount of threads? And so, that was one big win, was playing around with, okay we have 32 cores on this app server. How many threads can we run with TorqueBox? And I think the optimal number we found was 3x the number of cores. But that’s application-dependent. How many times is your application blocking while it hits the database or while it hits the file system? That sort of thing. So, we had to really play around with that. How many, what’s the max number of threads that we let TorqueBox have? And so, for a 32-core server, I think we settled on something like 96 or 100. Then the JVM. You can buy a book on tuning the JVM. And all of that stuff applies to JRuby. It’s not really any different than just running a Java app. So, you need to tune things like the garbage collector. And there are 20 different strategies for the JVM garbage collector and how you want it to run. You can turn on the concurrent garbage collector. There are all sorts of optimizations there. And that, you can do some research. And again, it depends on your application. How many dirty objects is your application generating? If you have a loop where you’re just creating a bunch of objects, millions of objects and then throwing them all away, well your garbage collector is going to have to run more often. That’s something you’d face with MRI as well. But there’s less tuning available there, where with the JVM, you can really hire someone that’s a JVM garbage collector tuning expert. There’s a lot to learn there. And it’s a huge, huge steep learning curve. I wouldn’t even, having played with it for so long with that, I would still call myself a novice because there are just so many different settings that have so many different fundamental impacts to the way it runs. And the easiest way to learn is just to test and break stuff and figure it out that way. Those where the two areas where on the Ops-y side we got more performance. The other side is just Rails and Ruby in general where sometimes the magic or the expressiveness of Ruby and Ruby on Rails can bite you on the butt where you do something that’s really beautiful but don’t know. Or don’t really consider when you write this beautiful code what performance impacts that has or just the characteristics of this code that you’re writing and the way maybe it’s hitting Active Record in some kind of loop and you’re not doing it optimally, because the code just is so natural to write that way. But in reality you need to go back and de-normalize your Ruby, so to speak, and make it a little bit less elegant at the tradeoff of maybe only hitting the database once or doing something with less iterations or less loops that is less elegant. But at the same time, gives you better performance. So, on the Ruby side, we had to take a step back and look at our code and say, okay, what parts of this code can we clean up, can we optimize, can we have less function jumps where we jump between 20 different functions? How can we clean that up and optimize it a little bit so that we get better performance out of that? However, in that respect, the thing that I preach, when I talk about scaling whether it’s scaling Rails or scaling PHP which I talk about quite a bit, it’s fundamentally the same idea that, and you guys might disagree with me here, but your code doesn’t matter. It doesn’t matter at all. In reality, when you scale, 99% of the time, you’re going to get way more benefits out of scaling your infrastructure, out of doing backgrounding jobs, out of picking a biter Ruby VM, picking a better Ruby server to use, whether it’s Unicorn, Puma, Passenger, et cetera, tuning your database. All of that stuff is going to have usually, unless you’re doing something horribly abysmal in your code, it’s going to have way bigger payoffs than spending all day trying to optimize your code to do something that’s going to be one or two milliseconds faster. Tuning all of those other subsystems at the ops-y later has way bigger payoffs. So, in that respect we tuned TorqueBox. But we didn’t really concentrate too much on scaling, or optimizing our code, because scaling your code in my mind just doesn’t matter that much.
JAMES: So, that’s an interesting claim. I’m trying to decide if I do agree with it or not. Let me talk it out a bit.
JAMES: So, my immediate response to that is well, if you have something like an O(n^2) algorithm or something, versus switching to just an O(log n) algorithm, you‘re going to have a huge performance characteristic difference there that no amount of hardware whatever is going to fix. Or, it may fix it for certain sizes of course, but once you get to that point where you have a big enough exponent involved, you can’t fix that problem with hardware. So, I don’t think I would totally downplay what can be done in the software. I do think I understand what you were saying. And I think maybe the way I would say it is if you’re doing micro-benchmark kind of stuff, like, oh well it turns out making an array and using each is 50 milliseconds faster than just using inject or something like that. If you’re doing stuff like that, then you’re on such a small scale that the gains you’d make there versus the gains you’d make putting it on better hardware or better tuning the environment probably just can’t be significant. Plus things like programmer time is pretty expensive often, even compared to hardware. To have time for the programmer to go in there and re-architect it to remove a few less calls or whatever can be expensive. So you have to weigh how much payout will that be versus just throwing one more processor core at this thing and that will just set us for a year or so with our current growth rate. So, I think there’s definitely a tradeoff to be made there and considered. But I think there’s definitely cases N+1 queries are pretty common in Rails. So, if you have a query where you fetch the list of items and then you fetch each item individually, often you can just go in there and with a correct use of Rails includes or whatever, probably fetch them all in one query instead of ten separate queries. And I think that has a substantial difference and loosens your hardware requirements. Does that make sense?
STEVE: Yeah. Yeah, that’s right. I totally agree with you. And I think that you touched on a couple of good points. And the first thing, N+1 is weird because in Rails, that is just super, super, super common. I think it’s just the magic of Active Record makes you forget about what the actual database calls you’re making are, whereas if you use, if you were writing that SQL explicitly in Ruby without an ORM or if you’re using a different language where the ORMs aren’t so elegant, you don’t see that as often just because it’s more obvious what you’re doing. So, N+1 is weird because even the best programmers I think sometimes make that mistake where they’re thinking in Ruby and not in actual database or SQL terms. And the other part you touched on, I agree but I also disagree as far as the algorithm performance whether it’s log n or n^2 or whatnot. And I agree at the core terms, like yes, you want to optimize the algorithms. You want to pick a better algorithm, because you’re right. No matter how much hardware you through at it, at a big enough data size, there are going to be substantial differences there. That being said, I think that we have to, as web programmers, think about it in the sense that we’re not really dealing with that much data. In a web request, there’s really not that much data where you can use the worst performing n^4 algorithm and probably, we have such a small amount of data per web request that we’re working with that it really doesn’t matter. Now granted, there are definitely cases where that’s going to make a huge difference. You have some API call that ingests a huge bunch of data that you need to process. And certainly, that is a case where it makes a lot of difference. But in general, a lot of web requests are just very tiny data exchanges. And the actual how that data is processed, at least in that single request context and not aggregating it over multiple requests, it doesn’t matter too much. So, I think that there’s almost a disconnect between computer science thinking and web application development thinking, because we’re not dealing with a lot of the same problems that we think about when we think about computer science. We’re dealing with such a tiny piece of data at a time that generally I don’t see it mattering. But I do agree with your fundamental point, which is maybe I should take a step back and say, okay, you don’t scale your code, you scale your algorithms and you scale your ops. And that’s where the biggest payoffs are. But you see it a lot with PHP actually, where people argue. Like you were talking about micro-optimizations and you see all the time, people in PHP forums would say, “Well you know, single quotes are actually faster than double quotes, so I use single quotes everywhere because it’s one millisecond faster, if I do it a million times in an iteration.” People argue about these things. And so, when I look at it, I’m like, this stuff doesn’t matter. Most of that payoff is so, so, so, so, so tiny. That’s not how people see all their code. If you go to Facebook or some big app, they don’t have programmers there that are thinking about, well maybe I should use single quotes instead of double quotes and we should do a search and replace across the whole codebase. That stuff just is such a small micro-optimization that it doesn’t matter, because like you said, programmer time is so expensive. And hardware comparatively is so cheap. You can get, if you think about it in just terms of financial cost, a pretty fast server with 32 cores, you’re looking at if you rent it, which is cheaper than buying it. If you just rent that server, you’re talking four or five hundred bucks a month. It’s pretty easy if you have a team of programmers that hit four or five hundred dollars in a very short span of time where you can just get more servers for cheaper than it’s going to cost you to put all these programmers on that problem to do these micro-optimizations. But I think people get confused when they think about scaling. And they think it’s about finding all these little tiny tips and tricks and these micro-optimizations, when really it’s that grander picture of how everything connects together and optimizing those connections instead of optimizing these really, really tiny non-important details.
JAMES: Yeah, I think you have a good point there, that if you ever get to the point where you need to switch to single quotes because it’s one millisecond faster, you’ve already lost.
JAMES: You’re already on the wrong side, right? And that kind of stuff just doesn’t really matter. And I think I take your meaning on the differences in web apps. I saw an app recently that had this really complicated code and it was doing this really impressive caching trick from the database using a global variable. And it was really complicated. And I had to go in and change it for some reason. I’m like, “Wow. Why is all this complexity here?” and it was to optimize how often this list was being fetched from the database. And so, I was wondering, how big was this list? Is this a massive chunk of data we’re pulling out of the database or something like that? And so, I actually went and looked and queried the production table. How many entries are in here? And it was 40.
JAMES: It’s like, with 40 entries in the database, you can do about the worst thing you want and you can’t make performance go south. So, I was like, well that may be a little overkill for this case. And I definitely take your point of you can spend a ton of effort optimizing this thing that just doesn’t matter.
STEVE: That’s right, yeah.
JAMES: And there’s no point in that. But we tend to have, I think most apps have their set of tables, their few tables, where it is important, where you have enough entries in there. In Twitpic, I assume you have several photo entries or whatever that entry is that represents someone’s uploaded photo. And at that point, you probably don’t want to do in Rails .all and fetch all of them.
STEVE: Right, that’s right.
JAMES: That’s probably going to be bad. So yeah, I take your meaning. And I think this is actually part of what you’re talking about. I looked at that Is Ruby Fast site that you mentioned earlier and it was pretty funny because JRuby just does terrible in every one of those graphs. But those graphs are all these small, little benchmarks measuring this tiny thing that matters not as much, whereas JRuby’s playing for the long game. It’s playing for what can I do once I’ve figure everything out about this app? How screaming fast can we make it go after we have all the right information? And that’s a very different characteristic that those benchmarks aren’t really designed to measure, I think.
STEVE: That’s right, yeah. And not to knock whoever makes that site, but I do think it’s funny. If it’s still like it was when I last looked, at the bottom it says, “These benchmarks were one on my MacBook Pro. I tried to open it up so the fan, it doesn’t overheat or whatever.” And I’m like, wow. Even that alone is, when you’re benchmarking something to make this bold claim like “Is Ruby fast? How fast is JRuby?” I would like to see at least some, a little bit more of a standardized using a server with a lot of cores. A MacBook Pro at best has four cores. I would like to see a little bit more of a professional benchmark done there versus, “Hey, I’m making these bold claims on my two-year-old MacBook Pro that I may or may not be playing a game on at the time when the benchmarks are taken.”
JAMES: [Chuckles] Yeah, good point.
AVDI: I feel like with regard to optimization, the classic advice still applies. Shut up and use the profiler.
STEVE: [Laughs] I’d take that advice. I like that a lot.
AVDI: Performance issues can come from a variety of sources. But really, nothing that you think you know matters, because the only thing that matters is what the profiler says.
JAMES: That’s a good point. And I’m terrible about remembering to start at the profiler. So, I’ll guess which one’s eating all the time. And I’ll go optimize the heck out of that thing and shave three milliseconds off of what’s happening.
JAMES: And then go look in the profiler and it’s, oh yeah, it was all eaten up in this other thing that I didn’t even touch.
AVDI: And in that vein, I’m curious. With JRuby and TorqueBox did you find good profiler type tools for hunting down bottlenecks?
STEVE: Yes. So actually, the way that we handled that, we took a couple of pages out of the way we were doing it with PHP. And we took more of a high-level look at it. And so, we really used a lot of statsd. We had statsd benchmarking and timings and just incrementers everywhere in our code. It was just like we really used a lot of statsd data aggregation to get those more high level views. So we can see, within statsd, okay this request in this particular function is taking X amount of time. And we can see that and graph that and see it over time and see how different deploys or changes in our code and whatnot influenced all of that. And so, because we had such granular statistics in that sense, we really didn’t have to do too much profiling on the JRuby side. Not to mention, my gut tells me it’s quite hard just because you pay all that Java startup time penalty. How do you know if something’s slow because it’s not quite optimized yet or if it’s because it’s actually truly slow. So, seeing that bigger picture over a larger timeframe really was our go-to tool for tracking down performance issues and optimizing things and just having that grand sense of how things change over time. That being said, I would have loved to use something like, for that same case, I would love to use something like New Relic. But at our scale, it was just too expensive to use. But really, for big scale applications, I think you get way more benefit out of profiling that aggregate over profiling in individual requests.
AVDI: That makes a lot of sense.
JAMES: Alright. Is there anything else we need to cover about Twitpic and optimizations and stuff, scaling?
STEVE: Oh, man. We got a lot of scaling talk in this podcast. I think that people listening to this are just going to… their brains are going to be melting by the end just because we’ve covered everything from PHP to JRuby to TorqueBox to backgrounding. So, I think we’ve covered quite a bit.
JAMES: Alright. Well, let’s get into some picks. Let’s do some picks. Avdi, you want to start us off?
JAMES: [Chuckles] Awesome. Saron?
SARON: Sure. I have two. One is I just finished reading the book ‘#GIRLBOSS’ by Sophia Amoruso who started Nasty Gal, just an online vintage clothing store. And I love this book so much. And if you’re not a boss and you’re not a girl, you should still read it. When I went to look at it, the first page had a picture or an illustration of I assume her smoking a cigarette. And it says, “The only thing I smoke is my competition.” And I was just like, mic drop. I have to get this book because it’s amazing. And it’s just as bad-ass as it sounds. So, that’s my first pick. My second pick is something just more fun. It’s a website that my friend Dan actually pointed me to. It’s called 808cube.com. And it is this, it looks like a Rubik’s cube and you can do drumbeats on it and music and it lights up. And it’s really interactive. And it is just a huge distraction and an awesome waste of time. So, I highly recommend that you check that out.
JAMES: Okay. I’ve got a couple of picks this week. Actually, I didn’t have hardly any, but a buddy of mine has shown me several cool things lately. So, thanks Paul for these picks. The first one in the spirit of the discussion we just had about scaling, it turns out if all you’re going to do is pull some data out of Postgres and JSONify it and shove it back down in some kind of API, you can actually skip a step by just having Postgres pull out what you want as JSON. And it can do that. And it’s amazing. It’s super cool. So, I’ll link to the article that shows how to do that. Neat stuff. And then for fun, there’s this YouTube channel called ScottBradleeLovesYa. And just these super cool songs, like vintage Call Me Maybe or a mash-up called Bohemian Rhapsody in Blue, so a mash-up between Queen and Gershwin. And that’s every bit as cool as it sounds. So yeah, total time-waster but I promise, you go to this site and you’ll be listening to a whole bunch of the songs on here. You definitely got to check it out. So, those are my picks. Steve, what have you got for us?
STEVE: Hey, yeah. So, I have a couple I can share on the topic of scaling. One is a configuration management service that I just saw recently. And I think it’s made by the guy that makes Vagrant. And it’s called Consul, Consul.io. And it’s this really neat way where you can dynamically configure your MySQL server IP address and it will turn that into a fully qualified DNS hostname that you can use in your application. But say, you have a MySQL failover and you need to change boxes, Consul.io lets you dynamically reset the IP of that MySQL server and then all the applications that are using the DNS hostname provided by Consul get that change without you having to change it in your Ruby configuration and redeploy your application. It’s this really nice config service. So, Consul.io. That looks really cool. The other thing is we talked about backgrounding jobs. And there’s this really awesome asynchronous work queue by the guys that make Bit.ly. It’s called NSQ. And it’s at NSQ.io. And NSQ is this really amazing distributed background work queue that you install on a bunch of servers. It’s really, really low overhead because it’s written in Go. And it has great visibility. It has this really awesome admin UI. And you put work into that queue and it can come back and post it to your web apps. And you can just write API calls, like an API call to resize and image, and you can put that work into NSQ and it will come back and post it to your API and do that work asynchronously via HTTP. It’s really, really, really slick. So, check that out. And the last one I’ll give, the last pick is not technical but it’s my favorite book of all time. I’ve always struggled with procrastination and resistance. And there’s this really awesome really, really, really short book called ‘The War of Art’, so a play on the ‘The Art of War’. ‘The War of Art’ by Steven Pressfield. And it’s this amazing book on procrastination and resistance. It’s a short read. It’s 80 pages. And the chapters are only two or three pages long. And it talks about this guy how is now a very, very famous author and his struggle with writing for the past 20 years and how he just would fail over and over and over and over again, and how he finally broke through that resistance. And the book is so inspiring. You can read it over and over and over again every time you feel like, “Oh, I’m just not getting a lot of stuff done. Oh, I’m procrastinating a lot.” And like I said, I’ve always struggled with procrastination like I think many people in the tech field have. So for me, my favorite book, ‘The War of Art’. And those are all the picks that I have.
JAMES: Awesome. Thank you very much. I want to thank Steve for coming on the show and talking to us and telling us all about the great scaling that he did on Twitpic and everything we can learn from that. So, thank you Steve.
AVDI: It was a lot.
STEVE: Yeah, you’re so welcome.
SARON: Thank you.
STEVE: You guys are so welcome. By the way, anyone listening to this, if you want to come work with me at BigCommerce, we’re in San Francisco and we’re hiring engineers, Ruby engineers, Go engineers, and PHP engineers, like crazy. Send me an email or hit me on Twitter and I’ll get you the VIP treatment and you can come work with me in San Francisco.
JAMES: Sounds awesome. Before we close this ep, I want to mention we are doing our book club and we are reading ‘Refactoring: Ruby Edition’. That’s the Addison-Wesley series book, so it’s got the red cover you’re probably used to seeing if you read Ruby books. And there is a companion work book you may want to work through called ‘Refactoring in Ruby’. That one has spaghetti on the cover. So, I guess it’s the spaghetti book. And those are what we’re reading. We’re working on setting up a date for that to talk with the author. So, we’ll get that set and let you know when the episode’s going to be. But you’ve got time to pick up the books now and follow along. And I think that’s it for us. So, we will call it quits here and see you next week. Bye.
[This episode is sponsored by Codeship. Codeship is a hosted continuous deployment service that just works. Set up continuous integration in a few steps and automatically deploy when all your tests have passed. Codeship has great support for a lot of languages and test frameworks. It integrates with GitHub and Bitbucket and lets you deploy cloud services like Heroku, AWS, Nodejitsu, Google App Engine, or your own servers. Start with their free plan. Setup only takes three minutes. Codeship, continuous deployment made simple.]
[A special thanks to Honeybadger.io for sponsoring Ruby Rogues. They do exception monitoring, uptime, and performance metrics and are an active part of the Ruby community.]
[Hosting and bandwidth provided by the Blue Box Group. Check them out at Bluebox.net.]
[Bandwidth for this segment is provided by CacheFly, the world’s fastest CDN. Deliver your content fast with CacheFly. Visit CacheFly.com to learn more.]
[Would you like to join a conversation with the Rogues and their guests? Want to support the show? We have a forum that allows you to join the conversation and support the show at the same time. You can sign up at RubyRogues.com/Parley.]